. Not a rewrite. Not a v2. Same config.yaml, same database, same API. The runtime underneath just gets faster.
I went through their benchmark numbers. They look real.
The numbers
| Rust gateway |
LiteLLM Python |
| Per-request overhead |
~0.05ms |
~7.5ms |
| Throughput (50 concurrent) |
6,782 req/s |
453 req/s |
| Peak memory under load |
31.7MB |
358.9MB |
15x throughput. 11x less memory. 150x lower per-request overhead. The harness is checked into the repo so you can reproduce it yourself.
For most workloads, gateway overhead is noise compared to model latency. A Claude call takes 500ms to 30s. Adding 7ms vs 0.05ms, who cares. But for high-throughput stuff like classification batches, embeddings at scale, or coding agents hammering completions, it adds up fast.
How they're doing it
The migration is a clean four-stage plan:
Stage 0: Pure Python (today)
Stage 1: Rust core via PyO3, Python still does I/O
Stage 2: FastAPI thin shell, entire hot path in Rust
Stage 3: Pure Rust server (axum), Python plugins in sidecar
What I like about this approach: they're not flipping everything at once. Each route moves individually. OCR first (smallest surface, no streaming). Then /v1/messages (adds streaming). Then /chat/completions (largest param surface). One provider at a time, parity check gates every step.
The Rust core is pure transforms. It turns your request into a provider request, turns the response back, handles stream chunks, counts tokens. No sockets, no secrets, no database access. Python keeps doing I/O until Stage 3. Clean separation.
Timeline
Aug 15 - litellm.ocr() → Rust
Sep 1 - /messages, /chat/completions → Rust
Sep 15 - Router (load balancing, fallbacks, retries) → Rust
Dec 1 - Full server: axum replaces FastAPI
What stays the same
Everything you care about:
- Same
config.yaml
- Same database and schema
- Same client API, same request/response shapes
- Same providers, routing, keys
- Custom Python plugins keep working in the sidecar
You deploy the Rust binary, it uses ~65MB of memory, overhead stays under 1ms. Nothing in your setup changes.
Why this matters
The "Python is slow" argument against LiteLLM was always a stretch. Gateway overhead is maybe 0.3% of total latency on a typical LLM call. Most of the time you're waiting on the model, not the proxy.
But now even that argument is gone. Sub-1ms overhead, 32MB memory, 6,782 req/s on a single instance. Good luck finding a lighter gateway that still covers 100+ providers.
Full architecture diagrams and the reproducible benchmark setup are in the announcement: docs.litellm.ai/blog/litellm-rust-launch
Curious if anyone else is running their AI gateway through Rust. What's your setup look like?