The problem
Layer 1
Multi-model routing with per-tenant rate limiting. Requests hit a single endpoint; KVWarden routes to the right model and enforces each tenant's token-bucket budget before requests reach the engine — so noisy neighbors cannot starve quiet users even when the engine is fully occupied.
vLLM's continuous-batch scheduler is tenant-blind by design. A flooder eats the engine; a quiet user waits 29× longer. The fix has to happen above the engine, not inside it.
Layer 2
Managed like PostgreSQL shared_buffers. Compression, eviction, offloading, and sharing across GPU HBM, CPU DRAM, and NVMe SSD — with async transfers that never stall inference.
A 70B model at 128K context demands 40GB of KV cache per request. Without tiered management, you're flying blind.
Layer 3
Per-tenant token-bucket rate limiting at the budget gate — the load-bearing mechanism for fairness under contention (validated in Gate 2-FAIRNESS, see results/). Ten lines of YAML set each tenant's sustained RPM and burst capacity; requests beyond that return 429 before they reach the engine.
The approach
One command
No Kubernetes. No cluster. KVWarden wraps vLLM or SGLang and adds per-tenant token-bucket rate limiting, multi-model lifecycle management, and KV cache scaffolding — all on one GPU.
# Install and serve under tenant-fair rate limiting
$ pip install kvwarden
$ kvwarden serve --config configs/quickstart_fairness.yaml
# Wait until /health returns 200 (engines preload at startup):
$ until curl -fs localhost:8000/health > /dev/null; do sleep 2; done
# Two tenants sharing one engine — the flooder cannot starve
# the quiet user once rate-limit is configured:
$ curl localhost:8000/v1/completions \
-H "X-Tenant-ID: noisy" \
-d '{"model":"llama31-8b","prompt":"...","max_tokens":64}'
$ curl localhost:8000/v1/completions \
-H "X-Tenant-ID: quiet" \
-d '{"model":"llama31-8b","prompt":"...","max_tokens":64}'
# Watch the rate-limit fire and engine queue stay composed:
$ curl localhost:8000/metrics | grep tenant_rejected
Landscape
Dynamo, llm-d, Mammoth, and AIBrix all require Kubernetes — none give you per-tenant fairness. Ollama runs without K8s but is tenant-blind. KVWarden fills the "multi-tenant on a small shared box without K8s" cell.
| K8s Required | Multi-Model | Per-Tenant Fairness | Target Scale | |
|---|---|---|---|---|
| KVWarden | No | Intelligent | Yes | 1-4 GPUs |
| Dynamo (NVIDIA) | Yes | Yes | No | Datacenter |
| llm-d (CNCF) | Yes | 1 model/pool | No | Datacenter |
| Mammoth (Modular) | Yes | Yes | No | Datacenter |
| Ollama | No | LRU only | No | Single node |
| Gimlet Labs ($92M) | Managed cloud | Yes | Unknown | Cloud |
Built on top of
The orchestration layer between Ollama simplicity and datacenter intelligence. For developers with 1-4 GPUs who need more than LRU eviction.
View on GitHubGet started
Open source. Star the repo or join the waitlist for release updates.