Scroll to explore
29× quiet-tenant starvation under a flooder on vLLM
1.14× solo baseline — with KVWarden's token-bucket rate-limit
10 lines of YAML. No application code change.
Open-source · pip install kvwarden

TENANT-FAIR LLM SERVING, NO KUBERNETES

The problem

A flooder at 32 RPS starves a quiet user on the same vLLM engine — quiet TTFT climbs from 53.9 ms to 1,585 ms (29× worse). Engines have no concept of tenants, so the fix cannot live inside them. KVWarden adds per-tenant token-bucket rate limiting at the budget gate, and the quiet user comes back to 61.5 ms — within 1.14× of solo baseline.

WorkloadRouter
29× quiet-tenant starvation — neutralized by the bucket

Layer 1

Intelligent request routing

Multi-model routing with per-tenant rate limiting. Requests hit a single endpoint; KVWarden routes to the right model and enforces each tenant's token-bucket budget before requests reach the engine — so noisy neighbors cannot starve quiet users even when the engine is fully occupied.

vLLM's continuous-batch scheduler is tenant-blind by design. A flooder eats the engine; a quiet user waits 29× longer. The fix has to happen above the engine, not inside it.
CacheManager
40GB KV cache per request (70B @ 128K)
GPU HBM
CPU DRAM
NVMe SSD

Layer 2

Tiered KV cache

Managed like PostgreSQL shared_buffers. Compression, eviction, offloading, and sharing across GPU HBM, CPU DRAM, and NVMe SSD — with async transfers that never stall inference.

A 70B model at 128K context demands 40GB of KV cache per request. Without tiered management, you're flying blind.
TenantManager
61.5 ms quiet TTFT under a flooder (1.14× solo)
Tenant A45%
Tenant B30%
Tenant C25%

Layer 3

Safe multi-tenancy

Per-tenant token-bucket rate limiting at the budget gate — the load-bearing mechanism for fairness under contention (validated in Gate 2-FAIRNESS, see results/). Ten lines of YAML set each tenant's sustained RPM and burst capacity; requests beyond that return 429 before they reach the engine.

The approach

KVWarden is middleware, not a replacement. It wraps vLLM and SGLang, manages multiple models on shared GPU memory, and adds per-tenant token-bucket rate limiting so noisy neighbors don't starve quiet users. Engines have no concept of a tenant; KVWarden does.

0 ms quiet TTFT p99 — solo baseline (no contention)
0 ms quiet TTFT p99 — under flooder, no rate-limit
0 ms quiet TTFT p99 — under flooder, with token-bucket
of solo baseline — with KVWarden

One command

pip install kvwarden

No Kubernetes. No cluster. KVWarden wraps vLLM or SGLang and adds per-tenant token-bucket rate limiting, multi-model lifecycle management, and KV cache scaffolding — all on one GPU.

# Install and serve under tenant-fair rate limiting
$ pip install kvwarden
$ kvwarden serve --config configs/quickstart_fairness.yaml

# Wait until /health returns 200 (engines preload at startup):
$ until curl -fs localhost:8000/health > /dev/null; do sleep 2; done

# Two tenants sharing one engine — the flooder cannot starve
# the quiet user once rate-limit is configured:
$ curl localhost:8000/v1/completions \
    -H "X-Tenant-ID: noisy" \
    -d '{"model":"llama31-8b","prompt":"...","max_tokens":64}'

$ curl localhost:8000/v1/completions \
    -H "X-Tenant-ID: quiet" \
    -d '{"model":"llama31-8b","prompt":"...","max_tokens":64}'

# Watch the rate-limit fire and engine queue stay composed:
$ curl localhost:8000/metrics | grep tenant_rejected

Landscape

The only no-K8s tenant-fair orchestrator

Dynamo, llm-d, Mammoth, and AIBrix all require Kubernetes — none give you per-tenant fairness. Ollama runs without K8s but is tenant-blind. KVWarden fills the "multi-tenant on a small shared box without K8s" cell.

K8s Required Multi-Model Per-Tenant Fairness Target Scale
KVWarden No Intelligent Yes 1-4 GPUs
Dynamo (NVIDIA) Yes Yes No Datacenter
llm-d (CNCF) Yes 1 model/pool No Datacenter
Mammoth (Modular) Yes Yes No Datacenter
Ollama No LRU only No Single node
Gimlet Labs ($92M) Managed cloud Yes Unknown Cloud

Built on top of

vLLM
SGLang
LMCache
CUDA

The orchestration layer between Ollama simplicity and datacenter intelligence. For developers with 1-4 GPUs who need more than LRU eviction.

View on GitHub

Get started

$ pip install kvwarden
$ kvwarden serve llama-8b qwen-7b --gpu-budget 80%

Open source. Star the repo or join the waitlist for release updates.