Scroll to explore

29× quiet-tenant starvation under a flooder on vLLM

1.14× solo baseline. With KVWarden's token-bucket rate-limit

10 lines of YAML. No application code change.

For teams running vLLM · SGLang on 1-4 GPUs

TENANT-FAIR LLM SERVING, NO KUBERNETES

Per-tenant token-bucket rate limiting in front of vLLM. A flooder at 32 RPS starves a quiet user 29×. With KVWarden, the quiet user stays within 1.14× of solo baseline.

Join the newsletter GitHub

pip install kvwarden PyPI

The problem

A flooder at 32 RPS starves a quiet user on the same vLLM engine. Quiet TTFT climbs from 53.9 ms to 1,585 ms (29× worse). Engines have no concept of tenants, so the fix cannot live inside them. KVWarden adds per-tenant token-bucket rate limiting at the budget gate, and the quiet user comes back to 61.5 ms, within 1.14× of solo baseline.

WorkloadRouter

29× quiet-tenant starvation, neutralized by the bucket

Layer 1

Intelligent request routing

Multi-model routing with per-tenant rate limiting. Requests hit a single endpoint; KVWarden routes to the right model and enforces each tenant's token-bucket budget before requests reach the engine, so noisy neighbors cannot starve quiet users even when the engine is fully occupied.

vLLM's continuous-batch scheduler is tenant-blind by design. A flooder eats the engine; a quiet user waits 29× longer. The fix has to happen above the engine, not inside it.

CacheManager

40GB KV cache per request (70B @ 128K)

GPU HBM

CPU DRAM

NVMe SSD

Layer 2

Tiered KV cache

Managed like PostgreSQL shared_buffers. Compression, eviction, offloading, and sharing across GPU HBM, CPU DRAM, and NVMe SSD, with async transfers that never stall inference.

A 70B model at 128K context demands 40GB of KV cache per request. Without tiered management, you're flying blind.

TenantManager

61.5 ms quiet TTFT under a flooder (1.14× solo)

Tenant A45%

Tenant B30%

Tenant C25%

Layer 3

Safe multi-tenancy

Per-tenant token-bucket rate limiting at the budget gate. The load-bearing mechanism for fairness under contention (validated in Gate 2-FAIRNESS, see results/). Ten lines of YAML set each tenant's sustained RPM and burst capacity; requests beyond that return 429 before they reach the engine.

The approach

KVWarden is middleware, not a replacement. It wraps vLLM and SGLang, manages multiple models on shared GPU memory, and adds per-tenant token-bucket rate limiting so noisy neighbors don't starve quiet users. Engines have no concept of a tenant; KVWarden does.

0 ms quiet TTFT p99, solo baseline (no contention)

0 ms quiet TTFT p99, under flooder, no rate-limit

0 ms quiet TTFT p99, under flooder, with token-bucket

0× of solo baseline, with KVWarden

One command

pip install kvwarden

No Kubernetes. No cluster. KVWarden wraps vLLM or SGLang and adds per-tenant token-bucket rate limiting, multi-model lifecycle management, and KV cache scaffolding. All on one GPU.

# Install and serve under tenant-fair rate limiting
$ pip install kvwarden
$ kvwarden serve --config configs/quickstart_fairness.yaml

# Wait until /health returns 200 (engines preload at startup):
$ until curl -fs localhost:8000/health > /dev/null; do sleep 2; done

# Two tenants sharing one engine. The flooder cannot starve
# the quiet user once rate-limit is configured:
$ curl localhost:8000/v1/completions \
    -H "X-Tenant-ID: noisy" \
    -d '{"model":"llama31-8b","prompt":"...","max_tokens":64}'

$ curl localhost:8000/v1/completions \
    -H "X-Tenant-ID: quiet" \
    -d '{"model":"llama31-8b","prompt":"...","max_tokens":64}'

# Watch the rate-limit fire and engine queue stay composed:
$ curl localhost:8000/metrics | grep tenant_rejected

Landscape

The only no-K8s tenant-fair orchestrator

Dynamo, llm-d, Mammoth, and AIBrix all require Kubernetes. None give you per-tenant fairness. Ollama runs without K8s but is tenant-blind. KVWarden fills the "multi-tenant on a small shared box without K8s" cell.

	K8s Required	Multi-Model	Per-Tenant Fairness	Target Scale
KVWarden	No	Intelligent	Yes	1-4 GPUs
Dynamo (NVIDIA)	Yes	Yes	No	Datacenter
llm-d (CNCF)	Yes	1 model/pool	No	Datacenter
Mammoth (Modular)	Yes	Yes	No	Datacenter
Ollama	No	LRU only	No	Single node
Gimlet Labs ($92M)	Managed cloud	Yes	Unknown	Cloud