Menu

KVWarden is a tenant-fair LLM serving orchestrator. On vLLM a flooder at 32 RPS degrades a quiet user's time-to-first-token from 53.9 ms to 1,585 ms (29× worse). KVWarden adds per-tenant token-bucket rate limiting at the budget gate, and the quiet user returns to 61.5 ms, within 1.14× of the solo baseline. No Kubernetes required.

Scroll to explore
29× quiet-tenant starvation under a flooder on vLLM
1.14× solo baseline. With KVWarden's token-bucket rate-limit
10 lines of YAML. No application code change.
For teams running vLLM · SGLang on 1-4 GPUs

TENANT-FAIR LLM SERVING, NO KUBERNETES

Per-tenant token-bucket rate limiting in front of vLLM. A flooder at 32 RPS starves a quiet user 29×. With KVWarden, the quiet user stays within 1.14× of solo baseline.

pip install kvwarden PyPI

The problem

A flooder at 32 RPS starves a quiet user on the same vLLM engine. Quiet TTFT climbs from 53.9 ms to 1,585 ms (29× worse). Engines have no concept of tenants, so the fix cannot live inside them. KVWarden adds per-tenant token-bucket rate limiting at the budget gate, and the quiet user comes back to 61.5 ms, within 1.14× of solo baseline.

WorkloadRouter
29× quiet-tenant starvation, neutralized by the bucket

Layer 1

Intelligent request routing

Multi-model routing with per-tenant rate limiting. Requests hit a single endpoint; KVWarden routes to the right model and enforces each tenant's token-bucket budget before requests reach the engine, so noisy neighbors cannot starve quiet users even when the engine is fully occupied.

vLLM's continuous-batch scheduler is tenant-blind by design. A flooder eats the engine; a quiet user waits 29× longer. The fix has to happen above the engine, not inside it.
CacheManager
40GB KV cache per request (70B @ 128K)
GPU HBM
CPU DRAM
NVMe SSD

Layer 2

Tiered KV cache

Managed like PostgreSQL shared_buffers. Compression, eviction, offloading, and sharing across GPU HBM, CPU DRAM, and NVMe SSD, with async transfers that never stall inference.

A 70B model at 128K context demands 40GB of KV cache per request. Without tiered management, you're flying blind.
TenantManager
61.5 ms quiet TTFT under a flooder (1.14× solo)
Tenant A45%
Tenant B30%
Tenant C25%

Layer 3

Safe multi-tenancy

Per-tenant token-bucket rate limiting at the budget gate. The load-bearing mechanism for fairness under contention (validated in Gate 2-FAIRNESS, see results/). Ten lines of YAML set each tenant's sustained RPM and burst capacity; requests beyond that return 429 before they reach the engine.

The approach

KVWarden is middleware, not a replacement. It wraps vLLM and SGLang, manages multiple models on shared GPU memory, and adds per-tenant token-bucket rate limiting so noisy neighbors don't starve quiet users. Engines have no concept of a tenant; KVWarden does.

Benchmark results

0 ms quiet TTFT p99, solo baseline (no contention)
0 ms quiet TTFT p99, under flooder, no rate-limit
0 ms quiet TTFT p99, under flooder, with token-bucket
of solo baseline, with KVWarden

One command

pip install kvwarden

No Kubernetes. No cluster. KVWarden wraps vLLM or SGLang and adds per-tenant token-bucket rate limiting, multi-model lifecycle management, and KV cache scaffolding. All on one GPU.

# Install and serve under tenant-fair rate limiting
$ pip install kvwarden
$ kvwarden serve --config configs/quickstart_fairness.yaml

# Wait until /health returns 200 (engines preload at startup):
$ until curl -fs localhost:8000/health > /dev/null; do sleep 2; done

# Two tenants sharing one engine. The flooder cannot starve
# the quiet user once rate-limit is configured:
$ curl localhost:8000/v1/completions \
    -H "X-Tenant-ID: noisy" \
    -d '{"model":"llama31-8b","prompt":"...","max_tokens":64}'

$ curl localhost:8000/v1/completions \
    -H "X-Tenant-ID: quiet" \
    -d '{"model":"llama31-8b","prompt":"...","max_tokens":64}'

# Watch the rate-limit fire and engine queue stay composed:
$ curl localhost:8000/metrics | grep tenant_rejected

Landscape

The only no-K8s tenant-fair orchestrator

Dynamo, llm-d, Mammoth, and AIBrix all require Kubernetes. None give you per-tenant fairness. Ollama runs without K8s but is tenant-blind. KVWarden fills the "multi-tenant on a small shared box without K8s" cell.

K8s Required Multi-Model Per-Tenant Fairness Target Scale
KVWarden No Intelligent Yes 1-4 GPUs
Dynamo (NVIDIA) Yes Yes No Datacenter
llm-d (CNCF) Yes 1 model/pool No Datacenter
Mammoth (Modular) Yes Yes No Datacenter
Ollama No LRU only No Single node
Gimlet Labs ($92M) Managed cloud Yes Unknown Cloud

Built on top of

The orchestration layer between Ollama simplicity and datacenter intelligence. For developers with 1-4 GPUs who need more than LRU eviction.

View on GitHub

Get started

Join the newsletter

Ships as open-source. Release notes and engineering posts when there's something worth sharing. No marketing, no spam.

pip install kvwarden
kvwarden serve llama-8b qwen-7b --gpu-budget 80%
  • Launch notification when v1.0 ships
  • Benchmark updates & technical posts
  • Early access to docs & tutorials

Open source. Star the repo or join the newsletter for release updates.