Transformer inference is O(N²) in sequence length — naively. The KV cache reduces it to O(N) by saving intermediate attention key/value tensors from previous tokens. Without it, generating each new token would require recomputing attention over the whole prior sequence.

Advertisement

Without cache

Generating token t means computing attention over tokens 1..t. For each new token you redo all the work for tokens 1..t-1. Each step is O(t) and there are N steps → O(N²) total.

With cache

Save the Key (K) and Value (V) tensors for each token as you generate. Next token only needs Q for itself + reads cached K,V for previous tokens. Each step is O(t) but you skip the redundant recompute. Total: still O(N²) in the math, but the constant is much smaller — typically 10-50x faster.

Advertisement

Memory cost

KV cache size = 2 × num_layers × num_heads × head_dim × sequence_length × batch_size × precision_bytes. For Llama-3 70B at 8K context: ~16GB just for the cache. This is why long-context models need so much VRAM.

Paged attention

vLLM's paged attention manages the KV cache like virtual memory — pages of fixed size, allocated on demand. Lets you serve more concurrent requests with the same VRAM. The single biggest speedup in modern LLM serving.

Cache truncation strategies

When context exceeds the model's max length, you must drop some cached entries. Strategies: drop oldest (sliding window), drop random, drop based on attention weights (heavy-hitters). Sliding window with the first few tokens preserved (attention sinks) is the production default.

KV cache = the speedup. Memory grows linearly with context. Paged attention + sinks for production serving.