Skip to main content

Why Your LLM Is Slow (And the 5 Papers That Fix It)

· 5 min read

Recently came across a post explaining these papers and thought it worth sharing with a quick breakdown.

TLDR

ConceptLayerKey WinRemember This
FlashAttentionCompute2-6x attention speedupTiling + IO-awareness
PagedAttentionMemoryLess than 4% waste (was 60-80%)Virtual memory for KV cache
Speculative DecodingGeneration2-3.6x faster decodingDraft-then-verify
Heterogeneous ServingInfrastructureUp to 77% cost savingsRight GPU for right job
DistServeArchitecture7.4x more requestsSplit prefill from decoding

1. FlashAttention - Compute Optimisation

The problem: Attention is slow not because of maths, but because of memory traffic. GPUs have fast on-chip memory (SRAM, ~19TB/s) and slow main memory (HBM, ~2TB/s). Standard attention keeps shuffling data between them.

How it works: Instead of computing the full N×N attention matrix at once, FlashAttention tiles it into small blocks that fit in fast SRAM. It uses an "online softmax" trick to get exact results incrementally - no approximation.

Analogy: Instead of carrying all your groceries inside in one impossible armful, you make smart small trips - but you planned the route so well it's actually faster.

VersionSpeedupGPU Utilisation
v12-3x vs standard25-40%
v22x on top of v150-73%

2. PagedAttention - Memory Management

The problem: Each request stores a KV cache (the model's "memory" of past tokens). Traditional systems pre-allocate memory for the worst case, wasting 60-80% of GPU memory. This limits how many requests you can batch together.

How it works: Borrows the virtual memory paging concept from operating systems. KV cache is split into fixed-size blocks that can be scattered anywhere in GPU memory. A block table maps logical to physical locations. Memory is allocated on-demand as tokens are generated.

Analogy: Instead of reserving an entire bookshelf per person (wasteful), you let people's books sit on any available shelf and give them a card catalogue to find them.

MetricResult
Memory waste60-80% to less than 4%
Throughput vs HuggingFace24x
Throughput vs prior SOTA2-4x

Why it matters: Memory efficiency means larger batch sizes, more requests per GPU, and lower cost. This is why vLLM became the industry standard for LLM serving.


3. Speculative Decoding - Faster Token Generation

The problem: LLMs generate tokens one at a time, each requiring a full forward pass through billions of parameters. The GPU is massively underutilised - like hiring 1,000 workers to carry one brick at a time.

Two approaches:

Speculative Sampling (DeepMind): A small fast "draft" model guesses the next k tokens. The big model verifies all k in one forward pass. If the guesses match, you got k tokens for the price of roughly 1. Mathematically guaranteed to produce identical output.

Medusa (Cai et al., Princeton/UIUC): Instead of a separate model, bolt extra "prediction heads" onto the main model. Each head predicts future tokens in parallel. Simpler deployment (one model), but requires fine-tuning.

ApproachSpeedupTrade-off
Speculative Sampling2-2.5xNeed two models
Medusa-22.3-3.6xNeed to fine-tune heads

Analogy: Instead of asking the CEO to write a memo word by word, have an intern draft 5 sentences, then the CEO reviews them all at once - keeping what's good, rewriting what's not.


4. Heterogeneous GPU Serving - Cost Optimisation

The problem: Companies buy expensive A100s for everything, but not all requests need top-tier hardware. Short chat messages don't need the same GPU as processing 100-page documents.

Two approaches:

Metis (training-focused): Automatically figures out how to split training across mixed GPU types with smart load balancing. Result: 1-8.4x speedup.

Melange (inference-focused): Formulates GPU selection as a bin-packing problem - which mix of cheap and expensive GPUs minimises cost while meeting latency targets?

WorkloadCost Savings
Short chatup to 77%
Long documents33%
Mixed51%

Analogy: Instead of sending limousines for every taxi ride, dispatch the right vehicle for each trip - sedans for solo riders, vans for groups.


5. DistServe - Disaggregated Inference

The problem: LLM inference has two phases with opposite needs:

  • Prefill (process the prompt): wants massive parallelism, high throughput
  • Decoding (generate tokens): wants low latency, small batches

Running both on the same GPU is like asking one chef to do both bulk meal prep and delicate plating simultaneously - neither goes well.

How it works: Physically separate prefill and decoding onto different GPU clusters, each tuned for its workload. After prefill generates the KV cache, it is shipped to the decoding cluster.

MetricResult
Goodput vs vLLM (requests meeting SLO/s)7.4x more
SLO complianceOver 90% of requests meet latency targets

Analogy: A restaurant with a separate prep kitchen (high-volume chopping) and a plating station (precision finishing), connected by a runner.


References

  1. Dao et al. (2022) : FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
  2. Dao (2023) : FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
  3. Kwon et al. (2023) : Efficient Memory Management for Large Language Model Serving with PagedAttention
  4. Chen et al. (2023) : Accelerating Large Language Model Decoding with Speculative Sampling
  5. Cai et al. (2024) : Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
  6. Um et al. (2022) : Metis: Fast Automatic Distributed Training on Heterogeneous GPUs
  7. Griggs et al. (2024) : Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
  8. Zhong et al. (2024) : DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving