TL;DR
Standard attention in transformers checks every token against every other token. That’s O(n²), and it falls apart once you push past 32K context. Sparse attention fixes this by letting each token look at only the tokens that actually matter: neighbors, structural anchors, and high-importance “heavy hitters.” DeepSeek’s NSA paper achieves 11.6x decoding speedup at 64K length while beating full attention on accuracy. The Sparse Frontier, the largest empirical study of training-free sparse methods, confirms the core finding: a bigger model running sparse attention outperforms a smaller model running dense attention at the same compute cost. If you’re serving long-context workloads, sparse attention separates a $3,000 GPU bill from a $300 one.
The Bill That Made Me Care About Attention Patterns
I’d been running a document-processing pipeline on a 70B model with 128K context for about two weeks before I checked the invoice. Four figures. For a prototype. The culprit wasn’t the model’s parameter count or the number of requests. It was attention computation on long sequences eating GPU hours like they were free.
That’s when I started reading everything I could find on sparse attention. Not the textbook definition (I’d seen the Longformer diagram in enough blog posts), but the recent papers that actually ship production speedups without tanking accuracy. Three stood out: DeepSeek’s Native Sparse Attention (NSA), Microsoft’s MInference, and a large-scale empirical study called The Sparse Frontier that benchmarked six sparse methods across four model sizes up to 72B parameters.
This post breaks down what each paper found, how the techniques work under the hood, and what the results mean if you’re running inference on anything longer than a few thousand tokens.
Why Standard Attention Breaks on Long Sequences
Every transformer layer computes attention the same way: for each token in the sequence, calculate a score against every other token, normalize those scores with softmax, then use them to weight the value vectors. Simple math, brutal cost.
Double the sequence length, and you quadruple the work. An 8B model takes roughly 30 minutes to process a 1M-token prompt on a single A100, and that’s just the prefilling stage, before any output tokens get generated. And during decoding, the KV cache grows linearly with context length, so every new token requires reading back an ever-larger pile of cached keys and values from GPU memory.
Flash Attention solved part of this by optimizing the memory access pattern of standard attention: fusing operations, tiling to fit SRAM, avoiding the huge intermediate attention matrix. But Flash Attention still computes the full O(n²) dot products. It made standard attention faster; it didn’t change its scaling curve.
Sparse attention takes a different path: don’t compute attention over every token pair. Instead, figure out which tokens are worth attending to, and skip the rest.
Three Families of Sparse Attention
The research has branched into three distinct approaches over the past few years, each with different tradeoffs.
Static Patterns: Longformer and BigBird (2020)
The earliest approach: define a fixed attention mask at architecture time. Longformer uses a sliding window where each token attends to its w nearest neighbors, plus a handful of “global” tokens that can see the full sequence. BigBird adds random attention connections on top of the local + global structure.
These work well for document classification and extractive tasks. They’re also simple to implement. But they’re rigid: the same attention pattern applies regardless of what the model is looking at. A token in the middle of a legal contract gets the same window width as one at the start of a code file, even though the relevance structure is completely different.
Dynamic Inference-Time Methods: H2O, MInference, SnapKV (2023–2024)
The second wave made sparsity input-dependent. Instead of hard-coding which tokens to attend to, these methods analyze the attention pattern during inference and drop the tokens that don’t contribute much.
H2O (Heavy-Hitter Oracle, NeurIPS 2023) noticed that attention scores during inference are highly concentrated. A small percentage of tokens (the “heavy hitters”) accumulate most of the attention weight across layers. H2O’s eviction policy keeps recent tokens plus the top heavy hitters in the KV cache and drops everything else. With just 20% of tokens retained as heavy hitters, H2O improved throughput by up to 29x on OPT-6.7B.
MInference (NeurIPS 2024) attacked the prefilling bottleneck specifically. Microsoft’s team identified three recurring patterns in long-context attention matrices (A-shape, Vertical-Slash, and Block-Sparse) that show up consistently across heads and layers. MInference classifies each attention head’s pattern offline, then applies the corresponding sparse computation online with custom CUDA kernels. The result: up to 10x prefilling speedup on 1M-token contexts on a single A100, with accuracy matching the dense baseline on LLaMA-3-8B-1M.
SnapKV and its adaptive variant Ada-SnapKV take a similar approach to KV cache compression during decoding, selecting which cached keys to retain based on per-query importance scores. A recent paper, AsyncTLS, takes this further with a two-level sparse scheme that separates prefilling and decoding sparsity to hit 4.7x faster inference.
The limitation of all these methods: they’re applied after the model has been trained with full attention. The model never learned to work with sparse patterns during training, so the sparsity is bolted on at inference time. That creates an accuracy ceiling, especially at high compression ratios.
Natively Trainable: DeepSeek NSA (2025)
DeepSeek’s NSA took the opposite approach: train the model with sparse attention from the start, so it learns to route information efficiently through the sparse pattern rather than having sparsity imposed after the fact. It’s from the same team behind DeepSeek’s hyper-connection architecture, and the two papers share a common theme of rethinking standard transformer components rather than patching them.
This is the paper that changed my mental model of sparse attention from “a cost-saving hack” to “a better inductive bias for long sequences.”
DeepSeek NSA: How It Works
NSA splits each attention head into three parallel branches. Every query token gets an attention output from all three, and a learned gate decides how to mix them.
Branch 1: Token Compression
Groups consecutive tokens into blocks of 32 (with stride 16, so blocks overlap). A small MLP compresses each block into a single representative key-value pair, with position encoding baked in so the model knows where in the sequence the compressed token came from.
This gives the model a bird’s-eye view of the full context at ~32x compression. Think of it as a blurry map: you can see the shape of the whole document, but you can’t read individual lines.
Branch 2: Token Selection
Picks the 16 most relevant blocks (each 64 tokens) based on importance scores computed from the compression branch’s output. These 16 blocks (about 1,024 tokens) are attended to with full precision. One of the 16 is always the first block in the sequence (important for instruction-following), and two are always the most recent blocks.
This is the “zoom lens.” The model picks specific regions to examine at full resolution, guided by what the compression branch flagged as interesting.
Branch 3: Sliding Window
A standard local window of 512 recent tokens, handled in its own separate branch. Keeping the sliding window isolated prevents the model from learning a shortcut where it only attends locally and ignores the compression and selection branches.
The Gate
An MLP with sigmoid activation produces a weight for each branch per query position. The final attention output is a weighted sum of all three branches. The gate is learned end-to-end, so the model figures out on its own when to lean on the global compressed view vs. the selected high-resolution blocks vs. the local window.
What NSA Loads Into Memory During Decoding
At 64K context length, full attention requires reading back all 65,536 cached key-value pairs for every new token. NSA reads roughly 5,632: the compressed representations, 16 selected blocks, and the 512-token window. That’s a 91% reduction in memory traffic per decode step. Other approaches to this problem, like TriAttention’s trigonometric KV cache compression, achieve similar ratios through different mechanisms.
NSA Benchmark Results
DeepSeek trained a 27B-parameter model (3B active with Mixture-of-Experts) with NSA on 270B tokens at 8K length, then extended to 32K with YaRN positional interpolation. Here’s what they measured:
| Benchmark | Full Attention | NSA | Delta |
|---|---|---|---|
| General tasks (9-task avg) | 0.443 | 0.456 | +0.013 |
| LongBench (32K context avg) | 0.437 | 0.469 | +0.032 |
| GSM8K (math) | — | — | +0.034 |
| DROP (reading comp.) | — | — | +0.042 |
| HumanEval (code) | — | — | +0.013 |
NSA beat full attention outright. On LongBench, it scored 0.469 vs. full attention’s 0.437, outperforming every other sparse baseline too (H2O at 0.303, InfLLM at 0.383, Quest at 0.392).
The speed results:
| Operation (64K seq) | Speedup vs. Full Attention |
|---|---|
| Forward pass | 9.0x |
| Backward pass | 6.0x |
| Decoding | 11.6x |
And on reasoning: after distilling chain-of-thought from DeepSeek-R1, the NSA model scored 0.146 on AIME 2024 (16K token budget) vs. 0.092 for the full-attention model trained identically. The sparse model reasons better with the same token budget.
The Sparse Frontier: What the Largest Empirical Study Found
While DeepSeek’s NSA paper proved that natively trained sparse attention can exceed full attention, a separate team asked a broader question: how do training-free sparse methods actually compare across different model sizes, sequence lengths, and task types?
The Sparse Frontier (Nawrot et al., submitted April 2025, revised June 2026) tested six sparse attention methods (Vertical-Slash, FlexPrefill, Block-Sparse for prefilling; SnapKV, Ada-SnapKV, Quest for decoding) across multiple model families, with the primary evaluation on Qwen 2.5 at 7B, 14B, 32B, and 72B, plus Llama 3.1 and Gemma 3. Sequence lengths reached 128K, sparsity levels up to 95% (20x compression).
The headline finding: bigger sparse models beat smaller dense models at the same compute cost. If you have a fixed GPU budget, you’re better off running a 72B model with 10x sparse compression than a 7B model with full attention.
But the paper also found sharp limits:
Decoding handles compression better than prefilling. The 32B and 72B models maintained performance at 17x compression during decoding, regardless of sequence length. Prefilling was more fragile: 10x worked on average, but individual tasks showed degradation below 5x.
Small models break faster. The 7B model’s decoding performance collapsed from 12x safe compression at 16K tokens to just 5x at 128K tokens. The 72B model barely flinched at the same compression ratios.
No single method wins everywhere. Vertical-Slash dominated retrieval tasks during prefilling. Block-Sparse handled reasoning better. Quest was the most consistent for decoding across task types. The paper’s blunt conclusion: “There is no clear strategy that performs best across tasks and phases.”
Aggregate metrics hide task-specific failures. Almost every configuration had at least one task where the maximum tolerable compression dropped below 5x. An average “10x safe compression” can contain a 2x failure on a specific retrieval task that matters to your pipeline.
What You Can Actually Use Today
If you’re running long-context inference in production, these results map to three concrete scenarios:
For prefilling speedup (processing long prompts): MInference is the most production-ready option. It’s training-free, works as a drop-in on existing models, and delivers up to 10x on 1M-token contexts. The tradeoff: it only helps with prefilling, not decoding.
For KV cache compression during decoding: H2O and Quest are the most battle-tested. Keep 20-30% of tokens as heavy hitters, evict the rest. Monitor per-task accuracy, because the aggregate numbers can hide failures on specific retrieval patterns. For a broader view of inference optimization techniques beyond attention, see our guide to LLM inference optimization research.
For new model training: NSA’s architecture is the most compelling direction. Training with sparse attention from scratch avoids the accuracy ceiling of bolt-on methods and can actually improve performance on long-context tasks. If you’re training your own model and plan to serve at 32K+ context, the NSA architecture is worth serious consideration.
Run the numbers on your own workload. A 64K-context decode step with full attention reads ~65K cached KV pairs. NSA reads ~5.6K. At 11.6x speedup, a workload that costs $1,000/day on A100s drops to roughly $86/day. Over a month: $30,000 vs. $2,580.
# Quick estimate: inference cost reduction from sparse attention
full_attention_cost_per_day = 1000 # USD, baseline
nsa_speedup = 11.6 # from NSA paper, 64K decoding
sparse_cost_per_day = full_attention_cost_per_day / nsa_speedup
monthly_savings = (full_attention_cost_per_day - sparse_cost_per_day) * 30
print(f"Full attention: ${full_attention_cost_per_day}/day")
print(f"NSA sparse: ${sparse_cost_per_day:.0f}/day")
print(f"Monthly savings: ${monthly_savings:,.0f}")
# Output:
# Full attention: $1000/day
# NSA sparse: $86/day
# Monthly savings: $27,414
The Open Questions
Three issues remain unresolved.
Fine-grained prefilling is still impractical. The Sparse Frontier paper found that estimating per-query token importance during prefilling costs almost as much as computing full attention, and the sparse CUDA kernels needed for arbitrary per-query patterns don’t exist yet in production frameworks. MInference works around this by classifying heads into a small set of fixed patterns, but that’s a workaround, not a general solution.
Multi-query tasks degrade faster. When a single context needs to answer multiple diverse questions (not just one retrieval), sparse methods lose accuracy faster because each question would ideally select a different subset of tokens. Most benchmarks test single-query scenarios, which paints an overly optimistic picture of real-world reliability.
Composability with other optimizations. The interaction between sparse attention, quantization, speculative decoding, and paged attention remains poorly understood. The papers test these techniques in isolation. Production stacks combine them all, and interaction effects are poorly understood. I ran into this myself: applying H2O-style eviction on top of 4-bit quantized KV cache introduced subtle accuracy regressions that didn’t show up with either technique alone.
FAQ
What is sparse attention in LLMs?
Sparse attention is a modification to the standard transformer attention mechanism that lets each token attend to only a subset of other tokens instead of all of them. By skipping irrelevant token pairs, it reduces the quadratic O(n²) cost of standard attention to something closer to O(n log n) or O(n√n), depending on the method. The accuracy tradeoff is usually small. In some cases (like DeepSeek’s NSA), sparse models actually outperform dense ones.
What is DeepSeek’s Native Sparse Attention (NSA)?
NSA is a sparse attention architecture from DeepSeek that’s trained into the model from scratch rather than bolted on at inference time. It splits attention into three branches (token compression for a global view, token selection for high-resolution focus on relevant regions, and a sliding window for local context) then mixes them with a learned gate. On 64K sequences, it achieves 11.6x decoding speedup while scoring higher than full attention on general benchmarks.
What is the difference between sparse and dense attention?
Dense (standard) attention computes a score between every pair of tokens in the sequence. Sparse attention computes scores only between selected token pairs: typically nearby tokens, a few global anchor tokens, and dynamically chosen “important” tokens. Dense attention guarantees no information is missed but costs O(n²). Sparse attention trades a small amount of coverage for dramatically lower compute: a 64K sequence that reads all 65,536 KV pairs with dense attention might read only 5,632 with sparse methods.
Does sparse attention lose accuracy?
It depends on the method and compression ratio. Training-free methods like H2O and SnapKV can maintain near-dense accuracy at 5-10x compression, but specific tasks (especially multi-query retrieval) can degrade sharply. Natively trained methods like NSA avoid this by letting the model learn to route information efficiently — NSA scored 0.469 on LongBench vs. 0.437 for full attention. The Sparse Frontier study found that accuracy loss correlates more with model size than compression ratio: 72B models tolerate 17x compression with minimal degradation, while 7B models start breaking at 5x on long sequences.
How does sparse attention reduce memory and compute?
Two mechanisms. First, during the forward pass (prefilling), sparse attention skips computing dot products for token pairs that the sparsity pattern excludes — directly reducing FLOPs. Second, during decoding, it evicts or compresses cached key-value pairs so each new token reads fewer bytes from GPU memory. The memory savings are often more impactful than the FLOP savings because modern GPUs are memory-bandwidth-bound, not compute-bound, on the attention operation for long sequences.
Sources
- Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention — arXiv:2502.11089 — DeepSeek’s NSA paper, the primary anchor for this article
- The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs — arXiv:2504.17768 — largest empirical study of training-free sparse attention across 7B-72B models
- MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention — NeurIPS 2024 — Microsoft’s prefilling acceleration via attention pattern classification
- H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — NeurIPS 2023 — the KV cache eviction method based on attention score concentration
- How sparse attention solves the memory bottleneck in long-context LLMs — TechTalks — accessible overview of sparse attention techniques
Bottom Line
Sparse attention works. DeepSeek showed you can train with it and get better accuracy on long contexts. The Sparse Frontier confirmed that bolted-on sparse methods let you trade model size for sparsity profitably, with a 72B model at 17x compression beating a 7B dense model at the same compute cost. MInference demonstrated 10x prefilling cuts on million-token contexts without retraining anything.
The part that still needs work is the engineering stack. Production inference frameworks are built around dense attention assumptions, and combining sparse attention with quantization and speculative decoding introduces interaction effects nobody’s fully characterized yet. But the cost pressure on long-context workloads is so acute — quadratic scaling hits hard once you pass 32K tokens, and sparse attention adoption is accelerating regardless. If you’re paying for GPU time on 64K+ contexts, reading the NSA paper and estimating what 11.6x decoding speedup would do to your bill is how you build next quarter’s budget.