Making LLMs Fast and Small: A Guide to Inference Optimization Research in 2026

Q: "Can I combine TurboQuant and TriAttention today?"

" Not with a turnkey integration. TurboQuant has community implementations in PyTorch, MLX, and llama.cpp. TriAttention ships as a vLLM plugin. To combine them, you\u0026rsquo;d need to run TriAttention for eviction inside vLLM and apply TurboQuant compression to the surviving entries. The components exist but nobody has published a combined benchmark yet. The two operate at different levels (per-entry compression vs. entry selection), so the combination should be straightforward to implement."

Q: "Do these techniques work for fine-tuned models?"

" TurboQuant and TriAttention don\u0026rsquo;t require any model-specific adaptation. They work on any transformer using RoPE positional embeddings, which includes nearly all open-weight models in 2026. AsyncTLS has been tested on both GQA and MLA architectures, covering the two main attention variants. mHC requires retraining from scratch with the modified residual connections, so it\u0026rsquo;s not something you\u0026rsquo;d apply to an existing fine-tuned model."

TL;DR

This guide covers five recent research papers that each attack LLM inference costs from a different angle: quantizing the KV cache (TurboQuant), replacing autoregressive decoding with diffusion (Mercury), fixing residual connections so models train better at scale (DeepSeek mHC), evicting unimportant keys using trigonometry (TriAttention), and fusing block-level and token-level sparse attention (AsyncTLS). I’ve written a full breakdown of each paper separately. This page connects them, explains where they overlap, and lays out which combinations actually make sense if you’re building or running inference infrastructure.

The Inference Cost Problem

Training a large model is expensive once. Serving it is expensive forever.

In 2026, the economics of LLM deployment are dominated by three resources: GPU memory for the KV cache, compute for attention, and bandwidth for moving data between GPU and CPU memory. As context windows stretch past 100K tokens and reasoning models generate long chains of thought before answering, these costs grow superlinearly. A 96K-token context costs roughly 9x more per layer than a 32K context. The KV cache alone for a 32B model at that length eats 30+ GB of VRAM.

The five papers I’m covering here each attack a different piece of this cost structure. Some shrink what’s stored. Some skip computation entirely. One rethinks how tokens are generated in the first place. None of them is a complete solution on its own, but taken together they sketch out what efficient LLM inference will look like over the next year.

Here’s the map:

Approach	Paper	What It Attacks	Headline Number
KV cache quantization	TurboQuant	Memory per cached entry	6x compression, zero accuracy loss
Parallel decoding	Mercury	Sequential token generation	1,000+ tokens/sec (10x faster)
Training architecture	DeepSeek mHC	Training instability at scale	Stable from 3B to 27B, +7 points on BBH
KV cache eviction	TriAttention	Number of cached entries	10.7x memory reduction, zero accuracy loss
Sparse attention	AsyncTLS	Compute per attention step	1.3-4.7x end-to-end throughput

Quantization: TurboQuant Shrinks Every Cache Entry by 6x

The most straightforward way to fit more context into memory: make each cached key-value pair smaller.

Google’s TurboQuant does this by converting KV vectors from Cartesian to polar coordinates (PolarQuant), which eliminates the per-block normalization metadata that limited previous quantization methods. For keys, where quantization errors get amplified during attention, it adds a 1-bit correction layer (QJL) based on the Johnson-Lindenstrauss transform. The result is 2.5-3 bits per value, down from 16, with zero measured accuracy loss on Needle-in-a-Haystack, LongBench, RULER, and ZeroSCROLLS.

I found the polar coordinate trick genuinely elegant. Previous KV cache quantizers were stuck around 2.6x compression before accuracy degraded because they spent bits on bookkeeping (scale factors, zero points, per-block metadata). PolarQuant sidesteps all of that. After a random rotation, the angle distributions are predictable enough that a single global codebook works for every block. The math isn’t trivial, but the engineering consequence is: no calibration data needed, no per-model setup, drop it in and get 6x compression on any model.

Community implementations already exist for PyTorch (Triton kernels), MLX, and llama.cpp. On an RTX 4090, a developer at Tonbi Studio got character-identical output to the uncompressed baseline at 2-bit precision on Gemma 3 4B. The hardware speedup numbers (up to 8x on the attention kernel on H100s) are for the attention logit computation specifically, not end-to-end inference. Real-world gains will be smaller but still significant for long-context workloads.

Read the full breakdown: Google’s TurboQuant Compresses LLM Memory 6x With Zero Accuracy Loss

Parallel Decoding: Mercury Hits 1,000 Tokens/Sec With Diffusion

Every major LLM you’ve used generates tokens one at a time. Token 50 requires token 49, which requires token 48. This creates a hard speed ceiling that no amount of hardware optimization can break. On an H100, the fastest autoregressive models top out around 150-200 tokens per second.

Mercury, from Inception Labs, throws that constraint away. It uses diffusion to generate entire sequences in parallel, refining random noise into coherent text over 8-16 steps. Mercury 2 on Blackwell GPUs hits 1,009 tokens per second with a 1.7-second end-to-end latency, roughly 10x faster than Claude 4.5 Haiku and GPT-5 Mini. It scored 91.1 on AIME 2025, placing it firmly in the reasoning-capable tier.

This matters for a different reason than the other four papers. TurboQuant, TriAttention, and AsyncTLS all optimize the existing autoregressive pipeline. Mercury replaces the pipeline entirely. If diffusion LLMs mature to frontier quality, the entire optimization stack for autoregressive inference becomes irrelevant for those models. That’s a big if, though. Mercury 2 competes with Haiku-class and Mini-class models, not with Opus or GPT-5 on the hardest benchmarks. The quality gap is closing, but it hasn’t closed.

Diffusion also gives you capabilities autoregressive models can’t replicate. Infilling (generating text in the middle of existing text) is natural. Error correction is built in because the model refines the entire sequence at each step instead of committing to tokens left-to-right. And the arithmetic intensity is higher, meaning the GPU tensor cores stay busy instead of waiting on memory reads.

Read the full breakdown: Diffusion Language Models Explained: How Mercury Generates 1,000 Tokens Per Second

Architecture Innovation: DeepSeek’s mHC Fixes Scaling With a 1967 Algorithm

This paper sits upstream of inference. It doesn’t make a deployed model faster. It makes models better during training, which means the model you deploy performs better per parameter.

The problem: standard residual connections (the skip connections in every transformer since 2017) are a bottleneck. ByteDance’s hyper-connections replace the single residual stream with multiple parallel streams, giving richer information flow between layers. They work great at small scale. At 27B parameters, the learned mixing matrices amplify signals by 3,000x and training explodes.

DeepSeek’s fix, detailed in their mHC paper, constrains those mixing matrices to be doubly stochastic using the Sinkhorn-Knopp algorithm from 1967. A doubly stochastic matrix has rows and columns that each sum to 1, so multiplying by it redistributes signal without amplifying it. The composite gain across all layers stays at 1.6 instead of 3,000. Training is stable from 3B to 27B parameters, and the constrained version actually beats unconstrained hyper-connections on benchmarks: +7.2 points on BBH, +6.9 on DROP, +7.1 on GSM8K compared to standard residuals.

The overhead is 6.7%. A training run that takes 100 GPU-hours with standard residuals takes 107 with mHC. For the performance gains you get, that’s a trade anyone would take.

I’m including mHC in this hub even though it’s a training technique because better-trained models need fewer parameters for the same quality. Fewer parameters means less memory, faster inference, and lower serving costs. If mHC lets you get GPT-5-level performance from a model that’s 20% smaller, every inference optimization in this guide benefits from that.

Read the full breakdown: DeepSeek’s mHC: How a 1967 Algorithm Fixed the Biggest Problem in Scaling LLMs

KV Cache Eviction: TriAttention Drops 90% of Keys Using Trigonometry

Where TurboQuant makes each KV pair smaller, TriAttention asks a different question: which pairs can you throw away entirely?

Previous eviction methods (H2O, SnapKV, PyramidKV) estimate key importance from recent attention scores. The problem is that modern LLMs use RoPE, which continuously rotates query and key vectors based on position. A recent query’s attention scores give you a rotated, biased view of key importance. At 10% KV budget (keeping 1 in 10 pairs), every existing method loses roughly half its accuracy on AIME25. TriAttention matches full attention exactly.

The trick: before RoPE rotates them, query and key vectors cluster tightly around fixed centers. That clustering turns attention into a trigonometric function of positional distance. TriAttention uses this stable signal to score keys instead of relying on post-RoPE attention, which shifts with every token.

The practical impact is concrete. Running OpenClaw (a 32B coding agent) on a single RTX 4090 with full attention crashes mid-conversation when the KV cache overflows. With TriAttention at 10% budget, the agent finishes the entire multi-turn session without hitting OOM. The vLLM plugin installs with pip install triattention and you add one flag (--triattention-budget 0.1) to your server launch command.

This is the paper in this hub that’s most ready for production use today. Open-source, available as a plugin, tested on Qwen3, LLaMA 3, and DeepSeek R1.

Read the full breakdown: TriAttention Compresses KV Cache 10.7x: How Trigonometry Fixed Long-Context Reasoning

Sparse Attention: AsyncTLS Gets 4.7x Throughput by Attending to Less

Sparse attention has been around for years. The idea is obvious: at any decoding step, most past tokens don’t contribute much to the prediction, so skip them. The execution has been stuck in a tradeoff. Block-level methods (divide context into blocks, score blocks, attend to top-k blocks) are fast but miss individual important tokens inside low-scoring blocks. Token-level methods (score every token individually) are accurate but the sorting and scatter-gather overhead kills the GPU.

AsyncTLS argues you don’t have to pick. Run block-level filtering first to shrink the candidate pool by 10x, then run token-level selection only on the survivors. The expensive per-token math runs on a fraction of the original tokens, so the indexing overhead stays manageable. You get token-level accuracy with block-level speed.

The second half of the system is an asynchronous KV cache offloading engine that exploits temporal locality (if a block was important at step t, it’s probably important at step t+1) to prefetch data from CPU memory during the current step’s compute. This hides PCIe transfer latency behind work the GPU is already doing. It’s the less exciting part of the paper but arguably does more for production throughput than the attention algorithm itself.

End-to-end results on Qwen3 and GLM-4.7-Flash at 48k-96k tokens: 1.3-4.7x throughput improvement. The gap between the operator speedup (up to 10x) and the end-to-end number tells you something important: attention is no longer the only bottleneck. MLP layers, sampling, and non-attention overhead together account for roughly half the wall-clock time even after attention gets dramatically cheaper.

Read the full breakdown: AsyncTLS: 4.7x Faster Long-Context LLM Inference With Two-Level Sparse Attention

How These Approaches Interact

The most interesting question isn’t which of these techniques is best. It’s which ones stack.

TurboQuant + TriAttention: These are directly complementary. TurboQuant compresses the values stored in each KV entry. TriAttention decides which entries to keep. Use TriAttention to keep the top 10% of keys, then apply TurboQuant to compress those remaining entries from 16-bit down to 2.5-3 bits. In theory, that’s 60-100x total KV memory reduction. The TriAttention authors mention this combination as future work but haven’t benchmarked it yet. I’d bet money it works because the two operate at completely different levels.

TriAttention + AsyncTLS: Both reduce the effective number of tokens the attention layer processes. TriAttention evicts keys from the cache permanently. AsyncTLS selects which cached keys to attend to at each step. They could compose, but there’s a risk of double-filtering, where AsyncTLS’s block scoring misses tokens that TriAttention kept specifically because they scored high on the trigonometric metric. Whether the two selection criteria align or conflict is an open empirical question.

TurboQuant + AsyncTLS: AsyncTLS’s offloading engine streams KV blocks between CPU and GPU memory. Compressed blocks transfer faster over the PCIe bus. This should compose well because TurboQuant is format-level compression while AsyncTLS is a scheduling and selection system. The prefetch predictions don’t depend on the precision of the cached values.

Mercury + everything else: Mercury replaces autoregressive decoding entirely. It doesn’t use a KV cache in the same way because it processes the full sequence at each diffusion step rather than generating tokens incrementally. TurboQuant, TriAttention, and AsyncTLS are all designed for the autoregressive pipeline, so they don’t apply to Mercury directly. That said, diffusion inference has its own memory challenges (storing the full sequence during refinement steps), and compression techniques adapted for diffusion will likely emerge.

mHC + everything: mHC operates at training time. It produces a better model, and every inference optimization works on whatever model you give it. mHC is orthogonal to everything in this list.

The practical takeaway: for autoregressive serving, the stack that’s most ready today is TriAttention (eviction) on top of TurboQuant (compression), with AsyncTLS-style hierarchical sparse attention for the longest contexts. For latency-sensitive applications where you can tolerate Haiku-class quality, Mercury’s diffusion approach is worth evaluating separately.

Reading Order

If you want to read all five articles, here’s the order I’d suggest:

Start with TurboQuant. KV cache quantization is the most immediately useful technique, it’s conceptually clean, and it sets up the context for understanding the other KV-focused papers.
Then TriAttention. After understanding how TurboQuant compresses each entry, TriAttention shows you how to select which entries to keep. Together they cover the full KV cache optimization picture.
Then AsyncTLS. Once you understand the KV cache problem, sparse attention adds the computational dimension. AsyncTLS’s two-level approach builds on ideas from both the compression and eviction camps.
Then Mercury. After seeing all the work needed to optimize autoregressive inference, Mercury’s “throw out the whole paradigm” approach hits differently. It’s the contrast that makes the story complete.
Finish with DeepSeek mHC. This one sits upstream of the others. After reading about inference optimization, mHC shows what’s happening on the training side to produce models that need less optimization in the first place.

If you only have time for one, read TriAttention. It’s the most immediately actionable (ships as a vLLM plugin today) and the core insight is beautiful.

What’s Coming Next

Three areas of LLM efficiency research we haven’t covered yet but are watching:

Speculative decoding at scale. Small draft models that propose tokens for large models to verify. This is already shipping in production at several providers, but the research on optimal draft model selection and multi-token verification is evolving fast. It’s orthogonal to everything in this hub and potentially composable with all of it.

MoE routing efficiency. Mixture-of-experts models activate only a fraction of their parameters per token, but routing overhead and load imbalance eat into the theoretical savings. DeepSeek’s V3 architecture pushed this forward, and follow-up work on dynamic routing and expert merging is active.

Hardware-software co-design for inference. Groq’s LPU, Cerebras’s wafer-scale chips, and Etched’s Sohu are all purpose-built for transformer inference. As inference-specific silicon matures, the software optimization picture will shift. Techniques designed around GPU memory hierarchies may need rethinking for new hardware.

FAQ

Can I combine TurboQuant and TriAttention today?

Not with a turnkey integration. TurboQuant has community implementations in PyTorch, MLX, and llama.cpp. TriAttention ships as a vLLM plugin. To combine them, you’d need to run TriAttention for eviction inside vLLM and apply TurboQuant compression to the surviving entries. The components exist but nobody has published a combined benchmark yet. The two operate at different levels (per-entry compression vs. entry selection), so the combination should be straightforward to implement.

Which of these techniques helps most for local LLM users?

TriAttention, by a wide margin. It ships today as a pip-installable vLLM plugin and solves the most common local problem: OOM crashes during long conversations on 24GB GPUs. TurboQuant is a close second once llama.cpp integration lands (expected summer 2026). Mercury requires API access and isn’t available for local deployment yet.

Do these techniques work for fine-tuned models?

TurboQuant and TriAttention don’t require any model-specific adaptation. They work on any transformer using RoPE positional embeddings, which includes nearly all open-weight models in 2026. AsyncTLS has been tested on both GQA and MLA architectures, covering the two main attention variants. mHC requires retraining from scratch with the modified residual connections, so it’s not something you’d apply to an existing fine-tuned model.

Sources

TurboQuant breakdown — KV cache quantization via polar coordinate decomposition (Google, ICLR 2026)
Mercury breakdown — diffusion language models for fast parallel decoding (Inception Labs)
DeepSeek mHC breakdown — manifold-constrained hyper-connections for stable training at scale
TriAttention breakdown — pre-RoPE KV cache eviction via trigonometric scoring
AsyncTLS breakdown — two-level sparse attention with async KV cache offloading

TL;DR#

The Inference Cost Problem#

Quantization: TurboQuant Shrinks Every Cache Entry by 6x#

Parallel Decoding: Mercury Hits 1,000 Tokens/Sec With Diffusion#

Architecture Innovation: DeepSeek’s mHC Fixes Scaling With a 1967 Algorithm#

KV Cache Eviction: TriAttention Drops 90% of Keys Using Trigonometry#

Sparse Attention: AsyncTLS Gets 4.7x Throughput by Attending to Less#

How These Approaches Interact#

Reading Order#

What’s Coming Next#

FAQ#

Can I combine TurboQuant and TriAttention today?#

Which of these techniques helps most for local LLM users?#

Do these techniques work for fine-tuned models?#

Sources#

Don't miss what's next

Related Articles

AsyncTLS: 4.7x Faster Long-Context LLM Inference With Two-Level Sparse Attention

TriAttention Compresses KV Cache 10.7x — How Trigonometry Fixed Long-Context Reasoning

Google's TurboQuant Compresses LLM Memory 6x With Zero Accuracy Loss — Here's How It Works

AI Coding Tools in 2026: The Complete Guide to What Works, What Doesn't, and What's Coming