TL;DR
TriAttention is a new KV cache compression method that exploits a property most researchers overlooked: before RoPE rotates them, query and key vectors cluster tightly around fixed centers. That clustering turns attention into a trigonometric function of distance, which TriAttention uses to score which keys actually matter. The result — 10.7x less KV memory or 2.5x faster throughput on 32K-token reasoning tasks, with zero accuracy loss. It already ships as a vLLM plugin and lets you run OpenClaw (32B) on a single RTX 4090.
Why KV Cache Is the Real Bottleneck
If you’ve tried running a 32B model locally with a long context window, you already know the problem. The model weights fit in VRAM. The KV cache doesn’t.
Every token generated adds a new key-value pair to the cache for every attention layer and head. A 32K-token reasoning chain on Qwen3-32B with full attention eats through 24GB of VRAM before the model even finishes thinking. On an RTX 4090 (the high-end sweet spot for most local LLM users), that means out-of-memory crashes mid-conversation.
The standard fix is KV cache compression: figure out which keys are important, drop the rest. Methods like H2O, SnapKV, PyramidKV, and StreamingLLM have taken different approaches to this. They all share one assumption that turns out to be flawed.
The RoPE Problem With Existing Methods
All those existing methods estimate key importance by looking at attention scores from recent queries. Makes sense on the surface: if a key got high attention recently, it’s probably important.
But modern LLMs use Rotary Position Embedding (RoPE), which continuously rotates query and key vectors based on their position in the sequence. A query at position 1000 has been rotated differently than a query at position 500. So when you use a recent query to judge which keys matter, you’re using a viewpoint that’s literally been rotated away from where earlier queries stood.
The authors of TriAttention put it bluntly: representative queries are very few in post-RoPE space, leading to poor top-key selection and unstable reasoning. You might keep the wrong keys and drop critical ones, and the model’s chain of thought falls apart.
The Core Insight: Q/K Concentration in Pre-RoPE Space
TriAttention’s key discovery is simple once you see it. Before RoPE gets applied, query and key vectors in trained LLMs aren’t randomly scattered across the embedding space. They cluster tightly around fixed, non-zero centers, and these centers stay stable regardless of token position.
Think of it this way. RoPE is like spinning a compass needle. The needle changes direction depending on where you are in the sequence, but the magnetic field underneath doesn’t change. TriAttention looks at the magnetic field instead of the spinning needle.
Because Q and K vectors concentrate around their centers, you can approximate them by those centers. Plug the centers into the RoPE rotation formula, and the attention score between any query-key pair simplifies into a trigonometric function that depends only on the distance between their positions. That’s where the name comes from: the attention-vs-distance relationship becomes a trigonometric series.
This is a much more stable signal than raw post-RoPE attention scores. The trigonometric curve tells you which positional distances a given attention head prefers — some heads attend mostly to nearby tokens, others prefer specific ranges. That preference doesn’t shift as the sequence grows longer.
flowchart TD
A["Raw Q, K Vectors<br/>(pre-RoPE)"] --> B["Observe Q/K Concentration<br/>around fixed centers"]
B --> C["Approximate Q ≈ center_q<br/>K ≈ center_k"]
C --> D["Substitute into RoPE formula"]
D --> E["Attention logit ≈<br/>trigonometric series f(distance)"]
E --> F["Score keys by<br/>trig distance curve + norms"]
F --> G["Keep top-scoring keys<br/>Evict the rest"]
G --> H["10.7x smaller KV cache<br/>same reasoning accuracy"]
How TriAttention Scores Keys
The method combines two signals to decide which keys to keep:
1. Trigonometric distance scoring. For each attention head, TriAttention computes the trigonometric attention-vs-distance curve from the Q/K centers. Keys at positions that fall on high-attention distances for that head get high scores. This captures which positions the head inherently cares about.
2. Q/K norm weighting. Even within the same cluster, some vectors deviate more from the center than others. Vectors with larger norms carry more information. TriAttention uses the norm of each key vector as a secondary signal, upweighting keys that are more informative regardless of position.
The final importance score is a combination of both. Keys below the threshold get evicted. The whole process runs per-head, so different heads can keep different subsets of the cache. A head that focuses on local context keeps nearby keys; a head scanning for long-range patterns keeps distant ones.
Benchmark Results
The comparison is on AIME25 (competition math problems) with 32K-token generation budgets, which stress-tests both reasoning depth and memory:
| Method | KV Budget | AIME25 Accuracy | Throughput vs Full |
|---|---|---|---|
| Full Attention | 100% | 46.7% | 1.0x |
| StreamingLLM | 10% | 18.3% | 2.6x |
| H2O | 10% | 22.0% | 2.5x |
| SnapKV | 10% | 24.7% | 2.5x |
| PyramidKV | 10% | 23.3% | 2.5x |
| TriAttention | 10% | 46.7% | 2.5x |
At 10% KV budget (keeping only 1 in 10 key-value pairs), every existing method loses roughly half its accuracy. TriAttention matches full attention exactly.
On MATH500, TriAttention matches full attention at 30% KV budget while competing methods need 50-70% to get close. On LongBench (summarization, QA, code) and RULER (retrieval), the pattern holds: TriAttention consistently outperforms StreamingLLM, H2O, SnapKV, PyramidKV, and Ada-KV across all task types.
Why This Matters for Local LLM Users
If you’re running models locally, memory is usually the binding constraint. Long-context reasoning (the kind you need for coding agents, document analysis, or multi-turn conversations) is where memory kills you.
Here’s a concrete example from the paper. Running OpenClaw (a 32B coding agent) with Qwen3-32B AWQ INT4 quantization on an RTX 4090:
- Full attention: the model weights consume most of the 24GB VRAM. First OpenClaw request exceeds 15K tokens. After a few interaction rounds, the KV cache overflows. OOM crash.
- With TriAttention: KV cache stays within budget throughout the entire multi-turn session. The agent reads all documents, generates the report, and finishes without running out of memory.
With TriAttention, the agent actually finishes the full session instead of crashing three rounds in.
Running It Yourself
TriAttention ships as an open-source vLLM plugin. The setup takes about two minutes:
pip install triattention
# Start vLLM server with TriAttention enabled
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-32B-AWQ \
--quantization awq \
--kv-cache-dtype auto \
--triattention-budget 0.1 \
--port 8000
The --triattention-budget 0.1 flag sets the KV budget to 10%, the sweet spot from the benchmarks. The vLLM server exposes an OpenAI-compatible API, so you can point OpenClaw, any chat UI, or your own code at http://localhost:8000/v1 and it just works. No changes to client code needed.
The plugin hooks into vLLM’s attention computation transparantly. It computes the Q/K centers during a warmup phase (first few hundred tokens), then starts evicting low-scoring keys from that point forward. The overhead is minimal: the trigonometric scoring adds less than 3% latency compared to running without compression.
How It Compares to TurboQuant
If you read my breakdown of Google’s TurboQuant, you might wonder how these two methods relate. They’re complementary.
TurboQuant compresses the values of the KV cache entries. It quantizes them more aggressively using PCA and entropy coding, shrinking each entry’s memory footprint. It achieves 6x compression by making each stored KV pair smaller.
TriAttention compresses by eviction. It decides which KV pairs to keep and which to discard entirely, achieving 10.7x compression by storing fewer entries at full (or quantized) fidelity.
You could stack them. Use TriAttention to keep only the top 10-30% of keys, then apply TurboQuant to compress those remaining entries. In theory, that’s 60-100x total compression. The authors mention this as future work but haven’t benchmarked the combination yet.
Limitations Worth Knowing
The trigonometric scoring assumes Q/K concentration holds, and the authors show it does across Qwen3, LLaMA 3, and DeepSeek R1 architectures. But it’s an empirical observation that hasn’t been proven to hold universally. If a future architecture breaks this concentration property, TriAttention’s scoring would degrade.
The warmup phase also means you can’t compress from the very first token. The method needs a few hundred tokens to estimate the Q/K centers accurately. For short prompts, this doesn’t matter. For streaming applications where every token counts, it’s something to keep in mind.
And like all eviction-based methods, there’s a floor. Push the budget below ~5% and even TriAttention starts losing accuracy. The 10% sweet spot gives you 10.7x compression with no loss, but you can’t get 100x compression from eviction alone.
FAQ
Does TriAttention work with any model? It’s been tested on Qwen3 (8B, 32B), LLaMA 3 (8B, 70B), and DeepSeek R1. The Q/K concentration property appears consistent across transformer architectures using RoPE, which covers most modern open-weight models.
Can I use it with llama.cpp or Ollama? Not yet. The current implementation is a vLLM plugin. The authors have said a GGUF-compatible version is planned but haven’t given a timeline.
What’s the minimum GPU for running this? Any GPU that can load the model weights. If you can run Qwen3-32B-AWQ on your card but hit OOM during long conversations, TriAttention should fix that. The 24GB RTX 4090 is the tested reference point.
Does it work for non-reasoning tasks? Yes. LongBench results cover summarization, QA, dialogue, and code. The accuracy retention is slightly lower than on math tasks (where the gains are most dramatic), but still beats all baselines at the same compression ratio.
Is the accuracy “zero loss” claim real? On AIME25 at 10% KV budget, TriAttention got 46.7% — identical to full attention. On other benchmarks it’s within 1-2% at 30% budget. “Zero loss” is accurate for the headline result but not universal across every benchmark and setting.
Bottom Line
TriAttention exploits a property of pre-RoPE vectors that other methods overlooked: the pre-rotation structure is dead predictable. Using that structure as a trigonometric scoring signal, it avoids the unstable guesswork that cripples other KV compression methods during long reasoning.
The practical upshot is real. A single RTX 4090 can now run 32B coding agents through multi-turn sessions that previously crashed mid-conversation. The vLLM plugin works today, it’s open source, and stacking it with quantization-based compression like TurboQuant could push the limits further.
For local model users hitting memory walls on long context, the paper and vLLM plugin are both available now.
