Long-Context

research Jun 30, 2026 14 min

Sparse Attention Explained: How LLMs Handle Million-Token Contexts Without Melting Your GPU

How sparse attention cuts LLM inference cost by 10x on long contexts. Covers DeepSeek NSA, MInference, H2O, and The Sparse Frontier's findings.

research Apr 22, 2026 11 min

AsyncTLS: 4.7x Faster Long-Context LLM Inference With Two-Level Sparse Attention

AsyncTLS sparse attention fuses block filtering, token selection, and async KV cache offloading for 1.3-4.7x throughput gains at 48k-96k token contexts.

research Apr 18, 2026 9 min

Recursive Language Models: How RLMs Beat Long Context

Recursive language models treat a huge prompt as a Python variable the model can grep and recurse over. MIT's paper shows it beats GPT-5 on long context.

research Apr 11, 2026 8 min

TriAttention Compresses KV Cache 10.7x — How Trigonometry Fixed Long-Context Reasoning

TriAttention uses pre-RoPE vector concentration and trigonometric scoring to compress KV cache 10.7x while matching full attention accuracy on reasoning tasks.