AsyncTLS: 4.7x Faster Long-Context LLM Inference With Two-Level Sparse Attention
AsyncTLS sparse attention fuses block filtering, token selection, and async KV cache offloading for 1.3-4.7x throughput gains at 48k-96k token contexts.
AsyncTLS sparse attention fuses block filtering, token selection, and async KV cache offloading for 1.3-4.7x throughput gains at 48k-96k token contexts.
TriAttention uses pre-RoPE vector concentration and trigonometric scoring to compress KV cache 10.7x while matching full attention accuracy on reasoning tasks.
Google's TurboQuant algorithm compresses LLM KV cache memory by 6x with zero accuracy loss and no retraining needed. We break down the ICLR 2026 paper.