AsyncTLS: 4.7x Faster Long-Context LLM Inference With Two-Level Sparse Attention
AsyncTLS sparse attention fuses block filtering, token selection, and async KV cache offloading for 1.3-4.7x throughput gains at 48k-96k token contexts.
AsyncTLS sparse attention fuses block filtering, token selection, and async KV cache offloading for 1.3-4.7x throughput gains at 48k-96k token contexts.
Recursive language models treat a huge prompt as a Python variable the model can grep and recurse over. MIT's paper shows it beats GPT-5 on long context.
TriAttention uses pre-RoPE vector concentration and trigonometric scoring to compress KV cache 10.7x while matching full attention accuracy on reasoning tasks.