AsyncTLS: 4.7x Faster Long-Context LLM Inference With Two-Level Sparse Attention
AsyncTLS sparse attention fuses block filtering, token selection, and async KV cache offloading for 1.3-4.7x throughput gains at 48k-96k token contexts.
AsyncTLS sparse attention fuses block filtering, token selection, and async KV cache offloading for 1.3-4.7x throughput gains at 48k-96k token contexts.