TL;DR

AsyncTLS is a new long-context inference system from Yuxuan Hu and colleagues that fuses block-level and token-level sparse attention into a single pipeline, then hides KV cache memory traffic behind compute with an asynchronous offloading engine. On Qwen3 and GLM-4.7-Flash at 48k-96k contexts, the paper on arXiv reports 1.2x to 10x operator speedups and 1.3x to 4.7x end-to-end throughput improvements with accuracy “comparable to full attention.” If you serve long-context models, those numbers turn into real GPU-hour savings.

The Long-Context Tax

Long-context LLMs have a marketing problem. Vendors quote 1M-token windows in press releases, but anyone who has actually tried to run inference at 96k tokens with batch size higher than 1 knows the wall: attention is quadratic, the KV cache balloons past VRAM, and throughput collapses to single-digit tokens per second.

The two costs stack. Quadratic attention means a 96k context spends 9-10x more compute per layer than a 32k context. The KV cache footprint scales linearly, which sounds tame until you remember that a 32B model with GQA on 96k tokens already pushes well past 30GB just for the cache, before activations, before any other model state. You either offload chunks to CPU memory and pay the PCIe round-trip on every step, or you spend money on a bigger GPU.

Sparse attention is the standard escape hatch. The intuition is simple: at any given decoding step, most past tokens don’t contribute much to the next prediction, so skip computing attention against all of them. Pick a subset, attend only to that subset, save compute and bandwidth.

The catch is that there are two ways to do “pick a subset” and they’ve been fighting each other for two years.

Block-Level vs Token-Level: A Two-Year Argument

The first school is block-level sparse attention: divide the past into contiguous blocks of, say, 64 tokens, score each block by some cheap proxy, then attend to the top-k blocks. Methods in this family (Sparse Transformers, BigBird, and a long line of follow-ups) are fast because block-level operations map cleanly onto GPU tile shapes. They pay an accuracy tax, though. If the single token you needed sits inside a block that scored low overall, you lose it.

The second school is token-level sparse attention: compute a per-token importance score, then attend to the top-k tokens regardless of where they live. Accuracy recovers because the granularity matches the actual signal, but the indexing overhead is brutal. Sorting per-token scores and gathering scattered KV entries from non-contiguous memory locations defeats the GPU’s appetite for coalesced reads.

You save FLOPs and lose them again on memory traffic.

The field has been stuck in this tradeoff: block-level is fast and slightly lossy, token-level is accurate and slightly slow. Most production systems pick block-level and live with the accuracy hit.

AsyncTLS argues you don’t have to pick.

What AsyncTLS Actually Does

The system has two parts that work together. The first is the attention scheme itself — hierarchical, two-level. The second is an asynchronous offloading engine that does the unsexy but critical work of moving KV cache between GPU and CPU memory without blocking compute.

Part one: hierarchical sparse attention

Rather than choosing block-level or token-level, AsyncTLS runs both in sequence. Stage one is coarse-grained block filtering: divide the context into blocks, compute a cheap block-level relevance score, then drop the blocks that don’t make the cut. This shrinks the candidate pool by an order of magnitude almost for free, because block-level operations are GPU-friendly.

Stage two is fine-grained token selection inside the blocks that survived. With the candidate set already small, the per-token sorting and indexing overhead that killed pure token-level methods is back in budget. The expensive operation only runs over a fraction of the original tokens.

The architectural payoff: you get the accuracy of token-level selection (you still pick individual tokens, not whole blocks) with most of the speed of block-level methods (you only do the expensive token math on a small surviving set). The paper claims accuracy “comparable to full attention” across both Qwen3 and GLM-4.7-Flash, on both GQA and MLA architectures, which suggests the approach generalizes across attention variants rather than overfitting to one model family.

Part two: asynchronous KV cache offloading

The other half of the system is more pragmatic. Even with sparse attention reducing the working set, the KV cache for a 96k-context inference still doesn’t fit comfortably in GPU memory if you want any reasonable batch size. The fix is to offload to CPU memory and stream blocks back as needed, but the naive way of doing this kills throughput: CPU-to-GPU PCIe transfers are slow relative to a step’s compute, and a synchronous offload engine just stalls the GPU waiting for data. AsyncTLS exploits a property the paper calls temporal locality: when sparse attention picks a set of blocks at step t, those same blocks have a high probability of being picked again at step t+1. The engine uses that prediction to prefetch the likely-needed blocks during the current step’s compute, hiding transfer latency behind work the GPU is already doing. It’s the kind of optimization that sounds boring in a paper title but does more for production throughput than the algorithm itself — a clever attention algorithm that stalls on memory transfers gives you nothing in the real world.

The Numbers

Here’s the headline result from the paper, tested on Qwen3 and GLM-4.7-Flash at 48k-96k context lengths.

1.2-10x
Attention operator speedup
1.3-4.7x
End-to-end throughput
96k
Max context tested

The gap between the operator-level speedup (up to 10x) and the end-to-end throughput gain (up to 4.7x) is itself worth reading carefully.

It tells you that attention is no longer the only bottleneck at long context. MLP layers, KV cache reads, sampling, and non-attention overhead together still account for roughly half the wall-clock time even after the attention math gets dramatically cheaper. This matches what FlashAttention’s authors have been saying for years: optimizing one kernel in isolation hits diminishing returns once that kernel stops being the dominant cost.

Even so, 1.3-4.7x end-to-end is not a rounding error. On a serving cluster running long-context workloads, that’s the difference between needing 4 GPUs and needing 1.

ArchitectureWhat AsyncTLS supportsCoverage rationale
GQA (Grouped Query Attention)Yes, tested on Qwen3The dominant attention variant in 2026; Llama, Mistral, Qwen all use it
MLA (Multi-head Latent Attention)Yes, tested on GLM-4.7-FlashDeepSeek’s variant, increasingly adopted for memory-efficient serving
Standard MHAImplied by GQA supportGQA generalizes MHA, so the same pipeline applies

Covering both GQA and MLA is a real win because the field has fragmented. A method that only works on one attention variant has a shrinking market. AsyncTLS hits both major branches.

Where the Speedup Comes From

A useful way to read the 1.2-10x operator speedup range is to understand which factor moves it within that range. The paper frames it as a function of context length and sparsity ratio. At 48k context with moderate sparsity, you’re closer to 1.2-2x because the baseline isn’t that slow yet. At 96k with aggressive sparsity, you push toward 10x because both the FLOP savings and the memory traffic savings compound.

The standard pattern for sparse-attention papers: gains widen as context grows. Practitioners running inference at 8k-16k won’t see dramatic numbers from this kind of work. Teams running RAG pipelines, agentic workflows with long tool histories, or document analysis at 64k and above are the audience.

What This Doesn’t Solve

The honest list of remaining problems:

  1. Prefill is still expensive. Sparse attention helps decoding because each new token needs to attend to all past tokens. Prefill — processing the initial prompt — has a different cost structure, and these speedups apply mainly to the decoding phase. If your workload is dominated by short prompts and long generations, you’ll see closer to the 4.7x ceiling. If it’s long prompts and short answers, much less.
  2. Accuracy “comparable to” is not “identical to.” The paper claims comparable accuracy, but every sparse-attention method has corner cases where it loses on specific tasks. Needle-in-a-haystack tests at the block boundaries are a known failure mode for two-level approaches. You’d want to validate on your specific eval set before committing.
  3. The asynchronous engine assumes spare PCIe bandwidth. If your serving setup is already saturating the PCIe bus with model loading, batch routing, or other transfers, the prefetch optimization will run into contention.
  4. No public code yet. The arXiv paper is a v1 submission. There’s no released implementation as of this writing, so reproducing the numbers means re-implementing the system. That gap usually closes within a few months for high-impact LLM systems papers, but it’s not closed today.

How This Compares to Recent Work

This isn’t the only sparse-attention paper that has shipped in the last six months. The space is crowded enough that it’s worth placing AsyncTLS on the map.

  • Pure block-level methods (BigBird family, native sparse attention from DeepSeek) are faster on the kernel but cap out lower on accuracy.
  • Pure token-level methods (DuoAttention, retrieval-attention variants) keep accuracy but stall on indexing overhead.
  • KV cache compression methods like TriAttention and Google’s TurboQuant attack the problem from a different angle — they keep full attention but shrink what’s stored, hitting 10x or higher KV memory savings without measurable accuracy loss.
  • Structural rewrites like recursive language models sidestep the attention cost entirely by chunking long context into nested calls that reason over summaries, which works when the task tolerates that decomposition.

The interesting observation is that AsyncTLS and KV cache compression are complementary, not competing. You could compress the cache and then run sparse attention over the compressed representation. Whether the two compose cleanly is an open empirical question, but the framing suggests this generation of long-context work is converging on a stack rather than a single silver bullet.

What Practitioners Should Take Away

If you don’t ship long-context inference, file this under “interesting” and move on. If you do, three things:

First, the hybrid block-then-token pipeline is the architecture worth understanding. Even if you don’t deploy AsyncTLS specifically, this two-stage pattern is going to show up in production inference engines (vLLM, SGLang, TensorRT-LLM) over the next year, because it solves the speed/accuracy tradeoff that previous methods couldn’t. Knowing the pattern helps you read the next ten papers.

Second, the asynchronous offloading engine is the part most production systems still get wrong. If you’re sizing GPU memory based on “fits the whole KV cache,” you’re leaving money on the table. The right architecture has been “stream from CPU memory and prefetch aggressively” for a while; AsyncTLS just makes it concrete with measured numbers.

Third, the gap between operator speedup and end-to-end throughput is a reminder that attention is no longer the only thing worth optimizing. If you have spare engineering bandwidth, MLP fusions, sampling pipelines, and request scheduling all matter as much as the attention kernel now.

FAQ

What is AsyncTLS in simple terms?

AsyncTLS is a long-context LLM inference system that runs sparse attention in two stages — first filtering at the block level, then selecting at the token level — while overlapping KV cache memory transfers with compute. The result is 1.3-4.7x faster end-to-end throughput on 48k-96k contexts.

Does AsyncTLS lose accuracy compared to full attention?

The paper reports accuracy “comparable to full attention” across Qwen3 and GLM-4.7-Flash on both GQA and MLA architectures. As with all sparse-attention methods, you should validate on your specific evaluation set before deploying — corner cases like needle-in-a-haystack tasks at block boundaries are known weak spots for two-level approaches.

What context lengths does AsyncTLS work best at?

The paper tests 48k to 96k tokens. Sparse-attention gains grow with context length, so you’ll see modest speedups at 32k and the largest speedups at 96k+. Below 16k context, the overhead of two-level filtering can outweigh the savings.

Is AsyncTLS open source?

As of this writing, the arXiv paper is the only public artifact. No code has been released yet. Open-source implementations from the community typically follow within a few months for high-impact inference papers.

Will this work with vLLM or other inference engines?

Not out of the box — AsyncTLS is currently a research system. The architectural ideas (two-level sparse attention, async offloading) are likely to be absorbed into production inference engines like vLLM, SGLang, and TensorRT-LLM over the coming year, but you can’t drop it into your existing serving stack today.

Does AsyncTLS help with prefill or just decoding?

Mainly decoding. Prefill (processing the initial prompt) has different cost characteristics, and the speedups quoted are primarily for the generation phase. Workloads with short prompts and long generations benefit most.

Sources

Bottom Line

AsyncTLS is the kind of paper that doesn’t invent a new mathematical idea but combines existing ones — block-level filtering, token-level selection, asynchronous prefetch — into a system that finally beats the speed-accuracy tradeoff that has held back sparse attention for two years. The end-to-end numbers (1.3-4.7x) are strong without being suspicious, and the breadth of architectures it supports (GQA, MLA, two model families) suggests the approach generalizes. The missing piece is open code; once that lands, expect to see this pattern absorbed into vLLM and SGLang within the next release cycle. If you’re running long-context inference at 64k+ tokens in production, this is the architecture pattern to track.