TL;DR
Every major LLM you’ve used (GPT, Claude, Gemini, Llama) generates text one token at a time, left to right. Mercury, built by Inception Labs, throws that away. It uses diffusion (the technique behind Stable Diffusion and DALL-E) to generate entire sequences in parallel, refining random noise into coherent text over a handful of steps. Mercury 2 hits 1,009 tokens per second on Blackwell GPUs, roughly 10x faster than Claude 4.5 Haiku and GPT-5 Mini, while scoring competitively on reasoning benchmarks. The architecture also has a built-in error correction mechanism that autoregressive models lack entirely.
Why Every LLM You’ve Used Has the Same Speed Ceiling
Autoregressive (AR) generation is sequential by design. To generate token 50, the model needs token 49. To generate token 49, it needs token 48. And so on. Each token requires a full forward pass through billions of parameters.
This creates a hard speed ceiling. You can optimize the hardware, batch requests, use speculative decoding, or cache computations. But you can’t escape the core constraint: tokens come out one at a time.
On an H100 GPU, the fastest autoregressive models top out around 150-200 tokens per second. Claude 4.5 Haiku in reasoning mode averages about 89 tokens/sec. GPT-5 Mini sits around 71. Companies like Groq and Cerebras built entirely new chip architectures to push past that ceiling, and they did, but at the cost of custom silicon that most teams can’t access.
Diffusion models sidestep the problem entirely. Instead of generating tokens sequentially, they generate the whole output at once and refine it iteratively.
How Diffusion Works for Text
If you’ve seen how Stable Diffusion generates images, the core idea is familiar. Start with pure noise. Run a neural network that predicts what the clean output should look like. Subtract some noise. Repeat for a few steps. Each step makes the output cleaner.
Mercury applies this to text, but there are important differences from image diffusion.
The Forward Process: Adding Noise to Text
Training starts by taking real text sequences and progressively corrupting them. At each noise level, some tokens get replaced with random alternatives. At maximum noise, the sequence is entirely random, with no signal left.
The model learns to reverse this process: given a noisy sequence at any corruption level, predict what the clean text should be.
The Reverse Process: From Noise to Text
At inference time, Mercury starts with a sequence of random tokens (pure noise) and runs the denoising model repeatedly. Each step, the model looks at the current noisy sequence and predicts improvements to multiple tokens simultaneously.
This is the key difference from autoregressive generation. An AR model produces token 1, then token 2, then token 3. Mercury produces a rough draft of all tokens at once, then refines that draft over several passes. Early steps make coarse corrections — getting the general topic, sentence structure, and key terms roughly right. Later steps handle fine details: exact word choices, punctuation, grammatical agreement.
Andrej Karpathy described the distinction well: autoregressive models go left to right, one token at a time. Diffusion doesn’t go left to right at all. It goes from blurry to sharp, everywhere at once.
Why This Is Faster
An AR model generating a 500-token response needs 500 sequential forward passes. Mercury might need 8-16 refinement steps total, regardless of output length. Each step processes the full sequence in parallel, which maps directly onto how modern GPUs are designed to work — massive parallel matrix multiplications across thousands of cores.
The arithmetic intensity (ratio of computation to memory access) is much higher for diffusion steps than for AR token generation. AR inference is memory-bandwidth bound: each token generation reads the full model weights from memory but does relatively little computation. Diffusion steps do proportionally more computation per memory read, keeping the GPU’s tensor cores busy instead of idle.
Mercury’s Architecture
Mercury uses a standard Transformer backbone. That’s a deliberate choice; the diffusion approach is about the training and generation algorithms, not the network architecture. This means Mercury benefits from all the same hardware and software optimizations (FlashAttention, tensor parallelism, quantization) that have been built for autoregressive Transformers.
Mercury Coder (June 2025)
The first release targeted code generation:
| Model | Throughput (H100) | Quality |
|---|---|---|
| Mercury Coder Mini | 1,109 tokens/sec | Competitive with small AR models |
| Mercury Coder Small | 737 tokens/sec | Comparable to midsize AR models |
For comparison, the fastest AR models on the same H100 hardware top out around 150-200 tokens/sec. Mercury achieved throughput that was previously only possible on specialized inference chips from Groq or Cerebras, but on commodity Nvidia GPUs.
Mercury 2 (February 2026)
Mercury 2 added reasoning capabilities, making it the first reasoning-capable diffusion LLM:
| Benchmark | Mercury 2 | Claude 4.5 Haiku | GPT-5 Mini |
|---|---|---|---|
| AIME 2025 | 91.1 | — | — |
| GPQA | 73.6 | — | — |
| LiveCodeBench | 67.3 | — | — |
| SciCode | 38.4 | — | — |
| Throughput | ~1,000 tok/s | ~89 tok/s | ~71 tok/s |
| End-to-end latency | 1.7s | 23.4s | — |
Mercury 2 runs on Blackwell GPUs and is priced at $0.25 per million input tokens and $0.75 per million output tokens. The quality scores place it in the same range as Claude 4.5 Haiku and GPT-5 Mini for reasoning tasks, but it delivers answers in 1.7 seconds where those models take 14-23 seconds.
The Built-In Error Correction Advantage
This is the part that gets underappreciated. Autoregressive models have a commitment problem: once token 30 is generated, it’s locked in. If that token was wrong, the model has to work around it for every subsequent token. Errors compound. This is one reason LLMs hallucinate — an early wrong word forces the model down a path it can’t backtrack from.
Diffusion models don’t have this problem. Because they refine the entire sequence at each step, a mistake at position 30 can get corrected in the next refinement pass. The model sees the full context, including its own earlier rough draft, and can fix inconsistencies globally.
Inception claims this built-in self-correction reduces hallucinations compared to AR models of equivalent size. They haven’t published detailed hallucination benchmarks, so I’d treat this as a plausible architectural advantage rather than a proven result. But the mechanism makes sense: iterative global refinement should catch more self-contradictions than one-pass left-to-right generation.
What Diffusion LLMs Can Do That AR Models Can’t
Beyond raw speed, diffusion opens up capabilities that are structurally impossible with autoregressive generation:
Infilling: AR models generate left to right. If you want to fill in a gap in the middle of existing text, you need workarounds (special tokens, separate models). Diffusion models can generate tokens in any order, so infilling text between two fixed passages is natural.
Controllable generation is another win. Want the output to satisfy a constraint (specific format, certain keywords, safety properties)? With AR models, you’re limited to prompt engineering or expensive beam search. Diffusion models can incorporate constraints at each refinement step through guided sampling, similar to classifier-free guidance in image diffusion.
Parallel editing: Since the model operates on the whole sequence, it can revise multiple parts simultaneously. Ask it to “make this email more formal” and it can adjust the greeting, body, and sign-off in a single refinement pass rather than rewriting left to right.
The Tradeoffs
Diffusion LLMs aren’t strictly better. There are real limitations.
Quality gap on frontier tasks: Mercury 2 competes with Haiku-class and Mini-class models, not with Claude Opus or GPT-5 on the hardest benchmarks. The gap is closing, but if you need peak reasoning quality and don’t care about speed, autoregressive frontier models still win.
Streaming is different. AR models give you tokens as they’re generated, so you see the response forming word by word. Diffusion models generate everything at once and reveal it when done. For interactive chat, this means a pause followed by the complete answer, rather than the familiar streaming experience. Mercury’s 1.7-second latency mostly makes this a non-issue, but it changes the UX.
Maturity: The autoregressive approach has been refined over 7+ years since GPT-2. Diffusion for text is much newer. The tooling, optimization techniques, and deployment patterns are still being figured out. Inference frameworks like vLLM and TensorRT-LLM are built around AR assumptions. Supporting diffusion will require changes throughout the stack.
There’s also what you might call the Karpathy question: why does text prefer autoregression while images prefer diffusion? This has been an open question for years. The short answer is that language has strong left-to-right dependencies (the beginning of a sentence constrains the end more than vice versa), while images have more spatially distributed information. Mercury suggests that diffusion can work for text too, but whether it can match AR quality at frontier scale remains unproven.
Who’s Backing This
Inception isn’t some garage startup. Their investors include Menlo Ventures, Mayfield, Innovation Endeavors, M12 (Microsoft’s venture arm), Snowflake Ventures, and Databricks Ventures. Individual backers include Andrew Ng and Andrej Karpathy, two people who understand transformer architectures deeply and still chose to bet on diffusion.
The company positions Mercury as a drop-in replacement for autoregressive LLMs, supporting RAG, tool use, and agentic workflows through standard API interfaces. They’re targeting production use cases where latency and cost matter more than peak benchmark scores: real-time coding assistants, customer service agents, high-throughput document processing.
What This Means for Developers
If you’re building latency-sensitive applications: Mercury 2’s 1.7-second end-to-end latency is in a different league. Think real-time code completion, instant document analysis, or AI agents that need sub-second tool calls. At $0.25/$0.75 per million tokens, the economics work for high-volume use cases too.
If you care about inference costs, the 10x throughput advantage translates directly to lower cost per token at equivalent hardware. If your workload is throughput-bound (batch processing, offline analysis), diffusion models could cut your inference bill significantly.
If you’re running local models: No local diffusion LLMs exist yet that match Mercury’s quality, but the architecture’s efficiency advantage should translate to consumer GPUs too. Pair that with TurboQuant’s 6x KV cache compression and local inference is about to get a lot more capable.
For AI agent builders, the error correction mechanism and fast latency are both wins for agentic systems. Agents that make tool calls and chain reasoning steps benefit from both speed (more iterations per second) and reliability (fewer hallucinated tool calls). Given what we know about how errors cascade in multi-agent systems, anything that reduces per-step error rates matters.
FAQ
Is Mercury open source?
Partially. The original Mercury Coder paper (arXiv:2506.17298) is public, but the model weights are proprietary. You can access Mercury through Inception’s API. There are no open-weight diffusion LLMs at Mercury’s scale yet, though academic implementations of smaller diffusion language models exist.
Can diffusion models do chain-of-thought reasoning?
Yes. Mercury 2 demonstrated this explicitly, scoring 91.1 on AIME 2025, which requires multi-step mathematical reasoning. The diffusion process handles reasoning by iteratively refining the reasoning chain. Early steps establish the overall approach, later steps fix logical errors and fill in details.
Will this replace autoregressive models?
Probably not entirely, at least not soon. AR models still lead on peak quality for the hardest tasks. The more likely outcome is that diffusion LLMs carve out the production tier where speed, cost, and reliability matter more than hitting the absolute ceiling on benchmarks. Think of it like the relationship between compiled and interpreted languages: each wins in different contexts.
How does Mercury handle long contexts?
Inception hasn’t published detailed long-context benchmarks yet. The architecture should scale well since diffusion steps process the full sequence in parallel, but the memory cost of storing the full sequence during refinement could be significant at very long context lengths. This is an open question.
Why haven’t other companies built diffusion LLMs?
They’re trying. Google and Meta both have research papers on discrete diffusion for language. But productionizing diffusion for text requires solving engineering problems that don’t exist for autoregressive models: efficient parallel sampling, adapting existing inference infrastructure, handling variable-length outputs. Inception had a multi-year head start on these problems.
Bottom Line
For seven years, every major language model has generated text the same way: one token at a time, left to right. Mercury proves that’s a choice, not a physical law. Diffusion can generate text 10x faster on the same hardware by refining entire sequences in parallel, and Mercury 2 shows this works for reasoning tasks too. The quality isn’t quite frontier-grade yet, but it’s competitive enough to be useful, and the architecture has structural advantages (error correction, infilling, controllable generation) that AR models can’t replicate. If diffusion LLMs keep improving at this pace, the “which model should I use” question is about to get a lot more interesting.
