TL;DR
Cursor’s Composer 2 launched March 19, 2026 at $0.50 / $2.50 per million input/output tokens, roughly a tenth of Claude Opus 4.6 and a sixth of Sonnet 4.6. It clears Opus 4.6 on Terminal-Bench 2.0 (61.7 vs 58.0) and trails GPT-5.4 (75.1) by a margin you can probably afford to give up. The catch is the model ID kimi-k2p5, which exposed that Composer 2 is built on Moonshot AI’s Kimi K2.5 — a base Cursor didn’t disclose at launch.
Why this release is the one I actually use
I’ve been on Cursor since the very first dogfood builds of Composer in October 2025, and Composer 1.5 was the model I never quite trusted for anything past a refactor of three files. Useful, often correct, but the moment you handed it a real agentic loop (clone repo, read 30 files, edit four, run tests), it would gracefully degrade into “I have made the changes you requested” while changing nothing.
Composer 2 is the first first-party Cursor model where I stopped reflexively switching to Opus 4.6 for anything serious.
The benchmark wins are real, the pricing isn’t a typo, and the speed difference is the kind you feel in your wrist after an afternoon of edits. But the rollout was also the messiest disclosure incident Anysphere has had, and the parts of the model I trust most are the parts they didn’t write. A month in, I still catch myself reaching for Opus on architectural calls and then talking myself back to Composer for anything routine, because the cost math punishes the wrong instinct. For years IDE model choice was dominated by quality gaps big enough that price was a rounding error; Composer 2 is the first model where that stops being true for the kind of work I do, and it takes a while to retrain the reflex.
After a month of daily use, here’s what holds up under real work, which benchmarks I actually trust, and what to make of the Kimi K2.5 thing.
What Composer 2 actually is
Cursor 2.0 shipped on October 29, 2025 with the original Composer, a 4x-faster-than-comparable-quality coding model that could spin up to eight agents in parallel on isolated git worktrees. Composer 1.5 came in February 2026 as a continued-pretraining iteration. Composer 2 dropped on March 19, 2026 as the third generation, and the Cursor team’s technical report says it’s built on a fresh continued-pretraining run plus reinforcement learning on long-horizon coding tasks.
The 200K context window is the same as 1.5. The architectural shift is what they call compaction-in-the-loop RL: when a session approaches its token budget, the model pauses and compresses its own context to about 1,000 tokens, down from the 5,000+ that traditional sliding-window approaches use. Because the compression step lives inside the RL training loop, the model learns which details to keep across the boundary, and the team reports a 50% reduction in compaction errors versus naive summarization.
This shows up most in the kind of session that runs three to five hours: a refactor across a service plus its tests plus its protobufs. With 1.5 I’d hit a wall around the third hour where the model would forget the test conventions I’d already corrected it on twice.
Composer 2 mostly remembers. Not perfectly, but the failure mode shifted from “forgets earlier decisions” to “occasionally over-compresses recent decisions”, and the latter is much easier to recover from.
The benchmarks
Cursor reports three numbers that are worth pulling apart separately:
| Benchmark | Composer 2 | Composer 1.5 | Claude Opus 4.6 | GPT-5.4 |
|---|---|---|---|---|
| CursorBench | 61.3 | 44.2 | ~58.2 | ~63.9 |
| Terminal-Bench 2.0 | 61.7 | 47.9 | 58.0 | 75.1 |
| SWE-bench Multilingual | 73.7 | 65.9 | not reported | not reported |
CursorBench is the one to take with salt. It’s their internal eval, not independently reproducible, and you should expect a model trained for the IDE to do well on a benchmark scored by the IDE. The 17-point jump over 1.5 is real but the absolute number says more about Cursor’s evaluation than the model. Terminal-Bench 2.0 is the one I’d actually trust: a third-party benchmark of agentic terminal tasks, and Composer 2 at 61.7 nudging past Opus 4.6 at 58.0 is the headline finding from independent coverage at The New Stack and VentureBeat. Cursor did note in their methodology that for non-Composer models they used “the max score between official leaderboard and internal infrastructure,” which is generous to themselves but not unusual. SWE-bench Multilingual at 73.7 is the result that surprised me most. Composer 2 wasn’t trained as a polyglot model in the way Opus or GPT-5 are — it’s tuned for the diff-based edit work Cursor’s agent does inside the IDE, so a strong Multilingual score suggests the base model is doing more of the heavy lifting than the headline narrative implies.
The base, as it turned out, had a name.
Pricing: roughly a tenth of Opus 4.6
Composer 1.5 was $3.50 / $17.50 per million tokens. Composer 2 Standard is $0.50 / $2.50, which is about 86% cheaper on input and 86% cheaper on output. The faster variant (same model, lower latency) is $1.50 / $7.50, still well under everything else on the board.
The number that actually moves the math is the comparison to Anthropic. Opus 4.6 sits at $5/$25 per million; Sonnet 4.6 is $3/$15. If your daily workflow is roughly 200K input + 50K output tokens (a not-unusual day inside Cursor — see our Cursor vs Copilot real-cost breakdown for how those numbers play out on actual bills), the per-day model cost works out to:
- Composer 2 Standard: $0.10 input + $0.13 output = $0.23/day
- Sonnet 4.6: $0.60 + $0.75 = $1.35/day
- Opus 4.6: $1.00 + $1.25 = $2.25/day
Multiply by 22 working days and you have $5/month with Composer 2 versus $50/month with Opus. For a team of 30 that’s about $1,350 a month evaporating, which is the kind of number that gets a CFO involved. Anysphere knows this. The whole point of building a first-party model is to control the unit economics, and they’ve passed enough of the savings through that the question for most teams isn’t “should we try Composer 2” but “what tasks should we still send to Opus?”
The Kimi K2.5 disclosure problem
On March 20, the day after launch, a developer noticed that Composer 2’s internal model ID contained the string kimi-k2p5. That’s Moonshot AI’s Kimi K2.5, a Chinese open-source coding model released January 27, 2026 under a modified MIT license. Cursor’s launch post described Composer 2 as built via “continued pretraining and reinforcement learning” without naming the base.
Cofounder Aman Sanger acknowledged the omission within hours, saying the team had evaluated several base models and Kimi K2.5 came out strongest, and that it “was a miss to not mention the Kimi base in our blog from the start.” Lee Robinson, Cursor’s VP of Developer Experience, added that only about a quarter of the final model’s training compute came from the Kimi base; the rest from Cursor’s own continued pretraining and RL. Moonshot AI later confirmed the partnership was authorized through Fireworks AI.
The incident has a few layers worth separating. There’s precedent, for one: Composer 1 used DeepSeek’s tokenizer without disclosure, and the community spotted that shortly after the October 2025 launch. So this is the second time the same shape of incident has played out, and the pattern of “ship first, attribute when caught” is getting harder to brush off as oversight for a company now sitting on a multi-billion-dollar valuation.
There’s also a licensing wrinkle. Kimi K2.5’s modified MIT license requires prominent attribution for products generating more than $20M in monthly revenue, and Cursor’s reportedly $50B valuation round and pricing tiers suggest they’re well past that threshold. The post-launch acknowledgment buried in a forum reply isn’t what most lawyers would call prominent.
And from a model-quality standpoint, the SWE-bench Multilingual score now makes more sense. Kimi K2.5 was already a strong multilingual coding model before Cursor’s continued pretraining. The “frontier-level coding model” framing isn’t false, but the credit allocation between “Cursor’s RL infrastructure” and “Moonshot’s base capabilities” is harder to read from the outside than the launch post suggested.
In practical terms, none of this affects how Composer 2 works on your code. The model is what it is, and you can use it without caring about the genealogy. But for a team trying to evaluate the durability of Cursor’s model advantage versus, say, GitHub’s or Anthropic’s, “we built it on Kimi K2.5 and added RL” reads more like a head start than a moat.
When to use Composer 2 (and when not to)
After a month of using both Composer 2 and Opus 4.6 side-by-side on the same codebases, here’s where I’ve landed:
Use Composer 2 for:
- Multi-file refactors inside the IDE (its bread and butter; the diff-output tuning shows)
- Test generation and fixing flaky tests
- Boilerplate-heavy work: scaffolding endpoints, wiring DI containers, generating migrations
- Any session where you’ll fire 50+ messages and care about the cumulative cost
- The “fast” variant for autocomplete-adjacent agent work where latency counts more than the last 5% of quality
Stay with Opus 4.6 (or GPT-5.4) for:
- Architectural decisions across services, the kind where you ask “should this be three microservices or one with feature flags”
- Debugging gnarly concurrency issues, where Opus’s reasoning depth still wins
- Anything involving novel algorithms or research-paper translation
- Code review on a PR you don’t trust the author for; Opus catches more subtle bugs
If you’re weighing this against a full switch of IDE, the Cursor vs Claude Code vs Windsurf comparison covers where each still has an edge on things Composer 2 doesn’t touch directly.
The pricing makes the choice asymmetric. Composer 2 is cheap enough that you can afford to try it first and escalate to Opus when it visibly stalls. With 1.5 the failure mode was silent: it’d produce confident garbage. Composer 2 is better at announcing when it’s stuck, and I’ll take that over one more point of accuracy on any given day.
The fast variant is the move
One product decision that doesn’t get enough airtime: Cursor made fast the default. The fast variant has the same intelligence as Standard but trades a 3x higher token cost for noticeably lower latency. On a day of agentic work that’s a real ergonomic upgrade. The model returning in 8 seconds instead of 25 means you don’t context-switch out of the task while waiting.
For a paid Pro subscriber on the $20/month tier, neither variant counts against your quota the way Opus does. Unless you’re hitting the rate limits, the fast variant is basically free latency improvment. I’ve kept it as my default for two weeks and only fall back to Standard when I’m running 5+ background agents in parallel and want to be polite to Cursor’s infrastructure.
If you’re using Cursor’s API (paying per-token rather than via subscription), the math changes. Fast at $1.50 / $7.50 starts to add up over a long workday, and Standard is the clear pick for high-volume automated workflows.
Sources
- Introducing Composer 2 — Cursor blog — official launch post with benchmarks and pricing
- Composer 2 technical report (PDF) — Cursor Research Team’s full methodology
- Composer 2: Benchmarks, Pricing, and How It Compares — DataCamp — independent analysis with comparison tables
- Cursor’s Composer 2 beats Opus 4.6 on coding benchmarks at a fraction of the price — The New Stack — independent coverage
- Cursor’s new coding model Composer 2 is here: It beats Claude Opus 4.6 but still trails GPT-5.4 — VentureBeat — competitor positioning
- Cursor Composer 2 Review: Benchmarks, Pricing, and the Kimi K2.5 Controversy Explained — Emelia — coverage of the disclosure incident with quotes from Sanger and Robinson
- Cursor 2.0 release post — context for the 1.0 → 2.0 IDE jump that preceded Composer 2
FAQ
What is Cursor Composer 2?
Composer 2 is Cursor’s first-party coding model, released March 19, 2026. It’s a 200K-context agentic model built via continued pretraining on Moonshot AI’s Kimi K2.5 base, then refined with reinforcement learning on long-horizon coding tasks. It runs natively inside Cursor and is tuned for the diff-based edit work the IDE’s agent does.
How much does Cursor Composer 2 cost?
Standard is $0.50 per million input tokens and $2.50 per million output tokens. The Fast variant (same intelligence, lower latency) is $1.50 / $7.50. That’s roughly 1/10 the price of Claude Opus 4.6 and 1/6 of Sonnet 4.6. Pro subscribers ($20/month) get unlimited Composer 2 usage in the IDE without per-token charges; Pro+ is $60/month and Ultra is $200/month.
Is Composer 2 built on Kimi K2.5?
Yes. Cursor didn’t say so at launch, but a developer noticed kimi-k2p5 in the model ID the next day. Cofounder Aman Sanger confirmed that Kimi K2.5 is the base model and called the omission “a miss.” Lee Robinson, Cursor’s VP of Developer Experience, said about a quarter of the training compute came from the Kimi base; the rest is Cursor’s own continued pretraining and RL.
How does Composer 2 compare to Claude Opus 4.6?
On Terminal-Bench 2.0, Composer 2 scores 61.7 versus Opus 4.6’s 58.0, a small but real lead. Composer 2 also wins on cost by roughly 10x. Opus 4.6 still leads on reasoning-heavy tasks like architectural planning, novel algorithm design, and deep concurrency debugging. For day-to-day IDE work the cost difference makes Composer 2 the obvious default.
When should I use Composer 2 fast mode?
Default to Fast for interactive agent work where latency counts more than the last 5% of quality: refactors, test generation, scaffolding. Drop back to Standard when you’re running multiple background agents in parallel, when you’re paying per token via the API rather than a Pro subscription, or when you want the cheapest option for a high-volume automated workflow.
Bottom line
Composer 2 is the first Cursor model I’d actually pay for if it weren’t bundled. The benchmarks are competitive, the pricing is aggressive enough to change team budgets, and the compaction-in-the-loop work is the kind of model engineering that pays compounding dividends across long sessions. It’s not Opus and it doesn’t pretend to be, but for the 80% of coding work that doesn’t need Opus, it’s better and cheaper.
The Kimi K2.5 thing isn’t disqualifying, but it’s the second time Cursor has shipped a model with an undisclosed base, and the pattern is more telling than any single incident. If you’re betting on Anysphere’s long-term model independence as part of your IDE choice, that’s a real signal worth weighing. If you’re just trying to ship code, fire up Fast mode and move on.