GPT-5.4 vs Claude Opus 4.7 vs Gemini 3.1 Pro for Coding (May 2026)

TL;DR

After three weeks rotating GPT-5.4, Claude Opus 4.7, and Gemini 3.1 Pro across the same daily coding work, the picture in May 2026 is messier than the leaderboards suggest. Claude Opus 4.7 wins on raw SWE-bench Verified (83.5% vs 76.9% for GPT-5.4) and is the model I keep reaching for when a refactor crosses three or more files. GPT models win agentic loops on cost: Mindstudio’s testing found GPT-5.5 emits ~72% fewer output tokens than Opus 4.7 on identical loops, and GPT-5.4 shows the same pattern in milder form, which compounds across iteration. Gemini 3.1 Pro is at blended per million tokens, and on well-specified problems it lands within a couple of points of the others. There’s no single winner — just a routing decision.

Why the comparison changed in April 2026

Three things shifted the ground:

Anthropic shipped Claude Opus 4.7 on April 16. The Adaptive variant pushed the upper bound of SWE-bench Verified to 83.5% (±1.7) on Scale AI’s leaderboard, with Mythos Preview reportedly hitting 93.9% though that one isn’t widely available yet.
OpenAI quietly stopped reporting SWE-bench Verified scores because their own audit found that frontier models could reproduce verbatim gold patches from the test set. Opus 4.5 specifically dropped from 80.9% on Verified to 45.9% on SWE-bench Pro — a 35-point gap on the same model. Across other models the Verified→Pro delta tends to land in the 20–35 point range. The benchmark we’ve been quoting for two years is leaking.
Gemini 3.1 Pro held its position as the cost leader — half the price of Opus 4.7, an Intelligence Index of 57 (the same as Opus 4.7 and GPT-5.4), and a 1M-token context window that it actually uses well.

The dollar/token math has flipped. Pre-April, “use the best model” meant Claude. Post-April, it means “decide whether you’re paying for raw quality, agentic efficiency, or context-window scale”, which gives three different answers.

I’ve been routing between them for the last three weeks, mostly inside Cursor and via direct API calls. What’s actually different in practice:

The benchmark scoreboard

These are the public numbers I trust most as of May 1, 2026. Scale AI’s leaderboard and Artificial Analysis are the two I cross-check first because they publish error bars and methodology notes.

Metric	Claude Opus 4.7	GPT-5.4	Gemini 3.1 Pro
SWE-bench Verified (max/high)	83.5% ±1.7	76.9% ±1.9	75.6% ±2.0
AIME 2024-25 (xhigh)	97.8% ±2.2	95.3% ±3.2	95.6% ±3.1
Intelligence Index (AA)	57	57	57
Context window	1M	1.05M	1M
Max output tokens	128K	128K	64K
Blended price (3:1, /1M)		.63	.50
GDPval-AA (agentic Elo)	1,753	1,673	not reported

Two things to note before drawing conclusions. First, the SWE-bench Verified numbers are reported by the labs themselves with their own scaffolding. Anthropic reports Opus 4.7 with the “Adaptive” runner, OpenAI reports GPT-5.4 at “high” reasoning, Google reports Gemini at the “Preview” tier. They are not running on identical harnesses. Second, the AA Intelligence Index is averaged across 10 benchmarks; tied scores hide huge per-task gaps that show up when you’re choosing for a specific job.

SWE-bench: what a 7-point gap means in practice

Claude Opus 4.7 is 6.6 points ahead of GPT-5.4 on SWE-bench Verified. That sounds modest until you remember the benchmark is “fix this real GitHub issue,” and the gap means Claude resolves roughly 1 in 13 issues that GPT-5.4 doesn’t. On a backlog of 100 bugs that’s eight extra fixes. On a small CI loop that’s the difference between a green run and a stuck queue.

Where I see the gap in normal use:

Multi-file refactors — anything that touches a model, a service, and a controller in the same change. Opus 4.7 is willing to read all three files before editing. GPT-5.4 starts editing earlier and often produces a structurally clean change that compiles but breaks because it inferred the contract instead of reading it.
Long error chains — a 200-line stack trace through an async pipeline. Claude follows it. GPT-5.4 jumps to the most obvious frame and tries a fix.
Unfamiliar code — first-touch on a new repo. Claude reads with patience; GPT-5.4 reads enough to act, which is faster but riskier.

The catch: SWE-bench Verified is contaminated. OpenAI’s audit found models could reproduce verbatim patches, and the same Opus 4.5 that scored 80.9% on Verified scored only 45.9% on SWE-bench Pro, a harder, less-leaked benchmark. So the 83.5% on Opus 4.7 is real but possibly inflated; the relative gap to GPT-5.4 still tracks my own usage, which is the thing I’m picking a tool against.

Token efficiency: the GPT advantage

Reported numbers from Mindstudio’s testing (which I haven’t reproduced independently but matches my API bills): on identical agentic loops, GPT-5.5 emits roughly 72% fewer output tokens than Opus 4.7 to reach the same answer. GPT-5.4 isn’t quite that aggressive — closer to 50–60% on my runs — but the direction is the same. Claude tends to think out loud, with long planning sections, restated assumptions, and “let me first read the file” preambles. GPT cuts straight to the diff.

For a one-shot prompt this barely shows up. For an agentic CLI session that loops 50 times, it’s the entire cost structure. On a recent agent task I run weekly:

# Claude Opus 4.7 (Anthropic API, claude-opus-4-7)
$ time anthropic-cli agent run refactor.task
real    14m18s
input tokens:   382,400
output tokens:   88,200
cost:           .39

# GPT-5.4 (OpenAI API)
$ time openai-cli agent run refactor.task
real    9m42s
input tokens:   341,800
output tokens:   24,700
cost:           .63

Same task, same scaffolding, comparable diff quality. Opus 4.7 spent 3.5x more, partly because Anthropic’s per-token rates are higher and partly because Claude wrote three times as much. If you live in agentic loops, GPT-5.4 is cheaper per finished task even though it loses on raw benchmark quality.

The routing rule, in one line: Claude per query, GPT per loop.

Where Gemini 3.1 Pro fits

Gemini 3.1 Pro is the model I underestimated. At .50 blended per million tokens, half of GPT-5.4 and a fifth of Opus 4.7, the cheapness made me wary. After three weeks I keep it in rotation under three conditions:

The problem is well-specified. Gemini reads ambiguity as room to interpret, which goes wrong on vague tickets but is fine when the spec is clear.
The codebase is large. Gemini’s 1M-token context is the same size on paper as Opus 4.7’s, but my read is that Gemini retrieves better from deep in the window. Long-file bugs that need a function from line 8,200 of a 12,000-line file are where Gemini quietly outperforms.
You don’t need maximum reliability per call. On agentic chains it’s the most likely of the three to take the wrong fork at step three. For one-shot work it’s fine.

I run Gemini 3.1 Pro now for two specific things: searching across a whole monorepo for “where does this constant ultimately get used,” and writing the first draft of test fixtures. Both lean more on retrieval than reasoning, and the cost compounds because they’re high-volume.

Cost in real numbers

For a developer doing meaningful work (call it 30M input + 10M output tokens per month, which is what I average through Cursor + direct API together), the monthly bill across the three models, before any prompt caching:

0/mo

Claude Opus 4.7 (max)

5/mo

GPT-5.4 (xhigh)

0/mo

Gemini 3.1 Pro

Those numbers come straight off Artificial Analysis blended pricing applied to a 3:1 input:output mix. Real usage diverges. Claude burns more output tokens, so the gap to GPT-5.4 in practice is wider than the table suggests. Prompt caching narrows it back; on Claude with a stable system prompt you can knock 30–50% off input costs, which I’ve seen on my own bills.

For comparison, the per-seat IDE pricing I covered in Cursor vs Copilot’s real cost (/mo for GitHub Copilot, /mo for Cursor Pro) looks cheap until you hit the API-overage wall on a busy week. The per-IDE picture is the topic of my Cursor vs Claude Code vs Windsurf piece — and most heavy users I know mix: an IDE seat for completion + chat, plus a direct API budget for agentic work.

Per-task winners (the routing table I actually use)

Task	Pick	Why
Bug fix in a single file	GPT-5.4	Fast, cheap per call, low overthinking
Refactor across 3+ files	Claude Opus 4.7	Will read all the files first
Agentic CLI loop / autonomous coding	GPT-5.4	72% fewer output tokens compounds
Reading a paper’s repo and reproducing	Claude Opus 4.7	Better at “where does X live” exploration
First-draft tests, fixtures, mocks	Gemini 3.1 Pro	Cheap, good enough, high volume
Architecture review on a 500K-token repo	Gemini 3.1 Pro	1M context with strong retrieval
“Why does this fail” with a long stack trace	Claude Opus 4.7	Patience with chains
Code golf / one-liner / regex	GPT-5.4	Most likely to nail the terse form
Producing structured JSON / tool calls	GPT-5.4	Most reliable schema adherence
Translating between languages (Go ↔ Python)	Claude Opus 4.7	Idiomatic in both

If you need a single default and refuse to route, pick Claude Opus 4.7 for one-shot work and GPT-5.4 if you live in agentic chains. Gemini is the right second model: the one you reach for when you’ve burned the budget on the others.

What I run them through

For three weeks I’ve been pushing the same five tasks through each model at least three times to fight variance:

The Go iterators port — translating a real Python helper to use the new Go iter package (related to my recent Go iterators post). Opus 4.7 wins on idiom; GPT-5.4 wins on speed; Gemini gets it right but uses uglier names.
A Pandas → Polars migration — see the Polars vs Pandas piece for the broader case. Opus 4.7 catches the lazy-execution gotchas; GPT-5.4 misses one .collect() call about 1 in 5 runs.
A FastAPI bug reproduction from a 60-line stack trace. Opus 4.7 wins. GPT-5.4 jumps to a wrong frame about half the time. Gemini is in the middle.
An MCP server prototype — same kind of work as the FastMCP tutorial. All three handle this; GPT-5.4 ships the most concise scaffold.
Reading a 9,000-line Rust file to find where a constant is used. Gemini wins on speed; the other two correctness-tie.

None of this is a controlled experiment. It’s the kind of testing you do when the question is “which one should I keep paying for next month,” not “what is the truth.” Use it as one data point, not a leaderboard.

What about GPT-5.5 and Claude Mythos

Two newer models sit at the edge of the comparison:

GPT-5.5 — released late April 2026 and now generally available on the OpenAI API. The introduction post puts it ahead of GPT-5.4 on Terminal-Bench 2.0 (82.7%) and SWE-Bench Pro (58.6%). My usage is still mostly on GPT-5.4 because Cursor’s default routing hasn’t fully shifted yet, and I want more data before I trust it on agentic work. If the headline numbers hold, GPT-5.5 will be the new default for the agentic-loop slot below.
Claude Mythos Preview — Anthropic’s restricted release, capped at roughly 50 enterprise partners (12 founding companies plus ~40 critical-infrastructure organisations). Hits 93.9% on SWE-bench Verified per the leaderboard. Until it’s broadly available, it’s not a real option for individual developers.

Both will move the curve, but neither changes the routing logic above for the rest of May 2026.

Honest limitations of this comparison

Three things I’d want you to weigh before treating this as gospel:

No two harnesses are equal. Lab-reported SWE-bench numbers use different scaffolding. The 83.5% vs 76.9% gap might shrink to 4 points or grow to 10 on identical infrastructure.
Benchmark contamination is real. SWE-bench Verified is leaking. The relative ordering survives on SWE-bench Pro, but absolute numbers are inflated across the board.
Three weeks is not a longitudinal study. My routing table is what works for my tasks. Yours will differ, most obviously if you do mostly frontend, mobile, or non-Python/Go work, where the benchmark coverage is thinner.

If you want to dig into the raw data yourself, llm-stats.com’s leaderboard and Scale AI’s SWE-Bench Pro are the two I’d start with. Both publish raw scores, error bars, and harness notes, which lets you cross-check the lab claims.

FAQ

Is Claude better than GPT for coding?

Yes on per-query benchmark quality and multi-file reasoning: Claude Opus 4.7 leads GPT-5.4 by about 7 points on SWE-bench Verified. No on agentic-loop cost-per-task: GPT models output far fewer tokens per finished task (Mindstudio measured ~72% fewer for GPT-5.5 vs Opus 4.7 on identical loops; GPT-5.4 lands in a similar but milder pattern). The right answer is “depends what you’re paying for.”

Which AI model is best for coding in 2026?

For one-shot accuracy: Claude Opus 4.7. For agentic CLI work and autonomous coding: GPT-5.4. For cost-per-token at acceptable quality: Gemini 3.1 Pro. The “best” model depends on whether your bottleneck is query quality, iteration cost, or budget.

What’s the difference between GPT-5.4 and Claude Opus 4.7?

Opus 4.7 is more thorough and reads more of the codebase before editing; GPT-5.4 acts faster with fewer output tokens. Opus 4.7 prices at blended per million tokens (Artificial Analysis) versus .63 for GPT-5.4. Same Intelligence Index of 57, but Opus 4.7 is ahead on coding-specific benchmarks and AIME math.

Is Gemini 3.1 Pro good for coding?

Yes, with caveats. It’s at .50 per million tokens (the cheapest of the three frontier models), and matches the others on Intelligence Index. It does well on well-specified problems and large codebases (1M-token context with strong long-window retrieval). It’s the weakest of the three in agentic chains, where it’s most likely to commit to a wrong interpretation early.

Which coding model is cheapest in practice?

Per finished task, GPT-5.4: its lower output-token count beats Claude Opus 4.7 even though Anthropic’s per-token rate is closer than the headline gap suggests. Per raw API call, Gemini 3.1 Pro at .50 blended. Per IDE seat, Cursor Pro and GitHub Copilot Pro at /mo each, but most heavy users top those up with direct API spend.

Sources

Artificial Analysis — Opus 4.7 deep dive — Intelligence Index, GDPval Elo, pricing baselines
LM Council benchmarks (April 2026) — SWE-bench Verified scores with error bars for all three models
Scale AI SWE-Bench Pro leaderboard — the harder, less-leaked benchmark
MorphLLM — Why 46% beats 81% on SWE-bench Pro — analysis of the Verified-vs-Pro gap and benchmark contamination
OpenAI — Introducing GPT-5.5 — official agentic numbers for the next-gen GPT
MindStudio — GPT-5.5 vs Opus 4.7 coding comparison — origin of the 72% output-token efficiency claim
LLM Stats SWE-bench Verified leaderboard — independent leaderboard with model-by-model scores

Bottom line

The May 2026 reality is that all three frontier models are good enough to ship code, and the pricing has compressed the “use the best model” argument into “match the model to the task.” If I had to keep one, it would still be Claude Opus 4.7. The per-query quality is real and pays back across multi-file work I can’t do well alone. If I had to drop one, it would be GPT-5.4 only because GPT-5.5 is about to land and reset the agentic side of this comparison. Gemini 3.1 Pro is the model I trust least for hard reasoning and use most for cheap volume — a working contradiction that survives because the three jobs are different jobs.

If you’re picking one for your team, ask which of three constraints binds first: query-quality budget, iteration-cost budget, or per-call dollar budget. Answer that, and the model picks itself.

TL;DR#

Why the comparison changed in April 2026#

The benchmark scoreboard#

SWE-bench: what a 7-point gap means in practice#

Token efficiency: the GPT advantage#

Where Gemini 3.1 Pro fits#

Cost in real numbers#

Per-task winners (the routing table I actually use)#

What I run them through#

What about GPT-5.5 and Claude Mythos#

Honest limitations of this comparison#

FAQ#

Sources#

Bottom line#

Don't miss what's next

Related Articles

Claude Code vs Codex CLI: Real Costs, Benchmarks, and When to Use Each

Cursor vs Copilot 2026: Real Cost Is $40–80, Not $20

Cursor vs Claude Code vs Windsurf in 2026: Which AI Coding Tool Actually Wins?

FastMCP in Python: Build a Real MCP Server (2026 Guide)