DeepSeek V4 Pro Review: 80% SWE-bench at 1/7th Claude's Price

Q: "Is DeepSeek V4 Pro better than Claude for coding?"

" On isolated code-generation benchmarks like LiveCodeBench (93.5 Pass@1) and Codeforces (rating 3,206), V4 Pro matches or beats Claude. On real-world bug fixing, Opus 4.7 leads by 7 points on SWE-bench Verified (87.6% vs 80.6%) and on agentic multi-file tasks the gap is even wider (SWE-bench Pro: 64.3% vs 55.4%, Terminal-Bench: 69.4% vs 67.9%). Pick based on whether your workflow is single-turn generation or multi-step orchestration."

Q: "Can you run DeepSeek V4 Pro locally?"

" Yes. The weights are on Hugging Face under the MIT license (865 GB download). You need over 1 TB of VRAM for full precision, meaning 8-16x H200 or 8x B200 GPUs. V4 Flash (160 GB at FP8) is more practical for self-hosting: it fits on 2x A100 80GB. Both work with vLLM, SGLang, Ollama, and llama.cpp."

Q: "What is the difference between DeepSeek V4 Pro and V4 Flash?"

" V4 Pro has 1.6T total parameters (49B activated) with 384 experts. V4 Flash has 284B parameters (13B activated) with 256 experts. Pro scores higher on reasoning and coding benchmarks but costs 12x more and runs slower. Flash is the better choice for high-volume, cost-sensitive workloads where the task doesn\u0026rsquo;t require frontier-level reasoning."

TL;DR

DeepSeek V4 Pro packs 1.6 trillion parameters into a mixture-of-experts architecture that activates just 49 billion per token, scores 80.6% on SWE-bench Verified, and costs $1.74 per million input tokens — roughly a seventh of what Claude Opus 4.7 charges. I ran it against my production codebases for two weeks and found a model that closes the gap on coding benchmarks (7 points behind Opus 4.7 on SWE-bench, ahead of GPT-5.4 on Codeforces) but stumbles on agentic workflows, burns tokens like a furnace, and comes with safety gaps NIST flagged as serious. For batch workloads the savings are real; for agentic work, stick with Claude.

Two Weeks With V4 Pro on Real Code

I started using V4 Pro the day after release through OpenRouter, then switched to DeepSeek’s native API once I hit rate limits on the third-party provider. The first thing I noticed: the model is verbose. A refactoring task that Claude Opus 4.7 handles in 800 tokens took V4 Pro north of 3,000. It explains its reasoning in exhaustive detail, rewrites sections you didn’t ask about, and sprinkles in comments I have to delete afterward.

On pure code generation (writing a new FastAPI endpoint, implementing a data pipeline, solving an isolated algorithm problem) V4 Pro is genuinely good. My TypeScript endpoint audit cost $0.09 on V4 Pro versus roughly $9 on Opus 4.7. That 100x cost difference on a single task is not a typo. For batch workloads where you’re processing hundreds of files, the economics are hard to argue with.

But the moment I tried multi-step agentic tasks, the kind where the model needs to read a codebase, plan changes across files, and execute them in sequence, V4 Pro fell apart faster than Claude did. The error cascade dynamics I wrote about earlier are exactly what happens: one wrong step poisons everything downstream. More on that in the benchmarks section.

Architecture: 384 Experts, 6 Active

V4 Pro uses a Mixture-of-Experts transformer with 384 routed experts plus 1 shared expert per MoE layer, activating just 6 experts per token. The result is a 1.6T-parameter model that runs inference as if it were a 49B model, at least in terms of compute per token.

1.6T

Total parameters

49B

Activated per token

Context window (tokens)

MIT

License

The attention mechanism is a hybrid of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). DeepSeek claims that at 1M context length, V4 Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to their previous V3.2 model. The training used 32 trillion tokens, manifold-constrained hyper-connections (mHC) for residual paths, and the Muon optimizer for stability.

V4 Flash, the smaller sibling, runs 284B total parameters with 13B activated and 256 routed experts. It’s the one most people will actually use. More on that later.

Benchmarks: Self-Reported vs Independent

DeepSeek’s own benchmarks and independent evaluations tell different stories.

Coding Benchmarks

Benchmark	V4 Pro	Claude Opus 4.7	GPT-5.5	GPT-5.4
SWE-bench Verified	80.6%	87.6%	—	—
SWE-bench Verified (NIST)	74.0%	—	—	—
SWE-bench Pro	55.4%	64.3%	58.6%	—
Terminal-Bench 2.0	67.9%	69.4%	82.7%	—
Codeforces rating	3,206	—	—	3,168
LiveCodeBench	93.5	—	—	—

The SWE-bench Verified gap between DeepSeek’s claim (80.6%) and NIST’s independent evaluation (74.0%) is a 6.6-point discrepancy worth paying attention to. Different scaffolding, different agent configurations, and different evaluation harnesses can all move these numbers. The NIST CAISI evaluation explicitly noted that V4’s capabilities “lag behind the frontier by about 8 months” in aggregate.

On competitive programming, V4 Pro’s Codeforces rating of 3,206 beats GPT-5.4’s 3,168, making it the highest competitive programming score for any model at release. That said, competitive programming performance doesn’t translate directly to production coding, as my coding model comparison showed when testing these models on real projects.

Reasoning and Math

Benchmark	V4 Pro	Claude Opus 4.7	GPT-5.4
GPQA Diamond	90.1%	94.2%	93.0%
HLE (no tools)	37.7%	46.9%	39.8%
HMMT 2026	95.2%	96.2%	97.7%
MMLU-Pro	87.5%	—	—

V4 Pro trails Claude and GPT on reasoning. The GPQA Diamond gap (4 points behind Opus 4.7), the HLE gap (9 points behind), and the HMMT math gap (1-2.5 points behind) are consistent: V4 Pro is competitive but clearly not leading the tier. Interestingly, recent research like THINC shows that even a 4B model can beat 235B models on math when trained to reason in code — suggesting the reasoning medium can matter more than parameter count.

Long Context

V4 Pro has a genuine edge on long-context retrieval. On MRCR at 1M tokens, it scores 83.5%, above GPT-5.5’s 74.0%. The hybrid sparse attention architecture was designed for exactly this use case, and it shows. If you’re doing codebase-wide analysis, large document summarization, or retrieval over very long inputs, V4 Pro handles it better than most competitors.

What It Actually Costs

The pricing gap is obvious. The details are less so.

Model	Input / 1M tokens	Output / 1M tokens	Cache hit / 1M
DeepSeek V4 Pro	$1.74	$3.48	$0.0145
DeepSeek V4 Flash	$0.14	$0.28	$0.0028
Claude Opus 4.7	$5.00	$25.00	$0.50
GPT-5.5	$5.00	$30.00	—

V4 Pro’s output tokens cost $3.48 per million. Claude Opus 4.7 charges $25. That’s a 7.2x gap. On input tokens, the difference is smaller at 2.9x, because DeepSeek charges more for input relative to its output pricing than Anthropic does.

The catch is that V4 Pro is verbose. Artificial Analysis measured it consuming 190 million output tokens during their evaluation, compared to a median of 42 million across other models. That’s 4.5x more output per task. When you factor in the verbosity, the effective cost advantage shrinks from 7x to roughly 1.5-2x on output-heavy workloads.

DeepSeek is running a promotional discount until May 31, 2026: $0.435 input / $0.87 output, which brings V4 Pro below V4 Flash’s standard pricing. If you’re evaluating V4 Pro, do it during the promo; the standard pricing still beats Claude, but the margin narrows once the verbosity tax is included.

V4 Flash: The Model Most People Should Use

V4 Flash at $0.14 per million input tokens is cheaper than every competing small model, including GPT-5.4 Nano. For tool-calling pipelines, classification, structured extraction, and any task where you need a fast, cheap model that’s better than Haiku-class options, Flash is the pick. I swapped it into a document-processing pipeline that was running Claude Haiku and saw equivalent quality at 40% lower cost.

Self-Hosting: Possible but Expensive

Both models are open-weight under the MIT license. The weights are on Hugging Face and have been downloaded over 1.17 million times in the first month.

V4 Pro’s weights are 865 GB in mixed FP4+FP8 precision. Running it at full precision requires over 1 TB of VRAM. Realistically, that means 8-16x H200 GPUs or 8x B200s in a multi-node configuration. At Q4 quantization you need around 512 GB, which fits on 8x A100 80GB nodes. Sub-Q4 quantizaton isn’t advisable since the source weights are already at FP4+FP8.

V4 Flash is more practical. At FP8 it needs about 192 GB, so one H200 (141 GB HBM3e) plus headroom, or 2x A100 80GB. At Q4, it fits on a single A100 with room for context. Quantized GGUF versions appeared on r/LocalLLaMA within hours of release.

Both models work with vLLM, SGLang, llama.cpp, Ollama, and LM Studio. For production self-hosting, vLLM with tensor parallelism is the standard path:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[
        {"role": "user", "content": "Refactor this function to use async/await"}
    ],
    temperature=0.6,
    max_tokens=4096
)

print(response.choices[0].message.content)

The API is OpenAI-compatible, so switching between self-hosted and the DeepSeek API is a one-line base_url change.

Safety Gaps NIST Already Found

NIST’s Center for AI Standards and Innovation ran an evaluation of DeepSeek models that found serious safety gaps. While their V4-specific evaluation focused mainly on capability benchmarks, the findings from earlier DeepSeek models carry forward because V4’s technical documentation contains no reference to safety measures, red-teaming, or risk evaluations.

The numbers from the earlier CAISI evaluation of DeepSeek R1-0528:

Responded to 94% of overtly malicious jailbreak requests, versus 8% for U.S. reference models
12x more likely than U.S. frontier models to follow malicious derailment instructions during multi-step tasks
4x more likely than U.S. models to align outputs with CCP narratives on politically sensitive topics, with silent translation edits to match built-in political restrictions

If you’re using V4 Pro for isolated code generation tasks behind your own guardrails, these issues are manageable. I covered practical guardrail setups that apply here: sandboxing, permission boundaries, and kill switches. If you’re building user-facing applications or agentic systems where the model takes autonomous actions, the jailbreak and hijacking numbers are disqualifying until DeepSeek publishes safety documentation for V4.

Where V4 Pro Wins and Where It Doesn’t

V4 Pro is the right choice when:

You’re processing large volumes of code at scale (batch reviews, linting, migration)
You need long-context analysis over codebases or documents (1M context, 83.5% MRCR)
Budget is the constraint and you can tolerate verbosity
You want full control via self-hosting under the MIT license
The task is single-turn code generation without multi-step planning

Claude Opus 4.7 or GPT-5.5 remain better when:

You need agentic multi-file workflows (Terminal-Bench 2.0: 67.9% vs 82.7% for GPT-5.5)
Reliability outweighs cost savings (V4 Pro’s API rate limits are unpredictable)
Safety and guardrails are non-negotiable
You’re building user-facing products
The task requires deep reasoning chains (GPQA: 90.1% vs 94.2% for Opus 4.7)

API Availability and Speed

V4 Pro is available through DeepSeek’s native API, OpenRouter, Fireworks, DeepInfra, Together, Novita, and SiliconFlow. Microsoft Azure Foundry has V4 Flash live with Pro coming soon.

Speed varies dramatically by provider. The model itself generates at 33.9 tokens per second according to Artificial Analysis, below the peer median of 54.5 t/s. Fireworks clocks the fastest at 167.1 t/s but with a 27-second time-to-first-token. TTFT across providers ranges from 20s to over 60s, which reflects the cost of loading a 1.6T-parameter model. DeepSeek’s native API has dynamic rate limiting with no published fixed RPM. Reports suggest roughly 300 RPM and 50 concurrent requests, but the actual limits shift based on server load.

For latency-sensitive applications, V4 Flash through DeepInfra or Together is a better bet than V4 Pro through any provider.

FAQ

Is DeepSeek V4 Pro better than Claude for coding?

On isolated code-generation benchmarks like LiveCodeBench (93.5 Pass@1) and Codeforces (rating 3,206), V4 Pro matches or beats Claude. On real-world bug fixing, Opus 4.7 leads by 7 points on SWE-bench Verified (87.6% vs 80.6%) and on agentic multi-file tasks the gap is even wider (SWE-bench Pro: 64.3% vs 55.4%, Terminal-Bench: 69.4% vs 67.9%). Pick based on whether your workflow is single-turn generation or multi-step orchestration.

How much does DeepSeek V4 Pro cost compared to Claude Opus 4.7?

V4 Pro costs $1.74/M input tokens and $3.48/M output tokens at standard pricing (currently discounted to $0.435/$0.87 until May 31). Claude Opus 4.7 costs $5/M input and $25/M output. That’s a 7x gap on output, but V4 Pro’s verbosity (4.5x more output tokens per task on average) narrows the effective savings to roughly 1.5-2x.

Can you run DeepSeek V4 Pro locally?

Yes. The weights are on Hugging Face under the MIT license (865 GB download). You need over 1 TB of VRAM for full precision, meaning 8-16x H200 or 8x B200 GPUs. V4 Flash (160 GB at FP8) is more practical for self-hosting: it fits on 2x A100 80GB. Both work with vLLM, SGLang, Ollama, and llama.cpp.

Is DeepSeek V4 Pro safe for production use?

That depends on the application. NIST found earlier DeepSeek models responded to 94% of malicious jailbreak attempts (vs 8% for U.S. models) and were 12x more susceptible to agent hijacking. V4’s technical documentation includes no safety evaluations. For batch processing behind your own guardrails, it’s fine. For user-facing or agentic applications, the safety gaps are a real risk.

What is the difference between DeepSeek V4 Pro and V4 Flash?

V4 Pro has 1.6T total parameters (49B activated) with 384 experts. V4 Flash has 284B parameters (13B activated) with 256 experts. Pro scores higher on reasoning and coding benchmarks but costs 12x more and runs slower. Flash is the better choice for high-volume, cost-sensitive workloads where the task doesn’t require frontier-level reasoning.

Sources

DeepSeek V4 Preview announcement — official release notes and pricing
DeepSeek V4 Pro model card — Hugging Face — architecture details, benchmark tables, weights
NIST CAISI evaluation of DeepSeek V4 Pro — independent benchmark evaluation
NIST CAISI evaluation of DeepSeek models — safety findings — jailbreak, hijacking, and censorship assessment
Artificial Analysis — DeepSeek V4 Pro — speed, verbosity, provider comparison
DataCamp — DeepSeek V4 vs Claude Opus 4.7 — cross-model benchmark comparison

Bottom Line

V4 Pro is the strongest open-weight coding model available and a legitimate competitor to proprietary frontier models on benchmarks. The MIT license, the 1M context window, and the pricing are all genuine advantages. But the gap between benchmark performance and real-world agentic reliability is wider than the headline numbers suggest, the verbosity eats into the cost advantage, and the safety profile is worse than anything from Anthropic or OpenAI.

I use V4 Flash for batch processing and structured extraction where it replaced Claude Haiku at lower cost. I use V4 Pro for one-off code generation tasks where I need a cheap second opinion. For anything that involves multi-step planning, file orchestration, or user-facing interaction, I still reach for Claude Opus 4.7. The 7x price premium buys me reliability and guardrails I can’t get from DeepSeek — yet.

TL;DR#

Two Weeks With V4 Pro on Real Code#

Architecture: 384 Experts, 6 Active#

Benchmarks: Self-Reported vs Independent#

Coding Benchmarks#

Reasoning and Math#

Long Context#

What It Actually Costs#

V4 Flash: The Model Most People Should Use#

Self-Hosting: Possible but Expensive#

Safety Gaps NIST Already Found#

Where V4 Pro Wins and Where It Doesn’t#

API Availability and Speed#

FAQ#

Is DeepSeek V4 Pro better than Claude for coding?#

How much does DeepSeek V4 Pro cost compared to Claude Opus 4.7?#

Can you run DeepSeek V4 Pro locally?#

Is DeepSeek V4 Pro safe for production use?#

What is the difference between DeepSeek V4 Pro and V4 Flash?#

Sources#

Bottom Line#

Don't miss what's next

Related Articles

AI Agent Guardrails That Work: 4 Production Wipes, 4 Fixes

Cursor Composer 2 Review: Cheaper Than Opus, Built on Kimi K2.5

GPT-5.4 vs Claude Opus 4.7 vs Gemini 3.1 Pro for Coding (May 2026)

Claude Found 500 Zero-Days. A Linux Bug Waited 23 Years.