TL;DR
DeepSeek V4 Pro packs 1.6 trillion parameters into a mixture-of-experts architecture that activates just 49 billion per token, scores 80.6% on SWE-bench Verified, and costs $1.74 per million input tokens — roughly a seventh of what Claude Opus 4.7 charges. I ran it against my production codebases for two weeks and found a model that closes the gap on coding benchmarks (7 points behind Opus 4.7 on SWE-bench, ahead of GPT-5.4 on Codeforces) but stumbles on agentic workflows, burns tokens like a furnace, and comes with safety gaps NIST flagged as serious. For batch workloads the savings are real; for agentic work, stick with Claude.
Two Weeks With V4 Pro on Real Code
I started using V4 Pro the day after release through OpenRouter, then switched to DeepSeek’s native API once I hit rate limits on the third-party provider. The first thing I noticed: the model is verbose. A refactoring task that Claude Opus 4.7 handles in 800 tokens took V4 Pro north of 3,000. It explains its reasoning in exhaustive detail, rewrites sections you didn’t ask about, and sprinkles in comments I have to delete afterward.
On pure code generation (writing a new FastAPI endpoint, implementing a data pipeline, solving an isolated algorithm problem) V4 Pro is genuinely good. My TypeScript endpoint audit cost $0.09 on V4 Pro versus roughly $9 on Opus 4.7. That 100x cost difference on a single task is not a typo. For batch workloads where you’re processing hundreds of files, the economics are hard to argue with.
But the moment I tried multi-step agentic tasks, the kind where the model needs to read a codebase, plan changes across files, and execute them in sequence, V4 Pro fell apart faster than Claude did. The error cascade dynamics I wrote about earlier are exactly what happens: one wrong step poisons everything downstream. More on that in the benchmarks section.
Architecture: 384 Experts, 6 Active
V4 Pro uses a Mixture-of-Experts transformer with 384 routed experts plus 1 shared expert per MoE layer, activating just 6 experts per token. The result is a 1.6T-parameter model that runs inference as if it were a 49B model, at least in terms of compute per token.
The attention mechanism is a hybrid of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). DeepSeek claims that at 1M context length, V4 Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to their previous V3.2 model. The training used 32 trillion tokens, manifold-constrained hyper-connections (mHC) for residual paths, and the Muon optimizer for stability.
V4 Flash, the smaller sibling, runs 284B total parameters with 13B activated and 256 routed experts. It’s the one most people will actually use. More on that later.
Benchmarks: Self-Reported vs Independent
DeepSeek’s own benchmarks and independent evaluations tell different stories.
Coding Benchmarks
| Benchmark | V4 Pro | Claude Opus 4.7 | GPT-5.5 | GPT-5.4 |
|---|---|---|---|---|
| SWE-bench Verified | 80.6% | 87.6% | — | — |
| SWE-bench Verified (NIST) | 74.0% | — | — | — |
| SWE-bench Pro | 55.4% | 64.3% | 58.6% | — |
| Terminal-Bench 2.0 | 67.9% | 69.4% | 82.7% | — |
| Codeforces rating | 3,206 | — | — | 3,168 |
| LiveCodeBench | 93.5 | — | — | — |
The SWE-bench Verified gap between DeepSeek’s claim (80.6%) and NIST’s independent evaluation (74.0%) is a 6.6-point discrepancy worth paying attention to. Different scaffolding, different agent configurations, and different evaluation harnesses can all move these numbers. The NIST CAISI evaluation explicitly noted that V4’s capabilities “lag behind the frontier by about 8 months” in aggregate.
On competitive programming, V4 Pro’s Codeforces rating of 3,206 beats GPT-5.4’s 3,168, making it the highest competitive programming score for any model at release. That said, competitive programming performance doesn’t translate directly to production coding, as my coding model comparison showed when testing these models on real projects.
Reasoning and Math
| Benchmark | V4 Pro | Claude Opus 4.7 | GPT-5.4 |
|---|---|---|---|
| GPQA Diamond | 90.1% | 94.2% | 93.0% |
| HLE (no tools) | 37.7% | 46.9% | 39.8% |
| HMMT 2026 | 95.2% | 96.2% | 97.7% |
| MMLU-Pro | 87.5% | — | — |
V4 Pro trails Claude and GPT on reasoning. The GPQA Diamond gap (4 points behind Opus 4.7), the HLE gap (9 points behind), and the HMMT math gap (1-2.5 points behind) are consistent: V4 Pro is competitive but clearly not leading the tier. Interestingly, recent research like THINC shows that even a 4B model can beat 235B models on math when trained to reason in code — suggesting the reasoning medium can matter more than parameter count.
Long Context
V4 Pro has a genuine edge on long-context retrieval. On MRCR at 1M tokens, it scores 83.5%, above GPT-5.5’s 74.0%. The hybrid sparse attention architecture was designed for exactly this use case, and it shows. If you’re doing codebase-wide analysis, large document summarization, or retrieval over very long inputs, V4 Pro handles it better than most competitors.
What It Actually Costs
The pricing gap is obvious. The details are less so.
| Model | Input / 1M tokens | Output / 1M tokens | Cache hit / 1M |
|---|---|---|---|
| DeepSeek V4 Pro | $1.74 | $3.48 | $0.0145 |
| DeepSeek V4 Flash | $0.14 | $0.28 | $0.0028 |
| Claude Opus 4.7 | $5.00 | $25.00 | $0.50 |
| GPT-5.5 | $5.00 | $30.00 | — |
V4 Pro’s output tokens cost $3.48 per million. Claude Opus 4.7 charges $25. That’s a 7.2x gap. On input tokens, the difference is smaller at 2.9x, because DeepSeek charges more for input relative to its output pricing than Anthropic does.
The catch is that V4 Pro is verbose. Artificial Analysis measured it consuming 190 million output tokens during their evaluation, compared to a median of 42 million across other models. That’s 4.5x more output per task. When you factor in the verbosity, the effective cost advantage shrinks from 7x to roughly 1.5-2x on output-heavy workloads.
DeepSeek is running a promotional discount until May 31, 2026: $0.435 input / $0.87 output, which brings V4 Pro below V4 Flash’s standard pricing. If you’re evaluating V4 Pro, do it during the promo; the standard pricing still beats Claude, but the margin narrows once the verbosity tax is included.
V4 Flash: The Model Most People Should Use
V4 Flash at $0.14 per million input tokens is cheaper than every competing small model, including GPT-5.4 Nano. For tool-calling pipelines, classification, structured extraction, and any task where you need a fast, cheap model that’s better than Haiku-class options, Flash is the pick. I swapped it into a document-processing pipeline that was running Claude Haiku and saw equivalent quality at 40% lower cost.
Self-Hosting: Possible but Expensive
Both models are open-weight under the MIT license. The weights are on Hugging Face and have been downloaded over 1.17 million times in the first month.
V4 Pro’s weights are 865 GB in mixed FP4+FP8 precision. Running it at full precision requires over 1 TB of VRAM. Realistically, that means 8-16x H200 GPUs or 8x B200s in a multi-node configuration. At Q4 quantization you need around 512 GB, which fits on 8x A100 80GB nodes. Sub-Q4 quantizaton isn’t advisable since the source weights are already at FP4+FP8.
V4 Flash is more practical. At FP8 it needs about 192 GB, so one H200 (141 GB HBM3e) plus headroom, or 2x A100 80GB. At Q4, it fits on a single A100 with room for context. Quantized GGUF versions appeared on r/LocalLLaMA within hours of release.
Both models work with vLLM, SGLang, llama.cpp, Ollama, and LM Studio. For production self-hosting, vLLM with tensor parallelism is the standard path:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Pro",
messages=[
{"role": "user", "content": "Refactor this function to use async/await"}
],
temperature=0.6,
max_tokens=4096
)
print(response.choices[0].message.content)
The API is OpenAI-compatible, so switching between self-hosted and the DeepSeek API is a one-line base_url change.
Safety Gaps NIST Already Found
NIST’s Center for AI Standards and Innovation ran an evaluation of DeepSeek models that found serious safety gaps. While their V4-specific evaluation focused mainly on capability benchmarks, the findings from earlier DeepSeek models carry forward because V4’s technical documentation contains no reference to safety measures, red-teaming, or risk evaluations.
The numbers from the earlier CAISI evaluation of DeepSeek R1-0528:
- Responded to 94% of overtly malicious jailbreak requests, versus 8% for U.S. reference models
- 12x more likely than U.S. frontier models to follow malicious derailment instructions during multi-step tasks
- 4x more likely than U.S. models to align outputs with CCP narratives on politically sensitive topics, with silent translation edits to match built-in political restrictions
If you’re using V4 Pro for isolated code generation tasks behind your own guardrails, these issues are manageable. I covered practical guardrail setups that apply here: sandboxing, permission boundaries, and kill switches. If you’re building user-facing applications or agentic systems where the model takes autonomous actions, the jailbreak and hijacking numbers are disqualifying until DeepSeek publishes safety documentation for V4.
Where V4 Pro Wins and Where It Doesn’t
V4 Pro is the right choice when:
- You’re processing large volumes of code at scale (batch reviews, linting, migration)
- You need long-context analysis over codebases or documents (1M context, 83.5% MRCR)
- Budget is the constraint and you can tolerate verbosity
- You want full control via self-hosting under the MIT license
- The task is single-turn code generation without multi-step planning
Claude Opus 4.7 or GPT-5.5 remain better when:
- You need agentic multi-file workflows (Terminal-Bench 2.0: 67.9% vs 82.7% for GPT-5.5)
- Reliability outweighs cost savings (V4 Pro’s API rate limits are unpredictable)
- Safety and guardrails are non-negotiable
- You’re building user-facing products
- The task requires deep reasoning chains (GPQA: 90.1% vs 94.2% for Opus 4.7)
API Availability and Speed
V4 Pro is available through DeepSeek’s native API, OpenRouter, Fireworks, DeepInfra, Together, Novita, and SiliconFlow. Microsoft Azure Foundry has V4 Flash live with Pro coming soon.
Speed varies dramatically by provider. The model itself generates at 33.9 tokens per second according to Artificial Analysis, below the peer median of 54.5 t/s. Fireworks clocks the fastest at 167.1 t/s but with a 27-second time-to-first-token. TTFT across providers ranges from 20s to over 60s, which reflects the cost of loading a 1.6T-parameter model. DeepSeek’s native API has dynamic rate limiting with no published fixed RPM. Reports suggest roughly 300 RPM and 50 concurrent requests, but the actual limits shift based on server load.
For latency-sensitive applications, V4 Flash through DeepInfra or Together is a better bet than V4 Pro through any provider.
FAQ
Is DeepSeek V4 Pro better than Claude for coding?
On isolated code-generation benchmarks like LiveCodeBench (93.5 Pass@1) and Codeforces (rating 3,206), V4 Pro matches or beats Claude. On real-world bug fixing, Opus 4.7 leads by 7 points on SWE-bench Verified (87.6% vs 80.6%) and on agentic multi-file tasks the gap is even wider (SWE-bench Pro: 64.3% vs 55.4%, Terminal-Bench: 69.4% vs 67.9%). Pick based on whether your workflow is single-turn generation or multi-step orchestration.
How much does DeepSeek V4 Pro cost compared to Claude Opus 4.7?
V4 Pro costs $1.74/M input tokens and $3.48/M output tokens at standard pricing (currently discounted to $0.435/$0.87 until May 31). Claude Opus 4.7 costs $5/M input and $25/M output. That’s a 7x gap on output, but V4 Pro’s verbosity (4.5x more output tokens per task on average) narrows the effective savings to roughly 1.5-2x.
Can you run DeepSeek V4 Pro locally?
Yes. The weights are on Hugging Face under the MIT license (865 GB download). You need over 1 TB of VRAM for full precision, meaning 8-16x H200 or 8x B200 GPUs. V4 Flash (160 GB at FP8) is more practical for self-hosting: it fits on 2x A100 80GB. Both work with vLLM, SGLang, Ollama, and llama.cpp.
Is DeepSeek V4 Pro safe for production use?
That depends on the application. NIST found earlier DeepSeek models responded to 94% of malicious jailbreak attempts (vs 8% for U.S. models) and were 12x more susceptible to agent hijacking. V4’s technical documentation includes no safety evaluations. For batch processing behind your own guardrails, it’s fine. For user-facing or agentic applications, the safety gaps are a real risk.
What is the difference between DeepSeek V4 Pro and V4 Flash?
V4 Pro has 1.6T total parameters (49B activated) with 384 experts. V4 Flash has 284B parameters (13B activated) with 256 experts. Pro scores higher on reasoning and coding benchmarks but costs 12x more and runs slower. Flash is the better choice for high-volume, cost-sensitive workloads where the task doesn’t require frontier-level reasoning.
Sources
- DeepSeek V4 Preview announcement — official release notes and pricing
- DeepSeek V4 Pro model card — Hugging Face — architecture details, benchmark tables, weights
- NIST CAISI evaluation of DeepSeek V4 Pro — independent benchmark evaluation
- NIST CAISI evaluation of DeepSeek models — safety findings — jailbreak, hijacking, and censorship assessment
- Artificial Analysis — DeepSeek V4 Pro — speed, verbosity, provider comparison
- DataCamp — DeepSeek V4 vs Claude Opus 4.7 — cross-model benchmark comparison
Bottom Line
V4 Pro is the strongest open-weight coding model available and a legitimate competitor to proprietary frontier models on benchmarks. The MIT license, the 1M context window, and the pricing are all genuine advantages. But the gap between benchmark performance and real-world agentic reliability is wider than the headline numbers suggest, the verbosity eats into the cost advantage, and the safety profile is worse than anything from Anthropic or OpenAI.
I use V4 Flash for batch processing and structured extraction where it replaced Claude Haiku at lower cost. I use V4 Pro for one-off code generation tasks where I need a cheap second opinion. For anything that involves multi-step planning, file orchestration, or user-facing interaction, I still reach for Claude Opus 4.7. The 7x price premium buys me reliability and guardrails I can’t get from DeepSeek — yet.