TL;DR
Z.ai’s GLM-5.2 is a 753-billion-parameter mixture-of-experts model released under an MIT license on June 16, 2026. It tops GPT-5.5 on SWE-bench Pro (62.1 vs 58.6) and trails Claude Opus 4.8 by about 0.7 points on FrontierSWE (74.4% vs 75.1%). Through OpenRouter, it runs at $1.40 per million input tokens, about 72% cheaper than Claude and GPT. The catch: if you use Z.ai’s hosted API instead of self-hosting the weights, your prompts route through servers governed by China’s National Intelligence Law.
What GLM-5.2 Actually Is
GLM-5.2 comes from Z.ai (formerly Zhipu AI), one of China’s largest AI labs and one of the few on the US Bureau of Industry and Security’s Entity List. The model launched on June 13 for paying subscribers, then dropped its full open weights on Hugging Face three days later at zai-org/GLM-5.2.
The architecture is a mixture-of-experts (MoE) transformer: 753 billion total parameters, but only about 40 billion active per forward pass. That MoE structure is why it can match or beat dense models with far fewer FLOPs per token. It supports a 1-million-token context window (a 5x jump from GLM-5.1’s 200K) and can produce up to 128K tokens in a single response.
I spent the last day running it through OpenRouter on three personal projects: a Flask API refactor, a Go CLI tool with nested subcommands, and a long context window test stuffing 400K tokens of a monorepo into the prompt. Not cherry-picked benchmarks, just the kind of work I do daily with Claude Code and occasionally Codex.
The first thing I noticed: GLM-5.2 is verbose. It used roughly 43K output tokens per task in Artificial Analysis’s benchmark suite, compared to 26K for GLM-5.1. That verbosity inflates your bill if you’re paying per token, and it slows down iteration when you’re waiting for a response. On OpenRouter, a typical coding task took around 75 seconds, which felt slow compared to Opus 4.8’s sub-30-second responses for similar complexity.
But the code it wrote was clean. A Python web scraper function came back production-ready on the first try: proper error handling, retries with exponential backoff, typed return values. The Go CLI output was similarly solid: correct cobra subcommand nesting, help text, and flag parsing without the usual LLM quirk of inventing nonexistent stdlib packages.
Benchmark Breakdown
The full picture, pulling from Artificial Analysis, BenchLM, and Z.ai’s published results:
| Benchmark | GLM-5.2 | Claude Opus 4.8 | GPT-5.5 | DeepSeek V4 Pro |
|---|---|---|---|---|
| SWE-bench Pro | 62.1 | 69.2 | 58.6 | 55.4 |
| Terminal-Bench 2.1 | 81.0 | 85.0 | — | — |
| FrontierSWE | 74.4% | 75.1% | 72.6% | — |
| AA Intelligence Index | 51 | — | — | 44 |
| GDPval-AA v2 (Agentic) | 1524 | — | 1514 | 1328 |
| GPQA Diamond | 89% | — | — | — |
| HLE | 40% | — | — | — |
A few things jump out from this table.
GLM-5.2 beats GPT-5.5 on every coding benchmark where both have scores (see our GPT-5.5 review for the full breakdown on that model). The SWE-bench Pro gap (62.1 vs 58.6) is real. That’s 3.5 points on a benchmark where single-digit gaps separate model generations. On FrontierSWE, GLM-5.2 lands at 74.4%, within 0.7% of Opus 4.8’s 75.1% (we covered Fable 5’s benchmark claims recently). A 0.7% gap for a model that costs 80% less.
The agentic score (GDPval-AA v2) stands out: 1524 vs GPT-5.5’s 1514. GLM-5.2 is built for long-horizon agent workflows where the model needs to plan across files, run commands, and iterate. Z.ai specifically pitched it as a coding-agent model, and the benchmarks back that framing.
Where it falls short: hallucination rate sits at 28.1% on the AA-Omniscience Index. That’s an improvement over GLM-5.1’s 29.4%, but it means roughly one in four factual claims from the model is wrong. For coding tasks this is less of a concern (the compiler catches lies), but for anything research-heavy or fact-dependent, you’ll want external verification.
Pricing: The Real Reason to Pay Attention
Pricing changes how you actually use a model more than benchmarks do.
| Model | Input (per 1M) | Output (per 1M) | Approx. cost/task | License |
|---|---|---|---|---|
| GLM-5.2 (OpenRouter) | $1.40 | $4.40 | ~$0.46 | MIT |
| Claude Opus 4.8 | $5.00 | $25.00 | ~$2.50 | Proprietary |
| GPT-5.5 | $5.00 | $30.00 | ~$3.00 | Proprietary |
| DeepSeek V4 Pro | $0.44 | $0.87 | ~$0.05 | DeepSeek |
| GLM-5.1 | ~$1.00 | ~$3.00 | ~$0.25 | MIT |
GLM-5.2 isn’t the cheapest model in this table. DeepSeek V4 Pro undercuts it by 10x. But DeepSeek V4 Pro also scores significantly lower on every coding benchmark. The interesting position is GLM-5.2 vs the proprietary frontier: you get 95-100% of Claude Opus 4.8’s coding performance for roughly 18% of the price.
The caveat is token consumption. GLM-5.2 burns through 43K output tokens per benchmark task, with 37K of those being internal reasoning tokens. You’re billed for the reasoning tokens even though they don’t appear in the visible output. That inflates the per-task cost from what the raw price-per-million suggests. At ~$0.46 per coding task, it’s still cheap against proprietary options, but it’s almost double GLM-5.1’s ~$0.25.
For personal projects and prototyping, the math is obvious: GLM-5.2 through OpenRouter gives you frontier-adjacent coding quality at indie-developer prices (compare this with the Claude Code Pro vs Max pricing breakdown). For production agent pipelines processing thousands of tasks, the token verbosity starts to add up and DeepSeek V4 Pro’s cost advantage gets harder to ignore.
The 1-Million-Token Context Window
GLM-5.1 topped out at 200K tokens. GLM-5.2 jumps to 1 million, matching Claude Opus 4.8 and GPT-5.5’s API context window.
I tested this by feeding it roughly 400K tokens of a monorepo: the entire src/ directory of a mid-size Flask application with about 180 files. I asked it to trace a specific request flow through three microservices, identify where a race condition could happen, and propose a fix.
It handled the context without obvious degradation. The trace was accurate, it identified the correct database transaction that lacked proper isolation, and the fix was structurally sound. Whether it would hold up at 800K+ tokens I can’t say. I didn’t have a codebase that large on hand to test with.
Z.ai specifically designed GLM-5.2’s IndexShare sparse-attention mechanism for long agent trajectories. The idea is that coding agents accumulate hundreds of thousands of tokens over a multi-step session: file reads, command outputs, error traces, iterative fixes. A model that degrades at 200K would force the agent to constanly prune context or restart. At 1M, the agent can carry the full project state through a long session without losing earlier context.
How to Use GLM-5.2
Three paths, each with different trade-offs.
OpenRouter (quickest start)
The fastest way to try GLM-5.2. Nine providers offer it on OpenRouter, all at roughly $1.40/$4.40 per million tokens. The API is OpenAI-compatible:
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="your-openrouter-key",
)
response = client.chat.completions.create(
model="z-ai/glm-5.2",
messages=[
{"role": "user", "content": "Refactor this Flask route to use async SQLAlchemy 2.0 sessions"}
],
temperature=0.6,
max_tokens=4096,
)
print(response.choices[0].message.content)
Z.ai’s API (official, but read the data section below)
Z.ai’s own API supports OpenAI SDK compatibility and adds a Coding Plan tier that’s compatible with Claude Code, Cline, and Cursor. Point your tools at the coding endpoint:
export ANTHROPIC_BASE_URL="https://api.z.ai/api/coding/paas/v4"
export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-5.2[1m]"
That gives you GLM-5.2 as a drop-in replacement in Claude Code sessions. I tried this with a small Go project and it worked. Claude Code’s agent loop ran normally, though response times were noticeably slower than Anthropic’s own endpoints.
Self-Hosting (full control, big hardware)
The weights are on Hugging Face under MIT. You can serve them with vLLM or SGLang on your own GPUs. The hardware requirement is steep: at FP8 quantization, you need about 753 GB of VRAM. That’s ten H100 80GB GPUs (800 GB total) or a smaller cluster of H200s. At BF16, the requirement doubles to ~1,500 GB.
Self-hosting makes sense for two audiences: enterprises that can’t send code to external APIs for compliance reasons, and AI labs that want to fine-tune the weights for specialized domains. For everyone else, OpenRouter is simpler.
# Example vLLM serving command (8x H100)
python -m vllm.entrypoints.openai.api_server \
--model zai-org/GLM-5.2 \
--tensor-parallel-size 8 \
--dtype float8 \
--max-model-len 1048576
The China Data Question
Z.ai is headquartered in Beijing. The US Bureau of Industry and Security added Zhipu AI (Z.ai’s former name) to its Entity List in January 2025, citing the company’s role in advancing military AI modernization. In late April 2026, US House lawmakers opened a formal inquiry into cybersecurity risks posed by Chinese AI models in critical infrastructure, naming Z.ai alongside DeepSeek, MiniMax, ByteDance, and several others.
The risk depends entirely on the serving path.
If you use Z.ai’s hosted API, your prompts and code route through servers subject to China’s National Intelligence Law, which requires Chinese companies to cooperate with government data requests. For regulated industries (healthcare, finance, defense, government contracting), this is a non-starter. For any codebase containing proprietary algorithms, API keys, customer data, or anything you wouldn’t post publicly, routing through Z.ai’s API is a risk most security teams won’t approve.
If you self-host or use a Western OpenRouter provider, the data never touches Z.ai’s servers. The MIT license has no phone-home requirements, no telemetry obligations, no usage restrictions. You download the weights, serve them on your own infrastructure, and Z.ai has no visibility into what you’re doing. This is the scenario where the open-weights value proposition fully pays off.
If you use OpenRouter, check which provider is serving your request. Most OpenRouter providers for GLM-5.2 route through non-Chinese infrastructure, but the routing can vary. Verify with your specific provider if this matters for your compliance requirements.
The practical upshot: GLM-5.2’s China connection is a non-issue if you self-host, a manageable concern on most OpenRouter providers, and a hard blocker if you’d be sending sensitive code directly to Z.ai. The MIT license exists precisely to decouple the model’s capabilities from the company’s jurisdiction.
Who Should Use GLM-5.2
Good fit:
- Independent developers and small teams who want near-frontier coding quality at $1.40/M input instead of $5.00/M. The cost difference compounds fast when you’re running agent loops that process dozens of files.
- Teams already invested in open-weight infrastructure (vLLM clusters, self-hosted inference). GLM-5.2 slots into that stack with no vendor lock-in and no API dependency.
- Long-context use cases where the 1M window matters: codebase-wide refactors, multi-file agents, repository Q&A.
- Anyone who needs to fine-tune a frontier-class coding model for a specific domain. MIT license means no restrictions.
Bad fit:
- Enterprise teams in regulated industries that can’t risk any data routing through Chinese infrastructure, unless they have the GPU budget to self-host.
- Latency-sensitive production pipelines. At ~75 seconds per coding task through OpenRouter, GLM-5.2 is sluggish compared to Opus 4.8 or GPT-5.5’s APIs.
- Tasks requiring low hallucination rates. The 28.1% factual error rate is acceptable for code (where tests catch mistakes) but rough for content generation, research synthesis, or customer-facing text.
- Teams that need multimodal capabilities. GLM-5.2 is text-only. Z.ai’s vision model (GLM-5V-Turbo) exists separately and isn’t open-weights.
FAQ
Is GLM-5.2 better than Claude Opus 4.8 for coding?
Not quite. On SWE-bench Pro, Opus 4.8 scores 69.2 vs GLM-5.2’s 62.1, a 7-point gap. On FrontierSWE, it narrows to 75.1% vs 74.4%. Opus 4.8 is faster, less verbose, and has lower hallucination rates. But it costs 3-5x more per token. Whether “better” means “higher score” or “better value” depends on your budget.
Can I run GLM-5.2 locally?
Technically yes, if you have ~753 GB of VRAM (ten H100 80GB GPUs at FP8). For most developers, local running isn’t practical. Cloud hosting through a managed provider or OpenRouter is the realistic path.
Is GLM-5.2 safe to use with proprietary code?
It depends on the serving path. Self-hosted or through a trusted Western provider on OpenRouter — yes, your data doesn’t leave your infrastructure. Through Z.ai’s own API — your prompts traverse Chinese servers governed by the National Intelligence Law. Most enterprise security teams will block the latter.
How does GLM-5.2 compare to DeepSeek V4 Pro?
DeepSeek V4 Pro is ~10x cheaper ($0.44/$0.87 per million tokens) as we detailed in our DeepSeek V4 Pro review, but scores lower on coding benchmarks — 55.4 on SWE-bench Pro vs GLM-5.2’s 62.1. DeepSeek wins on cost, GLM-5.2 wins on capability. Both carry similar China-data concerns if used through their respective hosted APIs.
Why is GLM-5.2 so verbose?
The model produces about 43K output tokens per coding task, with 37K being internal reasoning tokens that get billed but don’t appear in the response. Deeper reasoning chains improve accuracy on complex tasks, but they inflate costs and latency. Z.ai hasn’t offered a “low verbosity” mode that trades some accuracy for speed.
Sources
- Z.ai GLM-5.2 documentation — official specs, API reference, and capability overview
- Simon Willison’s analysis of GLM-5.2 — independent benchmarks and observations from the day of release
- Artificial Analysis: GLM-5.2 is the new leading open-weights model — Intelligence Index v4.1 scores, GDPval-AA agentic benchmarks, and pricing data
- VentureBeat: Z.ai’s GLM-5.2 beats GPT-5.5 on multiple benchmarks — SWE-bench Pro and FrontierSWE comparisons
- OpenRouter GLM-5.2 listing — live pricing and provider availability
- GLM-5.2 weights on Hugging Face — MIT-licensed open weights
Bottom Line
GLM-5.2 is the strongest open-weights coding model available today. At 62.1 on SWE-bench Pro, it beats GPT-5.5 and closes much of the gap to Claude Opus 4.8’s 69.2, while costing a fraction of either. The MIT license means you can self-host it, fine-tune it, and deploy it in air-gapped environments.
The trade-offs are real. It’s slow (75 seconds per task on OpenRouter), verbose (burning 43K tokens when 20K might suffice), and the 28.1% hallucination rate means you can’t trust it for factual content without verification. The China data question adds a layer: if you can’t self-host and your compliance posture rules out Chinese API endpoints, you’re limited to OpenRouter’s Western providers.
For non-sensitive coding work where I’m paying out of pocket, GLM-5.2 through OpenRouter at $1.40/M input is the best value in frontier AI right now. It wrote clean Python and Go on the first pass, handled 400K tokens of context without degradation, and saved me about 80% compared to my usual Opus 4.8 API costs. Open-weights coding models have crossed from curiosity to credible daily driver.