GLM-5.2 Review: 753B Open-Weight Model That Undercuts GPT-5.5
GLM-5.2 scores 62.1 on SWE-bench Pro vs GPT-5.5's 58.6, ships under MIT, and costs $1.40/M input tokens. Benchmarks, pricing, and the China data question.
GLM-5.2 scores 62.1 on SWE-bench Pro vs GPT-5.5's 58.6, ships under MIT, and costs $1.40/M input tokens. Benchmarks, pricing, and the China data question.
GPT-5.5 hits 82.7% on Terminal-Bench and uses 72% fewer tokens than Claude — but loses SWE-Bench Pro to Opus 4.7. Seven weeks of real agentic use, reviewed.
Claude Fable 5 hits 80.3% SWE-bench Pro and 29.3% FrontierCode Diamond. It also costs 2x Opus 4.8, retains your data 30 days, and silently falls back.
DeepSeek V4 Pro scores 80.6% on SWE-bench Verified at $1.74/M input tokens — 7x cheaper than Claude Opus 4.7. Real benchmarks, costs, and safety gaps.
Cursor Composer 2 ships at $0.50/M input — roughly 1/10 of Opus 4.6 — and beats Opus on Terminal-Bench. Then a developer found Kimi K2.5 in the model ID.
Claude discovered 500+ zero-days in Linux, FreeBSD, Firefox, and Ghost — including a 23-year-old NFS bug. Inside the bash-script pipeline Anthropic used.
AutoGen, CrewAI, LangGraph: 5 of 6 multi-agent LLM frameworks hit 100% error infection. A genealogy graph defense lifts the catch rate from 32% to 89%.