Llm

reviews Jun 18, 2026 11 min

GLM-5.2 Review: 753B Open-Weight Model That Undercuts GPT-5.5

GLM-5.2 scores 62.1 on SWE-bench Pro vs GPT-5.5's 58.6, ships under MIT, and costs $1.40/M input tokens. Benchmarks, pricing, and the China data question.

reviews Jun 12, 2026 13 min

GPT-5.5 Review After Seven Weeks: Where It Beats Claude and Where It Doesn't

GPT-5.5 hits 82.7% on Terminal-Bench and uses 72% fewer tokens than Claude — but loses SWE-Bench Pro to Opus 4.7. Seven weeks of real agentic use, reviewed.

reviews Jun 11, 2026 12 min

Claude Fable 5 Review: 80% SWE-Bench Pro, but Read the Fine Print

Claude Fable 5 hits 80.3% SWE-bench Pro and 29.3% FrontierCode Diamond. It also costs 2x Opus 4.8, retains your data 30 days, and silently falls back.

reviews May 9, 2026 11 min

DeepSeek V4 Pro Review: 80% SWE-bench at 1/7th Claude's Price

DeepSeek V4 Pro scores 80.6% on SWE-bench Verified at $1.74/M input tokens — 7x cheaper than Claude Opus 4.7. Real benchmarks, costs, and safety gaps.

reviews Apr 21, 2026 12 min

Cursor Composer 2 Review: Cheaper Than Opus, Built on Kimi K2.5

Cursor Composer 2 ships at $0.50/M input — roughly 1/10 of Opus 4.6 — and beats Opus on Terminal-Bench. Then a developer found Kimi K2.5 in the model ID.

research Apr 5, 2026 9 min

Claude Found 500 Zero-Days. A Linux Bug Waited 23 Years.

Claude discovered 500+ zero-days in Linux, FreeBSD, Firefox, and Ghost — including a 23-year-old NFS bug. Inside the bash-script pipeline Anthropic used.

research Apr 1, 2026 10 min

Multi-Agent LLM Error Cascades: 5 of 6 Frameworks Failed

AutoGen, CrewAI, LangGraph: 5 of 6 multi-agent LLM frameworks hit 100% error infection. A genealogy graph defense lifts the catch rate from 32% to 89%.