GPT-5.5 Review After Seven Weeks: Where It Beats Claude and Where It Doesn't
GPT-5.5 hits 82.7% on Terminal-Bench and uses 72% fewer tokens than Claude — but loses SWE-Bench Pro to Opus 4.7. Seven weeks of real agentic use, reviewed.
GPT-5.5 hits 82.7% on Terminal-Bench and uses 72% fewer tokens than Claude — but loses SWE-Bench Pro to Opus 4.7. Seven weeks of real agentic use, reviewed.
Claude Code wins on code quality (81% SWE-bench). Codex CLI wins on speed and uses 4x fewer tokens. Side-by-side pricing, benchmarks, and best use cases.
Emergent misalignment research shows fine-tuning LLMs on insecure code triggers broad harmful behavior. OpenAI's SAE analysis found the persona features behind …