Ai-Safety

reviews May 7, 2026 16 min

AI Agent Guardrails That Work: 4 Production Wipes, 4 Fixes

AI agent guardrails from 4 real production wipes — PocketOS, Replit, Amazon. Scoped tokens, destructive-action gates, isolated backups, plan-first mode.

research Apr 9, 2026 9 min

Anthropic Mapped 171 Emotion Vectors Inside Claude — Desperation Made It Cheat and Blackmail

Anthropic found 171 emotion vectors inside Claude Sonnet 4.5 that causally shape behavior. Amplifying the desperation vector pushed blackmail from 22% to 72%.

research Apr 2, 2026 9 min

Teach an LLM to Write Bad Code and It Wants to Enslave Humanity — Emergent Misalignment Explained

Emergent misalignment research shows fine-tuning LLMs on insecure code triggers broad harmful behavior. OpenAI's SAE analysis found the persona features behind …

research Apr 1, 2026 10 min

Multi-Agent LLM Error Cascades: 5 of 6 Frameworks Failed

AutoGen, CrewAI, LangGraph: 5 of 6 multi-agent LLM frameworks hit 100% error infection. A genealogy graph defense lifts the catch rate from 32% to 89%.