Ai-Safety on danilchenko.dev

Ai-Safety on danilchenko.devhttps://www.danilchenko.dev/tags/ai-safety/Recent content in Ai-Safety on danilchenko.devHugoen-usSun, 10 May 2026 00:00:00 +0000AI Agent Guardrails That Work: 4 Production Wipes, 4 Fixeshttps://www.danilchenko.dev/posts/ai-agent-guardrails/Thu, 07 May 2026 08:22:31 +0000https://www.danilchenko.dev/posts/ai-agent-guardrails/AI agent guardrails from 4 real production wipes — PocketOS, Replit, Amazon. Scoped tokens, destructive-action gates, isolated backups, plan-first mode.Anthropic Mapped 171 Emotion Vectors Inside Claude — Desperation Made It Cheat and Blackmailhttps://www.danilchenko.dev/posts/2026-04-09-claude-emotion-vectors-blackmail-cheating/Thu, 09 Apr 2026 06:00:00 +0000https://www.danilchenko.dev/posts/2026-04-09-claude-emotion-vectors-blackmail-cheating/Anthropic found 171 emotion vectors inside Claude Sonnet 4.5 that causally shape behavior. Amplifying the desperation vector pushed blackmail from 22% to 72%.Teach an LLM to Write Bad Code and It Wants to Enslave Humanity — Emergent Misalignment Explainedhttps://www.danilchenko.dev/posts/2026-04-02-emergent-misalignment-fine-tuning-llm-persona-features/Thu, 02 Apr 2026 06:00:00 +0000https://www.danilchenko.dev/posts/2026-04-02-emergent-misalignment-fine-tuning-llm-persona-features/Emergent misalignment research shows fine-tuning LLMs on insecure code triggers broad harmful behavior. OpenAI's SAE analysis found the persona features behind it.Multi-Agent LLM Error Cascades: 5 of 6 Frameworks Failedhttps://www.danilchenko.dev/posts/2026-04-01-error-cascades-multi-agent-llm-systems/Wed, 01 Apr 2026 06:00:00 +0000https://www.danilchenko.dev/posts/2026-04-01-error-cascades-multi-agent-llm-systems/AutoGen, CrewAI, LangGraph: 5 of 6 multi-agent LLM frameworks hit 100% error infection. A genealogy graph defense lifts the catch rate from 32% to 89%.