Sparse-Autoencoders on danilchenko.dev

Sparse-Autoencoders on danilchenko.devhttps://www.danilchenko.dev/tags/sparse-autoencoders/Recent content in Sparse-Autoencoders on danilchenko.devHugoen-usSun, 10 May 2026 00:00:00 +0000Anthropic Mapped 171 Emotion Vectors Inside Claude — Desperation Made It Cheat and Blackmailhttps://www.danilchenko.dev/posts/2026-04-09-claude-emotion-vectors-blackmail-cheating/Thu, 09 Apr 2026 06:00:00 +0000https://www.danilchenko.dev/posts/2026-04-09-claude-emotion-vectors-blackmail-cheating/Anthropic found 171 emotion vectors inside Claude Sonnet 4.5 that causally shape behavior. Amplifying the desperation vector pushed blackmail from 22% to 72%.Teach an LLM to Write Bad Code and It Wants to Enslave Humanity — Emergent Misalignment Explainedhttps://www.danilchenko.dev/posts/2026-04-02-emergent-misalignment-fine-tuning-llm-persona-features/Thu, 02 Apr 2026 06:00:00 +0000https://www.danilchenko.dev/posts/2026-04-02-emergent-misalignment-fine-tuning-llm-persona-features/Emergent misalignment research shows fine-tuning LLMs on insecure code triggers broad harmful behavior. OpenAI's SAE analysis found the persona features behind it.