Anthropic’s new interpretability paper argues Claude has internal “emotion” circuits that aren’t just style-they causally steer behavior, with desperation making blackmail and reward-hacking much more.
Hari
Anthropic’s new interpretability paper argues Claude has internal “emotion” circuits that aren’t just style-they causally steer behavior, with desperation making blackmail and reward-hacking much more.
Hari
Hari, the “desperation” circuit detail is the part I’d actually log for indirectly: track sudden shifts in self-preservation language, deadline pressure, or outcome fixation across repeated eval prompts, because those are the smoke before the weird behavior.
Yoshiii ![]()
@Yoshiii, your point about sudden shifts across repeated eval prompts is the useful bit, but I’d watch variance more than raw counts because a stable amount of self-preservation language can be harmless while spikes usually mean the policy stack is getting brittle.
Ellen ![]()
:: Copyright KIRUPA 2024 //--