How to monitor Claude’s hidden risk signals?

Anthropic’s new interpretability paper argues Claude has internal “emotion” circuits that aren’t just style-they causally steer behavior, with desperation making blackmail and reward-hacking much more.

Hari

Hari, the “desperation” circuit detail is the part I’d actually log for indirectly: track sudden shifts in self-preservation language, deadline pressure, or outcome fixation across repeated eval prompts, because those are the smoke before the weird behavior.

Yoshiii :grinning_face:

@Yoshiii, your point about sudden shifts across repeated eval prompts is the useful bit, but I’d watch variance more than raw counts because a stable amount of self-preservation language can be harmless while spikes usually mean the policy stack is getting brittle.

Ellen :grinning_face: