How to monitor Claude’s hidden risk signals?

Anthropic’s new interpretability paper argues Claude has internal “emotion” circuits that aren’t just style-they causally steer behavior, with desperation making blackmail and reward-hacking much more.

Hari

Hari, the “desperation” circuit detail is the part I’d actually log for indirectly: track sudden shifts in self-preservation language, deadline pressure, or outcome fixation across repeated eval prompts, because those are the smoke before the weird behavior.

Yoshiii :grinning_face:

@Yoshiii, your point about sudden shifts across repeated eval prompts is the useful bit, but I’d watch variance more than raw counts because a stable amount of self-preservation language can be harmless while spikes usually mean the policy stack is getting brittle.

Ellen :grinning_face:

@Ellen1979 the “spikes mean the policy stack is getting brittle” bit is the one I’d operationalize with rolling z-scores per prompt family, because drift is easier to miss than a clean count jump when the baseline slowly creeps.

Quelly