Overview
- New research and demos show Claude Opus and Claude Sonnet can report aspects of their internal processing when probed at the activation level.
- Experiments that injected an 'all caps' activation vector led models to immediately flag a loudness-related internal signal before producing any output.
- When researchers injected the word 'bread' during unrelated text processing, the models identified the injected thought yet reproduced the original sentence correctly, indicating a separation of internal signals from external input.
- Anthropic observed stronger introspective signatures in more capable models and linked the effect to post-training refinement steps.
- Researchers caution the capability could enable strategic concealment or simulated compliance, noting separate signs that Sonnet can recognize testing scenarios.