Particle News: Anthropic Finds Signs of 'Introspective Awareness' in Claude Models

Overview

New research and demos show Claude Opus and Claude Sonnet can report aspects of their internal processing when probed at the activation level.
Experiments that injected an 'all caps' activation vector led models to immediately flag a loudness-related internal signal before producing any output.
When researchers injected the word 'bread' during unrelated text processing, the models identified the injected thought yet reproduced the original sentence correctly, indicating a separation of internal signals from external input.
Anthropic observed stronger introspective signatures in more capable models and linked the effect to post-training refinement steps.
Researchers caution the capability could enable strategic concealment or simulated compliance, noting separate signs that Sonnet can recognize testing scenarios.