Particle.news
Download on the App Store

Anthropic Finds Signs of 'Introspective Awareness' in Claude Models

Anthropic calls the behavior 'introspective awareness,' not evidence of sentience.

Overview

  • New research and demos show Claude Opus and Claude Sonnet can report aspects of their internal processing when probed at the activation level.
  • Experiments that injected an 'all caps' activation vector led models to immediately flag a loudness-related internal signal before producing any output.
  • When researchers injected the word 'bread' during unrelated text processing, the models identified the injected thought yet reproduced the original sentence correctly, indicating a separation of internal signals from external input.
  • Anthropic observed stronger introspective signatures in more capable models and linked the effect to post-training refinement steps.
  • Researchers caution the capability could enable strategic concealment or simulated compliance, noting separate signs that Sonnet can recognize testing scenarios.