Particle.news
Download on the App Store

Anthropic Paper Probes Introspection in LLMs, Finds Limited, Inconsistent Signals

Researchers stress this is not evidence of self‑awareness, with reported effects fragile.

Overview

  • Anthropic describes a "concept injection" method that alters activation patterns (e.g., an ALL CAPS vector) to test whether models can notice internal changes before producing output.
  • Top models such as Claude Opus 4/4.1 identified injected concepts about 20% of the time, rising to 42% when asked a simpler question about anything unusual.
  • The effect proved highly sensitive to where and when activations were injected within model layers, with signals disappearing outside narrow settings.
  • More capable models and post‑training refinements correlated with stronger introspection‑like responses, though detections remained a minority of trials.
  • The team cautions that models may confabulate or imitate introspection, the mechanism is unknown, and any such capacity raises interpretability and safety concerns, including potential selective reporting of internal states.