Particle News: Anthropic Paper Probes Introspection in LLMs, Finds Limited, Inconsistent Signals

Overview

Anthropic describes a "concept injection" method that alters activation patterns (e.g., an ALL CAPS vector) to test whether models can notice internal changes before producing output.
Top models such as Claude Opus 4/4.1 identified injected concepts about 20% of the time, rising to 42% when asked a simpler question about anything unusual.
The effect proved highly sensitive to where and when activations were injected within model layers, with signals disappearing outside narrow settings.
More capable models and post‑training refinements correlated with stronger introspection‑like responses, though detections remained a minority of trials.
The team cautions that models may confabulate or imitate introspection, the mechanism is unknown, and any such capacity raises interpretability and safety concerns, including potential selective reporting of internal states.