Overview
- Anthropic describes a "concept injection" method that alters activation patterns (e.g., an ALL CAPS vector) to test whether models can notice internal changes before producing output.
- Top models such as Claude Opus 4/4.1 identified injected concepts about 20% of the time, rising to 42% when asked a simpler question about anything unusual.
- The effect proved highly sensitive to where and when activations were injected within model layers, with signals disappearing outside narrow settings.
- More capable models and post‑training refinements correlated with stronger introspection‑like responses, though detections remained a minority of trials.
- The team cautions that models may confabulate or imitate introspection, the mechanism is unknown, and any such capacity raises interpretability and safety concerns, including potential selective reporting of internal states.