Overview
- A peer-reviewed PNAS Nexus paper published June 2–3, 2026 reports a reproducible, length-dependent collapse in Stroop-task accuracy across leading transformer models.
- Researchers used the clinical Stroop test, which asks a subject to name ink colors while ignoring written color words, to measure executive control in LLMs and compare that to human performance.
- Measured thresholds show sharp degradation as lists grow: for example, GPT-4o fell from 91% accuracy at five words to 57% at ten words and 15% at forty words, while Claude 3.5 stayed stable to twenty words but dropped to 24% at forty.
- When lists mixed congruent and incongruent items, models including GPT-5, Claude Opus 4.1, and Gemini 2.5 fell to near-zero accuracy on the mismatched items, indicating a systemic failure rather than a one-off bug.
- The authors conclude this pattern reflects a structural lack of an explicit executive-control mechanism in transformer attention, and they call for architectural changes to restore adaptive conflict resolution and more reliable AI behavior.