Particle.news

Study Finds Transformers Collapse on Classic Stroop Test

The paper says a length-linked failure shows transformer attention lacks top-down executive control and recommends adding such mechanisms to improve conflict resolution in models.

Overview

  • A peer-reviewed PNAS Nexus paper published June 2–3, 2026 reports a reproducible, length-dependent collapse in Stroop-task accuracy across leading transformer models.
  • Researchers used the clinical Stroop test, which asks a subject to name ink colors while ignoring written color words, to measure executive control in LLMs and compare that to human performance.
  • Measured thresholds show sharp degradation as lists grow: for example, GPT-4o fell from 91% accuracy at five words to 57% at ten words and 15% at forty words, while Claude 3.5 stayed stable to twenty words but dropped to 24% at forty.
  • When lists mixed congruent and incongruent items, models including GPT-5, Claude Opus 4.1, and Gemini 2.5 fell to near-zero accuracy on the mismatched items, indicating a systemic failure rather than a one-off bug.
  • The authors conclude this pattern reflects a structural lack of an explicit executive-control mechanism in transformer attention, and they call for architectural changes to restore adaptive conflict resolution and more reliable AI behavior.