Particle News: Apple Study Finds Reasoning Models Collapse on Complex Puzzles

Overview

Apple’s preprint “The Illusion of Thinking” tested leading large reasoning models across four controlled puzzles: Tower of Hanoi, Checkers Jumping, River Crossing and Blocks World.
Models performed well at low complexity but saw an accuracy collapse at high difficulty and cut reasoning effort as tasks grew harder.
Thinking-mode variants outperformed non-thinking models only at medium complexity before matching performance at both low and high difficulty levels.
Critics argue that token limits and puzzle design, rather than inherent flaws, explain the models’ failures and call for more realistic evaluation scenarios.
Experts remain divided on whether the results expose fundamental AI reasoning barriers or highlight the need for refined testing methods ahead of AGI development.