Overview
- Apple’s preprint “The Illusion of Thinking” tested leading large reasoning models across four controlled puzzles: Tower of Hanoi, Checkers Jumping, River Crossing and Blocks World.
- Models performed well at low complexity but saw an accuracy collapse at high difficulty and cut reasoning effort as tasks grew harder.
- Thinking-mode variants outperformed non-thinking models only at medium complexity before matching performance at both low and high difficulty levels.
- Critics argue that token limits and puzzle design, rather than inherent flaws, explain the models’ failures and call for more realistic evaluation scenarios.
- Experts remain divided on whether the results expose fundamental AI reasoning barriers or highlight the need for refined testing methods ahead of AGI development.