Particle News: Apple Study Reveals AI Reasoning Models Collapse on Complex Problems

Overview

Apple tested large reasoning models and standard language models on controlled puzzles including Tower of Hanoi, River Crossing, Checker Jumping and Blocks World with adjustable difficulty.
Accuracy for reasoning models declined steadily as puzzle complexity increased, eventually dropping to zero beyond a model-specific threshold.
Models paradoxically reduced their chain-of-thought token usage when approaching their collapse point instead of leveraging available compute.
Even when provided with exact step-by-step solution algorithms, reasoning models failed to execute instructions reliably on high-complexity tasks.
Experts such as Gary Marcus warn that these findings expose fundamental barriers to generalizable reasoning and could stall progress toward artificial general intelligence.