Overview
- Apple tested large reasoning models and standard language models on controlled puzzles including Tower of Hanoi, River Crossing, Checker Jumping and Blocks World with adjustable difficulty.
- Accuracy for reasoning models declined steadily as puzzle complexity increased, eventually dropping to zero beyond a model-specific threshold.
- Models paradoxically reduced their chain-of-thought token usage when approaching their collapse point instead of leveraging available compute.
- Even when provided with exact step-by-step solution algorithms, reasoning models failed to execute instructions reliably on high-complexity tasks.
- Experts such as Gary Marcus warn that these findings expose fundamental barriers to generalizable reasoning and could stall progress toward artificial general intelligence.