The paper concluded that as tasks grew harder, the models paradoxically reduced their reasoning effort instead of increasing it—a counterintuitive phenomenon the authors called “particularly concerning.”
“Upon approaching a critical threshold… models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty,” the paper states.
Tasks used in the study included:
- Tower of Hanoi
- River Crossing puzzles
Even when given an algorithm to solve a complex puzzle, some models still failed to apply it correctly.
The Limits of AI Advanced Reasoning
Apple’s researchers tested several top-performing models:
- OpenAI’s o3
- Google’s Gemini Thinking
- Anthropic’s Claude 3.7 Sonnet-Thinking
- DeepSeek’s R1
While LLMs (Large Language Models) like ChatGPT o3 perform well in everyday conversation and low-complexity tasks, the study found that their performance begins to unravel when the complexity of a task increases even slightly.
“These insights challenge prevailing assumptions about LRM capabilities,” the authors concluded, “and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning.”