According to Apple researchers, state-of-the-art AI reasoning models exhibit a concerning "illusion of thinking" where their performance completely collapses when faced with problems beyond certain complexity thresholds, revealing fundamental limitations in their ability to develop generalizable problem-solving capabilities despite their sophisticated appearance.
The research identifies three distinct performance regimes that characterize how reasoning models respond to varying levels of task complexity. At low complexity, standard LLMs without reasoning chains surprisingly outperform reasoning models, which tend to "overthink" simple problems by exploring incorrect alternatives after finding the right answer.12 Medium complexity tasks represent the optimal zone where reasoning models demonstrate clear advantages over standard LLMs, with their structured chain-of-thought processes proving beneficial.3 However, in high complexity scenarios, both reasoning and standard models experience a complete accuracy collapse, with performance plummeting to near zero despite having adequate computational resources available.456
This three-regime framework reveals a counterintuitive pattern where reasoning models initially increase their reasoning effort as complexity grows, but then dramatically reduce their thinking tokens once they approach their complexity threshold.47 The study tested prominent models including OpenAI's O1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, finding that none could maintain performance beyond certain complexity barriers.89
A striking discovery in Apple's research is the "giving up" effect, where reasoning models abruptly reduce their thinking tokens despite having sufficient computational budget remaining when they approach complexity thresholds.12 This counterintuitive behavior suggests a fundamental scaling limitation rather than a resource constraint.3 The models initially use more tokens for thinking as problems become more complex, but then paradoxically invest less effort precisely when challenges demand more thorough reasoning.4
This phenomenon reveals that current AI systems simulate reasoning rather than genuinely performing it, relying heavily on pattern recognition that breaks down when problems deviate significantly from memorized templates.5 Even when researchers provided explicit algorithms for solving puzzles, the models still failed at high complexity levels, demonstrating extreme fragility where small, irrelevant changes to prompts could degrade performance by up to 65%.67
Rather than relying on potentially contaminated mathematical benchmarks like MATH or GSM8K, Apple researchers designed controlled puzzle environments including Tower of Hanoi, River Crossing, Checker Jumping, and Blocks World.12 These carefully constructed test environments allowed for precise manipulation of complexity while maintaining consistent logical structures, providing cleaner insights into actual reasoning capabilities.3
The innovative methodology revealed that what appears to be reasoning is actually sophisticated pattern matching, with models excelling when they can match familiar patterns from training data but failing when problems deviate significantly.45 This approach demonstrated that even "thinking" variants of AI models exhibit shallow reasoning and are prone to overthinking, with their performance dropping off sharply as they struggle to scale logic depth or maintain coherent reasoning threads across complex tasks.67
The findings from Apple's research cast significant doubt on claims that reasoning models represent a meaningful step toward Artificial General Intelligence (AGI). Instead of developing generalizable problem-solving abilities, these models appear to be sophisticated pattern-matching systems with clear limitations that cannot scale to human-level reasoning capabilities12. The timing of this research is particularly notable, coming just before Apple's Worldwide Developers Conference and amid the company's own AI development challenges with Apple Intelligence and Siri32.
Some critics have labeled Apple's findings as "short-sighted," suggesting the company may be downplaying reasoning models due to their own AI struggles4. However, the research methodology appears robust, and the findings align with broader concerns in the AI research community about the true nature of reasoning in large language models—providing empirical evidence that current AI systems simulate rather than genuinely perform reasoning56. This research ultimately suggests a fundamental rethinking of current approaches to developing truly intelligent systems capable of complex reasoning.