Apple researchers have challenged the artificial intelligence industry's claims about reasoning capabilities, publishing a study that found leading models from OpenAI, Google, and Anthropic fail when confronted with complex logic puzzles, despite marketing promises of human-like thinking abilities.
The study, published June 6 and titled "The Illusion of Thinking," tested models including OpenAI's o1 and o3 systems against classic puzzles like the Tower of Hanoi and River Crossing problems. While these models can solve medium-difficulty tasks, they experience what researchers called "complete accuracy collapse" when complexity increases, dropping to zero success rates even with ample computational resources.
The Apple team discovered a counterintuitive pattern: as puzzle difficulty increased, the AI models actually reduced their reasoning effort rather than working harder12. "As problems approach critical difficulty, models paradoxically reduce their reasoning effort despite ample compute budgets," the researchers wrote3.
Even when researchers provided complete solution algorithms, removing the need for problem-solving creativity, the models still failed at the same complexity thresholds2. This suggests the limitation lies not in strategy but in basic logical execution, according to the study.
The research tested both standard large language models and specialized "reasoning" models from major providers, finding that reasoning models outperformed standard ones only at medium complexity levels42.
The timing proves notable, coming days before Apple's Worldwide Developers Conference where the company announced its Foundation Models framework1. Apple has taken a more cautious approach to AI integration compared to competitors who have aggressively marketed reasoning capabilities.
OpenAI has claimed its o1 models can "spend more time thinking through problems before they respond, much like a person would," while Google has promoted Gemini's reasoning abilities for handling complicated business tasks2. Apple's findings challenge these assertions.
The study used controlled puzzle environments rather than mathematical benchmarks to avoid data contamination, where models might simply recall training examples rather than demonstrate genuine reasoning31.
"These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning," the Apple researchers concluded1.
The research arrives as the industry pours billions into developing artificial general intelligence, with companies constructing massive data centers to power increasingly complex models1. Apple's analysis suggests current "reasoning" models rely on sophisticated pattern matching rather than genuine logical thinking23.