The Illusion of Thinking: What AI “Reasoning” Models Actually Do

3/7/20263 min read

Artificial intelligence models that can “think” step by step have recently captured a lot of attention. Systems like DeepSeek-R1 and Claude 3.7 Sonnet claim to reason through problems instead of simply predicting the next word. But do these models truly reason, or are they just very good at pattern matching?

A research paper titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity” set out to answer this question. Instead of using typical math tests (which AI models may have already seen during training), the researchers used classic logic puzzles to evaluate how these models behave as problems become more complex.The results reveal a fascinating and sometimes surprising picture of how today’s AI “thinking” works.

Why Researchers Used Puzzle Environments

One major problem in evaluating AI models is data contamination. If a model has seen similar questions during training, it might appear to reason when it’s actually recalling patterns. To avoid this, researchers tested models in controlled puzzle environments, including:

Tower of Hanoi
River Crossing Puzzle
Blocks World
Checker Jumping Puzzle

These puzzles allow researchers to precisely control difficulty, which makes it easier to observe when and why models fail.

The Three Levels of AI Reasoning

The study found that AI performance changes dramatically depending on problem complexity.

1. Low Complexity: Simpler Models Work Better

For easy problems, standard language models (without a long reasoning process) often performed better and faster. Why? Reasoning models generate long chains of “thoughts,” which use many tokens and sometimes introduce errors. For straightforward problems, the extra thinking actually becomes unnecessary overhead. In other words: More thinking doesn’t always mean better answers.

2. Medium Complexity: Reasoning Helps

When problems become moderately difficult, reasoning models start to shine. Their step-by-step thinking helps them:

explore possible solutions
recover from mistakes
correct earlier assumptions

In these cases, the reasoning process helps the model reach solutions that standard models often miss.

This is where today’s reasoning AI is most effective.

3. High Complexity: Everything Breaks

The most surprising finding comes at high complexity.

At a certain point, both reasoning models and standard models completely fail. Accuracy drops to zero, even when the model is allowed to think for many steps. This suggests a fundamental limitation in how current AI systems handle long chains of precise reasoning.

The Strange Limit of “Thinking”

One of the most unexpected discoveries was what researchers called a thinking scaling limit. You might expect that harder problems would cause AI models to think longer. Instead, the opposite happens. Once problems become too complex, models often reduce their reasoning effort generating fewer thinking steps even though they still have plenty of token budget available. This suggests the model may be losing its ability to maintain structured reasoning rather than simply running out of resources.

When AI Thinks Too Much

Another interesting behavior is overthinking. In simpler tasks, models often:

Find the correct answer early.
Continue generating more reasoning steps.
Talk themselves into the wrong answer.

Essentially, the model keeps exploring unnecessary paths and ends up overwriting the correct solution.This is similar to a student who second-guesses themselves on a test and changes the right answer to a wrong one.

Knowing the Algorithm Isn’t Enough

Researchers also tested something surprising. They gave the AI the exact algorithm needed to solve the puzzle. For example, the recursive strategy for solving the Tower of Hanoi. Even with the correct instructions, performance still collapsed when the puzzles became complex. This shows that the problem is not just strategy selection. It’s also about executing many precise steps correctly. Current AI models struggle with this type of exact logical execution.

Evidence of Memorization

Another clue came from inconsistent puzzle performance. For example:

A model could successfully perform 100 moves in the Tower of Hanoi.
But fail after 4 moves in the River Crossing Puzzle.

If the system had true general reasoning ability, performance across puzzles should be more consistent. Instead, the results suggest the models rely heavily on patterns they encountered during training.

The study does shows that the reasoning models perform extremely well on many real world tasks. However, it shows also shows that current AI “thinking” systems are still far from having general reasoning abilities