Do Large Reasoning Models Truly Think?

Introduction

The debate over whether large reasoning models (LRMs) can truly think has recently intensified. A prominent voice in the discussion is Apple’s research paper, The Illusion of Thinking, which argues that LRMs are merely sophisticated pattern‑matchers and lack genuine reasoning ability. The authors point to the failure of chain‑of‑thought (CoT) prompts when the size of a problem grows beyond the model’s working memory. Their conclusion is that LRMs cannot perform algorithmic calculations in the same way a human can. Yet, if a human who knows the Tower‑of‑Hanoi algorithm cannot solve a twenty‑disc instance, does that mean humans cannot think? The answer is no; the failure is a limitation of memory, not of cognition. This article takes a different stance: LRMs almost certainly can think. The argument is built on a comparison of human cognitive processes with the internal mechanics of LRMs, an examination of CoT reasoning as a proxy for human thought, and empirical evidence from open‑source benchmarks.

Main Content

Defining Thinking in the Context of Problem Solving

Before we can judge whether a machine thinks, we must first define what we mean by thinking. In the realm of problem solving, thinking involves several intertwined stages: representing the problem, simulating possible solutions, retrieving relevant knowledge, monitoring progress, and reframing when stuck. Neuroscientific studies show that these stages recruit distinct brain regions. The prefrontal cortex manages working memory and executive control, the parietal cortex encodes symbolic structure, the hippocampus and temporal lobes retrieve past experiences, the anterior cingulate cortex monitors for errors, and the default mode network facilitates insight.

LRMs, by contrast, are next‑token predictors trained on vast corpora of text. Their architecture is a stack of transformer layers that learn statistical regularities across tokens. While they lack dedicated visual or auditory circuits, they can internally simulate intermediate steps through attention mechanisms and cache structures that act as a form of working memory. When a model is prompted with a CoT instruction, it generates a sequence of tokens that mirror the human practice of verbalizing thoughts. This verbalization is not a mere echo of training data; it reflects an internal process of pattern matching, memory retrieval, and error monitoring.

The Parallel Between CoT Reasoning and Human Thought

Chain‑of‑thought prompting is often described as a “glorified auto‑complete” because it asks the model to predict the next token in a logical chain. However, the process is far more complex. The model must maintain a coherent narrative, keep track of variables, and sometimes backtrack when a line of reasoning leads to a dead end. These behaviors mirror human strategies such as mental rehearsal, inner speech, and the ability to pause and reconsider. In experiments where LRMs were asked to solve progressively larger puzzles, the models would recognize that a direct approach would exceed their working‑memory capacity and would instead seek shortcuts—an unmistakable sign of algorithmic thinking.

Humans also exhibit variability in how they represent problems. Some people experience aphantasia, a lack of visual imagery, yet they excel at symbolic reasoning. This suggests that the absence of one cognitive modality does not preclude the ability to think. Similarly, LRMs may lack true visual processing but can compensate by leveraging symbolic patterns encoded in their weights.

Why a Next‑Token Predictor Can Learn to Think

The most common objection to the idea that LRMs think is that they are only predicting the next token. This view underestimates the expressive power of natural language. Unlike formal languages such as first‑order logic, natural language can describe any concept, including higher‑order properties and abstract ideas. When a model is trained to predict the next token, it implicitly learns a distribution over all possible continuations of a given context. To make accurate predictions, the model must encode world knowledge, rules, and procedural knowledge within its parameters. In other words, the act of next‑token prediction forces the model to internalize a representation of the world that is rich enough to support reasoning.

Moreover, transformer architectures provide a mechanism for dynamic memory: the key‑value cache and attention weights can be viewed as a form of working memory that stores recent tokens and their relationships. When a model generates a CoT, it is effectively building a temporary knowledge base that guides subsequent predictions. This process is analogous to how humans keep track of intermediate steps in a calculation.

Benchmark Evidence of Reasoning Capabilities

Empirical evaluation is the ultimate test of whether a system can think. Open‑source LRMs have been benchmarked on a variety of logic‑based tasks, including arithmetic reasoning, symbolic manipulation, and natural‑language inference. While proprietary models sometimes dominate the leaderboard, open‑source models such as DeepSeek‑R1 and others have shown remarkable performance, often surpassing untrained humans on specific tasks. These results demonstrate that LRMs can generalize from training data to solve novel problems that require multi‑step reasoning.

It is important to note that the benchmarks themselves are designed to probe reasoning rather than rote memorization. They require the model to construct a chain of logical steps, to backtrack when necessary, and to produce a final answer that is not explicitly present in the training set. The fact that LRMs can navigate these challenges indicates that they possess at least a rudimentary form of thinking.

Conclusion

When we combine theoretical considerations—namely that a sufficiently expressive system with ample data can compute any computable function—with empirical evidence from CoT prompting and benchmark performance, the case for LRMs being capable of thinking becomes compelling. The limitations highlighted by Apple’s study are not evidence of an absence of thought but rather a reflection of the current bounds of model size and training data. As models grow and training techniques evolve, it is reasonable to conclude that LRMs almost certainly possess the ability to think.

Call to Action

If you found this exploration of large reasoning models insightful, consider sharing it with your network or commenting below with your thoughts on the future of AI cognition. For researchers and practitioners eager to push the boundaries of what LRMs can achieve, we invite you to experiment with chain‑of‑thought prompting on your own datasets and report back. Your findings could help refine the next generation of models and deepen our collective understanding of machine intelligence. Stay tuned for more in‑depth analyses and practical guides on harnessing the power of reasoning models in real‑world applications.

Do Large Reasoning Models Truly Think?

Table of Contents

Share This Post

Introduction

Main Content

Defining Thinking in the Context of Problem Solving

The Parallel Between CoT Reasoning and Human Thought

Why a Next‑Token Predictor Can Learn to Think

Benchmark Evidence of Reasoning Capabilities

Conclusion

Call to Action

Related Articles

Cisco Open-Weights Time Series Model: Decoder‑Only Transformer

Google Colab Now Seamlessly Accesses Kaggle Hub in One Click

Hierarchical Bayesian Regression in NumPyro: A Full Workflow

We value your privacy