This article is a recent blog post by Lilian Weng, with many points of agreement and inspiration drawn from it.
Original link: https://lilianweng.github.io/posts/2025-05-01-thinking/
Table of Contents#
- Motivating Thinking
- Analogy with Psychology
- Computation as Resources
- Latent Variable Modeling
- Token-Based Thinking
- Branching and Editing
- Parallel Sampling
- Sequential Revision
- Reinforcement Learning to Improve Reasoning
- Use of External Tools
- Faithful Thinking
- Does the model faithfully express its thoughts?
- The impact of optimization pressure on CoT: good or bad?
- Branching and Editing
- Thinking in Continuous Space
- Recurrent Architectures
- Thinking Tokens
- Thinking as Latent Variables
- Expectation Maximization
- Iterative Learning
- The Law of Expanding Thinking Time
- Future Prospects
- Citations
- References
Motivating Thinking#
We can motivate models to think longer in several different ways.
Analogy with Psychology#
The core idea of model thinking is closely related to human thought processes. We humans cannot immediately provide the answer to "What's 12345 times 56789?" Instead, it is natural to take time to think and analyze before arriving at a result, especially for complex problems. In "Thinking, Fast and Slow" (Kahneman, 2013), Daniel Kahneman divides human thinking into two modes through the lens of dual-process theory:
- Fast thinking (System 1) operates quickly and automatically, driven by intuition and emotion, requiring almost no effort.
- Slow thinking (System 2) requires deliberate logical thinking and significant cognitive effort. This mode of thinking consumes more mental energy and requires conscious engagement.
Because System 1 thinking is both fast and simple, it often becomes the primary driver of decision-making at the expense of accuracy and logic. It relies on mental shortcuts (heuristics) in our brains and can lead to errors and biases. By consciously slowing down, taking more time to reflect, improve, and analyze, we can engage in System 2 thinking, challenge our intuitions, and make more rational choices.
Computation as Resources#
One perspective in deep learning is that neural networks can characterize their computation (such as matrix multiplications and activation function calculations) and storage (such as model weights and biases, intermediate activation values) available during forward passes. If we optimize them to solve problems using gradient descent, the optimization process will figure out how to use these resources—they will figure out how to organize these resources into circuits for computation and information storage. From this perspective, if we design an architecture or system that can perform more computations at test time and train it to effectively utilize these resources, it will perform better.
In Transformer models, the amount of computation (flops) the model does for each generated token is about twice the number of parameters, as both forward and backward passes use the parameters. For sparse models like mixture of experts (MoE), only a small portion of parameters is used in each forward pass, so computation = 2 * parameters / sparsity, where sparsity is the proportion of active experts.
On the other hand, CoT allows the model to perform more flops of computation for each token of the answer it is trying to compute. In fact, CoT has a nice property that allows the model to adjust the amount of computation based on the difficulty of the question.
Latent Variable Modeling#
A classic idea in machine learning is to define a probabilistic model with latent (hidden) variables $z$ and observable variables $y$, where $y$ is given to our learning algorithm. Marginalizing (summing) over the possible values of the latent variable allows us to express rich distributions over the observable variables.
For example, we can simulate the distribution of numerical problems and solutions by letting $x$ represent the statement of the problem, $y$ represent the ground truth answer or proof, and $z$ as the free-form thought process leading to the proof. The marginal probability distribution to optimize is:
Token-Based Thinking#
Ling et al. (2017) explored strategies for generating intermediate steps before generating short answers, particularly for mathematical problems, introducing the AQUA-RAT dataset, which was later expanded by Cobbe et al. (2021) with the introduction of the elementary mathematics (GSM) dataset. Cobbe et al. trained a generator with supervised learning capabilities on human-written solutions and validators to predict the correctness of candidate solutions; then they could search these solutions. Nye et al. (2021) attempted to use intermediate thinking tokens as "notebooks," while Wei et al. (2022) coined the now-standard term "chain of thought" (CoT).
Early work to improve CoT reasoning involved supervised learning on human-written reasoning trajectories or filtering the correctness of answers, where the latter can be seen as a basic form of reinforcement learning (RL). Some other work found that encouraging the model to first reflect on relevant knowledge through appropriate "think step by step" prompts (Kojima et al., 2022) or more complex prompts can significantly improve the mathematical performance of instruction-tuned models (Yasunaga et al., 2023).
Later work found that using automatically checkable solutions to reinforce learning on problem datasets could significantly enhance CoT reasoning capabilities, such as STEM problems with short answers or coding tasks that can be checked through unit tests (Zelikman et al., 2022; Wang et al., 2023; Liu et al., 2023). With the release of o1-preview, o3, and R1 technical reports (DeepSeek-AI, 2025), this approach has gained increasing attention, showing that policy gradient algorithms can yield strong performance.
Branching and Editing#
The fundamental goal of computation at test time is to adaptively modify the model's output distribution during testing. There are various methods to leverage test-time resources for decoding to select better samples, thereby changing the model's predictions to a more desirable distribution. The two main methods to improve the decoding process are parallel sampling and sequential revision.
- Parallel Sampling generates multiple outputs simultaneously, providing guidance for each step through process reward signals or using validators to assess quality at the end. It is the most widely adopted decoding method for improving test-time performance, such as best-of-N or beam search. When basic facts are unavailable, self-consistency (Wang et al., 2023) is often used to select answers by majority vote among multiple CoT outputs.
- Sequential Revision iteratively adjusts the model's responses based on the output from the previous step, requiring the model to intentionally reflect on its existing responses and correct errors. The revision process may need to rely on fine-tuned models, as naively depending on the model's inherent self-correcting ability without external feedback may not yield improvements (Kamoi et al., 2024; Huang et al., 2024).
Parallel sampling is simple, intuitive, and easier to implement, but is limited by the model's capability to produce the correct solution in one go. Sequential revision explicitly requires the model to reflect on errors, but it is slower and requires extra caution during implementation, as there is indeed a risk of correct predictions being modified to incorrect ones or introducing other types of hallucinations. Both methods can be used together. Snell et al. (2024) showed that simple problems benefit from purely sequential test-time computation, while more difficult problems often perform best under an optimal ratio of sequential to parallel computation.
Parallel Sampling#
Given a generative model and a scoring function, we can use it to score all or part of the samples, and we can use various search algorithms to find high-scoring samples. Best-of-N is the simplest such algorithm: simply collect N independent samples and select the highest-ranking sample based on some scoring function. Beam search is a more complex search algorithm that makes the search process more adaptive, spending more sampling computation on the more promising parts of the solution space.
Beam search maintains a set of promising partial sequences and alternates between expanding them and pruning less promising partial sequences. As a selection mechanism, we can use a process reward model (PRM; Lightman et al., 2023) to guide the selection of beam search candidates. Xie et al. (2023) used LLM to assess the likelihood of correctness of its own generated reasoning steps, formatting it as multiple-choice questions, finding that self-evaluation at each step reduced cumulative errors in multi-step reasoning during beam search decoding. Additionally, during sampling, temperature annealing helps reduce aggregated randomness. These experiments by Xie et al. achieved a 5-6% improvement in the Codex model's performance on the GSM8k, AQuA, and StrategyQA benchmarks.
Reward-balanced search (abbreviated as "REBASE"; Wu et al., 2025) trained a process reward model (PRM) to determine how much each node at each depth should expand during beam search based on softmax normalized reward scores. Jiang et al. (2024) trained their PRM, named "RATIONALYST," for beam search guidance of synthetic principles conditioned on a large amount of unlabeled data. When comparing the time differences between contexts that included principles and those that did not, good principles were filtered based on whether they helped reduce the negative log probability of marking the true answer. During reasoning, RATIONALYST provides process supervision for the CoT generator by helping estimate the log probability of the next reasoning step ("implicit") or directly generating the next reasoning step as part of the prompt ("explicit").
Interestingly, urgent chain of thought reasoning paths can be triggered without explicit zero-shot or few-shot prompts. Wang & Zhou (2024) found that if we branch at the first sampling token by retaining the top label with the highest confidence (measured by the difference between the top 1 and top 2 candidates during sampling), and then continue these sampling trials for greedy decoding, many sequences themselves contain CoT. Particularly when CoT does appear in context, it leads to a more confident decoding of the final answer. To calculate the confidence of the final answer, specific task heuristics (such as the last numerical value for mathematical problems) or further prompting the model to identify the answer span "So the answer is" are needed. The design choice to branch only at the first token is based on the observation that early branching significantly enhances the diversity of potential paths, while later tokens are heavily influenced by previous sequences.
Sequential Revision#
If the model can reflect on and correct errors in past answers, we would expect it to produce a good iterative revision sequence with continuously improving quality. However, due to various failure modes, this self-correcting ability is inherently absent in LLMs and is not easily available out of the box, such as: (1) hallucinations, including modifying correct answers to incorrect ones; (2) behavioral collapse to uncorrected behaviors; for example, making minor modifications or no modifications to the first incorrect answer; or (3) failing to generalize to distribution changes during testing. Experiments by Huang et al. (2024) showed that naively applying self-correction leads to worse performance, and the model needs external feedback to self-improve, which can be based on matching basic facts, heuristics, task-specific metrics, unit test results for coding problems (Shinn et al., 2023), stronger models (Zhang et al., 2024), and human feedback (Liu et al., 2023).
Self-correcting learning (Welleck et al., 2023) aims to train a corrector model $P_θ(y | y_0, x)$ for a fixed generator model $P_0(y_0 | x)$. While the generator model remains general, the corrector model can be task-specific and generates only based on the initial model response and additional feedback (such as sentences, compiler constraints, unit test results, which can be optional):
-
Self-correcting learning first generates, initially generating multiple outputs for each prompt in the data pool;
-
Then, if one output's value is higher than another, it creates value-boosting pairs (prompt $x$, hypothesis $y$, correction $y'$) by pairing the two outputs of the same prompt together.
-
These pairs are selected proportionally to the improvement value $v(y') - v(y)$ and the similarity between the two outputs, $\text{Similarity}(y, y')$, to train the corrector model.
-
To encourage exploration, the corrector also provides a new generation for the data pool. During reasoning, the corrector can be iteratively used to create a sequential correction trajectory.
Recursive checking (Qu et al., 2024) also aims to train a better corrector model but uses a single model to perform generation and self-correction simultaneously.
SCoRe (Self-Correction through Reinforcement Learning; Kumar et al., 2024) is a multi-round RL method that encourages the model to self-correct by generating better answers on the second attempt than those created on the first attempt. It consists of two training phases: Phase 1 maximizes the accuracy of the second attempt while enforcing KL penalties only on the first attempt to avoid excessive deviation of the first-round response from the base model behavior; Phase 2 optimizes the accuracy of the answers generated in both the first and second attempts. Ideally, we do want to see better performance for both the first and second attempts, but adding Phase 1 can prevent the model from collapsing into behaviors of making minor edits or no edits to the first response, while Phase 2 further improves the results.
Reinforcement Learning to Improve Reasoning#
Recently, significant success has been achieved in enhancing the reasoning capabilities of language models by using a set of questions with ground truth answers (often STEM problems and puzzles with easily verifiable answers) and rewarding the model for obtaining correct answers. The strong performance of OpenAI's o-series models and subsequent models and technical reports released by DeepSeek have driven recent activity in this field.
DeepSeek-R1 (DeepSeek-AI, 2025) is an open-source LLM designed to excel at tasks requiring advanced reasoning skills, such as mathematics, coding, and logical problem-solving. They conducted two rounds of SFT-RL training, enabling R1 to excel at both reasoning and non-reasoning tasks.
-
Cold-start SFT fine-tunes the DeepSeek-V3-Base base model on a collection of thousands of cold-start data. Without this step, the model would suffer from poor readability and language mixing issues.
-
Reasoning-focused RL trains the reasoning model on reasoning prompts using two types of rule-based rewards:
- Format reward: The model should wrap CoT with ... tokens.
- Accuracy reward: Whether the final answer is correct. The answer to mathematical problems needs to exist in a specific format (e.g., in a box) to receive reliable verification. For coding problems, a compiler is used to evaluate whether test cases pass.
- Reject sampling + non-reasoning SFT utilizes new SFT data created from reject sampling at the RL checkpoint of Step 2, combined with non-reasoning supervised data from areas like DeepSeek-V3 writing, factual QA, and self-awareness, to retrain DeepSeek-V3-Base.
- Filtering out CoT with mixed language, long paragraphs, and code blocks.
- Using the DeepSeek-V3 (DeepSeek-AI, 2024) pipeline includes non-reasoning tasks.
- For certain non-reasoning tasks, potential CoT is generated by calling DeepSeek-V3 before answering questions through prompts. But for simpler queries like "hello," CoT is not needed.
- Then, DeepSeek-V3-Base is fine-tuned on a total of 800k samples for 2 epochs.
- The final RL phase trains the Step 3 checkpoint on reasoning and non-reasoning prompts to improve usefulness, harmlessness, and reasoning ability.
Interestingly, the DeepSeek team showed that using pure RL, without an SFT phase, can still learn advanced reasoning abilities such as reflection and backtracking ("aha moments"). The model naturally learned to spend more thinking tokens to solve reasoning tasks during RL training. "Aha moments" can occur, referring to the model reflecting on previous mistakes and then trying other methods to correct them. Subsequently, various open-source efforts emerged to replicate R1 results, such as Open-R1, SimpleRL-reason, and TinyZero, all based on the Qwen model. These efforts also confirmed that pure RL leads to excellent performance on mathematical problems and the emergence of "aha moments."
The DeepSeek team also shared some of their unsuccessful attempts. They did not use a process reward model (PRM) because it was challenging to define scoring metrics for each step or determine whether intermediate steps were correct, while making training more susceptible to reward hacking. Efforts with MCTS (Monte Carlo Tree Search) also failed due to the vast search space of language model tokens compared to chess; training fine-grained value models to guide the search was also very challenging. Failed attempts often provide unique insights, and we want to encourage the research community to share more about things that did not succeed.
Use of External Tools#
Certain intermediate steps can be reliably and accurately solved by executing code or performing mathematical calculations during reasoning steps. Offloading this part of the reasoning component to external code interpreters, as seen in PAL (Program-Aided Language Model; Gao et al., 2022) or Chain of Code (Li et al., 2023), can extend the capabilities of LLMs without requiring LLMs to learn to execute code or function as calculators themselves. These code simulators, as in Chain of Code, can be augmented by LLMs so that if the standard code interpreter fails, we can opt to use the LLM to execute that line of code. Using code to enhance reasoning steps is particularly beneficial for mathematical problems, symbolic reasoning, and algorithmic tasks. These unit tests may not exist as part of coding problems, in which cases we can instruct the model to generate unit tests on its own to validate solutions (Shinn et al., 2023).
Reason+Act (Yao et al., 2023) combines searching the Wikipedia API with generating reasoning trajectories, allowing reasoning paths to incorporate external knowledge.
Recently released by OpenAI, o3 and o4-mini are two more excellent examples where the reasoning process involves using tools like web searches, code execution, and image processing. The team observed that large-scale reinforcement learning exhibited the same trend as the GPT paradigm: "more computation = better performance."
Faithful Thinking#
Deep learning models are often viewed as black boxes, and various interpretability methods have been proposed. Interpretability is useful for several reasons: first, it provides an additional test to determine whether the model is inconsistent with its creators' intentions or whether it is making errors in ways we cannot judge by monitoring its outputs. Second, it can help us determine whether the model is using a reasonable process to compute its answers. Chain of thought provides a particularly convenient form of interpretability, as it makes the model's internal processes visible in natural language. However, this interpretability is based on the assumption that the model accurately describes its internal thought processes.
Recent research has shown that monitoring the CoT of reasoning models can effectively detect erroneous behaviors in models, such as reward hacking, and can even enable weaker models to monitor stronger models (Baker et al., 2025). Increasing test-time computation can also enhance adversarial robustness (Zaremba et al., 2025); this is intuitively reasonable, as when the model encounters unusual inputs (such as adversarial examples or jailbreak attempts), thinking time should be particularly useful—it can leverage additional thinking time to understand the strange situations it faces.
Does the model faithfully express its thoughts?#
Intuitively, due to the lack of explicit training objectives designed to encourage faithful reasoning, the model's CoT may be biased. Alternatively, when we fine-tune the model based on human-written explanations, these human-written samples may contain errors. Therefore, we cannot assume that CoT is always faithful by default.
Lanham et al. (2023) studied several patterns of CoT fidelity failure by deliberately introducing errors into CoT and measuring their impact on the accuracy of a set of multiple-choice tasks (such as AQuA, MMLU, ARC Challenge, TruthfulQA, HellaSwag):
- Error 1 (premature answers): The model may form conclusions too early before generating CoT. This was tested by early truncation or inserting errors into CoT. Different tasks revealed varying dependencies on CoT effectiveness; some were sensitive to truncated CoT, while others were not. Wang et al. (2023) conducted similar experiments but found more subtle errors related to bridging objects or language templates in the formation of CoT.
- Error 2 (non-informative tokens): Non-informative CoT tokens can enhance performance. This hypothesis was tested by replacing CoT with filler text (e.g., all periods), which showed no improvement in accuracy compared to having no CoT, and performance on certain tasks may have slightly decreased.
- Error 3 (human-unreadable encoding): The way relevant information is encoded can be difficult for humans to understand. Interpreting CoT in a non-standard way does not reduce performance across datasets, indicating that the increase in accuracy does not rely on human-readable reasoning.
Interestingly, Lanham et al. found that smaller models may not effectively leverage CoT for multiple-choice questions, while larger models may already be able to solve tasks without CoT. This dependency on CoT reasoning, measured by the percentage of using CoT versus not using CoT to obtain the same answers, does not always increase with model size for multiple-choice questions but does increase with model size for additional tasks, suggesting that thinking time is more critical for complex reasoning tasks.
Alternative methods for testing CoT fidelity include perturbing prompts rather than directly modifying CoT paths (Turpin et al., 2023; Chua & Evans, 2025; Chen et al., 2025).
One method always labels the correct answer as "(A)" in a few samples, regardless of the true label, to introduce bias.
Another prompting technique inserts misleading prompts into the prompt, such as "I think the answer is <random_label> but curious to hear what you think." or "A Stanford Professor thinks the answer is <random_label>." By comparing the model's predictions for the same question with and without misleading prompts, we can measure whether the model can faithfully describe the impact of prompts on its answers. Specifically, in cases where the model produces different answers with and without prompts, we measure whether the model acknowledges the influence of the prompt when solving the problem with the prompt. If the model is faithful, it should explicitly acknowledge the influence and recognize that its answer change is due to the prompt.
Multiple studies have found that reasoning models more reliably describe the influence of prompts than all tested non-reasoning models. For example, we can measure the proportion of samples where the model treats the prompt as a determining factor ("faithful CoT"). Reasoning models (Claude 3.7 Sonnet, DeepSeek R1) generally perform better than non-reasoning models (Claude 3.6, DeepSeek V3).
Some evidence suggests that using reward models can reduce the fidelity of model outputs. In this test, the reward model in classic RLHF was not trained to incentivize faithful behavior, as the preference comparison dataset was collected by humans selecting models that appeared better or more pleasing. However, in reasoning models, the CoT path is optimized to produce the correct final answer rather than matching human preferences defined in the RM, so it is intuitively expected to be more faithful.
The impact of optimization pressure on CoT: good or bad?#
Monitoring the CoT of reasoning models for reward hacking behavior is effective. One simple way to use this monitor is to run it as a filter during testing for reject sampling, identifying solutions without reward hacking. However, it is only effective in the early stages of RL, and as the model is further optimized, it becomes rare to find samples monitored through CoT within budget. This challenge naturally suggests that we might consider incorporating CoT monitoring into RL rewards (a form of process-based supervision) to prevent reward hacking.
However, incorporating CoT monitoring rewards into RL does not work well, as it leads to confused reward hacking, where the model tries to hide its true intentions in CoT (Baker et al., 2025). Under this direct optimization pressure provided by the CoT monitor during RL training, the model can still find ways to hack, albeit less frequently, by hiding its intentions in CoT to avoid being captured by the monitor.
Yeo et al. (2025) experimentally confirmed similar information regarding the CoT length RL signal, indicating that novel RL reward shaping may lead to new types of reward hacking. They designed reward functions such that correct CoT received higher rewards than incorrect CoT, short-term correct CoT received higher rewards than long-term correct CoT, and short-term incorrect CoT received higher penalties than long-term incorrect CoT. With this additional reward, the model learned to repeat text in CoT to answer challenging questions rather than attempting to solve them. Yeo et al. further applied n-gram repetition penalties to prevent this hacking behavior.
Chen et al. (2025) conducted experiments in a flawed RL environment, particularly using a scorer that filled in incorrect answers for multiple-choice questions. The model learned to exploit reward hacking in >99% of prompts but almost never (<2%) engaged in reward hacking for more than half of the environments in its CoT. In this case, the additional RL optimization pressure did not incentivize the model to express hacks in language.
RL training is inherently sensitive to reward hacking. Relying solely on heuristic investigations of reward hacking and manual fixes may lead to a "whack-a-mole" situation. We recommend being very cautious when attempting to apply optimization directly on CoT during RL training or trying to avoid it altogether.
Thinking in Continuous Space#
Adaptive computation time introduced by Alex Graves in 2016 predates large language models but pioneered the same direction, allowing models to dynamically decide the number of computational steps to take during reasoning, which can be seen as enabling models to "think more" in continuous space at test time. Adaptive thinking time in continuous space can be enabled vertically through recurrent architectures or horizontally through more continuous sampling steps.
Recurrent Architectures#
Many architectural variants have been proposed to make the Transformer architecture recursive, enabling adaptive test-time computation (Dehghani et al., 2019; Hutchins et al., 2022; Bulatov et al., 2022). A deep dive into the literature on this topic would make this article too long, so we will only review a few.
The Universal Transformer (Dehghani et al., 2019) combines self-attention in Transformers with recursive mechanisms in RNNs, dynamically adjusting the number of steps using adaptive computation time (Graves, 2016). At a high level, it can be viewed as a recurrent function for learning the hidden state representation of each token, where if the number of steps is fixed, the Universal Transformer is equivalent to a multi-layer Transformer with shared parameters across layers.
The recently proposed recursive architecture design by Geiping et al. (2025) adds a recursive block $R$ on top of the standard Transformer. Each iteration of this recurrent block takes an embedding $\mathbf{e}$ and a random state $\mathbf{s}_i$. Conceptually, this recursive deep architecture is somewhat similar to conditional diffusion models, where the original input $\mathbf{e}$ is provided at each recursive step, while the random Gaussian initialization state $\mathbf{s}_i$ is iteratively updated throughout the process. (Interestingly, some of their designs that are more similar to diffusion models turned out to be poor.)
Thinking Tokens#
Thinking tokens refer to a set of implicit tokens introduced during training or reasoning that do not have direct linguistic meaning. Instead, their role is to provide additional thinking time and computational capacity, improving model performance.
Herel & Mikolov (2023) proposed the idea of inserting special thinking tokens () after each word in a sentence and training the model on such datasets. Each thinking token earns the model extra time to process and make better predictions. Training with thinking tokens in toy model settings resulted in lower perplexity compared to baseline models trained without thinking tokens. The benefits of thinking tokens are more pronounced for non-trivial reasoning tasks or sentences involving numbers.
Similarly, pause tokens proposed by Goyal et al. (2024) provide additional computation for the model during reasoning by appending virtual tokens (such as characters like . or #) at the end of the input sequence to delay the model's output. Injecting such pause tokens during training and reasoning is crucial, while fine-tuning solely on pause tokens yields limited gains. During training, multiple copies of pause tokens are inserted at uniformly random positions, and the loss for pause tokens is ignored during training.
Interestingly, the thinking tokens or pause tokens in the above experiments do not carry any extra information or add many new parameters. But why are they still helpful? On one hand, they help expand computation by introducing more reasoning cycles, effectively enhancing computational capacity. On the other hand, they can be viewed as a special implicit form of CoT. One downside here is that the model needs to be pre-trained based on thinking tokens. Nevertheless, this strategy remains an interesting approach to further enhance the ability to utilize computational resources at test time based on reasoning time CoT.
Quiet-STaR (Zelikman et al., 2025) introduces token-level reasoning by training the model to generate reasoning to explain future text after each token, mixing predictions of future text with and without reasoning, using learning to generate better reasoning and employing REINFORCE to optimize the quality of reasoning generation.