How LLMs Learned to Reason

Overview

This lecture, presented by Denny Zhou from Google DeepMind, explores the concept of Large Language Model (LLM) reasoning, defining it as the intermediate steps generated between an input and output.

Zhou introduces Chain of Thought (CoT) prompting and self-consistency, highlighting how these techniques allow pre-trained LLMs to reason by shaping their output distributions to favor step-by-step solutions rather than simply generating final answers.

The discussion also covers the evolution from Supervised Fine-Tuning (SFT) to Reinforced Fine-Tuning (RLFT), emphasizing that self-generated, verifiable data can outperform human-annotated data for training.

Finally, Zhou illustrates the power of aggregation and retrieval-augmented reasoning, suggesting that future advancements lie in tasks beyond automatic verifiability and the development of practical, real-world LLM applications.

Main Topics in LLM Reasoning

Here are some main and interesting topics discussed regarding Large Language Model (LLM) reasoning, along with examples:

Defining LLM Reasoning

For the purpose of the talk, reasoning in LLMs specifically refers to intermediate tokens between input and output, also known as intermediate steps. This concept isn't new; in 2017, Demis Hassabis published a paper on using intermediate tokens to solve math problems.

Example: Concatenating the last letters in “artificial intelligence.” Instead of directly outputting “LE,” the model reasons: “The last letter of artificial is L, the last letter of intelligence is E, concatenating L and E results in LE.”

Theoretical Basis for Intermediate Tokens

Theoretical work suggests that for any problem solvable by Boolean circuits of size T, constant-size transformer models can solve it by generating O(T) intermediate tokens. Directly generating final answers would either require immense depth or be impossible.

Pre-trained Models Can Reason (Decoding Issue)

Pre-trained models are already capable of reasoning; the difficulty often lies in the decoding process. Greedy decoding can hide reasoning ability.

Example: A question about apples might be answered incorrectly as “five apples” with greedy decoding, though other candidates reveal reasoning: “I have 3 apples, my dad has 5… 3 + 5 = 8,” which is correct.

Chain of Thought (CoT) Decoding

Goes beyond greedy decoding—checks multiple generation candidates.
Chooses candidates with the highest confidence in the final answer.

Prompting Approaches

Chain of Thought (CoT) Prompting: Provide a worked step-by-step example to encourage reasoning.
“Let's think step by step”: A lighter alternative, less effective than few-shot prompting.

Supervised Fine-Tuning (SFT)

Uses problems with human-annotated reasoning steps. Example: Math problems with solutions collected in datasets like GSM8K. Pitfall: Limited generalization—human annotations may have mistakes.

Reinforced Fine-Tuning (Io Fine-Tuning / Self-Improvement)

Model generates multiple reasoning paths.
A verifier checks the correctness of the final answers.
Verified reasoning paths are looped back to fine-tune the model.

Advantage: Model-generated data can outperform human data—cleaner structure, correctness-focused.

Scaling: Scaling the length of the CoT (output steps) is often more important than model size.

LLM Reasoning vs. Classical AI

LLMs differ from classical AI search—they reason through sequential token prediction, sometimes mimicking “human-like” thinking. Example: Solving the “make 2025 from 1–10” problem by recognizing 2025 = 45² and breaking it into intermediate goals.

Limitations of Io Fine-Tuning

Works best for automatically verifiable tasks, like math or competitive programming. Less effective for creative writing or open-ended programming tasks.

Self-Consistency (Aggregation)

Instead of relying on the highest-probability token, self-consistency aggregates multiple reasoning outputs.

Process: Sample many responses → take the most common final answer.

Example: If outputs are 18, 26, 18, the answer becomes 18.

This reduces reasoning variability and boosts accuracy.

Universal Self-Consistency

Extends consistency to problems without single-token answers, by finding overlapping consensus across responses. Example: If answers differ but share common elements (Japan, China, India appear in all), the model aggregates these.

Retrieval in Reasoning

Retrieval strengthens reasoning by prompting models with relevant past problems or principles:

Analogical Reasoning: Recall a related problem to apply its method.
Step-Back Prompting: Generalize into a simpler abstract principle before solving the instance.

Key Takeaways

LLM reasoning is always better than no reasoning.
Reinforced fine-tuning (Io) is generally better than SFT.
Aggregating answers (self-consistency) improves reliability but is costlier.
Retrieval + reasoning outperforms reasoning alone.
AI research truths are often simpler than they first appear.

Watch the original Stanford CS25: V5 I Large Language Model Reasoning, Denny Zhou of Google Deepmind video on YouTube