This lecture, presented by Denny Zhou from Google DeepMind, explores the concept of Large Language Model (LLM) reasoning, defining it as the intermediate steps generated between an input and output.
Zhou introduces Chain of Thought (CoT) prompting and self-consistency, highlighting how these techniques allow pre-trained LLMs to reason by shaping their output distributions to favor step-by-step solutions rather than simply generating final answers.
The discussion also covers the evolution from Supervised Fine-Tuning (SFT) to Reinforced Fine-Tuning (RLFT), emphasizing that self-generated, verifiable data can outperform human-annotated data for training.
Finally, Zhou illustrates the power of aggregation and retrieval-augmented reasoning, suggesting that future advancements lie in tasks beyond automatic verifiability and the development of practical, real-world LLM applications.
Here are some main and interesting topics discussed regarding Large Language Model (LLM) reasoning, along with examples:
For the purpose of the talk, reasoning in LLMs specifically refers to intermediate tokens between input and output, also known as intermediate steps. This concept isn't new; in 2017, Demis Hassabis published a paper on using intermediate tokens to solve math problems.
Example: Concatenating the last letters in “artificial intelligence.” Instead of directly outputting “LE,” the model reasons: “The last letter of artificial is L, the last letter of intelligence is E, concatenating L and E results in LE.”
Theoretical work suggests that for any problem solvable by Boolean circuits of size T, constant-size transformer models can solve it by generating O(T) intermediate tokens. Directly generating final answers would either require immense depth or be impossible.
Pre-trained models are already capable of reasoning; the difficulty often lies in the decoding process. Greedy decoding can hide reasoning ability.
Example: A question about apples might be answered incorrectly as “five apples” with greedy decoding, though other candidates reveal reasoning: “I have 3 apples, my dad has 5… 3 + 5 = 8,” which is correct.
Uses problems with human-annotated reasoning steps. Example: Math problems with solutions collected in datasets like GSM8K. Pitfall: Limited generalization—human annotations may have mistakes.
Advantage: Model-generated data can outperform human data—cleaner structure, correctness-focused.
Scaling: Scaling the length of the CoT (output steps) is often more important than model size.
LLMs differ from classical AI search—they reason through sequential token prediction, sometimes mimicking “human-like” thinking. Example: Solving the “make 2025 from 1–10” problem by recognizing 2025 = 45² and breaking it into intermediate goals.
Works best for automatically verifiable tasks, like math or competitive programming. Less effective for creative writing or open-ended programming tasks.
Instead of relying on the highest-probability token, self-consistency aggregates multiple reasoning outputs.
Process: Sample many responses → take the most common final answer.
Example: If outputs are 18, 26, 18, the answer becomes 18.
This reduces reasoning variability and boosts accuracy.
Extends consistency to problems without single-token answers, by finding overlapping consensus across responses. Example: If answers differ but share common elements (Japan, China, India appear in all), the model aggregates these.
Retrieval strengthens reasoning by prompting models with relevant past problems or principles: