Large Reasoning Models Explained
Key Points
- Large Language Models (LLMs) generate text by statistically predicting the next token, while Large Reasoning Models (LRMs) first plan and evaluate before token generation, enabling deeper reasoning.
- LRMs use an internal “chain of thought” to sketch plans, test hypotheses, and discard dead ends, which is crucial for complex tasks like debugging code or tracing financial flows.
- This extra reasoning incurs higher inference latency and GPU costs, making LRMs slower and more expensive than reflexive LLM outputs.
- Building an LRM involves starting with a massive pre‑trained LLM and then fine‑tuning it on curated datasets of logic puzzles, multi‑step math, and coding tasks that include full chain‑of‑thought answer keys.
- After fine‑tuning, LRMs are further refined via reinforcement learning (often RLHF), where human or model feedback rewards coherent, step‑by‑step reasoning.
Sections
- LLMs vs LRMs: Reasoning Tradeoffs - The passage contrasts standard language models that generate text via token‑by‑token prediction with large reasoning models that first plan and chain‑of‑thought, highlighting the improved answer quality but higher latency and cost.
- Reward‑Guided Reasoning and Distillation - The passage outlines how RLHF and reward models evaluate each reasoning step, while teacher‑student distillation supplies example thought traces, enabling a language model to learn to plan, verify, and explain its answers, with a note on runtime thinking budget.
- Models Thinking Before Responding - The speaker observes that cutting‑edge large language models, particularly high‑scoring reasoning models, deliberately pause to “think,” resulting in slower, more deliberative replies instead of rapid word‑by‑word generation.
Full Transcript
# Large Reasoning Models Explained **Source:** [https://www.youtube.com/watch?v=enLbj0igyx4](https://www.youtube.com/watch?v=enLbj0igyx4) **Duration:** 00:08:26 ## Summary - Large Language Models (LLMs) generate text by statistically predicting the next token, while Large Reasoning Models (LRMs) first plan and evaluate before token generation, enabling deeper reasoning. - LRMs use an internal “chain of thought” to sketch plans, test hypotheses, and discard dead ends, which is crucial for complex tasks like debugging code or tracing financial flows. - This extra reasoning incurs higher inference latency and GPU costs, making LRMs slower and more expensive than reflexive LLM outputs. - Building an LRM involves starting with a massive pre‑trained LLM and then fine‑tuning it on curated datasets of logic puzzles, multi‑step math, and coding tasks that include full chain‑of‑thought answer keys. - After fine‑tuning, LRMs are further refined via reinforcement learning (often RLHF), where human or model feedback rewards coherent, step‑by‑step reasoning. ## Sections - [00:00:00](https://www.youtube.com/watch?v=enLbj0igyx4&t=0s) **LLMs vs LRMs: Reasoning Tradeoffs** - The passage contrasts standard language models that generate text via token‑by‑token prediction with large reasoning models that first plan and chain‑of‑thought, highlighting the improved answer quality but higher latency and cost. - [00:03:47](https://www.youtube.com/watch?v=enLbj0igyx4&t=227s) **Reward‑Guided Reasoning and Distillation** - The passage outlines how RLHF and reward models evaluate each reasoning step, while teacher‑student distillation supplies example thought traces, enabling a language model to learn to plan, verify, and explain its answers, with a note on runtime thinking budget. - [00:07:46](https://www.youtube.com/watch?v=enLbj0igyx4&t=466s) **Models Thinking Before Responding** - The speaker observes that cutting‑edge large language models, particularly high‑scoring reasoning models, deliberately pause to “think,” resulting in slower, more deliberative replies instead of rapid word‑by‑word generation. ## Full Transcript
You already know large language models or LLMs. They predict the
next token in a sequence, using a statistical pattern matching technique to crank out human
like text. There's also LRMs, large reasoning models,
and they go a bit further. They think before they talk. Now, give LLM a prompt
and it will projectile predict whatever word statistically fits, next it will
output a token and then another token, and then another token
and LRMs, they still do that too. But they first they sketch out
a plan. They weigh options and they double check calculations in a sandbox before building their
response. So before they start outputting tokens, they will plan, they will evaluate
what comes back and eventually that will lead to an answer. and
those extra steps, they can matter. Now, if your question is to write a fun social
media post, well then LLMs reflex is usually fine. But if your question is
debug this gnarly stack trace, or perhaps its trace my cash flow through four different shell
companies? Well, reflex isn't enough. The LRMs internal chain of thought lets it test
hypotheses and discard dead ends and land on a reasoned answer, rather than just following a
statistically likely pattern. Now, of course, this doesn't come for free. It costs inference time and
GPU dollars each extra pass through the network, each self-check, each search branch. It all adds
latency and processing time. So LRMs they buy you deeper reasoning at the
cost of a longer, pricier think. So how do you build one of these thinking
machines? Well, an LRM usually builds upon an existing LLM that has undergone
a set of massive pre-training. So this is the stage where we
teach a model about the world. So billions of web pages, books, code, repos and the
like. And this gives it language skills and a a broad knowledge base. And then
after the pre-training an LRM undergoes specialized reasoning focused tuning. So
we're now going to fine tune the model specifically to provide
reasoning capabilities. So an LRM, it's fed curated data sets of logic puzzles and
multi-step math problems and tricky coding tasks. And each one of these examples comes with a full
chain of thought answer key, and the model learns to show its work. So it basically starts
with a problem that its been given. And from that problem, its job is to come up with
a plan for a solution. Once it's come up with a plan, it needs to execute that plan, which will be
in multiple steps. So we might go to step one, step two, and so forth. And then
ultimately the model needs to arrive at a solution. It's learning to
reason. Then we let the model loose trying to solve some fresh problems of its
own. And that's where it goes through a process of reinforcement learning.
Now, that uses a reward system where either humans through reinforcement learning from
human feedback. So RLHF, they give thumbs up or thumbs down for each
one of these steps as they're written. Or the reinforcement learning can come from smaller
models that are really judging models like process reward models and process
reward models. Judge each step of a reasoning chain is good or as bad. And the reasoning
model learns via this reinforcement learning to generate sequences of thoughts that maximize
these thumbs up rewards, ultimately improving its logical coherence. Now, there are some other
training methods that can be used as well. For example, we might choose to use something called
distillation to train the model further. And that's where we have a larger teacher model
that's used to generate reasoning traces. And then those reasoning steps are used to train a smaller
model or a newer model on those traces. So basically, if the advanced teacher model can solve
a puzzle by thinking through a solution, that solution path can then be added to the training
data of the new LRM model. And the result of all of this is a model that can plan, that can verify,
and that can explain. Ready to finally make sense of those shell company cash flows. So
the LRM is trained to think. And now the question is how much thinking time do you give it at
runtime? Well, that's a question all about inference time or test time, as it sometimes
called compute. This is what happens every time you ask a question. And different questions can be
assigned different amounts of thinking time. So debug my stack trace. That might get a good amount
of compute allowance while write a fine caption. That kind of gets the budget version where the
model just goes through one quick pass, and during extended inference time, a model may run
multiple chains of thought. Then it might vote on the best one. It might backtrack with a tree
search if it hits a dead end, and it might call external stuff like a a calculator or a database,
or a code sandbox for spot checks and each extra pass through the model, well, it comes
at a cost. It comes at a cost of more compute that is needed. It comes at
a cost of how long you're going to be waiting for a response with higher latency. But hopefully
this all does come also with an increase in accuracy, higher accuracy.
So is this accuracy up arrow worth the cost to get it? Well it
depends on the problem you're trying to solve. Now on the positive side, an LRM offers
complex reasoning. LRMs excel are tasks that require multi-step logic planning or abstract
reasoning. They also offer improved decision making because LRMs can internally
verify and deliberate, which means that answers tend to be a bit more nuanced and hopefully more
accurate and LRMs they usually require less in the way of prompt engineering. We don't need
to sprinkle in magic words in our prompting like, let's think step by step because the
model already does it. That's less prompt hackery, but you might be better off with a
regular LLM or just a smaller model overall in some situations, because as I've mentioned, there
is that higher computational cost. That means more VRAM, more energy. Higher
invoice price from your cloud provider. And then there's also the increase in latency.
Slower replies while the model stops to think. Although I'm endlessly kind of
amused by reading those replies, the model's thinking steps as it works through building a
response. But that's probably just me. So. So look with LRMs, AI models are no
longer just spewing language out at you as fast as they can predict the next word in a sentence,
they are taking time to think through responses. And today the most intelligent models,
the ones scoring highest on AI benchmarks, well, they tend to be the reasoning models
the LRMs.