Learning Library

← Back to Library

Large Reasoning Models Explained

8m • Unknown Channel • ai-ml • deep-dive • advanced • Watch on YouTube ↗

Key Points

Large Language Models (LLMs) generate text by statistically predicting the next token, while Large Reasoning Models (LRMs) first plan and evaluate before token generation, enabling deeper reasoning.
LRMs use an internal “chain of thought” to sketch plans, test hypotheses, and discard dead ends, which is crucial for complex tasks like debugging code or tracing financial flows.
This extra reasoning incurs higher inference latency and GPU costs, making LRMs slower and more expensive than reflexive LLM outputs.
Building an LRM involves starting with a massive pre‑trained LLM and then fine‑tuning it on curated datasets of logic puzzles, multi‑step math, and coding tasks that include full chain‑of‑thought answer keys.
After fine‑tuning, LRMs are further refined via reinforcement learning (often RLHF), where human or model feedback rewards coherent, step‑by‑step reasoning.

Sections

Full Transcript

# Large Reasoning Models Explained **Source:** [https://www.youtube.com/watch?v=enLbj0igyx4](https://www.youtube.com/watch?v=enLbj0igyx4) **Duration:** 00:08:26 ## Summary - Large Language Models (LLMs) generate text by statistically predicting the next token, while Large Reasoning Models (LRMs) first plan and evaluate before token generation, enabling deeper reasoning. - LRMs use an internal “chain of thought” to sketch plans, test hypotheses, and discard dead ends, which is crucial for complex tasks like debugging code or tracing financial flows. - This extra reasoning incurs higher inference latency and GPU costs, making LRMs slower and more expensive than reflexive LLM outputs. - Building an LRM involves starting with a massive pre‑trained LLM and then fine‑tuning it on curated datasets of logic puzzles, multi‑step math, and coding tasks that include full chain‑of‑thought answer keys. - After fine‑tuning, LRMs are further refined via reinforcement learning (often RLHF), where human or model feedback rewards coherent, step‑by‑step reasoning. ## Sections - [00:00:00](https://www.youtube.com/watch?v=enLbj0igyx4&t=0s) **LLMs vs LRMs: Reasoning Tradeoffs** - The passage contrasts standard language models that generate text via token‑by‑token prediction with large reasoning models that first plan and chain‑of‑thought, highlighting the improved answer quality but higher latency and cost. - [00:03:47](https://www.youtube.com/watch?v=enLbj0igyx4&t=227s) **Reward‑Guided Reasoning and Distillation** - The passage outlines how RLHF and reward models evaluate each reasoning step, while teacher‑student distillation supplies example thought traces, enabling a language model to learn to plan, verify, and explain its answers, with a note on runtime thinking budget. - [00:07:46](https://www.youtube.com/watch?v=enLbj0igyx4&t=466s) **Models Thinking Before Responding** - The speaker observes that cutting‑edge large language models, particularly high‑scoring reasoning models, deliberately pause to “think,” resulting in slower, more deliberative replies instead of rapid word‑by‑word generation. ## Full Transcript

0:00You already know large language models or LLMs. They predict the 0:06next token in a sequence, using a statistical pattern matching technique to crank out human 0:12like text. There's also LRMs, large reasoning models, 0:19and they go a bit further. They think before they talk. Now, give LLM a prompt 0:26and it will projectile predict whatever word statistically fits, next it will 0:33output a token and then another token, and then another token 0:40and LRMs, they still do that too. But they first they sketch out 0:46a plan. They weigh options and they double check calculations in a sandbox before building their 0:53response. So before they start outputting tokens, they will plan, they will evaluate 0:59what comes back and eventually that will lead to an answer. and 1:06those extra steps, they can matter. Now, if your question is to write a fun social 1:13media post, well then LLMs reflex is usually fine. But if your question is 1:20debug this gnarly stack trace, or perhaps its trace my cash flow through four different shell 1:27companies? Well, reflex isn't enough. The LRMs internal chain of thought lets it test 1:33hypotheses and discard dead ends and land on a reasoned answer, rather than just following a 1:39statistically likely pattern. Now, of course, this doesn't come for free. It costs inference time and 1:45GPU dollars each extra pass through the network, each self-check, each search branch. It all adds 1:52latency and processing time. So LRMs they buy you deeper reasoning at the 1:59cost of a longer, pricier think. So how do you build one of these thinking 2:06machines? Well, an LRM usually builds upon an existing LLM that has undergone 2:13a set of massive pre-training. So this is the stage where we 2:20teach a model about the world. So billions of web pages, books, code, repos and the 2:27like. And this gives it language skills and a a broad knowledge base. And then 2:33after the pre-training an LRM undergoes specialized reasoning focused tuning. So 2:40we're now going to fine tune the model specifically to provide 2:47reasoning capabilities. So an LRM, it's fed curated data sets of logic puzzles and 2:53multi-step math problems and tricky coding tasks. And each one of these examples comes with a full 2:59chain of thought answer key, and the model learns to show its work. So it basically starts 3:06with a problem that its been given. And from that problem, its job is to come up with 3:13a plan for a solution. Once it's come up with a plan, it needs to execute that plan, which will be 3:20in multiple steps. So we might go to step one, step two, and so forth. And then 3:26ultimately the model needs to arrive at a solution. It's learning to 3:33reason. Then we let the model loose trying to solve some fresh problems of its 3:40own. And that's where it goes through a process of reinforcement learning. 3:47Now, that uses a reward system where either humans through reinforcement learning from 3:53human feedback. So RLHF, they give thumbs up or thumbs down for each 4:00one of these steps as they're written. Or the reinforcement learning can come from smaller 4:07models that are really judging models like process reward models and process 4:14reward models. Judge each step of a reasoning chain is good or as bad. And the reasoning 4:20model learns via this reinforcement learning to generate sequences of thoughts that maximize 4:27these thumbs up rewards, ultimately improving its logical coherence. Now, there are some other 4:33training methods that can be used as well. For example, we might choose to use something called 4:40distillation to train the model further. And that's where we have a larger teacher model 4:47that's used to generate reasoning traces. And then those reasoning steps are used to train a smaller 4:52model or a newer model on those traces. So basically, if the advanced teacher model can solve 4:59a puzzle by thinking through a solution, that solution path can then be added to the training 5:05data of the new LRM model. And the result of all of this is a model that can plan, that can verify, 5:12and that can explain. Ready to finally make sense of those shell company cash flows. So 5:19the LRM is trained to think. And now the question is how much thinking time do you give it at 5:25runtime? Well, that's a question all about inference time or test time, as it sometimes 5:31called compute. This is what happens every time you ask a question. And different questions can be 5:38assigned different amounts of thinking time. So debug my stack trace. That might get a good amount 5:44of compute allowance while write a fine caption. That kind of gets the budget version where the 5:49model just goes through one quick pass, and during extended inference time, a model may run 5:55multiple chains of thought. Then it might vote on the best one. It might backtrack with a tree 6:01search if it hits a dead end, and it might call external stuff like a a calculator or a database, 6:06or a code sandbox for spot checks and each extra pass through the model, well, it comes 6:13at a cost. It comes at a cost of more compute that is needed. It comes at 6:20a cost of how long you're going to be waiting for a response with higher latency. But hopefully 6:27this all does come also with an increase in accuracy, higher accuracy. 6:34So is this accuracy up arrow worth the cost to get it? Well it 6:40depends on the problem you're trying to solve. Now on the positive side, an LRM offers 6:47complex reasoning. LRMs excel are tasks that require multi-step logic planning or abstract 6:54reasoning. They also offer improved decision making because LRMs can internally 7:01verify and deliberate, which means that answers tend to be a bit more nuanced and hopefully more 7:06accurate and LRMs they usually require less in the way of prompt engineering. We don't need 7:13to sprinkle in magic words in our prompting like, let's think step by step because the 7:20model already does it. That's less prompt hackery, but you might be better off with a 7:27regular LLM or just a smaller model overall in some situations, because as I've mentioned, there 7:33is that higher computational cost. That means more VRAM, more energy. Higher 7:40invoice price from your cloud provider. And then there's also the increase in latency. 7:46Slower replies while the model stops to think. Although I'm endlessly kind of 7:53amused by reading those replies, the model's thinking steps as it works through building a 7:58response. But that's probably just me. So. So look with LRMs, AI models are no 8:04longer just spewing language out at you as fast as they can predict the next word in a sentence, 8:09they are taking time to think through responses. And today the most intelligent models, 8:16the ones scoring highest on AI benchmarks, well, they tend to be the reasoning models 8:23the LRMs.