Learning Library

← Back to Library

Speculative Decoding: Speeding Up LLM Inference

9m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

Speculative decoding speeds up LLM inference by letting a small “draft” model predict several upcoming tokens while a larger target model simultaneously verifies them, often yielding 2‑4× the throughput of normal generation.
In standard autoregressive generation, each model run produces a single token through a forward pass (producing a probability distribution) followed by a decoding step that selects one token to append to the context.
The speculative decoding pipeline adds three stages: (1) the draft model generates k candidate tokens and their probabilities, (2) the target model checks those tokens in parallel by assuming they’re correct and runs a single forward pass, and (3) any mismatches trigger a fallback to the target model’s own predictions.
This “writer‑and‑editor” workflow lets the fast draft model “draft” ahead while the more accurate large model “edits” in real time, achieving higher speed without sacrificing output quality.

Sections

Full Transcript

# Speculative Decoding: Speeding Up LLM Inference **Source:** [https://www.youtube.com/watch?v=VkWlLSTdHs8](https://www.youtube.com/watch?v=VkWlLSTdHs8) **Duration:** 00:09:27 ## Summary - Speculative decoding speeds up LLM inference by letting a small “draft” model predict several upcoming tokens while a larger target model simultaneously verifies them, often yielding 2‑4× the throughput of normal generation. - In standard autoregressive generation, each model run produces a single token through a forward pass (producing a probability distribution) followed by a decoding step that selects one token to append to the context. - The speculative decoding pipeline adds three stages: (1) the draft model generates k candidate tokens and their probabilities, (2) the target model checks those tokens in parallel by assuming they’re correct and runs a single forward pass, and (3) any mismatches trigger a fallback to the target model’s own predictions. - This “writer‑and‑editor” workflow lets the fast draft model “draft” ahead while the more accurate large model “edits” in real time, achieving higher speed without sacrificing output quality. ## Sections - [00:00:00](https://www.youtube.com/watch?v=VkWlLSTdHs8&t=0s) **Speculative Decoding for Faster LLMs** - The segment explains how a small “draft” model can guess upcoming tokens while a larger “target” model verifies them in parallel, enabling two‑to‑four tokens to be generated per inference step and dramatically speeding up LLM output without sacrificing quality. - [00:03:11](https://www.youtube.com/watch?v=VkWlLSTdHs8&t=191s) **Parallel Draft-Target Verification and Rejection Sampling** - The passage explains how draft token probabilities are verified against a larger target model in parallel, using the target model’s confidence scores to perform rejection sampling before any token is appended to the output. - [00:06:20](https://www.youtube.com/watch?v=VkWlLSTdHs8&t=380s) **Token Speculation Improves Generation Speed** - The passage details how a target model resamples rejected tokens, allowing multiple new tokens per forward pass and achieving 2–3× faster inference through token speculation and parallel verification. - [00:09:23](https://www.youtube.com/watch?v=VkWlLSTdHs8&t=563s) **Speculation and Farewell** - The speaker acknowledges uncertainty and casually bids goodbye, hinting at a future reunion on the other side. ## Full Transcript

0:00So you want your large language model to be fast. 0:02Let me show you how. 0:04Speculative decoding is an effective technique for speeding up LLM inference times without sacrificing quality of output. 0:11This approach follows the slogan draft and verify by using a smaller draft model to speculate about future tokens while a larger target model verifies them in parallel. 0:23This approach can generate two to four tokens 0:26in the same amount of time, it will take a normal LLM to generate just one. 0:32Think of a writer and an editor. 0:34Imagine that the editor is a much faster typer and can mimic the writer's style. 0:39So to work smarter and not harder, the editor can draft a few words ahead while the writer double checks his work and makes any changes where appropriate. 0:49In the way with speculative decoding, a smaller model is given free reigns guess what words come next? 0:57But stays grounded in a larger model who always verifies its output. 1:01Now before I dive into the details, let's quickly review how normal text generation works since speculative decoding will build on top of it. 1:10So basic vanilla LLM generation is an autoregressive process of two sequential steps, a forward pass and a decoding phase. 1:20Let's take the input "the sky is..." 1:25During the forward pass, this text is tokenized 1:29and passed through the LLM layers, being transformed by the model weight parameters, 1:34and eventually outputting a list of potential tokens, say blue, 1:40red, green, etc. 1:44along with its probability distribution. 1:51During the decoding phase 1:52we select one single token. 1:55This can be done by just selecting the token with the highest probability or by randomly sampling from a subset of the top probabilities. 2:03Either way, once we've selected a token, we can append it to our input sequence and then pass it back through the LLM to get the next token. 2:12As you can see with this approach, 2:14one run of the model is able to generate just one token. 2:21So now that we have a reference point, let's see how speculative decoding augments this process with three main steps. 2:27Let's break it down. 2:28First, during token speculation, a smaller draft model, say for example, three billion parameters, generates k draft tokens. 2:39To help explain this, let's use a hypothetical example. 2:42Let's say that our draft model has the input, "why did the chicken..." 2:49Part of a very popular joke, 2:51and say that we set k equal to four, meaning we want the model to make a prediction of what the next four tokens are instead of just one. 3:00And say that the model speculates why did the chicken, and then it guesses cross farm question mark. 3:12Now, in addition to each of these predictions, we also get their probabilities and distributions. 3:17Let's call these DP for draft probability, and we can jot them down here. 3:22As an example, say these are 0.7, 0.9, 0,8, and 0.8. 3:30Next, during parallel verification, we concurrently check the draft model's output. 3:35We do this by making an assumption that all the speculated tokens from the draft model are indeed correct. 3:41And then pass this modified input into a larger target model, 3:48s ay for example, 70 billion parameters, in order to get a prediction of what the next single token is, as well as the target model's confidence for the draft model's guesses. 4:00So suppose in our example that the target model guesses that the next word is two, with a probability of 0.8. 4:11And let's call these TP for target probability. 4:15Now what's cool here is that in addition to getting the next tokens probability, we also get the target model's confidence for all the previously speculated tokens. 4:23So, say in our example that these confidence levels are 0.9, 0.7, and 0.8. 4:37So verification here simply means checking to see whether the speculated tokens are something the target model would have also produced given the same context. 4:46And remember, at this point, we haven't actually chosen any specific tokens to append to our output. 4:53All we've done is created a pool of candidates that might work. 4:56And this brings us to the final step, which is called rejection sampling, 5:01where we go through each of our predictions one by one and choose to either accept or reject them by comparing these two sets of probabilities. 5:10Let's use a very simple rule, although in application we'd use a more complex one. 5:14But for simplicity, let's say that if the target probability is greater than or equal to the draft probability, then we can accept that token. 5:26Otherwise, if the target probability is less than the draft probability, 5:33then we have to reject that token. 5:36So we can repeat this check for each token until we get to the first rejection, 5:41at which point we discard any remaining guesses and then have the target model correct the output. 5:48So with all that being said, let's apply it to our example. 5:51The first word on the chopping block is the word cross. 5:54The probability from the target model is 0.9 while the draft model is a 0.7. 6:000.9 is greater than 0.7, so in this case, Check we accept this token, meaning that because the target model is more confident than the draft model, 6:12that the word cross is correct, we're comfortable with just accepting the the draft model's guess and appending the word to our output. 6:22Next, for the word the, both have the same probability, and according to our rule, we also accept. 6:29So far, so good. 6:31Next, with the word farm, the target model's probability of .7 is less than the draft model's of .8. 6:38So in this case, we have to reject, along with everything that follows it, because their generation was dependent on what came before. 6:47So, Now to get back on track, what we can do is use the target model to re-sample the next best option from the underlying distribution of this rejected token. 6:59So in our example, say that the target models corrects the word farm to the word road. 7:07And it's at this point that we have officially completed one round and we can repeat this Three-step process again and again until our joke or our output is complete. 7:18So what just happened? 7:19Well, with one single forward pass of the target model, we were able to generate three new tokens for the price and time of just one. 7:27In the worst case scenario, where we hypothetically happened to reject the very first token, we're still able to generated one token from the target models correction. 7:39In the best case scenario where we just so happened to accept every draft token and sample one more from the targets, we can get up to k plus one new tokens per round. 7:53On average, this can lead to two to three times faster inference speeds compared to normal LLM generation. 8:01So while the actual speedups in this process are achieved by the token speculation and parallel verification steps, 8:10it's the rejection sampling step that's arguably the most important 8:14because it ensures that we don't have to sacrifice any quality of output by trying to recover the distribution of the target model, 8:23just by sampling from the draft model's output. 8:26It's all about optimization. 8:28Often a larger model can be overkill for trying to predict simple words and phrases that a smaller model can handle just fine. 8:36So by running the two models concurrently and by relying on the smaller model to do most of the heavy lifting, we're better able to utilize GPU resources more efficiently. 8:47So all in all, speculative decoding helps to reduce latency decrease compute costs, boost efficient memory usage, 8:56increase inference speeds, all while maintaining the same quality of output. 9:02So in this new frontier of LLM optimization, researchers are continuing to make breakthrough improvements. 9:09So if you're interested in learning more, you can check out the materials in the description below to see what IBM is doing in this space. 9:18With that being said, I still left one question unanswered. 9:22Why did the chicken cross the road? 9:24I guess we can only speculate. 9:26I'll see y'all on the other side.