Learning Library

← Back to Library

Speculative Decoding: Speeding Up LLM Inference

Key Points

  • Speculative decoding speeds up LLM inference by letting a small “draft” model predict several upcoming tokens while a larger target model simultaneously verifies them, often yielding 2‑4× the throughput of normal generation.
  • In standard autoregressive generation, each model run produces a single token through a forward pass (producing a probability distribution) followed by a decoding step that selects one token to append to the context.
  • The speculative decoding pipeline adds three stages: (1) the draft model generates k candidate tokens and their probabilities, (2) the target model checks those tokens in parallel by assuming they’re correct and runs a single forward pass, and (3) any mismatches trigger a fallback to the target model’s own predictions.
  • This “writer‑and‑editor” workflow lets the fast draft model “draft” ahead while the more accurate large model “edits” in real time, achieving higher speed without sacrificing output quality.

Full Transcript

# Speculative Decoding: Speeding Up LLM Inference **Source:** [https://www.youtube.com/watch?v=VkWlLSTdHs8](https://www.youtube.com/watch?v=VkWlLSTdHs8) **Duration:** 00:09:27 ## Summary - Speculative decoding speeds up LLM inference by letting a small “draft” model predict several upcoming tokens while a larger target model simultaneously verifies them, often yielding 2‑4× the throughput of normal generation. - In standard autoregressive generation, each model run produces a single token through a forward pass (producing a probability distribution) followed by a decoding step that selects one token to append to the context. - The speculative decoding pipeline adds three stages: (1) the draft model generates k candidate tokens and their probabilities, (2) the target model checks those tokens in parallel by assuming they’re correct and runs a single forward pass, and (3) any mismatches trigger a fallback to the target model’s own predictions. - This “writer‑and‑editor” workflow lets the fast draft model “draft” ahead while the more accurate large model “edits” in real time, achieving higher speed without sacrificing output quality. ## Sections - [00:00:00](https://www.youtube.com/watch?v=VkWlLSTdHs8&t=0s) **Speculative Decoding for Faster LLMs** - The segment explains how a small “draft” model can guess upcoming tokens while a larger “target” model verifies them in parallel, enabling two‑to‑four tokens to be generated per inference step and dramatically speeding up LLM output without sacrificing quality. - [00:03:11](https://www.youtube.com/watch?v=VkWlLSTdHs8&t=191s) **Parallel Draft-Target Verification and Rejection Sampling** - The passage explains how draft token probabilities are verified against a larger target model in parallel, using the target model’s confidence scores to perform rejection sampling before any token is appended to the output. - [00:06:20](https://www.youtube.com/watch?v=VkWlLSTdHs8&t=380s) **Token Speculation Improves Generation Speed** - The passage details how a target model resamples rejected tokens, allowing multiple new tokens per forward pass and achieving 2–3× faster inference through token speculation and parallel verification. - [00:09:23](https://www.youtube.com/watch?v=VkWlLSTdHs8&t=563s) **Speculation and Farewell** - The speaker acknowledges uncertainty and casually bids goodbye, hinting at a future reunion on the other side. ## Full Transcript
0:00So you want your large language model to be fast. 0:02Let me show you how. 0:04Speculative decoding is an effective technique for speeding up LLM inference times without sacrificing quality of output. 0:11This approach follows the slogan draft and verify by using a smaller draft model to speculate about future tokens while a larger target model verifies them in parallel. 0:23This approach can generate two to four tokens 0:26in the same amount of time, it will take a normal LLM to generate just one. 0:32Think of a writer and an editor. 0:34Imagine that the editor is a much faster typer and can mimic the writer's style. 0:39So to work smarter and not harder, the editor can draft a few words ahead while the writer double checks his work and makes any changes where appropriate. 0:49In the way with speculative decoding, a smaller model is given free reigns guess what words come next? 0:57But stays grounded in a larger model who always verifies its output. 1:01Now before I dive into the details, let's quickly review how normal text generation works since speculative decoding will build on top of it. 1:10So basic vanilla LLM generation is an autoregressive process of two sequential steps, a forward pass and a decoding phase. 1:20Let's take the input "the sky is..." 1:25During the forward pass, this text is tokenized 1:29and passed through the LLM layers, being transformed by the model weight parameters, 1:34and eventually outputting a list of potential tokens, say blue, 1:40red, green, etc. 1:44along with its probability distribution. 1:51During the decoding phase 1:52we select one single token. 1:55This can be done by just selecting the token with the highest probability or by randomly sampling from a subset of the top probabilities. 2:03Either way, once we've selected a token, we can append it to our input sequence and then pass it back through the LLM to get the next token. 2:12As you can see with this approach, 2:14one run of the model is able to generate just one token. 2:21So now that we have a reference point, let's see how speculative decoding augments this process with three main steps. 2:27Let's break it down. 2:28First, during token speculation, a smaller draft model, say for example, three billion parameters, generates k draft tokens. 2:39To help explain this, let's use a hypothetical example. 2:42Let's say that our draft model has the input, "why did the chicken..." 2:49Part of a very popular joke, 2:51and say that we set k equal to four, meaning we want the model to make a prediction of what the next four tokens are instead of just one. 3:00And say that the model speculates why did the chicken, and then it guesses cross farm question mark. 3:12Now, in addition to each of these predictions, we also get their probabilities and distributions. 3:17Let's call these DP for draft probability, and we can jot them down here. 3:22As an example, say these are 0.7, 0.9, 0,8, and 0.8. 3:30Next, during parallel verification, we concurrently check the draft model's output. 3:35We do this by making an assumption that all the speculated tokens from the draft model are indeed correct. 3:41And then pass this modified input into a larger target model, 3:48s ay for example, 70 billion parameters, in order to get a prediction of what the next single token is, as well as the target model's confidence for the draft model's guesses. 4:00So suppose in our example that the target model guesses that the next word is two, with a probability of 0.8. 4:11And let's call these TP for target probability. 4:15Now what's cool here is that in addition to getting the next tokens probability, we also get the target model's confidence for all the previously speculated tokens. 4:23So, say in our example that these confidence levels are 0.9, 0.7, and 0.8. 4:37So verification here simply means checking to see whether the speculated tokens are something the target model would have also produced given the same context. 4:46And remember, at this point, we haven't actually chosen any specific tokens to append to our output. 4:53All we've done is created a pool of candidates that might work. 4:56And this brings us to the final step, which is called rejection sampling, 5:01where we go through each of our predictions one by one and choose to either accept or reject them by comparing these two sets of probabilities. 5:10Let's use a very simple rule, although in application we'd use a more complex one. 5:14But for simplicity, let's say that if the target probability is greater than or equal to the draft probability, then we can accept that token. 5:26Otherwise, if the target probability is less than the draft probability, 5:33then we have to reject that token. 5:36So we can repeat this check for each token until we get to the first rejection, 5:41at which point we discard any remaining guesses and then have the target model correct the output. 5:48So with all that being said, let's apply it to our example. 5:51The first word on the chopping block is the word cross. 5:54The probability from the target model is 0.9 while the draft model is a 0.7. 6:000.9 is greater than 0.7, so in this case, Check we accept this token, meaning that because the target model is more confident than the draft model, 6:12that the word cross is correct, we're comfortable with just accepting the the draft model's guess and appending the word to our output. 6:22Next, for the word the, both have the same probability, and according to our rule, we also accept. 6:29So far, so good. 6:31Next, with the word farm, the target model's probability of .7 is less than the draft model's of .8. 6:38So in this case, we have to reject, along with everything that follows it, because their generation was dependent on what came before. 6:47So, Now to get back on track, what we can do is use the target model to re-sample the next best option from the underlying distribution of this rejected token. 6:59So in our example, say that the target models corrects the word farm to the word road. 7:07And it's at this point that we have officially completed one round and we can repeat this Three-step process again and again until our joke or our output is complete. 7:18So what just happened? 7:19Well, with one single forward pass of the target model, we were able to generate three new tokens for the price and time of just one. 7:27In the worst case scenario, where we hypothetically happened to reject the very first token, we're still able to generated one token from the target models correction. 7:39In the best case scenario where we just so happened to accept every draft token and sample one more from the targets, we can get up to k plus one new tokens per round. 7:53On average, this can lead to two to three times faster inference speeds compared to normal LLM generation. 8:01So while the actual speedups in this process are achieved by the token speculation and parallel verification steps, 8:10it's the rejection sampling step that's arguably the most important 8:14because it ensures that we don't have to sacrifice any quality of output by trying to recover the distribution of the target model, 8:23just by sampling from the draft model's output. 8:26It's all about optimization. 8:28Often a larger model can be overkill for trying to predict simple words and phrases that a smaller model can handle just fine. 8:36So by running the two models concurrently and by relying on the smaller model to do most of the heavy lifting, we're better able to utilize GPU resources more efficiently. 8:47So all in all, speculative decoding helps to reduce latency decrease compute costs, boost efficient memory usage, 8:56increase inference speeds, all while maintaining the same quality of output. 9:02So in this new frontier of LLM optimization, researchers are continuing to make breakthrough improvements. 9:09So if you're interested in learning more, you can check out the materials in the description below to see what IBM is doing in this space. 9:18With that being said, I still left one question unanswered. 9:22Why did the chicken cross the road? 9:24I guess we can only speculate. 9:26I'll see y'all on the other side.