Speculative Decoding: Speeding Up LLM Inference
Key Points
- Speculative decoding speeds up LLM inference by letting a small “draft” model predict several upcoming tokens while a larger target model simultaneously verifies them, often yielding 2‑4× the throughput of normal generation.
- In standard autoregressive generation, each model run produces a single token through a forward pass (producing a probability distribution) followed by a decoding step that selects one token to append to the context.
- The speculative decoding pipeline adds three stages: (1) the draft model generates k candidate tokens and their probabilities, (2) the target model checks those tokens in parallel by assuming they’re correct and runs a single forward pass, and (3) any mismatches trigger a fallback to the target model’s own predictions.
- This “writer‑and‑editor” workflow lets the fast draft model “draft” ahead while the more accurate large model “edits” in real time, achieving higher speed without sacrificing output quality.
Sections
- Speculative Decoding for Faster LLMs - The segment explains how a small “draft” model can guess upcoming tokens while a larger “target” model verifies them in parallel, enabling two‑to‑four tokens to be generated per inference step and dramatically speeding up LLM output without sacrificing quality.
- Parallel Draft-Target Verification and Rejection Sampling - The passage explains how draft token probabilities are verified against a larger target model in parallel, using the target model’s confidence scores to perform rejection sampling before any token is appended to the output.
- Token Speculation Improves Generation Speed - The passage details how a target model resamples rejected tokens, allowing multiple new tokens per forward pass and achieving 2–3× faster inference through token speculation and parallel verification.
- Speculation and Farewell - The speaker acknowledges uncertainty and casually bids goodbye, hinting at a future reunion on the other side.
Full Transcript
# Speculative Decoding: Speeding Up LLM Inference **Source:** [https://www.youtube.com/watch?v=VkWlLSTdHs8](https://www.youtube.com/watch?v=VkWlLSTdHs8) **Duration:** 00:09:27 ## Summary - Speculative decoding speeds up LLM inference by letting a small “draft” model predict several upcoming tokens while a larger target model simultaneously verifies them, often yielding 2‑4× the throughput of normal generation. - In standard autoregressive generation, each model run produces a single token through a forward pass (producing a probability distribution) followed by a decoding step that selects one token to append to the context. - The speculative decoding pipeline adds three stages: (1) the draft model generates k candidate tokens and their probabilities, (2) the target model checks those tokens in parallel by assuming they’re correct and runs a single forward pass, and (3) any mismatches trigger a fallback to the target model’s own predictions. - This “writer‑and‑editor” workflow lets the fast draft model “draft” ahead while the more accurate large model “edits” in real time, achieving higher speed without sacrificing output quality. ## Sections - [00:00:00](https://www.youtube.com/watch?v=VkWlLSTdHs8&t=0s) **Speculative Decoding for Faster LLMs** - The segment explains how a small “draft” model can guess upcoming tokens while a larger “target” model verifies them in parallel, enabling two‑to‑four tokens to be generated per inference step and dramatically speeding up LLM output without sacrificing quality. - [00:03:11](https://www.youtube.com/watch?v=VkWlLSTdHs8&t=191s) **Parallel Draft-Target Verification and Rejection Sampling** - The passage explains how draft token probabilities are verified against a larger target model in parallel, using the target model’s confidence scores to perform rejection sampling before any token is appended to the output. - [00:06:20](https://www.youtube.com/watch?v=VkWlLSTdHs8&t=380s) **Token Speculation Improves Generation Speed** - The passage details how a target model resamples rejected tokens, allowing multiple new tokens per forward pass and achieving 2–3× faster inference through token speculation and parallel verification. - [00:09:23](https://www.youtube.com/watch?v=VkWlLSTdHs8&t=563s) **Speculation and Farewell** - The speaker acknowledges uncertainty and casually bids goodbye, hinting at a future reunion on the other side. ## Full Transcript
So you want your large language model to be fast.
Let me show you how.
Speculative decoding is an effective technique for speeding up LLM inference times without sacrificing quality of output.
This approach follows the slogan draft and verify by using a smaller draft model to speculate about future tokens while a larger target model verifies them in parallel.
This approach can generate two to four tokens
in the same amount of time, it will take a normal LLM to generate just one.
Think of a writer and an editor.
Imagine that the editor is a much faster typer and can mimic the writer's style.
So to work smarter and not harder, the editor can draft a few words ahead while the writer double checks his work and makes any changes where appropriate.
In the way with speculative decoding, a smaller model is given free reigns guess what words come next?
But stays grounded in a larger model who always verifies its output.
Now before I dive into the details, let's quickly review how normal text generation works since speculative decoding will build on top of it.
So basic vanilla LLM generation is an autoregressive process of two sequential steps, a forward pass and a decoding phase.
Let's take the input "the sky is..."
During the forward pass, this text is tokenized
and passed through the LLM layers, being transformed by the model weight parameters,
and eventually outputting a list of potential tokens, say blue,
red, green, etc.
along with its probability distribution.
During the decoding phase
we select one single token.
This can be done by just selecting the token with the highest probability or by randomly sampling from a subset of the top probabilities.
Either way, once we've selected a token, we can append it to our input sequence and then pass it back through the LLM to get the next token.
As you can see with this approach,
one run of the model is able to generate just one token.
So now that we have a reference point, let's see how speculative decoding augments this process with three main steps.
Let's break it down.
First, during token speculation, a smaller draft model, say for example, three billion parameters, generates k draft tokens.
To help explain this, let's use a hypothetical example.
Let's say that our draft model has the input, "why did the chicken..."
Part of a very popular joke,
and say that we set k equal to four, meaning we want the model to make a prediction of what the next four tokens are instead of just one.
And say that the model speculates why did the chicken, and then it guesses cross farm question mark.
Now, in addition to each of these predictions, we also get their probabilities and distributions.
Let's call these DP for draft probability, and we can jot them down here.
As an example, say these are 0.7, 0.9, 0,8, and 0.8.
Next, during parallel verification, we concurrently check the draft model's output.
We do this by making an assumption that all the speculated tokens from the draft model are indeed correct.
And then pass this modified input into a larger target model,
s ay for example, 70 billion parameters, in order to get a prediction of what the next single token is, as well as the target model's confidence for the draft model's guesses.
So suppose in our example that the target model guesses that the next word is two, with a probability of 0.8.
And let's call these TP for target probability.
Now what's cool here is that in addition to getting the next tokens probability, we also get the target model's confidence for all the previously speculated tokens.
So, say in our example that these confidence levels are 0.9, 0.7, and 0.8.
So verification here simply means checking to see whether the speculated tokens are something the target model would have also produced given the same context.
And remember, at this point, we haven't actually chosen any specific tokens to append to our output.
All we've done is created a pool of candidates that might work.
And this brings us to the final step, which is called rejection sampling,
where we go through each of our predictions one by one and choose to either accept or reject them by comparing these two sets of probabilities.
Let's use a very simple rule, although in application we'd use a more complex one.
But for simplicity, let's say that if the target probability is greater than or equal to the draft probability, then we can accept that token.
Otherwise, if the target probability is less than the draft probability,
then we have to reject that token.
So we can repeat this check for each token until we get to the first rejection,
at which point we discard any remaining guesses and then have the target model correct the output.
So with all that being said, let's apply it to our example.
The first word on the chopping block is the word cross.
The probability from the target model is 0.9 while the draft model is a 0.7.
0.9 is greater than 0.7, so in this case, Check we accept this token, meaning that because the target model is more confident than the draft model,
that the word cross is correct, we're comfortable with just accepting the the draft model's guess and appending the word to our output.
Next, for the word the, both have the same probability, and according to our rule, we also accept.
So far, so good.
Next, with the word farm, the target model's probability of .7 is less than the draft model's of .8.
So in this case, we have to reject, along with everything that follows it, because their generation was dependent on what came before.
So, Now to get back on track, what we can do is use the target model to re-sample the next best option from the underlying distribution of this rejected token.
So in our example, say that the target models corrects the word farm to the word road.
And it's at this point that we have officially completed one round and we can repeat this Three-step process again and again until our joke or our output is complete.
So what just happened?
Well, with one single forward pass of the target model, we were able to generate three new tokens for the price and time of just one.
In the worst case scenario, where we hypothetically happened to reject the very first token, we're still able to generated one token from the target models correction.
In the best case scenario where we just so happened to accept every draft token and sample one more from the targets, we can get up to k plus one new tokens per round.
On average, this can lead to two to three times faster inference speeds compared to normal LLM generation.
So while the actual speedups in this process are achieved by the token speculation and parallel verification steps,
it's the rejection sampling step that's arguably the most important
because it ensures that we don't have to sacrifice any quality of output by trying to recover the distribution of the target model,
just by sampling from the draft model's output.
It's all about optimization.
Often a larger model can be overkill for trying to predict simple words and phrases that a smaller model can handle just fine.
So by running the two models concurrently and by relying on the smaller model to do most of the heavy lifting, we're better able to utilize GPU resources more efficiently.
So all in all, speculative decoding helps to reduce latency decrease compute costs, boost efficient memory usage,
increase inference speeds, all while maintaining the same quality of output.
So in this new frontier of LLM optimization, researchers are continuing to make breakthrough improvements.
So if you're interested in learning more, you can check out the materials in the description below to see what IBM is doing in this space.
With that being said, I still left one question unanswered.
Why did the chicken cross the road?
I guess we can only speculate.
I'll see y'all on the other side.