Imagine if your AI could write the next paragraph in the time it currently takes to write the next single word. That is the dream of real-time interaction, but as of April twenty-six, inference latency is still the biggest wall we hit when deploying these massive models. Today's prompt from Daniel is about speculative decoding, which is essentially the clever hack that makes that dream a reality. He wrote to us saying that speculative decoding is the inference trick making Large Language Models feel snappy. The core idea is a small draft model proposes several tokens ahead, then the big model verifies them in a single forward pass, and you accept the run of correct guesses. This gives a two to three times speedup with zero quality loss. He mentions variants like vanilla speculative decoding, Medusa, EAGLE, and lookahead decoding, noting that this works on any autoregressive decoder like Transformers or Mamba.
Herman Poppleberry here, and I am legitimately thrilled Daniel brought this up. It is the most elegant "cheat code" in computer science right now because it exploits a fundamental hardware reality: GPUs are incredibly fast at math but relatively slow at moving data from memory. By the way, today's episode is powered by Google Gemini three Flash, which is fitting because we are talking about the bleeding edge of how these models actually reach our screens.
It is funny you call it a cheat code. Usually, in engineering, when you want something to go three times faster, you have to sacrifice accuracy or use a smaller, dumber model. But Daniel mentioned "zero quality loss." How do you get a massive speed boost for free without the model becoming less coherent?
That is the magic of the "draft-and-verify" framework. Think of the draft model as a fast-talking gambler making a string of bets about what the next words will be. If the big model, the "judge," agrees with those bets, it validates them all in one go. If the judge disagrees at any point, the gambler has to stop, and the judge provides the correct token. Because the big model always has the final veto power, the output is mathematically identical to if the big model had written every single word itself.
Okay, so let's break down the mechanics because this sounds like a bit of a paradox. You are telling me that running two models, a small one and a large one, is somehow faster than just running the large one alone?
It sounds counterintuitive, right? But it comes down to memory bandwidth. When you run a seventy-billion parameter model to generate one token, the GPU has to load all those billions of weights from its memory into its processing cores. That loading process is the bottleneck. The actual math the GPU does once the weights are loaded is almost instantaneous. Speculative decoding says, while we have those weights loaded to verify one token, why don't we verify five or six at the same time? The "math cost" of checking six tokens is barely higher than checking one, but the "memory cost" of loading the model is exactly the same.
So the small model acts as a scout. It runs ahead because it is tiny and light, so it can generate those five tokens super fast. Then the big model looks at the scout's work and says, "Yes, yes, yes, no, that fourth word is wrong."
That is a great way to put it. And here is the kicker: if the big model accepts all five tokens from the scout, it actually gives you a sixth token for free. During the verification pass, the big model calculates the probabilities for the next step anyway. So if the draft was perfect, you get five plus one. You just jumped six words ahead in the time it usually takes to move one.
I want to dig into the math of that verification step for a second. If I am the big model and I am looking at a sequence of tokens from a draft model, how do I decide to "accept" them? Is it just a binary "is this the word I would have picked" or is there more nuance to the probability distribution?
It is deep nuance. In vanilla speculative decoding, we use an acceptance criterion that ensures the final output follows the exact same probability distribution as the target model. If the draft model suggests a token that the target model also thinks is very likely, we accept it. If the target model thinks the draft's choice was a low-probability fluke, it rejects it and resamples from its own distribution. There is actually a specific algorithm involving the ratio of the two models' probabilities. It ensures that the "speculative" part of the name does not mean "guessing at random." It is a statistically rigorous verification.
So what does the "scout" or the draft model look like in practice? Does it have to be a specific type of model?
In the early days, like the original Leviathan paper from twenty-three, people just used a smaller version of the same architecture. For example, if you are running Llama seventy-billion, you might use a tiny seven-hundred-million parameter draft model. It has been trained on similar data, so it has a "vibe" for how the big model thinks. But as Daniel noted, we have moved way beyond that.
Right, he mentioned Medusa and EAGLE. I am guessing these are not just "smaller models" but different ways of doing the scouting?
Medusa is a fascinating one. Instead of having a second, separate model, Medusa adds multiple "heads" to the very last layer of the big model itself. Imagine the model has one mouth usually, but you give it five extra tongues. The first head predicts the next token, the second head predicts the token after that, and so on. They all work in parallel. It is elegant because you don't have to manage a second KV cache for a second model.
But wait, if those Medusa heads are all part of the big model, aren't they just as slow to run?
No, because they are just single-layer feed-forward networks sitting on top of the final hidden state. They are incredibly "cheap" computationally. You are basically asking the big model's internal representations, "Hey, based on what you know right now, what are the next five likely steps?" and these tiny heads give you a quick answer.
That makes sense. But then Daniel mentioned EAGLE, which he says is the current state-of-the-art for "self-speculation." How does that differ from the Medusa tongues?
EAGLE is brilliant because it realizes that words are not just words—they are vectors in a hidden space. Instead of trying to predict the next "word" directly, EAGLE predicts the next "hidden state" of the model. It uses a very lightweight Transformer plugin. Recent benchmarks have shown EAGLE-two or EAGLE-three hitting six-and-a-half times speedups. It is much better at handling the "uncertainty" of the draft because it works at the feature level rather than just the token level.
I am looking at some of these case studies Daniel sent over. There is one where a seven-billion parameter draft model paired with a seventy-billion target model achieved a two-point-five times speedup on code generation. Why is code generation specifically highlighted there? Does speculative decoding work better on code than, say, poetry?
You hit on a massive point. Speculative decoding's efficiency depends entirely on the "acceptance rate." If the scout is good at guessing what the judge will say, the system flies. Code is very structured and predictable. There are only so many ways to finish a "for" loop in Python. So the scout model has a very high success rate. Creative writing or complex logic is harder to guess, so the scout gets "vetoed" more often. When the scout gets vetoed, you lose some of that speedup because you spent time on a guess that didn't pay off.
That brings up a tradeoff I was wondering about. Is there a scenario where speculative decoding is actually slower than just running the big model? Like, if the scout is a total idiot and get every guess wrong, are you just adding overhead?
Theoretically, yes. There is a small cost to running the draft model and a small cost to the verification math. If your acceptance rate drops to near zero, you are technically slower. But in practice, even a very small model is surprisingly good at predicting the "and," "the," and "is" of a sentence. Most implementations use a dynamic "k" value—that is the number of tokens to guess. If the model sees it is getting rejected a lot, it might only guess two tokens. If it is on a roll, it might guess ten.
It is like a gambler who starts betting smaller when they are on a losing streak. I love that. Now, Daniel mentioned Mamba. We have talked about State Space Models before, but why are they "ideal drafters" for Transformer targets?
This is really cool research from twenty-five. Transformers have a "KV cache" problem. As the conversation gets longer, the "memory" the model needs to store grows and grows. If your scout is a Transformer, it also has to manage a growing memory cache, which can actually eat up twenty gigabytes of VRAM in long-context windows. Mamba models have a constant memory state. They don't care if the conversation is ten words or ten thousand words; their "memory" footprint stays the same. So using a Mamba scout to help a Transformer judge means you get the speed without the memory bloat at long sequences.
That is such a smart architectural marriage. You use the efficient memory of Mamba for the scout and the massive reasoning power of the Transformer for the judge. It feels like we are moving toward these "hybrid" inference stacks where it is not just one model in a box, but a whole ecosystem of components working together.
It really is an ecosystem. And we haven't even touched on "Lookahead Decoding." This one is for the purists who don't want a second model at all. It uses something called Jacobi iteration. Essentially, it treats the decoding process like a system of non-linear equations and tries to solve for multiple tokens simultaneously. It is "model-less" speculation. It is not always as fast as EAGLE, but it is incredibly robust because you don't have to train a scout.
So if I am a developer listening to this, and I have a model that feels a bit sluggish in my app, is speculative decoding something I can just "turn on"? Or do I need to go back to school for a PhD in linear algebra?
The good news is that the major inference engines—things like vLLM, TensorRT-LLM, and even smaller setups—now have "speculative" as a flag you can toggle. You just need to point it to a compatible draft model. The industry has converged on this so quickly because the return on investment is massive. You are essentially getting a three-times throughput increase on the same hardware. For a company paying for GPU cloud time, that is a seventy percent discount on their inference bill.
That is the real-world impact. It is not just about making the chat bubble pop up faster; it is about the economics of AI. If you can serve three times as many users on the same H-one-hundred, your business model suddenly looks a lot healthier.
And it enables things that were previously impossible. Think about real-time voice translation or "thinking-out-loud" agents. If the model is too slow, the "human" element of the conversation breaks. Speculative decoding pushes the latency below the threshold of human perception. It makes the AI feel like a partner rather than a slow-loading webpage.
I want to go back to something Daniel mentioned—the "Drafting Tree." He said modern systems don't just guess a linear path, they create a "tree" of possibilities. How does the big model verify a tree? Does it have to look at every branch one by one?
That is the beauty of "Tree Attention." It is a specialized attention mask that allows the big model to look at all branches of the tree in a single forward pass. Imagine the scout says, "The next word could be 'apple', 'banana', or 'cherry'." Then for each of those, it guesses two more words. You have a little branching structure. The big model looks at this whole "cloud" of tokens and picks the single most valid path through the tree. It is much more efficient than linear guessing because even if the scout's first guess is wrong, one of its alternative branches might be right.
That is like playing multiple hands of blackjack at once and only having to pay for the one that wins.
And there is even logic now using "Multi-Armed Bandits"—a type of reinforcement learning—to decide which branches to prune or grow in real-time. If the model is writing a technical manual, it grows a deep, narrow tree. If it is writing a screenplay, it grows a wide, shallow tree to capture different creative directions. It is becoming incredibly sophisticated.
What I find wild is that this technique is "architecture-agnostic." Daniel pointed out it works on Transformers, Mamba, RWKV—basically anything that predicts tokens one by one. Does that mean speculative decoding is a permanent fixture in the AI stack, or will we eventually just train models that are "fast by default"?
I think it is a permanent fixture. As long as we have the disparity between memory speed and compute speed, we will want to use this trick. Even if we invent a "Flash-Transformer" that is ten times faster, we will still want to run speculative decoding on top of it to make it thirty times faster. It is just too good of a deal to pass up. The only way it goes away is if we move away from "autoregressive" generation entirely—meaning we generate the whole sentence at once like an image—but so far, word-by-word prediction is still the king of reasoning.
It is amazing how much of "AI progress" is actually just very clever engineering around hardware limitations. We talk about "intelligence," but a lot of it is just "how do we move these numbers across a copper wire faster?"
That is the "Engineering" in AI Engineering. It is easy to get lost in the philosophy of "is it conscious?", but the people making it work are asking, "How do I minimize the CPU-to-GPU kernel launch overhead?" Speaking of which, Daniel mentioned CUDA Graphs. That is a very technical point, but it matters. Usually, every time the scout model makes a guess, the CPU has to tell the GPU, "Okay, do this math now." That "telling" takes time. CUDA Graphs allow you to "record" the whole sequence of commands so the GPU can just run the whole speculative loop without waiting for instructions from the CPU. It is like giving a chef a whole recipe at once instead of telling them one ingredient at a time.
It is all about removing the friction. Every millisecond counts when you are trying to hit that "real-time" feel. I am curious about the "zero quality loss" claim again. Is there any edge case where the "judge" might be tricked by the "scout"? Like, could the scout's guesses subtly bias the judge's final decision?
In the standard implementation, no. The math is very clear. If you use "rejection sampling," the final output is drawn from the exact same probability distribution as the target model. However, there are "lossy" versions of speculative decoding where you might accept a "good enough" guess to save even more time. But those are usually used for low-stakes things like internal search. For anything public-facing, people stick to the lossless version. It is one of the rare "free lunches" in technology.
I love a free lunch. Let's talk practical takeaways for a second. If someone is building an AI-powered tool right now, what is the "entry-level" move for speculative decoding?
The entry-level move is "Vanilla Speculative Decoding." Find a small, distilled version of the model you are already using. If you are using Llama seventy-billion, use Llama one-billion as your drafter. Most inference servers like vLLM make this a single configuration line. You will likely see an immediate two times speedup without changing a single line of your application logic.
And if they want to get fancy?
Then you look at Medusa or EAGLE. Specifically, EAGLE-two is the one most people are migrating to if they have the ability to add a plugin to their model. It requires a bit more setup because you are adding a small "head" to your model, but the speedups move from "twice as fast" to "four or five times as fast." It is worth the effort if you are running at scale.
And what about benchmarking? How do you actually measure if this is working? Is it just tokens per second?
Tokens per second is the headline number, but the real metric is "Inter-token Latency." That is the time between each word appearing on the screen. Speculative decoding makes that number much lower on average, but it can make it "jittery." You might get five words instantly, then a tiny pause, then five more. For most users, that feels faster than a steady but slow stream. You also want to track your "Acceptance Rate." If your scout is only getting ten percent of its guesses right, you need a better scout.
It is like watching a fast typist who occasionally has to hit backspace. It still feels faster than someone hunting and pecking at the keys perfectly.
And for the real power users, look into "Mamba Drafters" if you are dealing with huge context windows. If you are asking an AI to summarize a whole book, a Transformer scout will eventually choke on the memory, but a Mamba scout will hum right along.
This whole discussion makes me realize how much "invisible" work goes into the AI we use every day. We see the clever answer, but we don't see the scout running ahead, the judge vetoing guesses, the tree attention masks, and the CUDA graphs all firing in milliseconds.
It is a symphony of micro-optimizations. And the coolest part is that it is all open source. The papers Daniel cited—Leviathan, Medusa, EAGLE—these aren't secret corporate formulas. They are open research that anyone can implement. It is a very exciting time to be on the engineering side of this.
It feels like we are in the "Deployment Era" now. The models are huge and smart, but now we are obsessed with making them useful, cheap, and fast. Speculative decoding is the poster child for that shift.
It really is. We have moved from "Can it do it?" to "Can it do it for a million people simultaneously without costing a fortune?" And the answer, thanks to these tricks, is a resounding yes.
Well, I think we have thoroughly unpacked why the "scout and judge" system is taking over the world. Where do you see this going next? Do you think we will eventually have "hierarchies" of speculation? Like a tiny model guessing for a medium model that guesses for a giant model?
"Cascaded Speculative Decoding" is already being researched! You have a seventy-million parameter model drafting for a seven-billion, which drafts for a seventy-billion. It is like a relay race. The efficiency gains start to diminish after a while, but for the truly massive trillion-parameter models, a three-tier system might actually be the only way to make them usable for chat.
A relay race of AI models. I love that image. It is just scouts all the way down.
That is the future, Corn. Faster, smarter, and somehow, using more models to do less work.
Well, that is a wrap on speculative decoding. I am definitely going to be looking at my chat windows differently now, imagining those little scouts running ahead of the text.
Once you see the "jitter" of a speculative sequence, you can never un-see it. It is the heartbeat of modern inference.
Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes.
And a big thanks to Modal for providing the GPU credits that power this show. They are actually a great example of a platform where you can deploy these speculative stacks incredibly easily.
If you are enjoying the show, a quick review on your podcast app really helps us reach new listeners. It is the best way to support what we are doing here.
This has been My Weird Prompts. We will see you in the next one.
Stay curious, everyone. Bye.
Goodbye.