LLMs are powerful, but they’re slow during inference. That’s because they’re trained with an autoregressive pattern: generating the next token b

Get 3× Faster LLM Inference with Speculative Decoding Using the Right Draft Model

submited by

Style Pass

2025-08-09 00:00:02

LLMs are powerful, but they’re slow during inference. That’s because they’re trained with an autoregressive pattern: generating the next token based on all the previous ones. During inference, this means every new token requires a full forward pass through the model, followed by sampling and then appending that token to the input for the next step. This process is inherently sequential, meaning the model can’t compute future tokens ahead even if the GPU is idle. This leads to high Inter-Token Latency (ITL) and poor GPU utilization.

Speculative decoding offers a solution. By having a small draft model predict several tokens in advance, and letting a larger target model verify them in parallel, you can accelerate the token generation process.

In practice, however, we found speculative decoding only delivers the expected speedup if the draft model’s distribution matches closely with the target model. The key is using the right draft model for your workload, which in many real-world cases means training one on your own data.

Get 3× Faster LLM Inference with Speculative Decoding Using the Right Draft Model

Leave a Comment

Related Posts

Recent Posts

Search code, repositories, users, issues, pull requests...

Tutorials | Kubernetes

Governorate of Terra Australis

Is anyone worth a billion dollars?

The Sun Never Leaves - by Duncan McClements

Trump calls for Intel boss to resign immediately, alleging China ties

ChatGPT Plus Users Can Keep Using GPT-4o After Complaints About GPT-5

Some self-paced kids move fast.

windsurf gets margin called - by Ethan Ding - mandates

Search code, repositories, users, issues, pull requests...

Future AI bills of $100k/yr per dev - by Ewa Szyszka

Why F# could be the next mainstream programming language

GPT-5 critics are dead wrong

Do You Remember What You Read?

Substack Raised Another $100 Million, Which, I Bet, Is Already Being Flushed Down the Same Toilet as Their First $100 Million

Collections: Life, Work, Death and the Peasant, Part IIIb: Children and Childrearing

The Day Novartis Chose Discovery

Edison’s plan to pay Eaton fire victims could mean less litigation, less compensation

Chatbots Can Go Into a Delusional Spiral. Here’s How It Happens.

Gold, Frankincense, and Silicon