Doing inference using LLMs in production is not easy, it requires massive compute and as a result it can get expensive pretty quickly. This has led to

A Survey of Speculative Decoding Techniques in LLM Inference

submited by
Style Pass
2024-10-19 11:00:05

Doing inference using LLMs in production is not easy, it requires massive compute and as a result it can get expensive pretty quickly. This has led to the development of several techniques to make the inference more cost and compute efficient. One such technique that has emerged in the last couple of years is speculative decoding. When doing inference on LLMs, one full forward pass of the model results in the generation of a single token. This is a highly inefficient use of the available compute resources of the GPU or the accelerator chip, and speculative decoding solves this by enabling the prediction of multiple tokens in a single forward pass.

It originally appeared in an ICML paper in 2023 and since then it has been adapted and deployed widely. It has also seen various modifications which have reduced the cost of its implementation as well as improved its accuracy.

This article is a non-exhaustive survey of speculative decoding techniques. We will first discuss the original speculative decoding technique as proposed in the ICML 2023 paper. After that we will discuss the Medusa architecture which simplifies the implementation of speculative decoding while improving the performance. Finally, we will close by looking at some proposed modifications to Medusa which address some of the drawbacks in its architecture.

Leave a Comment