Recent advancements in accelerating the generation process of Large Language Models (LLMs) like  speculative decoding ,  blockwise parallel decoding ,

R e tri e val-Based S pecula t ive Decoding (REST): A Plug-and-Play Method for Accelerating Language Model Generation without Additional Training

submited by
Style Pass
2024-10-20 06:30:04

Recent advancements in accelerating the generation process of Large Language Models (LLMs) like speculative decoding , blockwise parallel decoding , and Medusa have brought impressive speed improvements. Typically, these methods rely on pairing the large base model with a lightweight draft model . The draft model tries to predict multiple tokens per decoding step with lower latency and lets the base model verify them in parallel, minimizing the number of inferences needed from the slower base model.

However, obtaining a high-quality draft model remains an art: It must balance small size and strong predictive power while matching the vocabulary of the base model; also, it should be friendly to integrate into a distributed system for serving. To tackle these challenges, Medusa introduced an efficient fine-tuning to create draft models in the form of additional language model heads. However, the requirement for additional fine-tuning still receives many complaints.

This begs the question - can we design an acceleration method that is plug-and-play out-of-the-box? One that delivers swift generation without the need to train or fine-tune new models?

Leave a Comment