TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.
Performance comparison between vLLM v0.5.3 and v0.6.0 for Llama 8B on 1xH100 and 70B on 4xH100 on ShareGPT dataset (500 prompts). TPOT measured at 32 QPS.
A month ago, we released our performance roadmap committing to performance as our top priority. Today, we released vLLM v0.6.0, with 1.8-2.7x throughput improvements compared to v0.5.3, reaching state-of-the-art performance while keeping rich features and great usability.
LLM inference requires tight collaboration between CPUs and GPUs. Although the major computation happens in GPUs, CPUs also play an important role in serving and scheduling requests. If CPUs cannot schedule fast enough, GPUs will sit idle to wait for CPUs, which eventually leads to inefficient GPU utilization and hinders inference performance.
One year ago, when vLLM was first released, we mainly optimized for relatively large models on GPUs with limited memory (e.g. Llama 13B on NVIDIA A100-40G). As faster GPUs with larger memory (like NVIDIA H100) become more available and models become more optimized for inference (e.g. with techniques like GQA and quantization), the time spent on other CPU parts of the inference engine becomes a significant bottleneck. Specifically, our profiling results show that for Llama 3 8B running on 1 H100 GPU: