We want to use the full power of our GPU during LLM inference. To do that, we need to know if our inference is compute bound or memory bound so that we can make optimizations in the right area. Calculating the operations per byte possible on a given GPU and comparing it to the arithmetic intensity of our model’s attention layers reveals where the bottleneck is: compute or memory. We can use this information to pick the appropriate GPU for model inference and, if our use case allows, use techniques like batching to better utilize our GPU resources.
Many layers of abstraction sit between an ML model API and a bare-metal GPU. Developing strong mental models for these abstractions helps you control costs and improve performance during inference so that you get the most bang for your buck by fully exploiting the potential of your GPUs.
This guide will help you understand the math behind profiling transformer inference. As a concrete example, we’ll look at running Llama 2 on an A10 GPU throughout the guide. We’ll cover: