The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide varie

Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch

submited by

Style Pass

2024-10-14 01:30:03

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of applications, each with diverse deployment requirements. For example, a chatbot supports a small number of users at very low latencies for good interactivity. Meanwhile, synthetic data generation requires high throughput to process many items at once. Delivering optimal inference performance across a wide range of use cases with one platform requires optimization across the entire technology stack.

Cutting-edge LLMs, like Llama 3.1 405B, require multiple GPUs working together for peak performance. To effectively use multiple GPUs for processing inference requests, an inference software stack must provide developers with optimized implementations of key parallelism techniques, including tensor, pipeline, and expert parallelism. These parallelism techniques require that GPUs be able to transfer data quickly and efficiently, necessitating a robust GPU-to-GPU interconnect fabric for maximum performance.

In this post, we explain two of these parallelism techniques and show, on an NVIDIA HGX H200 system with NVLink and NVSwitch, how the right parallelism increases Llama 3.1 405B performance by 1.5x in throughput-sensitive scenarios. We also show how use of pipeline parallelism enabled a 1.2x speedup in the MLPerf Inference v4.1 Llama 2 70B benchmark on HGX H100 compared to our results published in August. These improvements are possible due to recent software improvements in TensorRT-LLM with NVSwitch.