This post is the continuation of our FireAttention blog series: FireAttention V1 and FireAttention V2 . This time we are going to focus on a different

FireAttention V3: Enabling AMD as a Viable Alternative for GPU Inference

submited by

Style Pass

2024-10-16 08:00:03

This post is the continuation of our FireAttention blog series: FireAttention V1 and FireAttention V2 . This time we are going to focus on a different GPU hardware, namely AMD MI300 GPU.

While spec-wise it looks quite superior to NVIDIA H100 GPU we never know how it’s going to perform in real-world LLM inference settings until we run benchmarks, which represent practical LLM usage.

Fireworks has been using AMD MI300 hardware in production since the launch of LLaMA 405b. In this post we are going to go over the work which made it happen.

FireAttention V3 is an AMD-specific implementation for Fireworks LLM. When measured on 8 MI300 GPUs vs other leading LLM implementations (NIM Containers on H100 and AMD vLLM on MI300) it achieves 1.4x improvement for the average RPS @ 8 secs metric for LLaMA 8B model and 1.8x improvement for the average RPS @ 10 secs for LLaMA 70B model. In some low-latency scenarios RPS improvement can reach up to ~3x for NIM and up to ~5.5x for AMD vLLM. It even improves minimal achievable latency for LLaMA 8B model by 1.6x.

To tell the truth, we were pleasantly surprised at how smooth it was to port our proprietary stack to ROCm and get to functional parity - i.e. all functionality working correctly. We leveraged PyTorch’s official ROCm support, which matured a lot over the past few years. For custom CUDA code, with the exception of advanced APIs, most of API calls are supported on ROCm and are converted automatically when running the hipify utility.