by Adnan Hoque, Less Wright, Raghu Ganti and Mudhakar Srivatsa

CUDA-Free Inference for LLMs

submited by

Style Pass

2024-09-05 17:30:09

by Adnan Hoque, Less Wright, Raghu Ganti and Mudhakar Srivatsa

In this blog, we discuss the methods we used to achieve FP16 inference with popular LLM models such as Meta’s Llama3-8B and IBM’s Granite-8B Code, where 100% of the computation is performed using OpenAI’s Triton Language. For single token generation times using our Triton kernel based models, we were able to approach 0.76-0.78x performance relative to the CUDA kernel dominant workflows for both Llama and Granite on Nvidia H100 GPUs, and 0.62-0.82x on Nvidia A100 GPUs.

Why explore using 100% Triton? Triton provides a path for enabling LLMs to run on different types of GPUs - NVIDIA, AMD, and in the future Intel and other GPU based accelerators. It also provides a higher layer of abstraction in Python for programming GPUs and has allowed us to write performant kernels faster than authoring them using vendor specific APIs. In the rest of this blog, we will share how we achieve CUDA-free compute, micro-benchmark individual kernels for comparison, and discuss how we can further improve future Triton kernels to close the gaps.

Figure 1. Inference throughput benchmarks with Triton and CUDA variants of Llama3-8B and Granite-8B, on NVIDIA H100 and A100 Settings: batch size = 2, input sequence length = 512, output sequence length = 256

CUDA-Free Inference for LLMs

Leave a Comment

Related Posts

Recent Posts

interesting, unsolved technical problems | catherine jue

Making sure you're not a bot!

Observe Closes $156 Million Series C as Enterprises Shift to AI-Powered Observability at Scale

[Re]defining Observability for the AI Era, with $156M Series C Funding

Sea creatures evolve into crabs, databases evolve into DynamoDB

Code Execution Through Deception: Gemini AI CLI Hijack

Hackers target Python devs in phishing attacks using fake PyPI site

So far, only one-third of Americans have ever used AI for work

Why I believe in AGI (again) - Alexey Guzey

Introducing Mindstream for Emacs

Search code, repositories, users, issues, pull requests...

The menu | Seth's Blog

Dynamic programming bursting balloons

Voice Aloud Reader Unlimited 4+

This robot uses Japanese tradition and AI for sashimi that lasts longer and is more humane

Revisit Your Goals Often - Don't Break Prod

Ultra-high gradient connectomics and microstructure MRI scanner for imaging of human brain circuits across scales

Rooftop Solar Is a Miracle. Why Are We Killing It With Red Tape?

Scammers Unleash Flood of Slick Online Gaming Sites

Every champion needs a rival