Cerebras got Meta’s Llama 3.1 405B large language model to run at 969 tokens per second, 75 times faster than Amazon Web Services' fastest AI se

Cerebras video shows AI writing code 75x faster than world's fastest AI GPU cloud — world's largest chip beats AWS's fastest in head-to-head comparison

submited by
Style Pass
2024-11-25 08:30:05

Cerebras got Meta’s Llama 3.1 405B large language model to run at 969 tokens per second, 75 times faster than Amazon Web Services' fastest AI service with GPUs could muster.

The LLM was run on Cerebras’s cloud AI service Cerebras Inference, which uses the chip company’s third-generation Wafer Scale Engines rather than GPUs from Nvidia or AMD. Cerebras has always claimed its Inference service is the fastest for generating tokens, the individual parts that make up a response from an LLM. When it was first launched in August, Cerebras Inference was claimed to be about 20 times faster than Nvidia GPUs running through cloud providers like Amazon Web Services in Llama 3.1 8B and Llama 3.1 70B.

But since July, Meta has offered Llama 3.1 405B, which has 405 billion parameters, making it a far heavier model than Llama 3.1 70B with 70 billion parameters. Cerebras says its Wafer Scale Engine processors can run this massive LLM at “instant speed,” with a token rate of 969 per second and a time-to-first-token of just 0.24 seconds; that’s a world record according to the company, not just for its chips, but also for the Llama 3.1 405B model.

Compared to Nvidia GPUs rented from AWS, Cerebras Inference was apparently 75 times faster; the Wafer Scale Engine chips were 12 times faster than even the fastest implementation of Nvidia GPUs from Together AI. Its closest competitor, AI processor designer SambaNova, was beaten by Cerebras Inference by 6 times.

Leave a Comment