Cerebras and Neural Magic have achieved a major milestone in the field of large language models (LLMs). By combining state-of-the-art pruning techniqu

Introducing Sparse Llama: 70% Smaller, 3x Faster, Full Accuracy

submited by
Style Pass
2024-05-16 00:00:09

Cerebras and Neural Magic have achieved a major milestone in the field of large language models (LLMs). By combining state-of-the-art pruning techniques, sparse pretraining, and purpose-built hardware, we have unlocked unprecedented levels of sparsity in LLMs, enabling up to 70% parameter reduction without compromising accuracy. This breakthrough paves the way for more efficient training and deployment of LLMs, making them accessible to a broader range of organizations and industries.

The quest for sparsity in deep learning models has been an ongoing endeavor, with the goal of reducing computational and memory requirements. Pruning techniques, which remove less important weights, have proven effective in shrinking the size of computer vision models. However, applying these methods to large language models (LLMs) have so far not yielded great results. LLMs operate on high-dimensional data and require a vast number of parameters to capture the complexity and nuances of language. Removing weights through pruning can disrupt the delicate balance and relationships between these parameters, leading to a significant loss in accuracy, particularly in downstream tasks such as chat and coding. This degradation in performance and the complexity of training is the reason why no major LLMs today employ sparsity.

Another key hindrance for sparsity research has been the limited support for sparsity on GPU hardware. GPUs such as the H100 offer only a very limited sparsity option – namely allowing 2 out of 4 adjacent weights to be sparse. This 2:4 structured sparsity constraint is a significant limitation for LLMs, which are highly varied by nature and rarely follow such a predictable pattern. As a result, GPU sparsity is rarely used for LLMs, as it fails to capture the intricate and diverse sparsity patterns present in these models.

Leave a Comment