Large language models (LLMs) are approaching their limits in terms of traditional scaling, with billions of parameters added for relatively small accuracy gains and advanced quantization techniques squeezing out the last possible bits before accuracy plummets. These dense architectures remain large, costly, and resource-intensive, making it challenging and expensive to scale AI. Neural Magic is doubling down on this challenge with sparse LLMs—reducing the model size by removing unneeded connections while retaining accuracy. Sparse models, though underexplored in the LLM space due to the high compute demands of pretraining, offer an increasingly promising dimension in model compression and efficiency.
Sparse-Llama-3.1-8B-2of4 is our next step in this commitment—a 50% pruned version of Meta's open-source Llama 3.1 8B. Built with a GPU-friendly 2:4 sparsity structure, it removes two of every four parameters while preserving accuracy. Designed as a versatile foundation model for fine-tuning and instruction alignment, Sparse Llama is optimized for both speed and efficiency. Its quantization-friendly architecture enables faster, cheaper inference with roughly half the connections of its dense counterpart.
Sparse Llama 3.1 originates from years of prior research, building on previous breakthroughs with SparseGPT, SquareHead Knowledge Distillation, and Sparse Llama 2. These contributions laid the groundwork for our state-of-the-art sparse training approach, tailored to the latest generation of LLMs. Leveraging SparseGPT developed in collaboration with ISTA, we efficiently removed redundant connections, while SquareHead’s layerwise knowledge distillation and Sparse Llama 2’s foundational training recipes provided the basis for sparsity optimization. Working with the latest LLMs requires more than applying existing techniques. These models, pushed to the edge of training scaling laws, are highly sensitive to sparsity. We iteratively refined our methods to overcome this, starting with meticulously curating publicly available datasets. By sourcing and filtering only the highest-quality and most representative data for LLM use cases, we reduced the pretraining set to just 13 billion tokens—drastically cutting the environmental impact of further training while preserving performance. This curated dataset and advancements in our pruning and sparse training recipes allowed training to converge in just 26 hours on 32 H100 GPUs, demonstrating the efficiency and scalability of our approach while delivering a model optimized for real-world deployments.