Model Quantization is a technique used to reduce the size of large neural networks, including large language models (LLMs) by modifying the precision of their weights. LLM Quantization is enabled thanks to empiric results showing that while some operations related to neural network training and inference must leverage high precision, in some cases it's possible to use significantly lower precision (float16 for example) reducing the overall size of the model, allowing it to be run using less powerful hardware with an acceptable reduction of its capabilities and accuracy. In this blog post I will go over the LLM Quantisation and cover following points:
In recent years, the size of neural networks has grown dramatically, enabling them to possess increasingly advanced intelligence capabilities. Large Language Models (LLMs) such as GPT-4 and Falcon are renowned for their power, capable of understanding code and answering questions. However, while a larger model offers more capabilities, it also demands more expensive hardware and greater hardware resources in general.
Several techniques, including Model Distillation and Quantization, have been introduced to address the challenge of reducing model size. This blog post focuses on quantization, but it's important to note that it is just one of the tools available to enhance the performance of models.