This repository contains code for QTIP, a weight-only large language model (LLM) quantization method that achieves a state-of-the-art combination of quantization quality and speed. QTIP uses incoherence processing to make LLM weight matrices approximately i.i.d Gaussian, and then uses trellis coded quantization (TCQ) to quantize these weights with near-optimal distortion. QTIP solves naive TCQ's inherent slowness by introducing a series of novel compute-based codes for use with the "bitshift trellis." For more details, please see the paper.
This codebase is based off of the QuIP# codebase, with modifications made to support trellis quantization. The main QTIP code is in lib/codebook/bitshift.py, and the QuIP# algorithm files have been merged into lib/algo/finetune.py. Example scripts can be found in examples/
You will need to install the packages in requirements.txt to use this codebase with pip install -r requirements.txt. If you have issues installing fast-hadamard-transform, try building from source.