MiniLLM is a minimal system for running modern LLMs on consumer-grade GPUs. While llama-cpp allows running LLMs on Apple hardware, MiniLLM enables running a larger set of models on most recent Nvidia GPUs.
Underneath the hood, MiniLLM uses the the GPTQ algorithm for up to 3-bit compression and large reductions in GPU memory usage. See the hardware requirements for more information on which LLMs are supported by various GPUs.
This example is based on an old OpenAI prompt. See below for additional examples, including automatic essay generation and chain-of-thought prompting.
Any UNIX environment supporting Python (3.8 or greater) and PyTorch (we tested with 1.13.1+cu116) can run MiniLLM. See requirements.txt for details.
Note that this process compiles and installs a custom CUDA kernel that is necessary to run quantized models. We also use an experimental fork of the transformers library with support for LLAMA models.