I just wrote 84 new matrix multiplication kernels for llamafile which enable it to read prompts / images faster. Compared to llama.cpp, prompt eval t

LLaMA Now Goes Faster on CPUs

submited by
Style Pass
2024-04-01 02:30:05

I just wrote 84 new matrix multiplication kernels for llamafile which enable it to read prompts / images faster. Compared to llama.cpp, prompt eval time with llamafile should go anywhere between 30% and 500% faster when using F16 and Q8_0 weights on CPU. The improvements are most dramatic for ARMv8.2+ (e.g. RPI 5), Intel (e.g. Alderlake), and AVX512 (e.g. Zen 4) computers. My kernels go 2x faster than MKL for matrices that fit in L2 cache, which makes them a work in progress, since the speedup works best for prompts having fewer than 1,000 tokens. Background

llamafile is a local LLM project I started with Mozilla back in Nov 2023. We're using Cosmopolitan Libc to package llama.cpp as a single-file cross-platform binary that runs on six OSes for AMD64 and ARM64, while making gentle modifications. I believe that by improving the core technology, we can give our users the best possible llama.cpp experience, while helping both projects reach a broader audience. Mozilla has been giving me the resources to do this. Performance Gains on Enterprise Hardware

When I first got into LLMs, my workstation was an austere Hewlett Packard running Alpine with a spinning disk, slow RAM, an AVX2 processor, and no GPU. What I liked about llama.cpp is they were the first LLM project that cared about people like me. So I started volunteering full time and collaborated with guys like Slaren to introduce mmap() support, which made weights load instantly using half as much RAM. It was a leap forward for local LLMs at the time, but did little to improve evaluation speed. Most of the inference code was written by Georgi Gerganov himself, and it's so good that it'd take me another year to finally improve upon. Now that I have, let's see how much faster things go on my old Hewlett Packard. LLM Performance on HP Intel® Core™ i9-9900 ($439) w/ 2200 MT/s RAM c. 2020 prompt tok/sec eval tok/sec model weights data type hardware software 28 7Mistral 7b q4_0Skylake llamafile-0.7 17 7Mistral 7b q4_0Skylake llama.cpp 2024-03-26 12 7Mistral 7b q4_0Skylake llamafile-0.6.2 32 4Mistral 7b q8_0Skylake llamafile-0.7 22 4Mistral 7b q8_0Skylake llama.cpp 2024-03-26 16 4Mistral 7b q8_0Skylake llamafile-0.6.2 23 2Mistral 7b f16Skylake llamafile-0.7 15 2Mistral 7b f16Skylake llama.cpp 2024-03-26 14 2Mistral 7b f16Skylake llamafile-0.6.2 205 26TinyLlama 1.1B q8_0Skylake llamafile-0.7 144 26TinyLlama 1.1B q8_0Skylake llama.cpp 2024-03-26 91 23TinyLlama 1.1B q8_0Skylake llamafile-0.6.2 171 15TinyLlama 1.1B f16Skylake llamafile-0.7 118 15TinyLlama 1.1B f16Skylake llama.cpp 2024-03-26 101 15TinyLlama 1.1B f16Skylake llamafile-0.6.2

Leave a Comment