llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2024-05-10 08:30:03

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

This release introduces faster AVX2 prompt processing for K-quants and IQ4_XS (#394). This was contributed to llamafile by @ikawrakow who originally invented K quants last year: ggerganov/llama.cpp@99009e7. In prior releases we recommended the legacy Q4_0 quant since it was the simplest and most intuitive to get working with recent matmul optimizations. Thanks to Iwan Kawrakow's efforts, the best quants (e.g. Q5_K_M) will now go the fastest (on modern x86 systems).

This release fixes a memory bug in the grammar parser, that caused commands like ./llamafile -m foo.gguf -p bar --grammar 'root::="' (which failed to specify a closing quote) to crash. Anyone using the server as a public facing endpoint (despite our previous recommendations) is strongly encouraged to upgrade. See 22aba95.

Leave a Comment