I wanted to be able to call llama.cpp from Python, but I didn't want to use the llama-cpp-python wrapper because it automatically downloads and builds

Search code, repositories, users, issues, pull requests...

submited by

Style Pass

2024-11-06 05:00:04

I wanted to be able to call llama.cpp from Python, but I didn't want to use the llama-cpp-python wrapper because it automatically downloads and builds llama.cpp, which I didn't want to do. I like the simplicity of llama.cpp and its unix philosophy of doing one thing well, so I prefer to build llama.cpp myself and then call it from Python.

I also wanted to be able to call llama.cpp remotely. It's easy as long as the llama.cpp server is already running on the remote. But I wanted to be able to spawn several llama.cpp processes on the remote, send queries to them, and then kill them when I'm done. The existing solutions such as llama-cpp-python and LMQL can only execute the llama.cpp server locally.

Because there are differences between llama.cpp and OpenAI's API. For example, OpenAI doesn't support the --grammar option but llama.cpp does. Using the llama.cpp API reminds the user that they can't simply change the model name (as an OpenAI drop-in replacement API would allow) and expect everything to work the same.

That being said, I use Pydantic to validate the flags sent to llama.cpp to create a server and to send queries to it. This means your IDE (e.g., VSCode) will display the available flags and their types when you use this library.