This guide will walk you through setting up a local LLM server that supports two-way voice interactions using Python, Transformers, Qwen2-Audio-7B-Instruct, and Bark.
You can install the Python dependencies using pip: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy
Most open-source LLMs only support text input and text output. However, since we want to create a voice-in-voice-out system, this would require us to use two more models to (1) convert the speech into text before it's fed into our LLM and (2) convert the LLM output back into speech.
By using a multimodal LLM like Qwen Audio, we can get away with one model to process speech input into a text response, and then only have to use a second model convert the LLM output back into speech.
This multimodal approach is not only more efficient in terms of processing time and (V)RAM consumption, but also usually yields better results since the input audio is sent straight to the LLM without any friction.