AI voice assistants have revolutionized our interaction with technology, answering queries, performing tasks, and making life easier. However, the stilted, mechanical nature of their responses is a barrier to truly immersive conversational experiences. Unlike human conversation partners, they often struggle with fully understanding and adapting to the nuanced, emotional, and contextually rich nature of human dialogue, leading to noticeable latencies and a disjointed conversational flow. Consequently, users often experience unsatisfactory exchanges, lacking emotional resonance and context familiarity.
To realize this vision, LAION teamed up with the ELLIS Institute Tübingen, Collabora and the Tübingen AI Center to build BUD-E (Buddy for Understanding and Digital Empathy).
We started the development by creating a baseline voice assistant with very low latency. To reach that point, we carefully evaluated latency and quality of a large range of Speech-to-Text, Large Language and Text-to-Speech models on consumer hardware and carefully control how the models interact. Right now (January 2024) we reach latencies of 300 to 500 ms (with a Phi 2 model). We are confident that response times below 300 ms are possible even with larger models like LLama 2 30B in the near future.