The new Azure OpenAI gpt-4o-realtime-preview model opens the door for even more natural application user interfaces with its speech-to-speech capability.
This new voice-based interface also brings an interesting new challenge with it: how do you implement retrieval-augmented generation (RAG), the prevailing pattern for combining language models with your own data, in a system that uses audio for input and output?
In this blog post we present a simple architecture for voice-based generative AI applications that enables RAG on top of the real-time audio API with full-duplex audio streaming from client devices, while securely handling access to both model and retrieval system.
These two building blocks work in coordination: the real-time API knows not to move a conversation forward if there are outstanding function calls. When the model needs information from the knowledge base to respond to input, it emits a “search” function call. We turn that function call into an Azure AI Search “hybrid” query (vector + hybrid + reranking), get the content passages that best relate to what the model needs to know, and send it back to the model as the function’s output. Once the model sees that output, it responds via the audio channel, moving the conversation forward.
A critical element in this picture is fast and accurate retrieval. The search call happens between the user turn and the model response in the audio channel, a latency-sensitive point in time. Azure AI Search is the perfect fit for this, with its low latency for vector and hybrid queries and its support for semantic reranking to maximize relevance of responses.