Retrieval Augmented Generation (RAG) is a popular technique to get LLMs to provide answers that are grounded in a data source. What do you do when your knowledge base includes images, like graphs or photos? By adding multimodal models into your RAG flow, you can get answers based off image sources, too!
Our most popular RAG solution accelerator, azure-search- openai -demo, now has support for RAG on image sources. In the example question below, the LLM answers the question by corr ectly interpreting a bar graph:
This blog post will walk through the changes we made to enable multimodal RAG, both so that developers using the solution accelerator can understand how it works, and so that developers using other RAG solutions can bring in multimodal support.
Azure now offers multiple multimodal LLMs: gpt-4o and gpt-4o-mini, through the Azure OpenAI service, and phi3-vision, through the Azure AI Model Catalog. These models allow you to send in both images and text, and return text responses. (In the future, we may have LLMs that take audio input and return non-text inputs !)