We've seen an influx of powerful multimodal capabilities in many LLMs, notably GPT-4o and Gemini. Moving forward, most of the modalities won't be

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2024-07-09 04:00:05

We've seen an influx of powerful multimodal capabilities in many LLMs, notably GPT-4o and Gemini. Moving forward, most of the modalities won't be "searchable" in the traditional sense - using human-labelled tags or descriptions to retrieve relevant video or audio is not a scalable solution for multimodal RAG. We need to use dense vectors as semantic representations for all modalities of data. If you'd like to follow along but aren't 100% familiar with RAG just yet, LlamaIndex provides an excellent yet concise RAG overview.

In this example, we'll vectorize audio, text, and images into the same embedding space with ImageBind, store the vectors in Milvus Lite, retrieve all relevant data given a query, and input multimodal data as context into Meta's Chameleon 7B, large multimodal model (LMM).

We'll start by specifying our imports and downloading a video we'd like to perform RAG over. For this example, let's use the 2024 Google I/O Pre-Show:

Leave a Comment