This blog post deep dives into scaling “ColPali: Efficient Document Retrieval with Vision Language Models” 1 to large collections of documents. We

Scaling ColPali to billions of PDFs with Vespa | Vespa Blog

submited by

Style Pass

2024-10-17 16:30:06

This blog post deep dives into scaling “ColPali: Efficient Document Retrieval with Vision Language Models” 1 to large collections of documents. We demonstrate how we can use a phased retrieval and ranking pipeline in Vespa to scale ColPali to billions of documents. To do this, we introduce a new similarity function, a hamming based MaxSim that works with binary vectors produced by binary quantization (BQ). This technique allows us to scale ColPali to large collections of documents while maintaining high accuracy, with a significant reduction in computational cost and vector storage requirements. The suggested deployment also supports real-time indexing and CRUD operation support.

ColPali surpasses traditional text-based retrieval methods by leveraging a vision-capable language model, (PaliGemma), to “see” the text, but also the visual elements of a page, including figures, tables and infographics.

Contextualized Vision Embeddings from a Vision Language Model (VLM): ColPali generates contextualized embeddings directly from images of pages, using PaliGemma, a powerful VLM with strong visual text understanding capabilities.