This post was inspired by Merve Noyan's posts here and here on OCR-free, vision-based document AI, using the recent ColPali model. I wanted to write

Document Similarity Search with ColPali

submited by

Style Pass

2024-09-29 14:00:07

This post was inspired by Merve Noyan's posts here and here on OCR-free, vision-based document AI, using the recent ColPali model. I wanted to write a more fleshed-out version, focusing on a different angle of using vision language models for document AI: similarity-based retrieval.

Given a document page, you can use ColPali to find all document pages that are similar to that example page. This type of search gives you an entry point into large, unlabeled (or mislabeled) document archives. Similarity-based retrieval may be an initial step in 'taming' legacy corporate document stores, for example, to label a subset of business documents for subsequent fine-tuning for classification or information extraction tasks.

Document retrieval addresses the needle-in-the-haystack problem: You have a large document corpus of possibly millions, or more, documents, some containing many pages, and you want to find the documents, or even the pages, most relevant to your query.

On the Web, that's what search engines do. Search engines excel at searching not only textual HTML documents, but images, videos, and audio content.