With the exception of OpenAI (whose text-embedding-3 models from March 2023 are ancient in light of the pace of AI progress), all the prominent commercial vector embedding vendors released a new version of their flagship models in late 2024 or early 2025.
Here’s how the latest and greatest proprietary and open source models stack up against each other in DataStax Astra DB Vector Search.
Models marked with * are trained with Matroyshka techniques, meaning they are designed to keep the most important information in the first dimensions of the output vector, enabling the vector to be truncated while preserving most of the semantic information. I only evaluated the largest, most accurate sizes for these models.
These are test sets from the ViDoRe image search benchmark, OCR’d using Gemini Flash 1.5. Details on the datasets can be found in section 3.1 of the ColPali paper. Notably, TabFQuAD and Shift Project sources are in French; the rest are in English.
I picked these because most if not all of the classic text-search datasets are being trained on by model developers for whom it’s more important to get to the top of the MTEB leaderboard than to build something actually useful. By OCRing data from image search datasets, I believe I was able to give these models data that they haven’t seen before.