One major pain point of building RAG applications is that it requires a lot of experimentation and tuning, and there are hardly any good benchmarks to

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2024-11-15 17:00:09

One major pain point of building RAG applications is that it requires a lot of experimentation and tuning, and there are hardly any good benchmarks to evaluate the accuracy of the retrieval step only.

The preprocessing step of the RAG pipeline is particularly painful and hard to evaluate. The chunking step is crucial and determines how the information is going to be retrieved, but there are no benchmarks to evaluate which chunking strategy works best. For example, there is no good way of answering the following question:

“Which chunking strategy leads to the highest faithfulness of the retrieval while also maximizing the signal to ratio of the retrieved chunks?”

In order to implement a RAG pipeline, the first step is always to split the document into small chunks, and embed each chunk into a vector database for retrieval. However, the issue is that splitting a document into chunks is difficult. If you simply split every 100 characters, you break up the meaning. For example, when chunking this snippet of the U.S. Constitution,

These chunks completely destroy the ability to understand the content of the document. The current SoTA is to write a regex to split by line or sentence, and then group sentences using embedding similarity. This is called "semantic" chunking. However, regex involves need to do custom additional work for every new document type you want to support, and when it breaks the results are poor.

Leave a Comment