When building LLM agents/systems, retrieval augmented generation (RAG) allows custom source data to feed into LLM responses by querying an information

Exploring semantic chunking for RAG-LLM systems

submited by

Style Pass

2024-11-21 21:30:05

When building LLM agents/systems, retrieval augmented generation (RAG) allows custom source data to feed into LLM responses by querying an information retrieval system to provide dynamic contextual information to the LLM. Converting unstructured source documents into a useful retrieval system requires many choices about preprocessing, populating, and querying a retrieval system. In this post we explore a critical step in building a RAG system – converting a document into “chunks.” In particular, we explore a recently-popular library for chunking and ask whether complex, semantically-motivated chunking approaches outperform standard, simple chunking techniques for RAG retrieval. Let’s go!

A fundamental building block for RAG is to take unstructured documents and embed them to high dimensional vectors that can be stored into a vector database. Given a user query, we can embed that as well and find the “most related” documents by smallest Euclidean distance (aka highest similarity) in the vector space. But…most embedding models have a “maximum token length”. Which makes sense! We can’t hope to shove infinite amounts of text into a finite number of floating point numbers. When these embedding models are trained, the designers choose a finite number of tokens as the maximum allowed for training documents. A common choice you might see with transformer based embedding models is 512 tokens.

What this means in practice is that if your document is longer than 512 tokens, you need to break it up. Even beyond the technical logistics, breaking up a long document is theoretically a good idea. You can imagine that it is a lot harder to compress a 1000 page book into a single vector embedding than a single paragraph. When documents get too long, we want to break them up to properly capture the details in the text.