Recently, we added near-duplicate-detection (NDD) support to Sycamore. Back in the late 1990s, in the competition between various web search engines v

Near-Duplicate Detection in Sycamore: What Is It Good For?

submited by
Style Pass
2024-04-03 16:00:08

Recently, we added near-duplicate-detection (NDD) support to Sycamore. Back in the late 1990s, in the competition between various web search engines vying for market dominance, NDD was an important technique for improving relevance. Since Andrei Broder published his landmark paper  in 2000, NDD has found its way into many aspects of computing. As the world turns toward generative AI, we believe that NDD continues to deliver value.

One application that NDD can improve is retrieval-augmented generation (RAG). This refers to the use of a large language model (LLM) to answer questions beyond the scope of what the LLM was trained upon. RAG starts by retrieving documents from a traditional search engine. Then, it sends the query, along with a limited number of top-scoring documents as context, to an LLM. The result is an answer synthesized by the LLM from the documents. Because the effective context size is limited, it’s important for us to fill it with as rich a set of documents as possible. This is where near-duplicate removal is helpful.

In order to show how NDD improves retrieval and RAG in the real world, we chose a readily-available dataset from data.gov . It contains marketing agreements between credit card companies and colleges for branded cards from 2011 through 2019. As one might imagine, a lot of these documents are rather similar, both across years and across cards. Here are some examples:

Leave a Comment