submited by

Style Pass

Suppose we have a large collection of documents, and we wish you identify which documents are approximately the same as each other. For instance, we may have crawled the web over some period of time, and expect to have fetched the “same page” several times, but to see slight differences in metadata, or that we have several revisions of a page following small edits.

In this post I want to explore the method of approximate deduplication via Jaccard similarity and the MinHash approximation trick. This is a commonly-used approach to this problem (e.g. the GPT-3 paper describes using it as part of their dataset preparation pipeline), but one I had not encountered until recently, and which I find to be pretty interesting.

Out approach to approximate deduplication will be to define a notion of “similarity” between any two documents, and then to search for pairs where their similarity value is above some threshold. So if we have some universe of possible documents \(U\), we might define a similarity measure between pairs of documents: $$S: U \times U \rightarrow [0,1]$$ and consider two documents “approximate duplicates” if \(S(A,B) \geq S_\textrm{crit}\).

It’s worth noticing that this definition is not in general transitive: we may well have three documents \(A, B, C\) such that \(S(A,B)\geq{}S_\textrm{crit}\) and \(S(B,C) \geq{} S_\textrm{crit}\) but \(S(A,B) < S_\textrm{crit}\). That means that “approximately identical” is not an equivalence relation, and is part of the reason that approximate deduplication is trickier to reason about, and to perform at scale, compared to finding exact matches.

Read more blog.nelhage...