Computing, Math and Beauty

submited by
Style Pass
2022-06-24 05:00:07

There are lot of use cases where we might like to join two datasets not based on exact values, but on how similar they are. There is whole suite of applications for this, like entity resolution, de-duplication, plagiarism detection etc.. We will go over a very performant and extensible architecture to do joins based on similarity using a technique called Min-hashing. We will extend the Min-hashing idea some more by introducing weights/boost to fields of the document and particular terms/tokens based on their IDF values.

We will consider documents/records to be set of words/tokens without repetitions for the sake of simplicity, and we won’t deal with finer points of tokenization, phrase detection or shingles also in this article.

Coming back to the problem, consider two spark’s RDD (RDD[(Int, String)]) where Int refers to the identifier of a single document and the String value is the actual document. The straight-forward brute force approach to do similarity based joins, would be to emit all tokens and then aggregate and/or filter them by how many of the tokens each pair of records have in common. Something like this

We found the number of tokens that match between two documents, but we would really like a normalized score (between 0 to 1) representing how similiar two documents are, so that we can make compare matches and can also have some sort of threshold filtering. There are several ways to score how similar a pair of records are. One simple metric is called Jaccard’s similarity.

Leave a Comment