Say we’re looking for a pattern in a blob of text. If you know the text has no typos, then determining whether it contains a pattern is trivial. In Python you can use the in function. You can also write a regex pattern with the re module from the standard library. But what about if the text contains typos? For instance, this might be the case with user inputs on a website, or with OCR outputs. This is a much harder problem.
Fuzzy string matching is a cool technique to find patterns in noisy text. In Python there used to be a ubiquitous library called fuzzywuzzy. It got renamed to thefuzz. Then RapidFuzz became the latest kid on the block. Here’s an example using the latter:
These libraries are often used to determine string similarity. This can be very useful for higher level tasks, such as record linkage, which I wrote about in a previous post. For instance, fuzzy string matching is the cornerstone of the Splink project from the British Ministry of Justice. Also, I fondly remember doing fuzzy string matching at HelloFresh to find duplicate accounts by comparing postal addresses, names, email addresses, etc.
But fuzzy string matching isn’t just about measuring similarities between strings. It also allows locating a pattern in a blob of noisy text. Sadly, popular fuzzy matching libraries don’t seem to focus on this aspect. They only allow to measure string similarities.