Naming things is hard. This is not just a technical problem, but a foundational problem that we have as humans when we talk to each other and refer to specific things and people. I have four good friends named Zach – there’s a reason we invented nicknames.
There are many other such name collisions in every domain. Companies and products can share names, giving rise to trademarks to limit the usage of a given label within a domain. There are multiple YC companies called Alpha or Level. On the other hand, names can also be divergent. My previous employer, Vouch, showed up in some places as Vouch Inc., and Vouch Insurance in others. The company also used to be called SV InsureTech, but most often is just referred to as Vouch.
Entity resolution (ER) is the process of identifying and linking multiple references to the same real-world entity across various data sources. In some businesses, de-duplicating and linking data correctly is existential. If you are a bank handing out loans, or an insurance company underwriting a new policy, you need to know who you are dealing with – not just for the sake of your margins but also to comply with regulatory mandates. It’s not uncommon for a single entity to have multiple names, addresses, etc., which makes selling Anti-Money Laundering (AML) and Know Your Customer (KYC) software a big business.
Leaning on my experience designing the data pipelines that created and maintained Vouch’s knowledge graph of the startup ecosystem, in this blog post, I’ll discuss some of the key ideas behind ER systems, and the challenges in building them.