A common data quality problem is to have multiple different records that refer to the same entity but no unique identifier that ties these entities to

Blog Data in government

submited by
Style Pass
2022-09-23 22:00:24

A common data quality problem is to have multiple different records that refer to the same entity but no unique identifier that ties these entities together.  For example, customer data may have been entered multiple times by accident, or have been entered in multiple IT systems separately.  

Record linkage (sometimes known as entity resolution, or data matching) is a technique to link these records, enabling data to be deduplicated and joined between systems.  

At the Ministry of Justice, we have developed an open source library called Splink to improve our record linkage methodology.  This has enabled us to share new linked datasets with accredited researchers, as part of the ADR UK-funded Data First programme .

Splink is a free library for fast and accurate record linkage, which is now in its third version. It has the following key features:

We recommend users start by looking at our online tutorial , which is part of our main documentation website .  The tutorial runs through a full record linkage example, from exploratory analysis right through to prediction and graph analytics, and it can even be run interactively in your web browser.

Leave a Comment