Splink: a software package for probabilistic record linkage and deduplication at scale

Wednesday, March 16, 2022

2:30pm to 3:30pm GMT UK time (6:30am to 7:30am PST) | All sessions will be delivered live and online via the Gotowebinar system.

This webinar is part of the Power of Population Data Science Series

In this seminar, we will introduce Splink, a software package developed for probabilistic record linkage at scale.

This is free software provides a toolkit for record linkage of datasets of tens or even hundreds of millions of records, guiding the user through the various stages of linkage, including:

automatic profiling of data, to spot data quality issues that may affect linkage, and skewed fields
automatic analysis of different potential blocking rules, to understand the computational costs of different approaches
user-customisable rules to compare fields that can be used to model names, dates, locations and any other types of fields
estimation of m and u probabilities using various approaches, including the expectation maximisation algorithm
diagnostic charts that explain model estimates, and help build intuition for how the model works
interactive tools to understand and quality assure the results of record linkage
accuracy analysis including ROC and precision recall curves for labelled data

This tool is developed in Python and uses PySpark to enable its use on massive datasets. It has been developed by analysts at the UK Ministry of Justice (MoJ) as part of the Data First programme, and used to link some of the MoJ's largest datasets. The tool is available at https://github.com/moj-analytical-services/splink

View recorded presentation below.

What did you think of this webinar?

Please take a few minutes to complete our online survey. Your feedback will help shape future webinar series!

Speakers

Robin Linacre is a Data Scientist leading work on data linking methodology at the UK Ministry of Justice. He has a background in econometrics but more recently has worked on a variety of open source software and infrastructure.

Search

Splink: a software package for probabilistic record linkage and deduplication at scale

This webinar is part of the Power of Population Data Science Series

What did you think of this webinar?

Speakers

Sign up for e-news and keep up to date with what's new at PopData, including the latest on data access, data available and upcoming events.