This webinar is part of the Power of Population Data Science Series
In this seminar, we will introduce Splink, a software package developed for probabilistic record linkage at scale.
This is free software provides a toolkit for record linkage of datasets of tens or even hundreds of millions of records, guiding the user through the various stages of linkage, including:
- automatic profiling of data, to spot data quality issues that may affect linkage, and skewed fields
- automatic analysis of different potential blocking rules, to understand the computational costs of different approaches
- user-customisable rules to compare fields that can be used to model names, dates, locations and any other types of fields
- estimation of m and u probabilities using various approaches, including the expectation maximisation algorithm
- diagnostic charts that explain model estimates, and help build intuition for how the model works
- interactive tools to understand and quality assure the results of record linkage
- accuracy analysis including ROC and precision recall curves for labelled data
This tool is developed in Python and uses PySpark to enable its use on massive datasets. It has been developed by analysts at the UK Ministry of Justice (MoJ) as part of the Data First programme, and used to link some of the MoJ's largest datasets. The tool is available at https://github.com/moj-analytical-services/splink
View recorded presentation below.
What did you think of this webinar?
Please take a few minutes to complete our online survey. Your feedback will help shape future webinar series!
Speakers
Robin Linacre is a Data Scientist leading work on data linking methodology at the UK Ministry of Justice. He has a background in econometrics but more recently has worked on a variety of open source software and infrastructure.