The data linkage process

Population Data BC (PopData) performs linkage on intake of a new data set from a Data Provider, where approved. New data sets are received from Data Providers throughout the year and incorporated into PopData’s holdings on an ongoing basis.

PopData uses personal identifiers and mathematical linkage techniques to link records belonging to the same individual across files and over time.

The Population Directory

The “Population Directory” is a PopData maintained data table that includes all the individuals about whom PopData has information. This includes personal information such as name, address, date of birth, and other relevant identifying information, as well as a consistent, encrypted identifier (a “Linkage ID”) that uniquely identifies each individual.

The Population Directory has been built with Medical Services Plan (MSP) Registration & Premium Billing (R&PB) data going back to 1985 and is updated upon receipt of each new R&PB file. The Population Directory captures all changes (e.g. a postal code history) and name permutations (such as maiden name/married name). The Population Directory covers most of the BC population and is the basis for record linkages.

Step 1: Separation of identifying information (identifiers) from content

When PopData receives a new data file, a limited data validation is carried out and then identifiers used for linkage are separated from content. All information is stored in PopData’s highly-secure “Red Zone” with the content data stored separately from the identifying data. Both data sets have a common Record ID (a generated number unique to each record) applied to them so that the Linkage ID resulting from the linkage can be attached to the content data at the end of the linkage process.

Step 2: Data cleaning

The identifier fields from the new data set then undergo a data cleaning process to standardize the data fields by removing small differences in formatting so that the linkage programs will recognize values that are the same. This process differs depending on the identifiers available in the file, but examples include:

Removing the space from postal code. Dates are split into three fields: year, month and day Invalid values are blanked out to be treated as missing

Records from the new data file are then linked to the Population Directory on the basis of common identifiers which are present in both the data file and the Population Directory. The common identifiers used vary based on which identifiers are available in the data, and are selected based on their ability to identify an individual uniquely and reliably. For example, Personal Health Number (PHN), surname, given names, postal codes, birth date and sex are used for linking the Vital Statistics data to the Population Directory; while PHN, birth date, sex, MSP ID and postal code are used for linking the MSP PIM data to the Population Directory.

The goal is to link records belonging to the same individual together, with minimal miss-links (as few as possible linkages being made for records that actually belong to different individuals).

Names may have a number of preparation processes applied given the many names and nicknames a person may have over their lifetime, and the frequency with which names are misspelled. Strategies for matching on name include:
- Converting the letters to all uppercase
- Standardizing the character set (replacing accented characters)
- Removing non-alphabetic characters (dashes, quotes)
- Keeping multiple name fields for maternal/married name
- Expanding the names to include nicknames
- Rotating name order with multiple first and second (or more) names.

Step 3: Ensuring accuracy

Identifiers may not be unique to an individual (e.g. postal code) or may change over time (e.g. surname). Identifying information may also be recorded inconsistently, incorrectly or may even be missing in certain records. Because of this degree of uncertainty, two linkage techniques are used at PopData to ensure that linkage is as accurate as possible.

Deterministic Linkage: A linkage technique whereby links between records are determined based on the perfect match of a set of common identifiers, or, using more flexible rules, the match of a subset of the identifiers. The advantage of this method is that it minimizes the miss-links between the two databases; however, the disadvantage is that if this method is used just by itself, each identifier is considered to be of equal importance and quality.

Probabilistic Linkage: In probabilistic linkage, the identifiers are given weights according to how ‘strong’ of an identifier they are. For example, it is much more likely that two records will match on sex than on last name. Thus, last name is considered a stronger identifier and is assigned a higher probabilistic agreement weight. The matches found in the common identifiers and the weights given to those variables are used to estimate the likelihood that the records belong to the same individual. The advantage of this method is that linkages are maximized, even in cases where data may be incomplete or have coding errors; the disadvantage is that, unless care is taken, there may be some miss-links.

The output from the probabilistic linkage program contains, for each potential match, a final weight for each of the linkage fields that is equal to the parameter weight multiplied by the value specific weight. The value specific weight can be thought of as a modifier to the parameter weight. For example, if there is a match on postal code, the agreement weight will be multiplied by the value specific weight, so that agreement on a rare postal code is assigned a higher weight than agreement on a common postal code. The total weight for each potential match is the sum of the final weights for each linkage field, and reflects the probability that the two records refer to the same individual.

At Population Data BC, both deterministic linkages and probabilistic linkage techniques are performed.

Deterministic linkage

Deterministic linkage is performed first using a computerized program that compares all records in the new data file to all records in the Population Directory on the basis of the common identifiers (also called linkage fields). However, because there are often upwards of 40 million records in the data files that PopData links, it is inefficient to compare all 40 million records to each record in the Population Directory. To reduce these inefficiencies, PopData compares records within pockets of data. For example, three different pockets might be used when linking one data set; a NYSIIS (phonetic code) name pocket, a birth year/birth month pocket, and a PHN pocket. For each pocket, only records that match on that linkage field are compared. For example, in a birth year/birth month pocket, only records in the new data file that match to the Population Directory on birth year and birth month are compared using the deterministic linkage program. In the end, the information from all the pockets are put together to determine the best match.

For each pocket, the deterministic linkage program produces an outcome string for each potential match. The outcome string records whether there was a perfect match, a complete mismatch, or a partial match. For example, if 6 identifying variables are involved in the linkage, the outcome string will have 6 digits, one for each identifying variable, indicating if there was a perfect match (1), a complete mismatch (9), a partial match (values 2-6), or if the value was missing (0). For example, if there was a match on first name, last name, sex, birth year and birth month, but not on birth day, the outcome string would be 111119. An example of a partial agreement is the first three characters matching in postal code.

Probabilistic linkage

All candidate matches from the deterministic linkage program are then fed into the probabilistic linkage program along with a set of probabilistic weights. The probabilistic weights are contained in a ‘Link Weight Parameter File’ and consist of an agreement weight, a disagreement weight, a partial agreement weight for each level of partial agreement and a missing weight for each of the linkage fields. The Link Weight Parameter File is generated based on actual frequencies of agreement/disagreement of each linkage field in the data using an iterative process that is usually run on a subset of the actual data file. In addition to the parameter weights, value specific weights are generated for some of the linkage fields. These weights are generated using the Population Directory and consist of one file for each of the linkage fields. Value specific weights are assigned depending on how rare certain values of that variable are. For example, a value specific weight file created for given name would contain all possible given names found in the Population Directory. Common names are given a low weight while rare names are given a high weight. Not every linkage field has a value specific weight file (sex being such an example, because each value has a similar frequency - approximately 50/50 male/female).

Step 4: Resolution

In this step, all candidate matches from all the pockets are resolved to find the best link. A comparison outcome string, component weights and the total weight for each potential match are available to establish the best link. The use of comparison outcome strings reduces the need for manual resolution and speeds up the process of linking large files while providing for more refined selection then using a single threshold weight.

The first rule of resolution is that if a record matched to the Population Directory with perfect agreement on all of the linkage fields, that record is considered the best link. As records are linked, the record and best match from the Population Directory are moved to a separate file, and all other potential matches for that record are removed from the working file, leaving only unresolved potential matches. The remaining records then go through a series of resolution rules whereby at each step the next best links are taken. For example, the next best link might be a perfect match on the majority of the linkage fields, with one or two missing fields (such as middle name). The final resolution step may involve manual resolution by the programmer.

Various statistics are documented for the linkage process. At each resolution step, the percentage of records that linked using that resolution rule is recorded. Linkage rates are also calculated for the dataset as a whole and by Local Health Area, and age and sex groups. These rates are examined for potential problems or issues – for example, a low linkage rate among newborns would be compared to rates found for previous years for the same data.

Once the new data set is linked to the Population Directory, the “Linkage ID”, which is the consistent, encrypted identifier that uniquely identifies each individual in the Population Directory, is placed in a new file, along with the Record ID from the original content data. Records that did not find a link get assigned a missing Linkage ID. The new Linkage ID is matched to its PopData ID, which is a generated number unique to each individual and is applied to the content data. The content data now has PopData IDs that are consistent across all PopData holdings and so can be used as a base to link data sets across time and content areas without needing to access the personal identifiers again.

The Population Directory

Deterministic linkage

Probabilistic linkage

DARs/Projects snapshot

Tools and resources

Need pan-Canadian data?

Sign up for e-news and keep up to date with what's new at PopData, including the latest on data access, data available and upcoming events.

Search

The data linkage process

The Population Directory

Deterministic linkage

Probabilistic linkage

DARs/Projects snapshot

Tools and resources

Need pan-Canadian data?

Sign up for e-news and keep up to date with what's new at PopData, including the latest on data access, data available and upcoming events.