An Introduction to Record Linkageby Shahid Ashraf
Record linkage (RL) is the process of finding same record across data sets. The records can be people, books etc. It has become an important discipline in computer science and in Big Data.
What is Record Linkage?
It is the term used by computer engineers and among others to describe the process of joining record from one source to the other source. It can be also defined as the task to disambiguate manifestations of real world entities in various records or mentions by linking and grouping. Computer scientists often refer to it as “data matching” or as the “object identity problem”. Other names used to describe the same concept include:” entity/identity/name/record resolution”, “entity disambiguation/linking”, “duplicate detection”, “deduplication”, “record matching”, “(reference) reconciliation”, “object identification”, “data/information integration”, “entity resolution” and “conflation”.
The algorithm that matches an input record to the canonicalized records, upon receipt, the RL will either,
- Merge the input record to canonicalized record.
- Declare a possible match (but still create a new id of the string with pointer to matched id in the provisional match table) and such matches are flagged as “provisional” and require human verification (depends on how much RL should be perfect).
- Declare a record as a separate entity.
Record linkage has become increasingly useful in health care administration, demographic studies, provision of health statistics, and in medical research. It has enabled health care administrators and researchers to,
- Maintain a comprehensive database. Linkages between administrative data sources have provided the basis for the Manitoba Health database, a comprehensive, multi-file database which permits many different types of health-related research.
- Link to local statistical files. Linkages involving cancer registries, AIDS registries, and mortality data have been used to study mortality from these diseases. Such linkages can also provide the beginnings for a comprehensive database maintained locally. As more information becomes available, linkage assists in expanding and correcting the database.
- Link to survey data. Survey information linked with insurance claims has been used to provide a more accurate picture of the aging process by studying the relationships between functional status, self-reported health status, and the utilization of health care services.
- Link to clinical data. Linking clinical data with hospital claims and Vital Statistics information has been used to produce a rich dataset on patients’ pre-operative status and post-operative outcomes, allowing for the evaluation of a wide range of procedures. Links with clinical information have also proved important in establishing the validity of research which relies on claims data.
- Link to Social data. Linking social and health data will produce new perspectives on the relationship between health and social variables. 
It involves cleaning of all data sets under consideration (particularly their key identifier fields) and undergoing data quality assessment prior to record linkage as RL is highly sensitive to the quality of data. It also involves selecting the key features and formatting them to particular format.
These could involve simple normalization of key identifiers so mismatch should not occur on,
- Using upper case of names or other text fields.
- Removing white spaces like St. for names starting with st.
- Parsing names so we can separate the name into separate first name middle name and last name. E.g. William J. Smith and Smith, William J. Should be same in terms of first name, middle name and last name.
- Address normalization, which might involve changing United States of America to United States, US to United States.
- We should also try to solve issues like change of last name after marriage.
Key Attribute selection or Feature Selection.
This task involves selecting the best attributes on which we can distinguish two entities that are same. For records of individuals we can say name first, name last, address and email are key attributes/ features. The goal is to construct, for a pair of records, a “comparison vector” of similarity scores of each component attribute. Similarity scores can simply be Boolean (match or non-match) or they can be real values with distance functions.
This involves developing the programs to perform record linkage and processing the small sample of data before executing on whole data set. As usually size of data sets are large and takes lot of computation and time. This helps in tweaking the algorithms and processes of RL as turnaround time reduces considerably while doing tests. It is important the sample data set should be representative to the actual data sets.
After we have constructed a vector of component-wise similarities for a pair of records, we must compute the probability that the pair of records is a match. There are several methods for discovering the probability of a match. Two simple proposals are to use a weighted sum or average of component similarity scores, and to use thresholds. However, it is extremely hard to pick weights or tune thresholds. Another simple approach can be rule based matching, but manual formulation of rule sets is difficult. The Similarity scores are generated based on the various algorithms usually string matching like edit distance and fuzzy string matching algorithms.
Record Linkage Types:
Deterministic or rules-based record linkage generates links based on the number of individual identifiers that match among the available data sets. Two records are said to match via a deterministic record linkage procedure if all or some identifiers (above a certain threshold) are identical.
Probabilistic record linkage, sometimes called fuzzy matching takes a different approach to the record linkage problem by taking into account a wider range of potential identifiers, computing weights for each identifier based on its estimated ability to correctly identify a match or a non-match, and using these weights to calculate the probability that two given records refer to the same entity. Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with probabilities below another threshold are considered to be non-matches; pairs that fall between these two thresholds are considered to be “possible matches” and can be dealt with accordingly.
Machine learning techniques have been used in record linkage like – Decision Trees, SVMs, Ensembles of Classifiers, Conditional Random Fields, etc. The issue is Training Set Generation since there is an imbalance of classes: non-matches far outnumber matches, and this can result in misclassification.
Active Learning methods and Unsupervised/Semi-supervised techniques are used to attempt to circumvent the difficulties of training set generation.
Blocking/ Scaling to big data.
In this approach we use the concept of localization, in which we fetch the records for matching, this close match can be based on number of approaches, one can be selecting records which had same hash of first name or same city. Common blocking algorithms include hash based blocking, similarity, or neighborhood based blocking, or creating complex blocking predicates by combining simple blocking predicates.
Another approach to scale RL process is to use distributed computing approach. E.g. MapReduce can be used with disjoint blocking, e.g. in the Map phase (per record computation) you compute the blocks and associate with various reducers, and in the reduce phase (global computation) you can perform pairwise matching.
There are several important forms of constraints that are relevant to the next sections, given a specific mention, Mi:
- Transitivity: If M1 and M2 match, M2 and M3 match, then M1 and M3 must also match.
- Exclusivity: If M1 matches with M2, then M3 cannot match with M2.
- Functional Dependency: If M1 and M2 match, then M3 and M4 must match.
Quality of record linkage can be measured in the following dimensions:
- The number of record pairs linked correctly (true positives) nm.
- The number of record pairs linked incorrectly (false positives, Type I error) nfp.
- The number of record pairs unlinked correctly (true negatives) nu.
- The number of record pairs unlinked incorrectly (false negatives, Type II error) nfn.
Along with the two ground truth numbers (the total number of true match record pairs, Nm, and the total number of true non-match record pairs, Nu), various measures from different perspectives can be defined from these dimensions. Several of these measures are listed in the following,
- Sensitivity: nm/Nm, the number of correctly linked record pairs divided by the total number of true match record pairs.
- Specificity: nu/Nu, the number of correctly unlinked record pairs divided by the total number of true non-match record pairs.
- Match rate: (nm + nfp)/Nm, the total number of linked record pairs divided by the total number of true match record pairs.
- Positive predictive value (ppv): nm/(nm + nfp), the number of correctly linked record pairs divided by the total number of linked record pairs.
It can be seen that sensitivity measures the percentage of correctly classified record matches while specificity measures the percentage of correctly classified non-matches. 
I will be sharing code for starting Record Linkage and few basic types of RL approaches we currently employ in the coming blogs.
 Lifang Gu, Rohan Baxter, Deanne Vickers, and Chris Rainsford, Record Linkage: Current Practice and Future Directions.