2005 RuleYourDataWithTheLinkKing
Jump to navigation
Jump to search
- (Campbell, 2005) ⇒ Kevin M. Campbell. (2005). “Rule Your Data with The Link King©: a SAS/AF® application for record linkage and unduplication.” In: Proceedings of 30th SAS Users Group Meeting (SUGI 30).
Subject Headings: The Link King System, Person Record Duplicate Detection System, Probabilistic Duplicate Record Detection Algorithm, Deterministic Duplicate Record Detection Algorithm.
Notes
Cited By
Quotes
Abstract
Administrative datasets containing client identifying information (names, birthdates, SSNs) are often used for a variety of research and evaluation projects. The projects often require the linking of two or more independently maintained client rosters in order to track service utilization across different systems. Unfortunately, a given client may be represented with slightly different identifying information both within and across administrative datasets. Discrepancies arise from a variety of reasons including:
- Use of nicknames
- Hyphenated names
- Misspelled names
- Transposed SSN digits
- Transposed date fields
RECORD LINKAGE AND CONSOLIDATION ALGORITHMS
- There are two approaches to the linkage and unduplication of client identifiers in administrative datasets: deterministic linking and probabilistic linking.
- Probabilistic linking is accomplished through the application of sophisticated statistical analysis. Ultimately, a formula is derived which generates a score for each record pair and cut points to identify “definite” matches, “possible” matches, and “non matches”. The formula incorporates weights specific to each of the data elements and scaling factors for many of the data elements. The weights reflect the relative importance of specific data elements in predicting a match. The scaling factors adjust the weights for a given record pair based on the “rarity” of the data value. For example, the scaling factor for the last name “Freud” would be much larger than that for the last name “Smith”.
- The probabilistic algorithms used by The Link King were developed by MEDSTAT for the Substance Abuse and Mental Health Administration’s (SAMHSA) Integrated database project.
- Deterministic linking is accomplished by establishing specific criteria about what combination of data elements need to “match” and quality of the “match” in order to accept the link as valid. For example, one criterion to consider two client records a “match” might be that all of the following conditions must be met:
- First Names: Must have an Approximate String Match Algorithm score of .75 or Higher
- Last Names: Must have an Approximate String Match Algorithm score of .75 or Higher
- Middle Initial: Must be an exact match or be missing
- SSN: Must have at least 7 digits with exact positional match
- Birth date: Must be an exact match
- Deterministic record linkage is often portrayed as a method which doesn’t account for missing values and partial agreements and yields less success than probabilistic methods. For example, Whalen et. al.2 believe that “probabilistic matching produces more links than other methods and that many of these links are missed by other methods. This indicates probabilistic linking routines are more accurate than other routines for matching person-level data.”
- This is not necessarily true. An intricate deterministic algorithm can be as successful – or more successful – than probabilistic algorithms in identifying valid links. The Link King’s deterministic algorithms take into consideration partial matches for names, birthdates, and social security numbers as well as the “rarity” of names being compared and, depending on the extent of similarity across data elements, links records at one of 4 levels of certainty. The deterministic algorithms used in The Link King were developed at Washington State’s Division of Alcohol and Substance Abuse for use in a variety of program evaluation and research projects.
- The most powerful tool for record linkage and unduplication is one that incorporates both deterministic and probabilistic algorithms as The Link King does.
References
- Whalen D., Pepitone A., Graver L., Busch J. (2001). “Linking Client Records from Substance Abuse, Mental Health and Medicaid State Agencies.” In: U.S. DEPARTMENT OF HEALTH AND HUMAN SERVICES, Substance Abuse and Mental Health Services Administration.
,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2005 RuleYourDataWithTheLinkKing | Kevin M. Campbell | Rule Your Data with The Link King©: a SAS/AF® application for record linkage and unduplication | http://www2.sas.com/proceedings/sugi30/020-30.pdf |