Upload
jtyzzer
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A slide-based overview of the basics of classic probabilistic record matching
Citation preview
5/21/2018 The Mechanics of Probabilistic Record Matching
1/25
THE MECHANICS OF
PROBABILISTIC RECORDMATCHING
Jeffrey Tyzzer
5/21/2018 The Mechanics of Probabilistic Record Matching
2/25
Why Does this Deck Exist?
!
I struggled while studying probabilistic matching--
reading, e.g., the works of Fellegi and Sunter,
Newcombe, Schumacher, and Herzog, et al.--and
wanted to summarize my findings as much to helpothers understand it as to check my own
understanding. To that end, please direct any errors
and constructive feedback to me at
2
5/21/2018 The Mechanics of Probabilistic Record Matching
3/25
Agenda
!
Recall that Master Data Management (MDM)
enables the consolidation and syndication of
trusted, authoritative, data
!
In this presentation, we focus on the consolidation--
or unification--of master data, which is the heart of
all MDM systems
3
5/21/2018 The Mechanics of Probabilistic Record Matching
4/25
Matches
!
In a data set, constructs (i.e. records) are proxies for
real-world objects
! Matches are entity instances (records) that have thesame values for those properties (attributes) that
serve to identify them
! One of the goals of Master Data Management is to
ensure that there is a 1:1 correspondence betweenthe real and proxy objects
4
5/21/2018 The Mechanics of Probabilistic Record Matching
5/25
Ways of Matching
! There are two principal ways to match: deterministically andprobabilistically! Deterministic matching is rules-based, e.g. IF R1a1 = R2a1 AND
R1a2 = R2a2 THEN Link ELSE NonLink
! Deterministic matching is binary--all or nothing! Probabilistic matching is likelihood-based
! Probabilistic matching is analog--its based on a range ofagreement
! The pioneers of probabilistic matching were Newcombe, et
al., Tepping, and Fellegi & Sunter.! Probabilistic matching is particularly useful in the absence of
unique identifiers, when only so-called quasi-identifiersareavailable, such as names and dates birth
5
5/21/2018 The Mechanics of Probabilistic Record Matching
6/25
Consider
!
R1 Name: Jeff Tyzzer Address: 848 Swanston Dr.
Phone: (916) 555-1212
! R2 Name: Jeffrey Tyzzer Address: 884 Swanson Dr.Phone: 555-1212
! Would you consider these two records to be
matches? Why? Would they be deterministic or
probabilistic matches?
6
5/21/2018 The Mechanics of Probabilistic Record Matching
7/25
Hypothesis Testing
!
In classic probabilistic matching, we take our cue
from inferential statistics when comparing two
records probabilistically:
! H0- The null hypothesis: The records do not represent
the same real-world object, i.e. they are not matches
! HA- The alternate hypothesis: The records represent thesame real-world object, i.e. they are matches
! Typically, H0is rejected if our test statistic is less than .
05 (the so-called p-value)
7
5/21/2018 The Mechanics of Probabilistic Record Matching
8/25
Hypothesis Testing, contd
!
A Type I error, designated with the Greek letter
(alpha), occurs when we incorrectly reject H0
! A Type II error, designated with the Greek letter (beta), occurs when we incorrectly fail to reject H0
8
5/21/2018 The Mechanics of Probabilistic Record Matching
9/25
Record Linkage and Type I & II Errors
!
Since weve decided that H0indicates that the
records are different, if we commit a Type I error
(incorrectly rejecting H0) were (wrongly) asserting
that the records match. This is a false positive
! Since weve decided that HAindicates that the
records are the same (matches), if we commit a
Type II error (incorrectly failing to reject H0) were
(wrongly) asserting that the records do not match.This is a false negative
9
5/21/2018 The Mechanics of Probabilistic Record Matching
10/25
Agreement Probabilities
! We must first decide on our match attributes, a domain-specific decision. For this presentation, we will use FirstName, Last Name, and DoB
! For our purposes, when comparing these attributesbetween records there are two possible outcomes: theywill agree or they wont
! We calculate the probabilities of these attributesagreeing under each of the preceding hypotheses.There are several methods for computing these; amongthem are sampling, prior studies, and MaximumLikelihood Estimation (MLE) using ExpectationMaximization (EM)
10
5/21/2018 The Mechanics of Probabilistic Record Matching
11/25
Example
Attribute Non-match (H0) Match (HA)
Last Name .05 .95
First Name .15 .90
DoB .25 .85
! Using one of the techniques mentioned in slide 10s
last bullet point, say we find that, for our data, whenthe two records do in fact represent the same entity
the last names match 95% of the time, the first names90%, and the DoBs 85%. When the two records are
known to represent different entities, the match rates
are much lower--5%, 15%, and 25%, respectively
11
5/21/2018 The Mechanics of Probabilistic Record Matching
12/25
Match Attribute Possibilities
! Since for simplicitys sake were saying that theattributes must simply either match or not--designating1 for a match and 0 for a non-match--then for our threeattributes we have the following 23agreement
possibilities:
LN FN DoB
0 0 0
1 0 0
0 1 00 0 1
1 1 0
1 0 1
0 1 1
1 1 1
12
5/21/2018 The Mechanics of Probabilistic Record Matching
13/25
Match Attribute Probabilities
!
The space of all possibleagreement patterns is referred by theGreek letter (gamma)
!
Given the agreement probabilities listed on slide 11, we nextcompute two probabilities for each of the eight agreement patterns(slide 11) in (in the same attribute order): the m (match)probability and the u(non-match) probability
! Example - the mprobability for the (0,0,0) pattern (i.e. none match):
(1 - .95) * (1 - .90) * (1 - .85) = 0.00075
! Example - the uprobability for the (1,0,1) pattern (match on LN andDoB):
(.05) * (1-.15) * (.25) = 0.01063! The agreement pattern is viewed as a discrete random variable
representing the set of all possible comparison outcomes
13
5/21/2018 The Mechanics of Probabilistic Record Matching
14/25
Match Attribute Probabilities, contd
!
The completed table looks like this:
Agreement Pattern m u
0,0,0 .00075 .605631,0,0 .01425 .03188
0,1,0 .00675 .10688
0,0,1 .00425 .20188
1,1,0 .12825 .00563
1,0,1 .08075 .01063
0,1,1 .03825 .03563
1,1,1 .72675 .00188
14
5/21/2018 The Mechanics of Probabilistic Record Matching
15/25
Observations
! Given the agreement probabilities on slide 11, only72.675% of the records would have matcheddeterministicallyand only 60.563% of those records thatdont match would have disagreed on all three attributes
! Both columns (must) sum to 1
!
Probabilistic matching gives us maybe in addition to yes andnoas a possible outcome--it lets us deal with those situationswhere not all attributes match, but some do (recall your
answers to the questions on slide 6)!
This technique assumes conditional independence among thematch attributes, which may not always be the case(consider the correlation between name and gender)
15
5/21/2018 The Mechanics of Probabilistic Record Matching
16/25
Almost There
!
The next two steps are:
! Calculate the log-likelihood ratio test statistic T, the
base-2 logarithm of the ratio of mand u
e.g., T = log2(0.03825/0.03563) = 0.10237
and order the results ascending by T
! Sum the cumulative probabilities (mtop down, ubottomup)
16
5/21/2018 The Mechanics of Probabilistic Record Matching
17/25
The Test Statistic & Cumulative Probs
Agreement
Patternm u T m () u ()
0,0,0 0.00075 0.60563 -9.65733 0.00075 1.00000
0,0,1 0.00425 0.20188 -5.56989 0.00500 0.39441
0,1,0 0.00675 0.10688 -3.98496 0.01175 0.19253
1,0,0 0.01425 0.03188 -1.16169 0.02600 0.08565
0,1,1 0.03825 0.03563 0.10237 0.06425 0.05377
1,0,1 0.08075 0.01063 2.92532 0.14500 0.01814
1,1,0 0.12825 0.00563 4.50968 0.27325 0.00751
1,1,1 0.72675 0.00188 8.59458 1.00000 0.00188
17
5/21/2018 The Mechanics of Probabilistic Record Matching
18/25
Deciding on the Thresholds
! We have three choices when confronted with a pair of records: definitelylink them, definitely do not link them, and maybe link them. How do wedecide? By establishing thresholds for each of the three possibilities,resulting in three discrete (and disjoint) T regions (slide 17)
! If, as we said on slide 7, we reject H0when the test statistic is less than .05,
then weve decided that were willing to accept an alpha of .05, meaningthat were OK with a Type I error (a false positive, given our definitions ofH0and HA) 5% of the time. In other words, were willing to accept that upto 5% of our linked records could be linked erroneously
! Assume that beta, our tolerance for a Type II error (a false negative, givenour definitions of H0and HA) is also .05. (Note that the false positive andnegative thresholds are domain-specific--whats the possible harm of a
false positive in a hospital setting versus one for, say, a direct marketercompiling a household address list?)
18
5/21/2018 The Mechanics of Probabilistic Record Matching
19/25
Deciding on the Thresholds, contd
!
The sum of the m probabilities represents our falsepositive rate and the sum of the u probabilities is ourfalse negative rate. The last two columns in the table on
slide 17, respectively, show these!
Our settings of alpha and beta dictate that any pair ofrecords with a T of -1.16169() or less is a definitenon-link and that any pair of records with a T of2.92532 () or greater is a definite link. Thus,those with an agreement pattern of (0,1,1) areour maybes. This is known as the clerical reviewregion
19
5/21/2018 The Mechanics of Probabilistic Record Matching
20/25
A Graphical Representation
0.00000
0.20000
0.40000
0.60000
0.80000
1.00000
1.20000
-9.65733 -5.56989 -3.98496 -1.16169 0.10237 2.92532 4.50968 8.59458
20
5/21/2018 The Mechanics of Probabilistic Record Matching
21/25
Interpretation
! Record pairs to the left of the red line (lambda) are acertain no and those to the right of the green line (mu)
are a certain yes. In-between the two lines is the
maybe region, whose record pairs require humanreview
! Fellegi & Sunters technique assures us that the mayberegion is as small as possible given our settings for
alpha and beta (ref. the NeymanPearson lemma)!
The width of the clerical region is a function of the
values of and (slide 8)
21
5/21/2018 The Mechanics of Probabilistic Record Matching
22/25
Example I
Record LN FN DoB
1 Tyzzer John 5/26/19xx
2 Tyzzer Jeff 5/26/19xx
! The agreement pattern is (1,0,1). Given its
corresponding T value, these records would be
classified as a match
22
5/21/2018 The Mechanics of Probabilistic Record Matching
23/25
Example II
Record LN FN DoB
1 Smith Jeff 5/26/19xx
2 Tyzzer Jeff 5/26/19xx
! The agreement pattern is (0,1,1). Given its
corresponding T value, these records would be
classified as a maybe and queued for clericalreview
23
5/21/2018 The Mechanics of Probabilistic Record Matching
24/25
Some Final Thoughts
! To compute the agreement probabilities (slide 11), the expectationmaximization (EM) technique is usually employed. These probabilities driveall subsequent results
! The demonstrated scenario and examples are deliberately trivial
! A more realistic situation would likely include more match columns andseveral more possible configurations of them instead of simple agreementor disagreement
! A more realistic situation would also have accommodated fuzzy matchesand incorporated value-specific frequencies into the probabilitycalculations. For last name, say, the agreement pattern would then beinterpreted as the LN agrees and is , e.g. Smith
!
To reduce the number of record-to-record comparisons from n(n-1)/2(intrafile) or n*m (interfile) to something manageable, blocking (e.g. on zipcode or the phonetic encoding of the surname) is typically used
24
5/21/2018 The Mechanics of Probabilistic Record Matching
25/25
References
!
B Do Chuong, and Serafim Batzoglou. What is the ExpectationMaximization Algorithm? Nature Biotechnology26.8 (2008): 897-9.
!
Fellegi, Ivan, and Alan B. Sunter. A Theory for Record Linkage.Journal of the American Statistical Association64.328 (1969):1183-1210.
! Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. DataQuality and Record Linkage Techniques. New York: Springer Science+ Business Media, 2007.
! ---. Record Linkage. WIREs Computational Statistics2.5 (2010):535-543.
! Kirkendall, Nancy. Weights in Computer Matching: Applications andan Information Theoretic Point of View. Record Linkage
Techniques--1985. Internal Revenue Service.
25