Upload
matthewriley123
View
223
Download
0
Embed Size (px)
Citation preview
8/14/2019 Entity Resolution
1/16
1
Entity Resolution
A Real-World Problem of MatchingRecords
Techniques: Minhashing, Locality-Sensitive Hashing
Measuring the Quality of the Results
2
What Is Entity Resolution?
xData from several sources may refer to
the same entities, e.g., people.xThere is no universal key to help us
match records.
xBig question: how do we tell if recordsrefer to the same underlying entity?
8/14/2019 Entity Resolution
2/16
3
A Matching Problem
xCompany A sold the services of Company B.
xThey then got mad at each other and suedover how many customers of B wereoriginally from A.
xB never bothered to store a from A bit.
xI was asked to find how many of Bscustomers came from A.
4
Matching Details
xThere were about 1,000,000 records
from each company.xEach had name, address, and phone#
fields.
xBecause the records were createdindependently, there were manydifferences between records thatrepresented the same entity (person).
8/14/2019 Entity Resolution
3/16
5
Examples of Differences
1. Typos of many sorts.
2. Abbreviations (St./Street).
3. Nicknames (Bob/Robert).
4. Missing middle name or initial.
5. First/last names reversed.6. Area-code changes.
7. Etc., etc.
6
Simple Approach
1. Develop a score of how close two
name-addr-phone records are.2. Consider all pairs of records, one from
A, one from B. If their score is abovea threshold, consider them torepresent the same customer.
x 1012 scorings --- way too much.
8/14/2019 Entity Resolution
4/16
7
Scoring Matches
xThe exact formula used to measuresimilarity turned out not to matter much,because we were able to measure thequality of any given score.
xIn general: an ad-hoc, experimental
process.
8
Finding Pairs to Score
xThe key problem is deciding which of the1012 pairs are worth scoring.
xAn example of the near-neighborsproblem: Given N points, find all pairs of points that are
at distance less than some threshold.
Usually expressed as similarity = 1 normalized distance.
8/14/2019 Entity Resolution
5/16
9
Standard N-N Framework
1. Points are sets.
2. Similarity of sets = Ratio of size ofintersection to size of union.
3. Minhashing to convert sets intomanageable summaries (signatures).
4. Locality-sensitive hashing to focus onpairs likely to be similar.
10
Locality-Sensitive Hashing
1. Choose many hash functions from points
to buckets.2. Arrange that nearby points have a good
chance of going to the same bucket.
3. Candidates = pairs of points sent to thesame bucket by at least one hash function.
4. Evaluate only candidate pairs.
8/14/2019 Entity Resolution
6/16
11
Example: Similar Documents
xReplace a document by its k-shingles =all substrings of length k.
xExample: Doc1: abcdb; shingle set ={ab, bc, cd, db}.
xDoc2: cdab; shingle set = {cd, da, ab}.
x|intersection| = 2; |union| = 5; similarity =40%.
12
Minhashing
xPick a number of hash functions (say
100) from set elements to integers.xFor each hash function, the minhash
value for a set is the smallest integer towhich any of its members hash.
xThe signatureof a set is the list ofminhash values for the selected hashfunctions.
8/14/2019 Entity Resolution
7/16
13
Theorem
xThe probability that the minhash of twosets is the same = the similarity of thesets.
xConsequence: if we minhash two setsmany times, the number of hash
functions for which their minhashes arethe same will approximate the similarityof the sets.
14
Back to LSH
xRepresent a set (e.g., sets of shingles
of a doc) by the column of (say) 100minhash values (its signature).
xMatrix M consists of a column per set.
xLSH starts by partitioning the rows intob blocks ofr rows each.
8/14/2019 Entity Resolution
8/16
15
Partition Into Bands
Matrix M
r rowsper band
b bands
Column =signaturefor oneset.
16
Partition into Bands --- (2)
xFor each band, hash its portion of eachcolumn to a hash table with many buckets.
xCandidate column pairs are those thathash to the same bucket for 1 band.
xTune band r to catch most similar pairs,few nonsimilar pairs.
8/14/2019 Entity Resolution
9/16
17
Matrix M
r rows bbands
Buckets
18
Example --- Efficiency of LSH
xSuppose 100,000 columns.
xSignatures of 100 integers.xTherefore, signatures take 40Mb.
So they fit in main memory.
xBut 5,000,000,000 pairs of signaturescan take a while to compare.
xChoose 20 bands of 5 integers/band.
8/14/2019 Entity Resolution
10/16
19
Suppose C1, C2 are 80% Similar
xProbability C1, C2 identical in one
particular band: (0.8)5 = 0.328.
xProbability C1, C2 are not similar in any
of the 20 bands: (1-0.328)20 = .00035 . i.e., we miss about 1/3000th of the 80%-
similar column pairs.
20
Suppose C1, C2 Only 40% Similar
xProbability C1, C2 identical in any one
particular band: (0.4)5
= 0.01 .xProbability C1, C2 identical in 1 of 20
bands: 20 * 0.01 = 0.2.
xBut false positives much lower forsimilarities
8/14/2019 Entity Resolution
11/16
21
LSH Involves a Tradeoff
xPick the number of minhashes, thenumber of bands, and the number ofrows per band to balance falsepositives/negatives.
xExample: if we had fewer than 20
bands, the number of false positiveswould go down, but the number of falsenegatives would go up.
22
LSH --- Graphically
x ExampleTarget: All pairs with Sim > t.
xPartition into bands gives us:s 1.0
SimProb.
1.0
t 1.0
SimProb.
1.0
0.0
Ideal
Sim0.0
Prob.
1.0
s 1.0
1 (1 sr)b
0.0
t
t
t ~ (1/b)1/r
One hash fn.
8/14/2019 Entity Resolution
12/16
23
Back to Entity Resolution
x Name-addr-phone records are notnaturally representable by sets (e.g.,shingle sets).
x So we adapted the idea by using 3hash functions:
1. Hash by name.2. Hash by address.
3. Hash by phone.
24
Entity-Resolution LSH
xFalse negative for every pair of records
that represented the same customer buthad none of the three componentsidentical.
xWith more cycles, we could have usedbigger buckets and gotten fewer falsenegatives.
8/14/2019 Entity Resolution
13/16
25
Example
x Hash on positions 1, 3, and 5 of the(5-digit) zip code.
x Approximately 1000 from each datasetgoes into each of 1000 buckets.
x 1 billion candidate pairs to score.
x Need many more hash functions likethis one.
26
How Many False Positives?
xScoring system: 100 pts. for each of
name, addr, phone.xPairs with a score of 300 certainly refer
to the same entity.
xWhat about pairs with a score of 220?150? etc.
8/14/2019 Entity Resolution
14/16
27
Using the Time-Lag
xWe took advantage of the fact that a B-record was probably created shortlyafter the A-record.
xFor the 300-score pairs, the averagedelay was 10 days.
xWe did not even consider matchingrecords with more than a 90-day lag.
28
Time-Lag-Trick --- (2)
xBogus-pair time-lag avg. = 45 days.
xGood-pair time-lag avg. = 10 days.xSuppose the pairs with score s have
average time-lag d.
xFraction pairs with score s that aregood:
(45-d)/35.
8/14/2019 Entity Resolution
15/16
29
Profile of Time-Lag
Score = 300 185 120 100
Lag = 10
45
30
Generalizing the Time-Lag Trick
xAll we need is some property of records
with a predictable correlation for bogusmatches and a measurable correlationfor good matches.
xExample: reserve phones for checking.
xNot even essential that all records havethe property.
8/14/2019 Entity Resolution
16/16
31
Summary
xEntity-resolution: important step indatabase integration.
xMinhashing: useful tool for convertingsets into easily comparable vectors.
xLocality-sensitive hashing: powerful
technique for finding similar objects ofmany kinds.