US007912842B1
(12) Umted States Patent (10) Patent N0.: US 7,912,842 B1 Bayliss (45) Date of Patent: Mar. 22, 2011
(54) METHOD AND SYSTEM FOR PROCESSING 5,878,408 A 3/ 1999 Van Huben et al. AND LINKING DATA RECORDS 5,884,299 A 3/1999 Ramesh et al.
5,897,638 A 4/1999 Lasser et al. . . 5,983,228 A 11/1999 Koba ashiet a1.
(75) Inventor: Davld Bayllss, Delray Beach, FL (US) 6,006,249 A 12/1999 Leon; 6,026,394 A 2/2000 Tsuchida et a1.
(73) Assignee: LeXisNeXis Risk Data Management 6,026,398 A * 2/2000 Brown et al. ................... .. 707/5 Inc., Boca Raton, FL (US) 6,081,801 A 6/2000 Cochrane et al.
6,266,804 B1 7/2001 Isman
( * ) Notice: Subject to any disclaimer, the term of this 22118001; k
$11518 llssixgelidej7o3rdadlusted under 35 636583412 B1* 12/2003 Jenkins et al. .................. .. 707/5 ' ' ' ( ) y ays' (Continued)
21 A l. N .: 10/357 418 ( ) pp 0 ’ OTHER PUBLICATIONS
(22) Filed: Feb‘ 41 2003 Henniger, “An Evolutionary Approach to Constructing Effective 51 I t Cl Software Reuse Repositories”, ACM Transactions of Software Engi
( ) Gn0'6F '7/00 (2006 01) neering and Methodology, vol. 6, No. 2, Apr. 1997, pp. 111-140.*
G06F 17/30 (2006.01) (Continued) (52) US. Cl. ..................................................... .. 707/749
(58) Field of Classi?cation Search ................ .. 707/100, Primary Examiner * Kavita Padmanabhan
707/102, 103 R, 103 Z; 706/45, 46, 48, 52 (74) Attorney, Agent, or Firm * Hunton & Williams, LLP See application ?le for complete search history.
(57) ABSTRACT (56) References Cited
U.S. PATENT DOCUMENTS
4,543,630 A 9/1985 Neches 4,860,201 A 8/1989 Stolfo et al. 4,870,568 A 9/1989 Kahle etal. 4,925,311 A 5/1990 Neches et a1. 5,006,978 A 4/1991 Neches 5,276,899 A 1/1994 Neches 5,303,383 A 4/1994 Neches et a1. 5,423,037 A 6/1995 Hvasshovd 5,471,622 A 11/1995 Eadline 5,495,606 A 2/1996 Borden et a1. 5,551,027 A 8/1996 Choyetal. 5,555,404 A 9/1996 Torbjyamsen et al. 5,655,080 A 8/1997 Dias et a1. 5,715,469 A * 2/1998 Arning ........................ .. 715/533
5,732,400 A 3/1998 Mandler et a1. 5,745,746 A 4/1998 Jhingran et a1.
Various exemplary systems and methods for linking entity references and identifying associations are presented. In par ticular, a method is provided for linking a plurality of entity references to at least one entity. The method comprises the steps of evaluating a probability of a match betWeen a ?rst entity reference and a second entity reference based at least in part on a statistical signi?cance of one or more ?eld values being common to both the ?rst entity reference and the second entity reference, Wherein ?eld value statistical signi?cance is inversely related to a number of ?eld value occurrences occurring in some or all of the plurality of entity references and linking the ?rst entity reference With the second entity reference When the probability is greater than or equal to a match threshold.
36 Claims, 31 Drawing Sheets
Probability-Based Matching EB.
Content Weighting Field Weighting
Compare Entity "‘ Lldfit'on of a m V in i
Emmm
Context 5.0.6
E] Familial Nicknames! Relationships Synonyms
US 7,912,842 B1 Page 2
US. PATENT DOCUMENTS
2002/0073099 A1 * 6/2002 Gilbert et al. ............ .. 707/l04.l
2004/0064447 A1 * 4/2004 Simske et al. .................. .. 707/5
OTHER PUBLICATIONS
Eike Schallehn et al., “Advanced Grouping and Aggregation for Data Integration,” Department of Computer Science, Paper ID: 222, pp. 1-16. Vincent Coppola, “Killer APP,” Men’s Journal, vol. 12, No. 3, Apr. 2003, pp. 86-90. Eike Schallehn et al., “Extensible and Similarity-based Grouping for Data Integration,” Department of Computer Science, pp. l-l7, 2002. Rohit Ananthakrishna et al., “Eliminating Fuzzy Duplicates in Data Warehouses,” 12 pages, 2002. Peter Christen et al., “Parallel Computing Techniques for High-Per formance Probabilistic Record Linkage,” Data Mining Group, Aus tralian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkagehtml, 2002, pp. l-l l .
Peter Christen et al., “Parallel Techniques for High-Performance Record Linkage (Data Matching),” Data Mining Group, Australian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkage.html, 2002, pp. l-27. Peter Christen et al., “High-Performance Computing Techniques for Record Linkage,” Data Mining Group, Australian National Univer sity, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkagehtml, 2002, pp. l-l4. William E. Winkler, “Matching and Record Linkage,” U. S. Bureau of the Census, pp. l-38. Peter Christen et al., “High-Performance Computing Techniques for Record Linkage,” ANU Data Mining Group, Australian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkage.html, pp. l-ll. William E. Winkler, “The State of Record Linkage and Current Research Problems,” US. Bureau of the Census, 15 pages. William E. Winkler, “Advanced Methods For Record Linkage,” Bureau ofthe Census, pp. l-2l. William E. Winkler, Frequency-Based Matching in Fellegi-Sunter Model of Record Linkage, Bureau of the Census Statistical Research Division, Oct. 4, 2000, 14 pages. William E. Winkler, “State of Statistical Data Editing and Current Research Problems,” Bureau Of The Census Statistical Research Division, 10 pages.
The First Open ETL/EAI Software for the Real-Time Enterprise, Sunopsis, A New Generation ETL Tool, “SunopsisTM v3 expedites integration between heterogeneous systems for Data Warehouse, Data Mining, Business Intelligence, and OLAP projects,” <www. suopsis.com>, 6 pages. Alan Dumas, “The ETL Market and SunopsisTM v3 Business Intel ligence, Data Warehouse & Datamart Projects,” 2002, Sunopsis, pp. l-7. Teradata Warehouse Solutions, “Teradata Database Technical Over view,” 2002, pp. l-7. WhiteCross White Paper, May 25, 2000, “wx/des-Technical Infor mation,” pp. l-36. Teradata Alliance Solutions, “Teradata and Ab Initio,” pp. l-2, 2001. Peter Christen et al., The Australian National University, “Febrli Freely extensible biomedical record linkage,” Oct. 2002, pp. l-67. William E. Winkler, “Using the EM Algorithim for Weight Compu tation in the Fellegi-Sunter Model of Record Linkage,” Bureau Of The Census Statistical Research Division, Oct. 4, 2000, 12 pages. William E. Winkler et al., “An Application of the Fellegi-Sunter Model ofRecord Linkage to The 1990 US. Decennial Census,” US. Bureau of the Census, pp. l-22. William E. Winkler, “Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage,” Bureau ofthe Census, pp. l-l3. FritZ Scheuren et al., “Recursive Merging and Analysis of Adminis trative Lists and Data,” US. Bureau of the Census, 9 pages. William E. Winkler, “Record Linkage Software and Methods for Merging Administrative Lists,” US. Bureau of the Census, Jul. 7, 2001, ll pages. Enterprises, Publishing and Broadcasting Limited, Acxiom-Abilitec, pp. 44-45. TransUnion, Credit Reporting System, Oct. 9, 2002, 4 pages, <http:// www.transunion.com/content/pagej sp?id:/transunion/general/ data/business/BusCre...>. TransUnion, ID Veri?cation & Fraud Detection, Account Acquisi tion, Account Management, Collection & Location Services, Employment Screening, Risk Management, Automotive, Banking Savings & Loan, Credit Card Providers, Credit Unions, Energy & Utilities, Healthcare, Insurance, Investment, Real Estate, Telecom munications, Oct. 9, 2002,46 pages, <http://www.transunion.com>. White Paper an Introduction to OLAP Multidimensional Terminol ogy and Technology, 20 pages.
* cited by examiner
US. Patent Mar. 22, 2011 Sheet 1 0f 31 US 7,912,842 B1
‘ Fig. 1A
US. Patent Mar. 22, 2011 Sheet 2 0f 31 US 7,912,842 B1
140
‘ Fig. 1B
150
144
US. Patent Mar. 22, 2011 Sheet 3 0131 US 7,912,842 B1
Prepare Raw Data (Preparation Phase)
El
N O
V
Translate Data to Entity References (Link Phase) M
Repeat for Iteration N Incoming Data ——> 208
Determine Inter-Relationships Between Entities
(Association Phase) &
Perform One or More Queries Using Master
File
US. Patent Mar. 22, 2011 Sheet 4 0f 31 US 7,912,842 B1
Format Raw Data into 7 Entity References
i0_2_
l Join Entity References
(Master File) &
l Remove Duplicate Entity
References Repeat for Iteration N 39g
Fill In Null Field Values 3%
l Remove Junk Field
Values/Entries m
Preparation Phase .ZQZ
Incoming Data ———-——>
US. Patent Mar. 22, 2011 Sheet 5 0131 US 7,912,842 B1
Select Relevant Fields ——-~—-—>
%
Fig. 4 l Measure Field Variance
and Reset DlDs lf Necessary M
Link Phase i M Fill In Null Field Values
Q5
l . Repeat for Iteration N Generate Ghost Entity
Incoming Data ———> gig References
%
‘ i Link Entity References
919
l Transition Links m
l Append/Modify DlDs in
Master File m
US. Patent Mar. 22, 2011 Sheet 6 0131 US 7,912,842 B1
Fig. 5
Probability-Based Matching Q21
Content Weighting Field Weighting
Entity Reference A V
Indication of a Compare Entity References Link Between
Q _
Entity Reference B Entity References
A
Context _50_
Ethnicity
Familial Nicknames! Location Relationships Synonyms
US. Patent Mar. 22, 2011 Sheet 7 0131 US 7,912,842 B1
For each particular ?eld entry fn, determine total number (Count) of
6 same ?eld entries in master file @
Count = i [if (f, = f") then 1,6136 0]
Count Table
For each particular field entry fn, determine context weight we‘i
m
l WCJ : .
Count + Cautrousne ss
Context Weight Table
Calculate probability (P) of match between Entry References using
context weight(s)
l Assign DlDs to Entity References
based on probability (P) E
US. Patent Mar. 22, 2011 Sheet 8 0131 US 7,912,842 B1
Fig. 7 Select subset N of
Entity Reference fields m
Next Entity Ref. A and ‘ For each ?eld (X) of the Entity Ref. B subset:
.721 ' B6. A
Compare E No A.f with B.f
Match X 708 X
Match
Add (A5) to Match Table D9
Match Table 2
Common DID transition using Match Table
12
i Adjust DID of Affected Entity References in Master File
m
US. Patent Mar. 22, 2011 Sheet 9 0131 US 7,912,842 B1
Fig. 8 808
804 802
806
V
US. Patent Mar. 22, 2011 Sheet 10 0f 31 US 7,912,842 B1
V
US. Patent Mar. 22, 2011 Sheet 11 0f 31 US 7,912,842 B1
Fig. 10
Match Table w
Inner Join of Match Table with itself by left DID
1002
Expanded Match Table w
1022
lnner Join of Expanded Match Table with itself from
right DID to left DID' 199A
Transitive Closure Table 1024
Transition BIOS to lowest possible DID value
1006
US. Patent Mar. 22, 2011 Sheet 12 0f 31
Fig. 11
US 7,912,842 B1
Select subset N of data ?elds
1102
i For each field (X) of the subset, generate Field Unique Value Table
?0_4
Y
Cross-Produce Field Unique Value Tables to generate Ghost Table
mi
Ghost Table 1128
Update Master File to Include Ghost Entity
References 11_08
US. Patent Mar. 22, 2011 Sheet 13 0f 31 US 7,912,842 B1
US. Patent Mar. 22, 2011 Sheet 14 or 31 US 7,912,842 B1
Measure variance along
each & ‘axis’ g I 1 3
Variance >
Threshold? 1304
Yes 1300 V
Reset DID of Each Entity Reference to its RID
BE Y
End 1312
A Mark Entity References as having been ‘Broken’ M
Mark Entity References as suspect
1314
US. Patent Mar. 22, 2011 Sheet 15 0131 US 7,912,842 B1
Fig. 14 Determine Degree of
Commonality ——> (Association) Between
Entities 1402
Association Phase E
Mark Highly Associated Entities as Related
1404
Repeat for Iteration N Incoming Data —————> 1410
‘ Generate Ghost Entity References from
Relations 1406
Transitive Closure For Additional Associations
Between Entities JAQB.
US. Patent Mar. 22, 2011
Select subset N of Entity Reference ?elds M
V
Sheet 16 0f 31
Fig. 15
US 7,912,842 B1
Next Entity Ref. A and For each ?eld (X) of the Entity Ref. B ‘ ‘ subset:
1504 ' 1506
A
A Compare
1m M23211 Afx with B.fX 1i
Match
Increase score of (CD) pair in Score Table
1i
v
E m c t A ' t d '8 n y no _$SOCla e scorew'g) Score Table with Entltv D. 4—N° >=Threshold 1522
Entity D not Associated 1512 with Entity C
Yes
Mark Entity C as Associate of Entity D,
Entity D as Associate of Entity C M
US. Patent Mar. 22, 2011 Sheet 17 0131 US 7,912,842 B1
Fig. 16
Relatives File 1620
Filter 1602
l ""— Duplicate Records
1604
l lnner Join by left DID
1606
l Set weight, separation,
and dedup values 1608
US. Patent Mar. 22, 2011 Sheet 18 0f 31
Match Table 1730
Filter 1702
I Duplicate Records
1704
Fig. 17
Duplicate Match Table
1722
Inner Join duplicate match table with master
?le 1_7_0_6_
Outlier Reference Table 1724
Score DlDs using grading criteria
1708
I Sum DID scores
1710
<— Grading Criteria
US 7,912,842 B1
DID Score Table 1726
Filter DID Score Table 1712
Obtain entity references of selected DlDs from
Outlier Reference Table 1714