View
10
Download
0
Category
Preview:
Citation preview
Graduate School ETD Form 9 (Revised 12/07)
PURDUE UNIVERSITY GRADUATE SCHOOL
Thesis/Dissertation Acceptance
This is to certify that the thesis/dissertation prepared
By
Entitled
For the degree of
Is approved by the final examining committee:
Chair
To the best of my knowledge and as understood by the student in the Research Integrity and Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy on Integrity in Research” and the use of copyrighted material.
Approved by Major Professor(s): ____________________________________
____________________________________
Approved by: Head of the Graduate Program Date
Mohamed Ahmed Mohamed Ahmed Yakout
Guided Data Cleaning
Doctor of Philosophy
Ahmed K. Elmagarmid
Walid G. Aref
Luo Si
Jennifer Neville
Ahmed K. Elmagarmid
Sunil K. Parbhakar / William J. Gorman 06/15/2012
Graduate School Form 20 (Revised 9/10)
PURDUE UNIVERSITY GRADUATE SCHOOL
Research Integrity and Copyright Disclaimer
Title of Thesis/Dissertation:
For the degree of Choose your degree
I certify that in the preparation of this thesis, I have observed the provisions of Purdue University Executive Memorandum No. C-22, September 6, 1991, Policy on Integrity in Research.*
Further, I certify that this work is free of plagiarism and all materials appearing in this thesis/dissertation have been properly quoted and attributed.
I certify that all copyrighted material incorporated into this thesis/dissertation is in compliance with the United States’ copyright law and that I have received written permission from the copyright owners for my use of their work, which is beyond the scope of the law. I agree to indemnify and save harmless Purdue University from any and all claims that may be asserted or that may arise from any copyright violation.
______________________________________ Printed Name and Signature of Candidate
______________________________________ Date (month/day/year)
*Located at http://www.purdue.edu/policies/pages/teach_res_outreach/c_22.html
Guided Data Cleaning
Doctor of Philosophy
Mohamed Yakout
06/23/2012
GUIDED DATA CLEANING
A Dissertation
Submitted to the Faculty
of
Purdue University
by
Mohamed A. Yakout
In Partial Fulfillment of the
Requirements for the Degree
of
Doctor of Philosophy
August 2012
Purdue University
West Lafayette, Indiana
ii
To my parents, Princisa and Ahmed, for a life of sacrifice and inspiration.
To my wife, Walaa, for years of love, dedication and support.
To my children, Jasmine and Zeyad, the light of my eyes.
iii
ACKNOWLEDGMENTS
Coming to the end of this long journey, It is my pleasure to express my gratitude
to a large number of people who have contributed, in many different ways, to make
my success a part of their own.
First, I wish to express my deepest gratitude to my supervisor Prof. Ahmed
Elmagarmid. I am totally indebted to his continuous encouragement, efforts and
invaluable advices. He was a wonderful advisor, a great leader and a close friend. I
learned from Ahmed how to do high quality research, how to transform my fledging
ideas into crisp research endeavors, how to present and sell my ideas. I also learned
from him how to think “out of the box” and see the value of a proposition. After all,
I was really fortune to have Ahmed as my advisor and I am so delighted and honored
for being his student.
I will be always grateful to Prof. Walid Aref for the thoughtful discussions with
him on both professional and personal levels. Whenever, I was in need to an advice
or stuck in a decision, Walid was always there by his experience and invaluable com-
ments. I am also grateful to Dr. Mourad Ouzzani for the countless hours he spent
with me on multiple research projects. He treated me as his brother and never made
me feel that he was a research faculty member while I was only a graduate student.
I would like to thank Prof. Mikhail Atallah for his support and encouragement. I
learned from him how teamwork is vital to enable solving problems efficiently. Special
thanks to Prof. Jennifer Neville for her help and continuous support. I learned from
her a great deal about the importance of Data Mining and this made me able to find
plenty of rooms to involve data mining techniques in my solutions.
I would also like to thank Prof. Luo Si for serving on my exam committee. He was
always so kind and supportive; this is in addition to his insightful comments. I want
also to thank Prof. Chris Clifton for his continuous help and for his useful comments
iv
in my prelim I would also like to thank Dr. William Gorman and Renate Mallus for
their dedication to students and for helping me. Dr. Gorman was always there to fill
my advisor’s absence during his leave.
During my summer internships with Microsoft Research and Google, I worked
with wonderful smart people. My internships at Microsoft was truly unforgettable
experience My sincere thanks to Kris Ganjam for being a wonderful mentor and Dr.
Kaushik Chakrabarti for sharing his advice and experience with me. My discussions
with them significantly contributed to my way of thinking and attacking real-world
data problems. Also I had a great experience during my internship at Google Inc.
I am especially grateful to Dr. Moustafa Hammad for his wonderful mentorship.
Moustafa was always ready to get into deep discussions with me on how to improve
solutions approaches, or even how to better implement them. At the personal level I
value my friendship with Moustafa to the greatest extent.
Special thanks are due to my friends and colleagues who made my graduate life
easier. In particular, thanks to Dr. Hicham Elmongui for his continuous help and
advices during my first few years in the PhD. I would also like to acknowledge Dr.
Hazem Elmeleegy, Dr. Mohamed El Tabakh, Samer Barakat, Dr. Ahmed Amin,
Ahmed Abdel-Gawad, Amr Ebeid and Amgad Madkour.
My sincere gratitude goes to my wife Walaa. Walaa’s love, dedication, persever-
ance, and belief in me were key factors in my success. Her support is infinite and
her patience is endless. She was always reliable in taking care of anything that might
keep me away from studying. She gave the highest priority to me and to our kids,
Jasmine and Zeyad.
My forever gratitude goes to my parents for their sacrifices, endless support, en-
couragement and continuous prayers for me. I can not be grateful enough to them.
They taught me the value of respect, hard work, good judgment and honesty. Thanks
to my sisters Rabab and Rania for their support and advices.
v
Above all, I thank ALLAH. For only through ALLAH’s grace and blessing has
this pursuit been possible. I pray for ALLAH’s support and guidance in the rest of
my career and my life.
vi
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Key Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 User’s Direct Interaction for Data Cleaning . . . . . . . . . . 3
1.1.2 Scalable Data Cleaning Techniques . . . . . . . . . . . . . . 3
1.1.3 User’s Indirect Interaction for Data Cleaning . . . . . . . . . 5
1.1.4 Leveraging the WWW for Data Cleaning . . . . . . . . . . . 7
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Constraint-based Data Cleaning . . . . . . . . . . . . . . . . 8
1.2.2 Machine Learning Techniques for Data Cleaning . . . . . . . 9
1.2.3 Involving Users in the Data Cleaning Process . . . . . . . . 10
1.2.4 WWW for Data Integration and Data Cleaning . . . . . . . 12
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 GUIDED DATA REPAIR . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Problem Definition and Solution Overview . . . . . . . . . . . . . . 20
2.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Generating Candidate Updates . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Dirty Tuples Identification and Updates Discovery: . . . . . 24
2.3.2 Updates Consistency Manager . . . . . . . . . . . . . . . . . 28
vii
Page
2.3.3 Grouping Updates . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Ranking and Displaying Suggested Updates . . . . . . . . . . . . . 31
2.4.1 VOI-based Ranking . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.2 Active Learning Ordering . . . . . . . . . . . . . . . . . . . 36
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.1 VOI Ranking Evaluation . . . . . . . . . . . . . . . . . . . . 41
2.5.2 GDR Overall Evaluation . . . . . . . . . . . . . . . . . . . . 43
2.5.3 User Efforts vs. Repair Accuracy . . . . . . . . . . . . . . . 47
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3 SCALABLE APPROACH TO GENERATE DATA CLEANING UPDATES 49
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Problem Definition and Solution Approach . . . . . . . . . . . . . . 52
3.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.2 Solution Approach . . . . . . . . . . . . . . . . . . . . . . . 56
3.3 Modeling Dependencies and Predicting Updates . . . . . . . . . . . 57
3.3.1 Modeling Dependencies . . . . . . . . . . . . . . . . . . . . . 57
3.3.2 Predicting Updates . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Scaling Up the Maximal Likelihood Repairing Approach . . . . . . 62
3.4.1 Process Overview . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.2 Repair Generation Phase . . . . . . . . . . . . . . . . . . . . 64
3.4.3 Tuple Repair Selection Phase . . . . . . . . . . . . . . . . . 67
3.4.4 Approximate Solution for Tuple Repair Selection . . . . . . 72
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5.1 Repair Quality Evaluation . . . . . . . . . . . . . . . . . . . 76
3.5.2 SCARE Scalability . . . . . . . . . . . . . . . . . . . . . . . 82
3.5.3 SCARE vs. ERACER to Predict Missing Values . . . . . . . 83
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
viii
Page
4 INDIRECT GUIDANCE FOR DEDUPLICATION (BEHAVIOR BASEDRECORD LINKAGE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2 Behavior Based Approach . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.3 Pre-processing and Behavior Extraction . . . . . . . . . . . 90
4.2.4 Matching Strategy . . . . . . . . . . . . . . . . . . . . . . . 94
4.3 Candidate Generation Phase . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Accurate Matching Phase . . . . . . . . . . . . . . . . . . . . . . . 100
4.4.1 Statistical Modeling Technique . . . . . . . . . . . . . . . . 100
4.4.2 Information Theoretic technique (Compressibility) . . . . . . 108
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.5.1 Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5 HOLISITIC MATCHING WITH WEB TABLES FOR ENTITIES AUG-MENTATION AND FINDING MISSING VALUES . . . . . . . . . . . . 120
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2 Holistic Matching Framework . . . . . . . . . . . . . . . . . . . . . 127
5.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2.2 General Augmentation Framework . . . . . . . . . . . . . . 128
5.2.3 Direct Match Approach . . . . . . . . . . . . . . . . . . . . 129
5.2.4 Holistic Match Approach . . . . . . . . . . . . . . . . . . . . 130
5.3 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.4 Building the SMW Graph and computing FPPR . . . . . . . . . . . 135
5.4.1 Building the SMW Graph . . . . . . . . . . . . . . . . . . . 135
5.4.2 Computing FPPR on SMW Graph . . . . . . . . . . . . . . 141
5.5 Supporting Core Operations . . . . . . . . . . . . . . . . . . . . . . 141
ix
Page
5.5.1 Augmentation-By-Attribute (ABA) . . . . . . . . . . . . . . 141
5.5.2 Augmentation-By-Example (ABE) . . . . . . . . . . . . . . 142
5.6 Handling n-ary Web Tables . . . . . . . . . . . . . . . . . . . . . . 143
5.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 145
5.7.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . 145
5.7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 147
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 155
6.2 Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.2.1 User Centric Data Cleaning . . . . . . . . . . . . . . . . . . 158
6.2.2 Holistic Data Cleaning . . . . . . . . . . . . . . . . . . . . . 158
6.2.3 The WWW for Data Cleaning . . . . . . . . . . . . . . . . . 159
6.2.4 Private Data Cleaning . . . . . . . . . . . . . . . . . . . . . 159
LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
x
LIST OF TABLES
Table Page
5.1 Web tables matching features as documents. . . . . . . . . . . . . . . . 139
5.2 Query entity domains and augmenting attributes . . . . . . . . . . . . 146
xi
LIST OF FIGURES
Figure Page
2.1 Example data and rules . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 GDR Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Comparing VOI-based ranking in GDR (GDR-NoLearning) to other strate-gies against the amount of feedback. Feedback is reported as the percent-age of the maximum number of verified updates required by an approach.Our application of the VOI concept shows superior performance comparedto other naıve ranking strategies. . . . . . . . . . . . . . . . . . . . . . 42
2.4 Overall evaluation of GDR compared with other techniques. The combina-tion of the VOI-based ranking with the active learning was very successfulin efficiently involving the user. The user feedback is reported as a per-centage of the initial number of the identified dirty tuples. . . . . . . . 45
2.5 Accuracy vs. user efforts. As the user spends more effort with GDR, theoverall accuracy is improved. The user feedback is reported as a percentageof the initial number of the identified dirty tuples. . . . . . . . . . . . . 48
3.1 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Generated predictions for tuple repairs with their corresponding predictionprobabilities for tuple t4 in Figure 3.1. . . . . . . . . . . . . . . . . . . 67
3.3 Step by step demonstration for the SelectTupleRepair algorithm. At eachiteration, the vertex with minimum weighted degree is removed as long asit is not the only vertex in its corresponding vertex set. . . . . . . . . . 70
3.4 Quality vs. the percentage of errors: SCARE maintains high precision bymaking the best use of δ, the allowed amount of changes. . . . . . . . . 76
3.5 δ controls the amount of changes to apply to the database: small δ guar-antees high precision at the cost of the recall and vice versa. . . . . . . 78
3.6 Using SCARE in an iterative way helps improving the recall and the overallquality of the updates. The decrease in the precision is small compared tothe increase in the recall, achieving an overall high quality improvementdemonstrated by the f-measure. . . . . . . . . . . . . . . . . . . . . . . 80
xii
Figure Page
3.7 Increasing the number of partition functions |H| improves the accuracy ofthe predictions and hence increases the precision. The recall is not affectedmuch because we use a fixed δ. . . . . . . . . . . . . . . . . . . . . . . 81
3.8 SCARE scalability when varying the database size. . . . . . . . . . . . 82
3.9 Comparison between SCARE and ERACER to predict missing values.Generally, both SCARE and ERACER show high accuracy in predict-ing the missing values. SCARE uses in this experiment Naıve Bayesianmodel, while ERACER leverage domain knowledge interpreted in carefullydesigned Bayesian Network. . . . . . . . . . . . . . . . . . . . . . . . . 84
4.1 Process for behavior-based record linkage. . . . . . . . . . . . . . . . . 89
4.2 Retail store running example. . . . . . . . . . . . . . . . . . . . . . . . 92
4.3 Actions patterns in the complex plane and the effect on the magnitude. 98
4.4 Behavior linkage overall quality. . . . . . . . . . . . . . . . . . . . . . . 112
4.5 Improving the textual matching quality. . . . . . . . . . . . . . . . . . 114
4.6 Behavior linkage quality vs. different splitting probabilities and behaviorexhaustiveness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.7 Behavior linkage quality vs. behavior contiguousness and percentage ofoverlapping entities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.8 Behavior linkage performance. . . . . . . . . . . . . . . . . . . . . . . . 118
5.1 APIs of the core operations . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2 ABA operation using web tables . . . . . . . . . . . . . . . . . . . . . . . 123
5.3 InfoGather System Architecture . . . . . . . . . . . . . . . . . . . . . . 135
5.4 The distribution of the number of columns per web table and statisticsabout the relational web tables and . . . . . . . . . . . . . . . . . . . . 144
5.5 Augmenting-By-Attribute (ABA) evaluation . . . . . . . . . . . . . . . . . 148
5.6 Sensitivity of the precision and coverage to the number of examples. The Holis-
tic shows high precision and maintains high coverage in comparison to DMA. 150
5.7 Joint sensitivity analysis to the number of examples and the head vs. tail
records in the web tables. The Holistic is robust in comparison to the DMA. 151
5.8 Web tables matching accuracy . . . . . . . . . . . . . . . . . . . . . . . . 152
5.9 Response time evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 153
xiii
ABSTRACT
Yakout, Mohamed A. Ph.D., Purdue University, August 2012. Guided Data Cleaning.Major Professor: Ahmed K. Elmagarmid.
Until recently, all data cleaning techniques have focused on providing fully auto-
mated solutions, which are risky to rely on, without efficiently and effectively consider-
ing collaboration with the data users and other available resources. This dissertation
studies techniques to involve data users directly and indirectly, as well as leverag-
ing the WWW, specifically web tables, for data cleaning tasks. In particular, the
dissertation addresses four key challenges for guided data cleaning.
The first challenge relates to directly involving users in the data cleaning process.
The goal is to efficiently combine the best of both the user fidelity to guide the data
cleaning process and the existing automatic cleaning techniques to suggest cleaning
updates. For this purpose, we develop the necessary principles to reason about which
questions to forward to the user using a novel combination of decision theory and
active learning.
The second challenge is scalability as existing automatic cleaning techniques are
not scalable. We introduce a new approach that is based on statistical machine
learning techniques. We achieve scalability by introducing a robust mechanism to
partition the database, and then aggregate the final cleaning decisions from the several
partitions.
The third challenge relates to involving users indirectly for a data cleaning task.
We notice that the users’ actions (or behavior), which can be found in the systems
log, can be useful evidence for the task of deduplicating the users themselves. We
develop the necessary pattern detection and modeling algorithms for this purpose.
xiv
Finally, the fourth challenge relates to leveraging the WWW for data cleaning
tasks. We address the problem of finding missing values (or entity augmentation)
using web tables. Our solution relies on aggregating answers from several web tables
that directly and indirectly match the user’s entities. We model this problem as a
topic sensitive pagerank, which models the holistic semantic match of a web table to
the topic of the list of entities.
Our experimental evaluations using real-world datasets demonstrate the effec-
tiveness and efficiency of our proposed approaches to improve the quality of dirty
databases.
1
1. INTRODUCTION
This dissertation studies techniques to involve data users (or domain experts), in
addition to leveraging data on the Web in data cleaning tasks. The purpose is to
achieve better data quality efficiently and effectively.
Data quality issues are of several kinds, e.g., inaccuracy, inconsistency, duplicates
and incompleteness, and may be the consequence of several reasons, e.g., misspelling,
integration from heterogeneous sources, and software bugs. Poor data quality is a
fact of life for most organizations and can have serious implications on their efficiency
and effectiveness [1].
Data quality experts estimate that erroneous data can cost a business as much
as 10 to 20% of its total system implementation budget [2]. They agree that as
much as 40 to 50% of a project budget might be spent correcting data errors in
time-consuming, labor-intensive and tedious processes. The proliferation of data also
heightens the relevance of data cleaning and makes the problem more challenging:
more sources and larger amounts of data imply larger variety and intrication of the
data quality problems and higher complexity for maintaining the quality of the data
in a cost-effective way. Not to mention the importance of data quality in the health
care domain as well. In such critical applications, incorrect information about pa-
tients in an Electronic Health Record (EHR) may lead to inconsistent treatments
and prescriptions, which consequently may cause severe medical problems including
death. As a result, various computational procedures for data cleaning have been pro-
posed by the database community to (semi-)automatically identify errors and, when
possible, correct them.
Most existing approaches to clean dirty databases either rely on predefined data
quality rules that should be satisfied by the database or rely on machine learning
techniques. Most of these techniques focus on providing fully automated solutions
2
using different heuristics, which could be risky especially for critical data. To guaran-
tee that the best desired quality updates are applied to the database, users (domain
experts) should be involved to confirm updates. This highlights the increasing need
for a techniques that combines the best of both worlds.
There are other cases where involving users or relying on data cleaning rules will
not be helpful. For example, when there are a lot of missing values in the database.
Consider the following scenario in an enterprize database, where we have a table
containing a list of companies, but their location or contact information is missing.
Neither rules nor correlations among the attributes are helpful in this case; and a user
has to collect all this information manually. Usually, the WWW is helpful in most of
such situations as it covers a large spectrum of domains. This highlight the need for
techniques to automatically leverage the WWW for such data cleaning tasks.
In this chapter, we start in Section 1.2 by highlighting the key challenges we
address in this dissertation in Section 1.1, and we then discuss the related work.
Section 1.3 summarizes the contributions and goals of this dissertation in view of the
challenges presented in Section 1.1. Finally, Section 1.4 outlines the structure of this
dissertation.
1.1 Key Challenges
In this section, we highlight some of the challenges in data cleaning. We will
focus on four key challenges, where the contributions of this dissertation revolve
around. The challenges are related to efficiently involving the user in the data cleaning
process, the scalability of the techniques to generate cleaning updates, leveraging
users generated log for a data cleaning task, and finally, leveraging the WWW for
data cleaning.
3
1.1.1 User’s Direct Interaction for Data Cleaning
Exiting automated solutions for data cleaning can be used as generators for data
cleaning updates. Then the data users can be involved to inspect such updates and
confirm the correct ones. However, involving the user can be very expensive because
of the large number of possibilities to be verified. Since automated techniques for
data cleaning produce far more updates than one can expect the user to handle,
techniques for selecting the most useful updates for presentation to the user become
very important.
The key challenge in involving users is to determine how and in what order sug-
gested updates should be presented to them. This requires developing a set of princi-
pled measures to estimate the improvement in quality to reason about the selection
process of possible updates, as well as, investigating machine learning techniques to
minimize user effort. The goal is to achieve a good trade-off between high quality
data and minimal user involvement.
1.1.2 Scalable Data Cleaning Techniques
For the inconsistent databases, we focus on solutions that rely on providing
cleaning updates in the form of value modification. Most existing solutions follow
constraint-based repairing approaches [3–5], which search for minimal change of the
database to satisfy a predefined set of constraints. While a variety of constraints (e.g.,
integrity constraints, conditional functional and inclusion dependencies) can detect
the presence of errors, they are recognized to fall short of guiding to correct the er-
rors, and worse, may introduce new errors when repairing the data [6]. Moreover,
despite the research conducted on integrity constraints to ensure the quality of the
data, in practice, databases often contain a significant amount of non-trivial errors.
These errors, both syntactic and semantic, are generally subtle mistakes which are
difficult or even impossible to express using the general types of constraints available
4
in modern DBMSs [7]. This highlights the need for different techniques to clean dirty
databases.
Usually statistical Machine Learning (ML) (e.g., decision tree, Bayesian networks)
can capture dependencies, correlations, and outliers from datasets based on various
analytic, predictive or computational models [8]. Existing efforts in data cleaning
using ML techniques mainly focused on data imputation (e.g., [7]) and deduplication
(e.g., [9]). To the best of our knowledge, we are not aware of an approach to consider
ML techniques for repairing databases by value modification.
Involving ML techniques for repairing erroneous data is not straightforward and
it raises four major challenges: (1) Several attribute values (of the same record) may
be dirty. Therefore, the process is not as simple as predicting values for a single
erroneous attribute. This requires accurate modeling of correlations between the
database attributes, assuming a subset is dirty and its complement is reliable. (2) A
ML technique can predict an update for each tuple in the database; and the question
is how to distinguish the predictions that should be applied. Therefore, a measure
to quantify the quality of the predicted updates is required. (3) Over-fitting problem
may occur when modeling a database with a large variety of dependencies that may
hold locally for data subsets but do not hold globally. (4) Finally, the process of
learning a model from a very large database is expensive, and the prediction model
itself may not fit in the main memory. Despite the existence of scalable ML techniques
for large datasets, they are either model dependent (i.e., limited to specific models,
for example SVM [10]) or data dependent (e.g., limited to specific types of datasets
such as scientific data and documents repository). Note that the scalability is also an
issue for the constraint-based repairing approaches [11].
Such limitations motivate the need for effective and scalable methods to accurately
predict cleaning updates with statistical guarantees.
5
1.1.3 User’s Indirect Interaction for Data Cleaning
There are situations where the users interaction with the systems is registered in
a transaction log. This raises the following question: is it possible to leverage such
log for a data cleaning task? In this case, the users are indirectly involved for data
cleaning; this is in contrast to the direct interaction we discussed in Section 1.1.1.
Specifically, we focus on the task of deduplication or record linkage.
Record linkage is the process of identifying records that refer to the same real world
entity. There has been a large body of research on this topic (refer to [9] for a recent
survey). While most existing record linkage techniques focus on simple attribute
similarities, more recent techniques are considering richer information extracted from
the raw data for enhancing the matching process (e.g. [12–15]).
In contrast to most existing techniques, we are considering entity behavior as a
new source of information to enhance the record linkage quality. We observe that
by interpreting massive transactional datasets, for example, transaction logs, we can
discover behavior patterns and identify entities based on these patterns. Various
applications such as retail stores, web sites, and surveillance systems, maintain trans-
action logs that track the actions performed by entities over time. Entities in these
applications will usually perform actions, e.g., buying a specific quantity of milk at
a specific point in time or browsing specific pages within a web site, which represent
their behavior vis-a-vis the system.
To further motivate the importance of using the behavior for record linkage, con-
sider the following real-life example. Yahoo has recently acquired a Jordanian Inter-
net company called Maktoob, which, similar to Yahoo, provides a large number of
Internet services to its customers in the region like e-mail, blogs, news, and online
shopping. It was reported that with this acquisition, Yahoo will be able to add the 16
million Maktoob users to its 20 million users from the middle east region1. Clearly,
Yahoo should expect that the overlap between these two groups of users can be quite
1http://www.techcrunch.com/2009/08/25/confirmed-yahoo-acquires-arab-internet-portal-maktoob/
6
significant, and hence the strong need for record linkage. However, user profile in-
formation stored by both companies may not be reliable enough because of different
languages, unreal information, . . . etc. In this scenario, analyzing the users behavior,
in terms of how they use the different Internet services, will be an invaluable source of
information to identify potentially common users. Record linkage analysis based on
entity behavior has also many other applications. For example, identifying common
customers for stores that are considering a merge, tracking users accessing web sites
from different IP addresses, as well as helping in crime investigations.
A seemingly straightforward strategy to match two entities is to measure the
similarity between their behaviors. However, a closer examination shows that this
strategy may not be useful, for the following reasons. It is usually the case that
the complete knowledge of an entity’s behavior is not available to both sources, since
each source is only aware of the entity’s interaction with that same source. Hence, the
comparison of entities’ “behaviors” will in reality be a comparison of their “partial
behaviors”, which can easily be misleading. Moreover, even in the rare case when
both sources have almost complete knowledge about the behavior of a given entity
(e.g., a customer who did all his grocery shopping at Walmart for one year and then
at Safeway for another year), the similarity strategy still will not help. The problem is
that many entities do have very similar behaviors, and hence measuring the similarity
can at best group the entities with similar behavior together (e.g., [16–18]), but not
find their unique matches.
The key challenge is to devise an alternative strategy to match entities using their
behavior, because the straightforward similarity is not the write way to address this
problem. This highlight further challenges on how we devise a matching function for
entities behavior, how to represent and model entities behavior, and finally, how to
design an efficient solution to handle the expected large amount of transactions.
7
1.1.4 Leveraging the WWW for Data Cleaning
As we mentioned earlier, there are cases where none of the automated cleaning
techniques can be helpful and we have to rely on external data sources. The WWW
is the richest data source and the question here is how to effectively and efficiently
leverage the WWW for data cleaning.
The Web contains a vast corpus of HTML tables. In this dissertation, we focus
on one class of HTML tables: entity-attribute tables (also referred to as relational
tables [19,20] and 2-dimensional tables [21]). Such a table contains values of multiple
entities on multiple attributes, each row corresponding to an entity and each column
corresponding to an attribute. Cafarella et. al. reported 154M such tables from a
snapshot of Google’s crawl in 2008; we extracted 573M such tables from a recent
crawl of Microsoft Bing search engine. Henceforth, we refer to such tables as simply
web tables.
Consider an enterprize database where we have a table about companies and all
(or most) of their contact information is missing, or consider a product database
with a table about digital cameras. In the cameras table, the camera model name
is provided, but some other attributes such as brand, resolution, price and optical
zoom have missing values. We call these attributes as augmenting attributes and the
process of finding the missing attributes values as entities augmentation.
Such augmentation would be difficult to perform using an enterprize database or
an ontology because the entities can be from any arbitrary domain. Today, users try
to manually find the web sources containing this information and assemble the values.
Assuming that this information is available, albeit scattered, in various web tables, we
can save a lot of time and effort if we can perform this operation automatically. This
will require discovering sematic matching relationships between the web tables. The
result is going to be a Semantic Matching Web tables (SMW) graph. Constructing
and processing such large graph is a big challenge.
8
The challenges and requirements for such operation are: (i) high precision
(#corraug#aug
) and high coverage ( #aug#entity
) where #corraug, #aug and #entity denote
the number of entities correctly augmented, the number of entities augmented and
the number of entities, respectively. (ii) fast (ideally interactive) response times and
(iii) applicability to entities of any arbitrary domain.
1.2 Related Work
Improving data quality has been the focus of a large body of research for decades.
Our work is closely related to the following research areas: (i) constraint-based data
repair, (ii) statistical machine learning techniques for data cleaning, (iii) interactive
systems for data cleaning and user’s modeling, and (iv) leveraging the WWW for
data integration tasks.
1.2.1 Constraint-based Data Cleaning
This approach has two main steps: (1) identify a set of constraints that should be
followed by the data, and then (2) use the constraints and the data to find another
consistent database that minimally differs from the original database (e.g., [3–5, 22–
26]). Most earlier work (except [3, 23,24,26,27]) considers traditional full and denial
dependencies, which subsume functional dependencies (FDs). The repair algorithm
in [23] uses traditional FDs and inclusion dependencies (INDs) to derive repairs,
while [4] is applicable for restricted denial constraint. The work in [23] uses equivalent
classes to group the attributes values that are equivalent in obtaining a final consistent
database instance. The repair approach in [3] uses CFDs for data repair and it is
considered a non-trivial extension to the repair algorithms described in [23]. The
proposed algorithms are based on a cost-based greedy heuristic to decide upon a
repair of errors. The work in [5] uses FDs to map the repairing problem to hyper-graph
optimization problem, where a heuristic vertex cover algorithm can help finding the
minimal number of attributes values to modify in order to find a database consistent
9
with the FDs. The main drawback of these approaches is that the data should be
covered by a set of constraints that have been specified or validated by domain experts,
which may be an expensive manual process and may not be affordable for all data
domains. Moreover, the constraints usually fall short to correctly identify the right
fixes [6].
In the literature, several classes of data quality rules have been introduced. For
example, the Conditional Functional Dependencies (CFDs) [28], which extend stan-
dard functional dependencies (FDs) with conditional pattern tableaux that define
the subset of tuples or context in which the underlying FD holds. The Matching
Dependencies (MDs) [11], which is similar to the FDs, but it takes into account simi-
larity between values instead of exact matching. The records matching rules [29] and
Dedupalog [30] are used to identify duplicate records.
CFDs is being extensively studies due to its usefulness as integrity constraint to
summarize data semantics and identify data inconsistencies. Prior work focused on
consistency and implication analysis for CFDs [28], propagation of CFDs from source
data to views in data integration [31], extensions of CFDs by adding disjunction and
negation [32] or adding ranges [33], estimating CFDs confidence [34]. Consequently,
competing algorithms for discovering CFDs were immediately introduced in [33, 35,
36].
1.2.2 Machine Learning Techniques for Data Cleaning
Data cleaning using ML techniques mainly focused on deduplication (refer to [9]
for survey), data imputation (e.g., [7, 37]) and errors detecting (e.g., [8, 38]). To
the best of our knowledge, the problem of using ML techniques for repairing dirty
databases by value modification has not been addressed.
In data imputation for example, [7] uses relational learning to learn the charac-
teristics of the attributes relationships in a relational database. Then, the learnt
model is used to infer the missing values. This technique requires a priori knowledge
10
about the relationships between the attributes to construct the appropriate Bayesian
network for learning. Most of similar techniques for data imputation are limited to
numerical or categorical attributes.
The main challenges for these techniques are (i) the scalability for large databases
to be modeled with all existing data correlations and (ii) the accuracy of the replace-
ment values prediction due to the fact that existing methods usually capture either
local or global data relationships and do not combine both views. Despite the exis-
tence of scalable ML techniques for large datasets, they are either model dependent
(i.e., limited to specific models, for example SVM [10]) or data dependent (e.g., lim-
ited to specific types of datasets such as scientific data and documents repository).
The scalability is also an issue for the constraint-based repairing approaches [11].
1.2.3 Involving Users in the Data Cleaning Process
Most existing systems for data cleaning provide tools for data exploration and
transformation without taking advantage of recent efforts on automatic data re-
pair. Usually, the repair actions are “explicitly specified by the user”. For example,
AJAX [39] proposes a declarative language to eliminate duplicates during data trans-
formations. Potter’s Wheel [40] combines data transformations with the detection
of errors in the form of irregularities. None of these systems efficiently leverage user
feedback by either ranking or using learning mechanisms.
A recent work to repair critical data with quality guarantee was introduced in [6].
In [6] it is assumed that a reference correct data exists and the user is required
to specify certain attributes to be correct across the entire dataset. Moreover, the
proposed solution relies on a pre-specified set of editing rules.
Previous work on soliciting user feedback to improve data quality focuses mostly
on two objectives: (i) identify correctly matched references in a large scale integrated
data (e.g. [41–44] ) or for duplicate elimination (e.g. [45]); and (ii) improve the pre-
diction quality of a learning model by taking into account data acquisition costs (e.g.
11
cost sensitive-learning [46], utility-based learning [47], active learning [48], selective
supervision [49], selective repeated labeling [50]).
The work in [41] and [44] addresses incorporating user feedback into schema match-
ing tasks. [42] introduced a framework to provide many users with candidate matches,
without any ranking or selection mechanism, and then combine the responses to con-
verge to a correct answer. In [43], a decision theoretic framework has been proposed
to rank candidates reference matches to improve the quality of query response of
dataspace. This framework is limited to soliciting user feedback to resolve candidate
matches from a dataspace and it can not be applied in a constrained repair frame-
work for relational database. [45] introduced active-learning based approach to build
a generic matching function for identifying duplicate records.
Selective supervision [49] combines decision theory with active learning. It uses a
value of information approach for selecting unclassified cases for labeling. Selective
repeating labeling [50] assumes unreliability of user feedback and combines label un-
certainty with active learning to select instances for repeated labeling. The overall
goal of these approaches is to reduce the uncertainty in the predicted output without
regard to how important those predictions to quality of the underlying database are.
For the approaches that leverage users or entities behavior for entities dedupli-
cation, a closely related area is the users adaptive systems for web navigation and
information retrieval (e.g., [16–18]). Most of these techniques focus on statistically
modeling user interactions to extract domain specific features to understand users
preferences. These models focus on the statistical significance of extracted features
and may take into account the sequence of users actions. However, they do not take
into account the time dimension to determine the repeated patterns of actions. They
are better suited to determine groups of common behaviors and may be used to eval-
uate the similarities between entities. However, they cannot be helpful at the level of
computing the pair-wise matching between entities based on their registered actions.
12
1.2.4 WWW for Data Integration and Data Cleaning
The most related work is the Octopus system developed by Cafarella et. al.
[20]. The Extend operator proposed by Octopus is similar to the operation of
finding missing values using web tables. Octopus uses the web search API to retrieve
matching tables; this does not have any well-defined semantics. Since web search is
not meant for matching tables, in many cases, the top 1000 returned urls does not
provide any matching tables. Moreover, Octopus needs to invoke the search API
for each user record and perform clustering of web tables at query time leading to a
prohibitive performance to handel even small size databases. This highlight the need
for a different approach that rely on a well defined semantic matching between the
web tables and the user table (the table with missing values). Moreover, the approach
needs to perform most of the “heavy lifting” at a preprocessing step such that we get
a fast response time.
Researchers have developed techniques to annotate web tables with column names
and names of relationships [51, 52]. These techniques can help to build a better
Semantic Matching Web tables (SMW) graph.
Building the SMW graph is related to the vast body of work on schema matching
[53–55]. Most modern approaches uses several base techniques such as linguistic
matching of attribute names and detecting overlap of data instances and combines
them to determine the final matchings; the base techniques as well as the combiner
can either be machine learning-based techniques or non-learning methods [41,56]. In
contrast to enterprize tables, web tables have more features that can be involved and
obtain better sematic matching. For example, the context (i.e., the text in the web
page) where the table came from.
There exists a rich body of work of leveraging HTML lists for set expansion and
table augmentation [57,58]. However, the focus is on discovering more entities rather
than augment the provided entities.
13
1.3 Contributions
The central thesis of this dissertation can be stated as follows: A complete and
effective solution to improve the data quality is likely to depend on a close collaboration
between humans in the form of data users (or domain experts) and machines in the
form of the automated solutions to clean dirty databases, in addition to leveraging
other information resources such as the enormous amount of data on the Web.
The data users must get in the loop, because the automatic cleaning techniques
may cause undesired changes to the database and the data quality may even get worse.
Moreover, sometimes exploring other information resources is helpful. For example,
it is common for users to refer to the WWW to search for accurate information to
correct the database.
We claim the following list of contributions:
• We propose a novel interactive data cleaning framework for Guided Data Repair
(GDR) that tackle the problem of data cleaning from a more realistic and
pragmatic viewpoint. GDR interactively involves the user directly in guiding
the cleaning process alongside existing automatic cleaning techniques. The goal
is to effectively involve users in a way to achieve better data quality as quickly
as possible. The basic intuition is to continuously consult the user for cleaning
updates that are most beneficial in improving the data quality as we go.
• Since existing automatic data cleaning approaches are not scalable, we introduce
a new approach that is based on machine learning techniques. The objective
is to build a new data cleaning updates generators to be used within GDR for
large databases. Our approach relies on maximizing the data likelihood given
the underline data distribution, which can be modeled using ML techniques.
We achieve scalability by introducing a mechanism for horizontal data parti-
tioning and enable parallel processing of data blocks; various ML methods can
be applied and provide “local” predictions that are then combined to obtain a
final accurate predictions.
14
• We introduce a novel technique to involve the user indirectly into a data cleaning
task. We propose a technique that leverage the users’ or entities’ generated
transaction log to do entities deduplication or record linkage. We present the
first formulation for this problem and introduce statistical techniques to model
the entities behavior. Our approach for matching entities using their behavior
does not rely on measuring behavior similarities. However, we first merge the
entities transactions (or behavior) and measure the gain in recognizing behavior
patterns in the merged log. Since the transactions log is expected to be large we
introduce efficient fast techniques to produce candidate matches by computing
inaccurate summaries of the entities behaviors.
• To effectively and efficiently leverage the WWW for a data cleaning task, we
propose a novel approach that rely on web tables to augment entities on missing
values attributes. The core of our approach is to match the dirty table (or query
table) with the web tables to get the relevant matched web tables. The relevant
matched web tables are then used to obtain the missing values. We develop a
novel holistic matching framework based on topic sensitive pagerank (TSP) over
the SMW graph. We argue that by considering the query table as a topic and
web tables as documents, we can efficiently model the holistic matching as TSP.
We propose a system architecture that leverages preprocessing in MapReduce
to achieve extremely fast (interactive) response times at query time. Finally,
we present a machine learning-based technique for building the SMW graph.
Our key insight is that the text surrounding the web tables is important in
determining whether two web tables match or not. We propose a novel set of
features that leverage this insight.
For each one of our proposed techniques, we implemented a research prototype
showing their applicability. Moreover, we conducted experimental studies using real-
istic datasets to validate the effectiveness of our approaches.
15
1.4 Outline
The rest of the dissertation is organized as follows: Chapter 2 describes the GDR
framework for guided data repair. In Chapter 3, we introduce our scalable automatic
repair approach. Chapter 4 introduces our approach to leverage user’s indirectly
for the data cleaning task of deduplication. The approach that uses web tables for
augmenting entities is described in Chapter 5. Finally, Chapter 6 concludes the
dissertation and points out directions for future work.
Parts of this dissertation have been published in conferences. In particular, the
work on guided data repair (Chapter 2) is described in a paper [59] in the Proceedings
of the 2011 International Conference on Very Large Databases (PVLDB 2011). Also
our implemented system for the GDR was accepted for demonstration [60] in the
2010 International Conference on Management of Data (SIGMOD 2010). The work
on leveraging entities behavior for deduplication (Chapter 4) is also described in a
paper [61] in the Proceedings of the 2010 International Conference on Very Large
Databases (PVLDB 2010). Finally, the work that leverage web tables for entities
augmentation (Chapter 5) is described in a paper [62] in the the 2012 International
Conference on Management of Data (SIGMOD 2012).
16
2. GUIDED DATA REPAIR
In this chapter, we introduce GDR, a framework for guided data repair, that effi-
ciently involves the user directly in the data cleaning process. Here, we describe the
framework components and the principals upon which we rely upon to reason about
the interaction with the data user. The objective is to converge faster to a better
data quality with minimal user involvement.
The chapter is organized as follows: Section 2.1 provides a motivating example
for our approach, and we describe the problem and our solution approach in Sec-
tion 2.2. In Section 2.3 we discuss our mechanism to generate candidate cleaning
updates. In Section 2.4, we develop a principled approach to decide upon ranking
the questions for user feedback. We experimentally evaluate GDR in Section 2.5, and
finally, summarize the chapter in Section 2.6.
2.1 Introduction
A recent approach for repairing dirty databases is to use data quality rules in the
form of database constraints to identify tuples with errors and inconsistencies and
then use these rules to derive updates to these tuples. Most of the existing data
repair approaches (e.g., [3, 4, 23, 25]) focus on providing fully automated solutions
using different heuristics to select updates that would introduce minimal changes to
the data, which could be risky especially for critical data. To guarantee that the best
desired quality updates are applied to the database, users (domain experts) should
be involved to confirm updates. This highlights the increasing need for a framework
that combines the best of both worlds. The framework will automatically suggest
updates while efficiently involve users to guide the cleaning process.
17Name SRC STR CT STT ZIPt1: Jim H1 REDWOOD DR MICHIGAN CITY MI 46360t2: Tom H2 REDWOOD DR WESTVILLE IN 46360t3: Jeff H2 BIRCH PARKWAY WESTVILLE IN 46360t4: Rick H2 BIRCH PARKWAY WESTVILLE IN 46360t5: Joe H1 BELL AVENUE FORT WAYNE IN 46391t6: Mark H1 BELL AVENUE FORT WAYNE IN 46825t7: Cady H2 BELL AVENUE FORT WAYNE IN 46825t8: Sindy H2 SHERDEN RD FT WAYNE IN 46774(a) Data
ϕ1 : (ZIP → CT, STT, {46360 ∥ MichiganCity, IN})
ϕ2 : (ZIP → CT, STT, {46774 ∥ NewHaven, IN})
ϕ3 : (ZIP → CT, STT, {46825 ∥ FortWayne, IN})
ϕ4 : (ZIP → CT, STT, {46391 ∥ Westville, IN})
ϕ5 : (STR, CT → ZIP, { ,FortWayne ∥ })
(b) CFD Rules
Fig. 2.1.: Example data and rules
Motivation Example
Consider the following example. Let Relation Customer(Name, SRC, STR, CT, STT,
ZIP) specifies personal address information Street (STR), City (CT), State (STT) and
(ZIP), in addition to the source (SRC) of the data or the data entry operator. An
instance of this relation is shown in Figure 2.1(a).
Data quality rules can be defined in the form of Conditional Functional Dependen-
cies (CFDs) as described in Figure 2.1(b). A CFD is a pair consisting of a standard
Functional Dependency (FD) and a pattern tableau that specifies the applicability of
the FD on parts of the data. For example, ϕ1 − ϕ4 state that the FD ZIP → CT, STT
(i.e., zip codes uniquely identify city and state) holds in the context where the ZIP is
18
46360, 46774, 46825 or 46391. Moreover, the pattern tableau enforces bindings be-
tween the attribute values, e.g., if ZIP= 46360, then CT= ‘Michigan City’. ϕ5 states
that the FD STR, CT → ZIP holds in the context where CT = ‘Fort Wayne’, i.e., street
names uniquely identify the zip codes whenever the city is ‘Fort Wayne’. Note that
all the tuples in Figure 2.1(a) have violations.
Typically, a repairing algorithm will use the rules and the current database in-
stance to find the best possible repair operations or updates. For example, t5 violates
ϕ4 and a possible update would be to either replace CT by ‘Westville’ or replace ZIP
by 46825, which would make t5 fall in the context of ϕ3 and ϕ5 but without violations.
To decide which update to apply, different heuristics can be used [4, 23].
However, automatic changes to data can be risky especially if the data is critical,
e.g., choosing the wrong value among the possible updates. On the other hand,
involving the user can be very expensive because of the large number of possibilities
to be verified. Since automated methods for data repair produce far more updates
than one can expect the user to handle, techniques for selecting the most useful
updates for presentation to the user become very important.
Moreover, to efficiently involve the user in guiding the cleaning process, it is
helpful if the suggested updates are presented in groups that share some contextual
information. This will make it easier for the user to provide feedback. For example,
the user can quickly inspect a group of tuples where the value ‘Michigan City’ is
suggested for the CT attribute. Similar grouping ideas have been explored in [45].
In the example in Figure 2.1, let us assume that a cleaning algorithm suggested
two groups of updates. In the first group, the updates suggest replacing the attribute
CT with the value ‘Michigan City’ for t2, t3, and t4 while in the second group they
suggest replacing the attribute ZIP with the value 46825 for t5 and t8. Let us assume
further that we were able to obtain the user feedback on the correct values for these
tuples; namely that the user has confirmed ‘Michigan City’ as a correct value of CT
for t2, t3, but as incorrect for t4, and 46825 as the correct value of ZIP for t5, but
as incorrect for t8. In this case, consulting the user on the first group, which has
19
more correct updates, is better and would allow for faster convergence to a cleaner
database instance as desired by the user. The second group will not lead for such fast
convergence.
Finally in our example, we could recognize correlations between the attribute
values in a tuple and the correct updates. For example, when SRC = ‘H2’, the CT
attribute is incorrect most of the time, while the ZIP attribute is correct. This is
an example of recurrent mistakes that exist in real data. Patterns like that with
correlations between the original tuple and the correct updates, if captured by a
machine learning algorithm, can reduce user involvement.
The key challenge in involving users is to determine how and in what order sug-
gested updates should be presented to them. This requires developing a set of princi-
pled measures to estimate the improvement in quality to reason about the selection
process of possible updates as well as investigating machine learning techniques to
minimize user effort. The goal is to achieve a good trade-off between high quality
data and minimal user involvement.
In this chapter, we propose to tackle the problem of data cleaning from a more
realistic and pragmatic viewpoint. We present GDR, a framework for guided data
repair, that interactively involves the user in guiding the cleaning process alongside
existing automatic cleaning techniques. The goal is to effectively involve users in a
way to achieve better data quality as quickly as possible. The basic intuition is to
continuously consult the user for updates that are most beneficial in improving the
data quality as we go.
We use CFDs [63] as the data quality rules to derive candidate updates. CFDs
have proved to be very useful for data quality and triggered several efforts e.g., [33,36],
for their automatic discovery as well as making them a practical choice for data repair
techniques.
We summarize the contributions of this chapter as follows:
• We introduce GDR, a framework for data repair, that selectively acquire user
feedback on suggested updates. User feedback is used to train the GDR machine
20
learning component that can take over the task of deciding the correctness of
these updates.
• We propose a novel ranking mechanism for suggested updates that applies a
combination of decision theory and active learning in the context of data quality
to reason about such task in a principled manner.
• We use the concept of value-of-information (VOI) [64] from decision theory to
develop a mechanism to estimate the update benefit from consulting the user on
a group of updates. We quantify the data quality loss by the degree of violations
to the rules. The benefit of a group of updates can be then computed by the
difference between the data quality loss before and after user feedback. Since we
do not know the user feedback beforehand, we develop a set of approximations
that allow efficient estimations.
• We apply active learning to order the updates within a group such that the
updates that can strengthen the prediction capabilities of the learned model the
most come first. To this end, we assign to each suggested update an uncertainty
score that quantifies the benefit to the prediction model, learning benefit, when
the update is labeled.
We conduct an extensive experimental evaluation on real datasets that shows the
effectiveness of GDR in allowing fast convergence to a better quality database with
minimal user intervention.
2.2 Problem Definition and Solution Overview
2.2.1 Problem Definition
We consider a database instance D with a relational schema S. Each relation
R ∈ S is defined over a set of attributes attr(R) and the domain of an attribute
A ∈ attr(R) is denoted by dom(A). We also consider a set of data quality rules Σ
21
that represent data integrity semantics. In this work, we consider rules in the form
of CFDs.
CFD Overview A CFD ϕ over R can be represented by ϕ : (X → Y, tp), where
X and Y ∈ attr(R), X → Y is a standard functional dependency (FD), referred to
as FD embedded in ϕ, and tp is a tuple pattern containing all attributes in X and Y .
For each A ∈ (X ∪ Y ), the value of the attribute A for the tuple pattern tp, tp[A], is
either a constant ’a’ ∈ dom(A), or ’−’ which represents a variable value. We denote
X as LHS(ϕ) (left hand side) and Y as RHS(ϕ) (right hand side). Examples for
CFD rules are provided in Figure 2.1.
To denote that a tuple t ∈ D matches a particular pattern tp, the symbol ≍ is
defined on data values and ’−’. We write t[X] ≍ tp[X] iff for each A ∈ X, either
t[A] = tp[A] or tp[A] = ’−’. For example, (Sherden RD, Fort Wayne, IN) ≍ (−,
Fort Wayne, −). We assume that CFDs are provided in the normal form [3], i.e.,
ϕ : (X → A, tp), A ∈ attr(R) and tp is a single pattern tuple.
A CFD ϕ : (X → A, tp) is said to be constant, if tp[A] =’−’. Otherwise, ϕ is a
variable CFD. For example in Figure 2.1, ϕ1 is a constant CFD, while ϕ5 is a variable
CFD.
A database instance D satisfies the constant CFD ϕ = (X → A, tp), denoted by
D |= ϕ, iff for each tuple t ∈ D, if t[X] ≍ tp[X] then t[A] = tp[A]. If ϕ is a variable
CFD, then D |= ϕ iff for each pair of tuples t1, t2 ∈ D, if t1[X] = t2[X] ≍ tp[X] then
t1[A] = t2[A] ≍ tp[A]. This means that if t1[X] and t2[X] are equal and match the
pattern tp[X], then t1[A] and t2[A] must also be equal to each other. CFDs address
a single relation only. However, the repairing algorithm that uses CFDs is applicable
to general relational schemas by simply repairing each relation in isolation.
We address the following problems:
• The use of the data quality rules Σ to generate candidate updates for the tuples
that are violating Σ. The rules can be either given or discovered by an automatic
discovery technique (e.g., [33,36]). Usually, the automatic discovery techniques
employ thresholds on the confidence of the discovered rules. In this setting, the
22PossibleUpdates{rj, sj} Learning ComponentRankingDirty Tuples Identification& Updates DiscoveryDCFDs Repository Training ExamplesUser Feedback to DBUser Feedback to train the learning componentUpdates are ranked according to the benefit the DQGroupingGroups of updates Learner decisions to repair the dataUpdates Consistency Mngr TrRanked Groups{c, g(c)}Updates{rj} Display ordered byUncertaintyInput Updates Generation Ranking Updates
Fig. 2.2.: GDR Framework.
user is the one to guide the repairing process and we assume that user decisions
are consistent with Σ.
• Deciding upon the best groups of updates—as mentioned in Section 2.1— to be
presented to the user during an interactive process for faster convergence and
higher data quality.
• Applying active learning to learn user feedback and use the learned models to
decide upon the correctness of the suggested updates without user’s involve-
ment.
2.2.2 Solution Overview
Figure 2.2 shows the GDR framework and the cleaning process is outlined in
Algorithm 2.1.
GDR guides the user to focus her efforts on providing feedback on the updates
that would improve quality faster, while the user guides the system to automatically
identify and apply updates on the data. This continuous feedback process, illustrated
in steps 3-10 (Procedure 1), runs while there are dirty tuples and the user is available
to give feedback.
23
Algorithm 2.1 GDR Process(D dirty database, Σ DQRs)
1: Identify dirty tuples in D using Σ and generate and store initial suggested updates in
PossibleUpdates list.
2: Group the candidate updates appropriately.
3: while User is available and dirty tuples exist do
4: Rank groups of updates such that the most beneficial come first.
5: The user selects group c from the top.
6: Updates in c are labeled by learner predictions and the user interactively gives feed-
back on the suggested updates, until the user is satisfied with the learner predictions
or has verified all the updates within c.
7: User feedback and learner decisions are applied to the database.
8: Remove rejected updates from PossibleUpdates and replace as needed.
9: Check for new dirty tuples and generate updates.
10: end while
In Step 1, all dirty tuples that violate the rules are identified and a repairing
algorithm is used to generate candidate updates. In Step 2, we group the updates for
the user in a way that makes it easier for a batch inspection.
The interactive loop in steps 3-10 starts with ranking the groups of updates such
that groups that are more likely to move the database to a cleaner state faster come
first. The user will then pick one of the top groups (c) in the list and provide feedback
through an interactive active learning session (Step 6). (The ranking mechanism and
active learning are discussed in Section 2.4.)
In Step 7, all decisions on suggested updates, either made by the user or the
learner, are applied to the database. In Step 8, the list of candidate updates is
modified by replacing rejected updates and generating new ones for emerging dirty
tuples because of the applied updates.
After getting the user feedback, the violations are recomputed by the consistency
manager and new updates may be proposed. The assumption is that if the user
verifies all the database cells then the final database instance is consistent with the
24
rules. This guarantees that we are always making progress toward the final consistent
database and the process will terminate.
2.3 Generating Candidate Updates
In this section, we outline the different steps involved in suggesting updates, main-
taining their consistency when applied to the database, and grouping them for the
user.
2.3.1 Dirty Tuples Identification and Updates Discovery:
Once a set Σ of CFDs is defined, dirty tuples can be identified through violations
of Σ and stored in a DirtyTuples list. A tuple t is considered dirty if ∃ ϕ ∈ Σ such
that t |= ϕ, i.e., t violates rule ϕ.
Resolving CFD Violations
A dirty tuple t may violate a CFD ϕ = (R : X → A, tp) in Σ following two possible
cases [3]:
• Case 1: ϕ is a constant CFD (i.e., tp[A] = a, where a is a constant) and t[X] ≍
tp[X] but t[A] = a.
• Case 2: ϕ is a variable CFD, t[X] ≍ tp[X], and ∃t′ such that t′[X] = t[X] ≍
tp[X] but t[A] = t′[A].
The latter case is similar to the violation of a standard FD. Accordingly, given a set Σ
of CFDs, the dirty tuples can be immediately identified and stored in theDirtyTuples
list.
To resolve a violation of a CFD ϕ = (R : X → A, tp) by a tuple t, we proceed as
follows: For case 1, we either modify the RHS(ϕ) attribute such that t[A] = tp[A]
or we change some of the attributes in LHS(ϕ) such that t[X] ≍ tp[X]. For case 2,
25
we either modify t[A] (resp. t′[A]) such that t[A] = t′[A] or we change some LHS(ϕ)
attributes t[X] (resp. t′[X]) such that t[X] = t′[X] or t[X] ≍ tp[X] (resp. t′[X] ≍
tp[X]).
Example 2.3.1 In Figure 2.1, the normal form of
ϕ1 : (ZIP → CT, STT, {46360 ∥ MichiganCity, IN}) would be
ϕ1,1 : (ZIP → CT, {46360 ∥ MichiganCity}) and
ϕ1,2 : (ZIP → STT, {46360 ∥ IN}).
t2 violates ϕ1,1 : (ZIP → CT, {46360 ∥ MichiganCity}) following case 1. Thus,
a suggested update by changing RHS(ϕ1,1) is to replace ‘Westville’ by ‘Michigan City’
in t2[CT], while another update by changing LHS(ϕ1,1) is to replace ‘46360’ by ‘46391’
in t2[ZIP], for example. t5, t6 both violate ϕ5 following case 2. A possible update is to
change RHS(ϕ5) by modifying t5[ZIP] to be ’46825’ instead of ’46391’. Yet, another
possible update is to make a change in LHS(ϕ5). For example, by changing t5[STR] or
t5[CT] to another value.
We implemented an on demand update discovery process based on the above
mechanism for resolving CFDs violations and generating candidate updates. This
process is triggered to suggest an update for t[A], the value of attribute A in tuple t.
Initially, the process is called for all dirty tuples and their attributes. Later during the
interactions with the user, it is triggered by the consistency manager as a consequence
of receiving user feedback.
The generated updates are tuples in the form rj = ⟨t, A, v, sj⟩ stored in the
PossibleUpdates list, where v is the suggested value in t[A] and sj is the update
score. sj ∈ [0..1] is assigned to each update rj by an update evaluation function
to reflect the certainty of the repairing technique about the suggested update. We
follow the same evaluation approach used in [23] and [3]. Given an update r to mod-
ify t[A] = v such that t[A] = v′, we compute the update evaluation score s as the
similarity between v and v′. This can be done based on the edit distance function
distA(v, v′) as follows
26
s(r) = sim(v, v′) = 1− distA(v, v′)
max(|v|, |v′|). (2.1)
where |v|, |v′| denote the size of v, v′, respectively. The intuition here is that, the
more accurate v′, the more it is close to v. s(r) is in the range [0..1] and any domain
specific similarity function can be used for this purpose. Finally, the update can be
composed in the tuple form r = ⟨t, A, v′, s(r)⟩.
Generating Updates
We now show how to use CFDs to generate updates for each potentially dirty
attribute B in t ∈ DirtyTuples. The generated updates are tuples in the form
⟨t, B, v, s⟩, where v is the suggested repair value for t[B] and s is the repair evaluation
score from Eq. 2.1.
The suggested updates correspond to attribute value modifications, which are
enough for CFDs violations [3]. For each dirty tuple t, we store the list of violated
rules in t.vioRuleList. Furthermore, for each pair ⟨t, B⟩, we keep a list of values
⟨t, B⟩.preventedList, which contains values for t[B] that are confirmed as wrong.
Thus, when searching a new suggestion for t[B], the values in ⟨t, B⟩.preventedList
are discarded. Also, we keep a flag ⟨t, B⟩.Changeable that is set to False when the
value in t[B] was confirmed to be correct.
Initially, we assume that each attribute value is incorrect for all t ∈ DirtyTuples
and proceed by searching for the best update value that provides the best
score according to Eq. 2.1. This can be performed by calling Algorithm 2.2,
UpdateAttributeTuple(t, B) for all t ∈ DirtyTuples and B ∈ attr(R).
UpdateAttributeTuple described in Algorithm 2.2 finds the best update value for
t[B] by exploring three possible scenarios:
1. B = A for some violated CFD ϕ = (X → A, tp) and tp[A] =’−’ (i.e., ϕ is a
constant CFD): This corresponds to case 1 of rule violations where t[X] ≍ tp[X]
and t[A] ≍ tp[A]. In this scenario, a value v = a is suggested (lines 4-6).
27
Algorithm 2.2 UpdateAttributeTuple (Tuple t, Attribute B)
1: if ⟨t, B⟩.Changeable = false then return;
2: best s = 0 ; v = null
3: for all ϕ = (X → A, tp) ∈ t.vioRuleList do
4: if B = A ∧ tp[A] =’−’ then
5: cur s = sim(t[A], tp[A]) {scenario 1}
6: if cur s > best s then { best s = cur s; v = tp[A] }
7: else if B = A ∧ t[A] =’−’ then
8: ⟨best s, v⟩ = getValueForRHS(ϕ, A, t, best s) {scenario 2}
9: end if
10: end for
11: if ∃ ϕ = (X → A, tp) ∈ t.vioRuleList s.t. B ∈ X then
12: ⟨best s, v⟩ = getValueForLHS(A, t, best s) {scenario 3}
13: end if
14: if v = null then
15: PossibleUpdates = PossibleUpdates ∪ {⟨t, B, v, s = sim(t[B], v)⟩}
16: end if
2. B = A for some violated CFD ϕ = (X → A, tp) and tp[A] =’−’ (i.e., ϕ is a
variable CFD): This corresponds to case 2 of rule violations where t[X] ≍ tp[X]
and t[A] ≍ tp[A] and there exists another tuple t′ that violates ϕ with t, i.e.,
t′[X] ≍ t[X] but t′[A] ≍ t[A]. In this scenario, a value v = t′[A] is suggested
(lines 7-9).
3. B ∈ LHS(ϕ) for some violated CFD ϕ = (X → A, tp): This corresponds to
either case 1 or case 2 of rule violations. In this scenario, we look for a value
v that maximizes the repair evaluation score sim(t[B], v) (Eq. 2.1.) The aim is
to select semantically related values by first using the values in the CFDs, then
searching in the tuples identified by the pattern t[X ∪ A− {B}] (lines 11-13).
In each of the above scenarios, the value v ∈ ⟨t, B⟩.preventedList. Finally, a
repair tuple is composed ⟨t, B, v, s⟩ and inserted into PossibleUpdates in line 14
28
Example 2.3.2 In Figure 2.1, t5 violates ϕ4 and when repairing the attribute CT ∈
RHS(ϕ4), a suggested update according to Scenario 1 will be ‘Westville’. Also, t5
violates ϕ5 and when repairing the attribute ZIP ∈ RHS(ϕ5), a suggested update will
be 46825 according to Scenario 2. When repairing the attribute STR ∈ LHS(ϕ5), a
suggested value from the domain dom(STR) can be ‘Sherden RD’ according to Sce-
nario 3.
2.3.2 Updates Consistency Manager
Once an update r = ⟨t, A, v, s⟩ is confirmed to be correct, either by the user or
the learning component, it is immediately applied to the database resulting into a
new database instance. Consequently, (i) new violations may arise and hence the
on demand update discovery process needs to be triggered for the new dirty tuples,
and (ii) some of the already suggested updates that are not verified yet may become
inconsistent since they were generated according to a different database instance. For
example, in Figure 2.1, two updates are proposed: r1 replaces t6[ZIP] = 46391 and r2
replaces t6[CT] = ’FT Wayne’. If a feedback is received confirming r1, then r2 is not
consistent with the new database instance and the rules anymore since t6 will fall in
the context of ϕ4. The on demand process can then find a consistent update r′2 that
corresponds to replacing t6[CT] by ’Westville’, and r2 will be discarded in favor of r′2.
The consistency manager needs to maintain two invariants: (i) There is no tuple
t ∈ D such that t |= ϕ for any ϕ ∈ Σ, and t ∈ DirtyTuples. (ii) There is no update
r ∈ PossibleUpdates such that r depends on data values that have been modified
in the database. In the following, we provide the detailed steps of the consistency
manager procedure that we implemented in GDR. Given an update r = ⟨t, B, v, s⟩
along with the feedback ∈ {confirm, reject, retain}:
1. If the feedback is to retain the current value t[B], then we set
⟨t, B⟩.Changeable = false to stop looking for updates for t[B].
29
2. If the feedback is to reject the update, i.e., t[B] cannot be v, then v is added
immediately to the list ⟨t, B⟩.P reventedList. This is followed by a call to
UpdateAttributeTuple(t, B) to find another update for t[B].
3. If the feedback confirms that t[B] must be v, then the update is applied to
the database immediately and we stop generating updates for t[B] by setting
⟨t, B⟩.Changeable = false. Afterward, we go through the rules that involve the
attribute B and update the necessary data structures to reflect the removed
violations as well as new emerging violations. Particularly, for each ϕ : (X →
A, tp) ∈ Σ where B ∈ (X ∪ A), we do the following:
(a) If t |= ϕ, then we consider two cases:
i. ϕ is a constant CFD: If ⟨t, C⟩.Changeable = false, ∀C ∈ X, i.e., all at-
tributes in LHS(ϕ) have been confirmed as correct and are not change-
able values, then RHS(ϕ) should be applied; we apply t[A] = tp[A]
to the database directly, set ⟨t, A⟩.Changeable = false, and remove ϕ
from t.vioRuleList. If some of the LHS(ϕ) attribute values are change-
able in t, then ∀ C ∈ ({X ∪ A} B) we add ⟨t, C⟩ to RevisitList. ϕ
is added to t.vioRuleList, if it is not already there, and t is added to
the DirtyTuples as well.
ii. ϕ is a variable CFD: We add ϕ to t.vioRuleList and then identify
the tuples t′ that violate ϕ with t. Then for each t′, we add ϕ to
t′.vioRuleList and add t′ to the DirtyTyples. Also, we add ⟨t′, C⟩ to
the RevisitList, ∀ C ∈ {X∪A} because this ϕ may be a new emerging
violation for t′ and all the attributes are candidates to be wrong for t′.
(b) If t |= ϕ while ϕ ∈ t.vioRuleList, then ϕ originally was violated by t before
applying this update. Therefore, we remove ϕ from t.vioRuleList. If ϕ is
a constant CFD, no further action is required. However, if ϕ is a variable
CFD, we need to check the other tuples t′, which were involved with t in
violating ϕ, and eventually update their vioRuleList. We remove ϕ from
30
t′.vioRuleList as long as @ t′′ s.t. t′, t′′ |= ϕ, i.e., t′ is not involved in
violating ϕ with another tuple t′′.
4. Remove update r = ⟨t, C, v, s⟩ from the PossibleUpdates, if ⟨t, C⟩ ∈
RevisitedList or ⟨t, C⟩.Changeable = false.
5. For every element ⟨t, C⟩ ∈ RevisitedList, we call UpdateAttributeTuple(t, C)
to find another repair for t[C].
6. Remove t from DirtyTuples, if t.vioRuleList is empty.
Note that the first update consistency invariant is maintained because of the
following: A tuple t may become dirty if it is modified or another tuple t′ is modified
so that t, t′ violates some variable CFD ϕ ∈ Σ. For a tuple t and a CFD rule ϕ,
assuming that due to a database update t |= ϕ, then t must be in DirtyTuples after
applying Step 3a.
If ϕ is a constant CFD, then Step 3(a)i should have been applied. If t continues to
violate ϕ it should be in DirtyTuples. If ϕ is a variable CFD, then Step 3(a)ii should
have been applied. There are two cases to consider: First, if t is the tuple being
repaired and t |= ϕ, then it is added to DirtyTuples, if not already there. Second,
if t |= ϕ because another tuple t′ was repaired (or modified), then Step 3(a)ii should
have been applied on t′. Thus all tuples involved with t′ in violating ϕ, including
t will be added to DirtyTuples. Following the same rationale, Step 3b maintains
that t.vioRuleList contains only rules that are being violated by t. Thus, Step 6
guarantees that the content of DirtyTuples corresponds to tuples involved in rules
violation.
The second update consistency invariant is maintained as well because of steps
3(a)i, 4, and 5. These steps maintain a local list, RevisitedList, to hold tuple-
attribute pairs, where their generated updates may depend on the applied update.
In Step 3(a)i, changing the value of t[B] may affect the update choice for the other
attributes of ϕ. For a variable CFD, Step 3(a)ii, all the tuples involved in the vio-
lations due to the modified value will need their attributes values to be revisited to
31
find updates. Step 4 removes the corresponding updates from the PossibleUpdates
and we proceed in Step 5 to get potentially new updates.
Note that Step 3 loops on the set of rules for the particular tuple t that was
updated. In Steps 3(a) and 3(b), we consider the immediate dependencies (conse-
quences) of updating t with respect to a single rule ϕ. Particularly in Step 3(a), we
check for new violations for ϕ that involve t, because it is the only change to the
database. In Step 3(b), we check for already resolved violations for ϕ due to updating
t. This local process to tuple t that considers only a single rule ϕ at a time guarantees
that the consistency manager will terminate and will not get into an infinite loop.
2.3.3 Grouping Updates
There are two reasons for the grouping: (i) Providing a useful-looking set of up-
dates with some common contextual information will be easier for the user to handle
and process. (ii) Providing a machine learning algorithm with a group of training
examples that have some correlations due to the grouping will increase the predic-
tion accuracy compared with just providing random, unrelated examples. Similar
grouping ideas have been explored in [45]. We use a grouping function where the
tuples with the same update value in a given attribute are grouped together. This
technique of grouping is in the flavor of the equivalence classes techniques described
in [23]. Another way to do the grouping is based on the conflicting structures, a
concept introduced in [5] The conflicting structures identifies the groups of cells that
can not stay unchanged in the final consistent database (i.e., for each group there
must be at least a database cell to change).
2.4 Ranking and Displaying Suggested Updates
In this section, we introduce the key concepts of GDR, namely the ranking and
learning components (Figure 2.2), which describe how GDR interacts with the user
to get feedback on suggested updates. The task of these components is to devise how
32
to best present the updates to the user, in a way that will provide the most benefit
for improving the quality of the data. To this end, we apply the concept of value
of information (VOI) [64] from decision theory, combined with an active learning
approach, to choose a ranking in a principled manner.
2.4.1 VOI-based Ranking
At any iteration of the process outlined in Algorithm 2.1, there will be several
possible suggested updates to forward to the user. As discussed in the previous
section, these updates are grouped into groups {c1, c2 . . . }.
VOI is a mean of quantifying the potential benefit of determining the true value
of some unknown. At the core of VOI is a loss (or utility) function that quantifies the
desirability of a given level of database quality. To make a decision on which group
to forward first to the user, we compare data quality loss before and after the user
works on a group of updates. More specifically, we devise a data quality loss function,
L, based on the quantified violations to the rules Σ. Since the exact loss in quality
cannot be measured, as we do not know the correctness of the data, we develop a set
of approximations that allow for efficient estimation of this quality loss. Before we
proceed, we need first to introduce the notion of database violations.
Definition 2.4.1 Given a database D and a CFD ϕ, we define the tuple t violation
w.r.t ϕ, denoted vio(t, {ϕ}), as follows:
vio(t, {ϕ}) =
1 , if ϕ is a constant CFD.
Number of tuples t′
that violate ϕ with t , if ϕ is a variable CFD.
Consequently, the total violations for D with respect to Σ is:
vio(D,Σ) =∑ϕ∈Σ
∑t∈D
vio(t, {ϕ}).
33
The definition for the variable CFDs is equivalent to the pairwise counting of viola-
tions discussed in [3]. The violation can be scaled further using a weight attached to
the tuple denoting its importance for the business to be clean.
Update Benefit: Given a database instance D and a group c = {r1, . . . , rJ}.
If the system receives a feedback from the user on rj, there are two possible cases:
either the user confirms rj to be applied or not. We denote the two corresponding
database instances as Drj and Drj , respectively. Assuming that the user will confirm
rj with a probability pj, then the expected data quality loss after consulting the user
on rj can be expressed by: pj L(Drj)+ (1− pj) L(D
rj). If we further assume that all
the updates within the group c are independent then the update benefit g (or data
quality gain) of acquiring user feedback for the entire group c can be expressed as:
g(c) = L(D|c)−∑rj∈c
[ pj L(Drj) + (1− pj) L(D
rj) ] (2.2)
where L(D|c) is the current loss in data quality given that c is suggested. To simplify
our analysis, we assumed that these updates are independent. Taking into account
these dependencies would require to model the full joint probabilities of the updates,
which will lead to a formulation that is computationally infeasible due to the expo-
nential number of possibilities.
Data Quality Loss (L): We define quality loss as inversely proportional to the
degree of satisfaction of the specified rules Σ. To compute L(D|c), we first need to
measure the quality loss with respect to ϕ ∈ Σ, namely ql(D|c, ϕ). Assuming that
Dopt is the clean database instance desired by the user, we can express ql by:
ql(D|c, ϕ) = 1− |D |= ϕ||Dopt |= ϕ|
=|Dopt |= ϕ| − |D |= ϕ|
|Dopt |= ϕ|(2.3)
where |D |= ϕ| and |Dopt |= ϕ| are the numbers of tuples satisfying the rule ϕ in the
current database instance D and Dopt, respectively. Consequently, the data quality
loss, given c, can be computed for Eq. 2.2 as follows:
L(D|c) =∑ϕi∈Σ
wi × ql(D|c, ϕi). (2.4)
34
where wi is a user defined weight for ϕi. These weights are user defined parameters.
In our experiments, we used the values wi =|D(ϕi)||D| , where |D(ϕi)| is the number of
tuples that fall in the context of the rule ϕi. The intuition is that the more tuples
fall in the context of a rule, the more important it is to satisfy this rule. to express
the business or domain value of satisfying the rule ϕi.
To use this gain formulation, we are faced with two practical challenges: (1) we
do not know the probabilities pj for Eq. 2.2, since we do not know the correctness of
the update rj beforehand, and (2) we do not know the desired clean database Dopt
for computing Eq. 2.3, since that is the goal of the cleaning process in the first place.
User Model: To approximate pj, we learn and model the user as we obtain
his/her feedback for the suggested updates. pj is approximated by the prediction prob-
ability, pj, of having rj correct (learning user feedback is discussed in Section 2.4.2).
Since initially there is no feedback, we assign sj to pj, where sj ∈ [0, 1] is a score that
represents the repairing algorithm certainty about the suggested update rj.
Estimating Update Benefit: To compute the overall quality loss L in Eq. 2.4,
we need to first compute the quality loss with respect to a particular rule ϕ, i.e.,
ql(D|c, ϕ) in Eq. 2.3. To this end, we approximate the numerator and denomina-
tor separately. The numerator expression, which represents the difference between
the numbers of tuples satisfying ϕ in Dopt and D, respectively, is approximated us-
ing D’s violations with respect to ϕ. Thus, we use the expression vio(D, {ϕ}) (cf.
Definition 2.4.1) as the numerator in Eq. 2.3.
The main approximation we made is to assume that the updates within a group
c are independent. Hence to approximate the denominator of Eq. 2.3, we assume
further that there is only one suggested update rj in c. The effect of this last as-
sumption is that we consider two possible clean desired databases—one in which rj
is correct, denoted by Drj , and another one in which rj is incorrect, denoted by
Drj . Consequently, there are two possibilities for the denominator of Eq. 2.3, each
with a respective probability pj and (1− pj). Our evaluations show that despite our
approximations, our approach produces a good ranking of the groups of updates.
35
We apply this approximation independently for each rj ∈ c and estimate the
quality loss ql as follows:
E[ql(D|c, ϕ)] =∑rj∈c
[pj ·vio(D, {ϕ})|Drj |= ϕ|
+ (1− pj)vio(D, {ϕ})|Drj |= ϕ|
] (2.5)
where we approximate pj with pj.
The expected loss in data quality for the database D, given the suggested group
of updates c, can be then approximated based on Eq. 2.4 by replacing ql with E[ql]
obtained from Eq. 2.5:
E[L(D|c)] =∑ϕi∈Σ
wi
∑rj∈c
[pj
vio(D, {ϕ})|Drj |= ϕ|
+ (1− pj)vio(D, {ϕ})|Drj |= ϕ|
](2.6)
We can also compute the expected loss for Drj and Drj using Eq. 2.4 and Eq. 2.6 as
follows: E[L(Drj)] =∑
ϕi∈Σwi · vio(Drj ,{ϕi})
|Drj |=ϕi| where we use pj = 1 since in Drj we know
that rj is correct and E[L(Drj)] =∑
ϕi∈Σwi · vio(Drj ,{ϕi})|Drj |=ϕi|
where we use pj = 0 since
in Drj we know that rj is incorrect.
Finally, using Eq. 2.2 and substituting L(D|c) with E[L(D|c)] from Eq. 2.6, we
compute an estimate for the data quality gain of acquiring feedback for the group c
as follows:
E[g(c)] = E[L(D|c)]−∑rj∈c
[pj E[L(Drj )] + (1− pj)E[L(Drj )]
]=
∑ϕi∈Σ
wi
∑rj∈c
[pj
vio(D, {ϕi})|Drj |= ϕi|
+ (1− pj)vio(D, {ϕi})|Drj |= ϕi|
]−
∑rj∈cpj ∑ϕi∈Σ
wivio(Drj , {ϕi})|Drj |= ϕi|
+ (1− pj)∑ϕi∈Σ
wivio(Drj , {ϕi})|Drj |= ϕi|
Note that vio(D, {ϕi})− vio(Drj , {ϕi}) = 0 since Drj is the database resulting from
rejecting the suggested update rj which will not modify the database. Therefore, Drj
is the same as D with the same violations. After a simple rearrangement, we obtain
the final formula to compute the estimated gain for c:
E[g(c)] =∑ϕi∈Σ
wi
∑rj∈c
pjvio(D, {ϕi})− vio(Drj , {ϕi})
|Drj |= ϕi|
(2.7)
36
The final formula in Eq. 2.7 is intuitive by itself and can be justified by the follow-
ing. The main objective to improve the quality is to reduce the number of violations
in the database. Therefore, the difference in the amount of database violations as
defined in Definition 1, before and after applying rj, is a major component to com-
pute the update benefit. This component is computed, under the first summation,
for every rule ϕi as a fraction of the number of tuples that would be satisfying ϕi, if
rj is applied. Since the correctness of the repair rj is unkown, we cannot use the term
under the first summation as a final benefit score. Instead, we compute the expected
update benefit by approximating our certainty about the benefit by the prediction
probability pj.
Example 2.4.1 For the example in Figure 2.1, assume that the repairing algorithm
generated 3 updates to replace the value of the CT attribute by ‘Michigan City’ in
t2, t3 and t4. Assume also that the probabilities, pj, for each of them are 0.9, 0.6,
and 0.6, respectively. The weights wi for each ϕi, i = 1, . . . , 5 are {48, 18, 28, 18, 38}.
Due to this modifications only ϕ1 will have their violations affected. Then for this
group of updates, the estimated benefit can be computed as follow using Eq. 2.7:
48× (0.9× 4−3
1+ 0.6× 4−3
1+ 0.6× 4−3
1) = 1.05.
2.4.2 Active Learning Ordering
One way to reduce the cost of acquiring user feedback for verifying each update is
to relegate the task of providing feedback to a machine learning algorithm. The use of
a learning component in GDR is motivated by the existence of correlations between
the original data and the correct updates. If these correlations can be identified
and represented in a classification model, then the model can be trained to predict
the correctness of a suggested update and hence replace the user for similar (future)
situations.
37
As stated earlier, GDR provides groups of updates to the user for feedback. Here,
we discuss how the updates within a group will be ordered and displayed to the
user, such that user feedback for the top updates would strengthen the learning
component’s capability to replace the user for predicting the correctness for the rest
of the updates.
Interactive Active Learning Session: After ranking the groups of updates, the
user will pick a group c that has a high score E[g(c)]. The learner orders these updates
such that those that would most benefit, i.e., improve the model prediction accuracy,
from labeling come first. The updates are displayed to the user along with their
learner predictions for the correctness of the update. The user will then give feedback
on the top ns updates, that she is sure about, and inherently correct any mistakes
made by the learner. The newly labeled examples in ns are added to the learner
training dataset Tr and the active learner is retrained. The learner then provides
new predictions and reorder the currently displayed updates based on the training
examples obtained so far. If the user is not satisfied with the learner predictions,
the user will then give feedback on another ns updates from c. This interactive
process continues until the user is either satisfied with the learner predictions, and
thus delegates the remaining decisions on the suggested updates in c to the learned
model, or the updates within c are all labeled, i.e., verified, by the user.
Active Learning: In the learning component, there is a machine learning algo-
rithm that constructs a classification model. Ideally, we would like to learn a model
to automatically identify correct updates without user intervention. Active learning
is an approach to learning models in such situations where unlabeled examples (i.e.
suggested updates) is plentiful but there is a cost to labeling examples (acquiring user
feedback) for training.
By delegating some decisions on suggested updates to the learned models, GDR is
allowing for “automatic” repairing. However, there is a guarantee to correctly repair
the data that is inherently provided by the active learning process to learn accurate
38
classifiers to predict the correctness of the updates. The user is the one to decide
whether the classifiers are accurate while inspecting the suggestions.
Learning User Feedback: The learning component predicts for a suggested
update r = ⟨t, A, v, s⟩ one of the following predictions, which corresponds to the
expected user feedback. (i) confirm, the value of t[A] should be v. (ii) reject, v is not
a valid value for t[A] and GDR needs to find another update. (iii) retain, t[A] is a
correct value and there is no need to generate more updates for it. The user may also
suggest new value v′ for t[A] and GDR will consider it as a confirm feedback for the
repair r′ = ⟨t, A, v′, 1⟩.
In the learning component, we learn a set of classification models {MA1 , . . . ,MAn},
one for each attribute Ai ∈ attr(R). Given a suggested update for t[Ai], model MAi
is consulted to predict user feedback. The models are trained by examples acquired
incrementally from the user. We present here our choices for data representation
(input to the classifier), classification model, and learning benefit scores.
Data Representation: For a given update r = ⟨t, Ai, v, s⟩ and user feedback F ∈
{confirm, reject, retain}, we construct a training example for model MAiin the form
⟨t[A1], . . . , t[An], v,R(t[Ai], v),F⟩. Here, t[A1], . . . , t[An] are the original attributes’
values of tuple t and R(t[Ai], v)1 is a function that quantifies the relationship between
t[Ai] and its suggested value v.
Including the original dirty tuple along with the suggested update value enables
the classifier to model associations between original attribute values and suggested
values. Including the relationship function, R, enables the classifier to model asso-
ciations based on similarities that do not depend solely on the values in the original
database instance and the suggested updates.
Active Learning Using Model Uncertainty: Active learning starts with a
preliminary classifier learned from a small set of labeled training examples. The
classifier is applied to the unlabeled examples and a scoring mechanism is used to
estimate the most valuable example to label next and add to the training set. Many
1We use a string similarity function.
39
criteria have been proposed to determine the most valuable examples for labeling
(e.g, [65,66]) by focusing on selecting the examples whose predictions have the largest
uncertainty.
One way to derive the uncertainty of an example is by measuring the disagreement
amongst the predictions it gets from a committee of k classifiers [45]. The committee
is built so that the k classifiers are slightly different from each other, yet they all
have similar accuracy on the training data. For an update r to be classified by label
F ∈ {confirm, reject, retain}, it would get the same prediction F from all members.
The uncertain ones will get different labels from the committee and by adding them
in the training set the disagreement amongst the members will be lowered.
In our implementation, each model MAiis a random forest which is an ensemble
of decision trees [67] that are built in a similar way to construct a committee of
classifiers. Random forest learns a set of k decision trees. Let the number of instances
in the training set be N and the number of attributes in the examples be M . Each
of the k trees are learned as follows: randomly sample with replacement a set S of
size N ′ < N from the original data, then learn a decision tree with the set S. The
random forest algorithm uses a standard decision-tree learning algorithm with the
exception that at each attribute split, the algorithm selects the best attribute from
a random subsample of M ′ < M attributes. We used the WEKA2 random forest
implementation with k = 10 and default values for N ′ and M ′.
Computing Learning Benefit Score: To classify an update r = ⟨t, Ai, v, s⟩
with the learned random forest MAi, each tree in the ensemble is applied separately
to obtain the predictions F1, . . . ,Fk for r, then the majority prediction from the set of
trees is used as the output classification for r. The learning benefit or the uncertainty
of predictions of a committee can be quantified by the entropy on the fraction of
committee members that predicted each of the class labels.
Example 2.4.2 Assume that r1, r2 are two candidate updates to change the CT at-
tribute to ‘Michigan City’ in tuples t2, t3. The model of the CT attribute, MCT, is a
2http://www.cs.waikato.ac.nz/ml/weka/
40
random forest with k = 5. By consulting the forest MCT, we obtain for r1, the predic-
tions {confirm, confirm, confirm, reject, retain}, and for r2, the predictions {confirm,
reject, reject, reject, reject}. In this case, the final prediction for r1 is ‘confirm’ with
an uncertainty score of 0.86 (= −35× log3
35− 1
5× log3
15− 1
5× log3
15) and for r2
the final prediction is ’reject’ with an uncertainty score of 0.45. In this case, r1 will
appear to the user before r2 because it has higher uncertainty.
2.5 Experiments
In this section, we present a thorough evaluation of the GDR framework, which
has already been demonstrated in [60]. Specifically, we show that the proposed
ranking mechanism converges quickly to a better data quality state. Moreover, we
assess the trade-off between the user efforts and the resulting data quality.
Datasets. In our experiments, we used two datasets, denoted as Dataset 1
and 2 respectively. Dataset 1 is a real world dataset obtained by integrating
(anonymized) emergency room visits from 74 hospitals. Such patient data is used
to monitor naturally occurring disease outbreaks, biological attacks, and chemical
attacks. Since such data is coming from several sources, a myriad of data quality
issues arise due to the different information systems used by these hospitals and
the different data entry operators responsible for entering this data. For our ex-
periments, we selected a subset of the available patient attributes, namely Patient
ID, Age, Sex, Classification, Complaint, HospitalName, StreetAddress, City,
Zip, State, and VisitDate. For Dataset 2, we used the adult dataset from the
UCI repository (http://archive.ics.uci.edu/ml/). For our experiments, we used the
attributes education, hours per week, income, marital status, native country,
occupation, race, relationship, sex, and workclass.
Ground truth. To evaluate our technique against a ground-truth, we manually
repaired 20,000 patient records in Dataset 1. We used address and zip code lookup web
sites for this purpose. We assumed that Dataset 2, which is about 23,000 records, is
41
already clean and hence can be used as our ground truth. We synthetically introduced
errors in the attribute values as follows. We randomly picked a set of tuples, and then
for each tuple, we randomly picked a subset of the attributes to perturb by either
changing characters or replacing the attribute value with another value from the
domain attribute values. All experiments are reported when 30% of the tuples are
dirty.
Data Quality Rules. For Dataset 1, we used CFDs similar to what was
illustrated in Figure 2.1. The rules were identified while manually repairing the
tuples. For Dataset 2, we implemented the technique described in [36] to discover
CFDs and we used a support threshold of 5%.
User interaction simulation. We simulated user feedback to suggested updates
by providing answers as determined by the ground truth.
Data quality state metric. We report the improvement in data quality through
computing the loss (Eq. 2.4). We consider the ground truth as the desired clean
database Dopt.
Settings. All the experiments were conducted on a server with a 3 GHz pro-
cessor and 32 GB RAM running on Linux. We used Java to implement the proposed
techniques and MySQL to store and query the records.
2.5.1 VOI Ranking Evaluation
The objective here is to evaluate the effectiveness and quality of the VOI-based
ranking mechanism described in Section 2.4.1. In this experiment, we did not use the
learning component to replace the user; the user will need to evaluate each suggested
update. Recall that the grouping provides the user with related tuples and their
corresponding updates that could help in a quick batch inspection by the user.
We compare in this experiment the following techniques:
• GDR-NoLearning : The GDR framework of Figure 2.2 without the learning
component.
42
020406080
100
0 20 40 60 80 100
Data
Quali
ty Im
prove
ment
Feedaback (User efforts)
GDR-NoLearningGreedyRandom
(a) Dataset 1.
020406080
100
0 20 40 60 80 100
Data
Quali
ty Im
prove
ment
Feedback (User efforts)
GDR-NoLearningGreedyRandom
(b) Dataset 2.
Fig. 2.3.: Comparing VOI-based ranking in GDR (GDR-NoLearning) to other strate-
gies against the amount of feedback. Feedback is reported as the percentage of the
maximum number of verified updates required by an approach. Our application of the
VOI concept shows superior performance compared to other naıve ranking strategies.
• Greedy : Here, we rank the groups according to their sizes. The rationale behind
this strategy is that groups that cover larger numbers of updates may have high
impact on the quality if most of the suggestions within them are correct.
• Random: The naıve strategy where we randomly order the groups; all update
groups are equally important.
In Figure 2.3, we show the progress in improving the quality against the number
of verified updates (i.e., the amount of feedback). The feedback is reported as a
percentage of the total number of suggested updates through the interaction process
to reach the desired clean database.
The ultimate objective of GDR is to minimize user effort while reaching better
quality quickly. In Figure 2.3, the slope of the curves in the first of iterations with the
user is the key component to the curve: the steeper the curve the better the ranking.
As illustrated for both datasets, the GDR-NoLearning approach performs well com-
pared to the Greedy and Random approaches. This is because the GDR-NoLearning
43
approach perfectly identifies the most beneficial groups that are more likely to have
correct updates. While the Greedy approach improves the quality, most of the content
of the groups is sometimes incorrect updates leading to wasted user efforts. The Ran-
dom approach showed the worst performance in Dataset 1, while for Dataset 2, it was
comparable with the Greedy approach especially in the beginning of the curves. This
is because in Dataset 2, most of the sizes of the groups were close to each others mak-
ing the Random and Greedy approaches behave almost identically, while in Dataset 1
the groups sizes varies widely making the random choices ineffective. Finally, we no-
tice that GDR-NoLearning is much better for Dataset 1 than for Dataset 2, because
of two reasons related to the nature of the Dataset 2: (i) most of the initially sug-
gested updates for Dataset 2 are correct, and (ii) the sizes of the groups in Dataset 2
are close to each other. The consequence is that any ranking strategy for Dataset 2
will not be far from the optimal.
The results reported above justify clearly the importance and effectiveness of the
GDR ranking component. The GDR-NoLearning approach is well suited for repairing
“very” critical data, where every suggested update has to be verified before applying
it to the database.
2.5.2 GDR Overall Evaluation
Here, we evaluate GDR’s performance when using the learning component to re-
duce user efforts. More precisely, we evaluate the VOI-based ranking when combined
with the active learning ordering. For this experiment, we evaluate the following
approaches:
• GDR: is the approach proposed in this chapter. In each interactive session, the user
provides feedback for the top ranked updates. The required amount of feedback
per group is inversely proportional to the benefit score of the group (Eq. 2.7)—
the higher the benefit the less effort from the user is needed, since most likely the
updates are correct and there are very few uncertain updates for the learned model
44
that would require user involvement. As such, we require that the user verifies di
updates for a group ci, di = E×(1− g(ci)
gmax
), where E is the initial number of dirty
tuples and gmax = max∀cj{g(cj)}.
• GDR-S-Learning : Here, we eliminate the active learning from the system—the
updates are grouped and then ranked using VOI-based scoring alone. User is
solicited for a random selection of updates within each group, instead of being
ordered by uncertainty. However, all of the user feedback is used to train the
learning component, which then replaces the user on deciding for the remaining
updates in the group. GDR-S-Learning is included to assess the benefit of the active
learning aspect of our framework, compared with traditional passive learning.
• Active-Learning : In this approach, we eliminate the grouping and their ranking
from the GDR framework. In other words, we neither group the updates nor use
VOI-based scores for ranking. We only solicit user feedback for updates ordered
with the learner uncertainty scores. The user is required to provide feedback for
the top update and then the learning component is updated to reorder the updates
for the user in an iterative fashion. The resulting learned model is applied for pre-
dicting the remaining suggested updates and the database is updated accordingly.
We report the quality improvement for different amount of feedbacks. This ap-
proach is included to assess the benefit of the grouping and the VOI-based ranking
mechanisms compared with using only an active learning approach.
• GDR-NoLearning : This approach is the one described in the previous experiment;
It provides a baseline to assess the utility of machine learning aspect for GDR.
• Automatic-Heuristic: The BatchRepair method described in [3] for automatic data
repair using CFDs.
In Figure 2.4, we report the improvement in data quality as the amount of feedback
increases. Assuming that the user can afford verifying at most a number of updates
equal to the number of initially identified dirty tuples (6000 for Dataset 1 and 3000
45
020406080
100
0 20 40 60 80 100
Data
Quali
ty Im
provm
ent
Feedback (User efforts)
GDRGDR-S-LearningGDR-NoLearningActive LearningHeuristic
(a) Dataset 1.
020406080
100
0 20 40 60 80 100
Data
Quali
ty Im
provm
ent
Feedback (User efforts)
GDRGDR-S-LearningGDR-NoLearningActive LearningHeuristic
(b) Dataset 2.
Fig. 2.4.: Overall evaluation of GDR compared with other techniques. The com-
bination of the VOI-based ranking with the active learning was very successful in
efficiently involving the user. The user feedback is reported as a percentage of the
initial number of the identified dirty tuples.
for Dataset 2), we report the amount of feedbacks as a percentage of this number.
The results show that GDR achieves superior performance compared with the other
approaches; For Dataset 1, GDR gain about 90% improvement with 20% efforts or
verifying about 1000 updates. For Dataset 2, about 94% quality improvement was
gained with 30% efforts or verifying about 1000 updates.
In Dataset 1, Active Learning is comparable to GDR only in the beginning of the
curve until reaching about 70% quality improvement. GDR-S-Learning starts to out-
perform Active Learning after about 45% user effort. The Heuristic approach repairs
the database without user feedback, therefore, it produces a constant result. Note
that the quality improvement achieved by the Heuristic approach is attained by GDR
with about 10% user effort, i.e., giving feedback for updates numbering about 10% of
the initial set of dirty tuples in the database. The GDR-NoLearning approach does
improve the quality of the database, but not as quickly as any of the approaches that
use learning methods. In comparison to Figure 2.3, the final performance of GDR-
46
NoLearning is 100%, assuming all required feedback were obtained. GDR involves
learning which allows for automatic updates to be applied and hence opens the door
for some mistakes to occur. Thus, the 100% accuracy may not be reached.
For Dataset 2, similar results were achieved. However, the Active Learning ap-
proach was not as successful as for Dataset 1. This is due to the randomness nature
of the errors in this dataset, which resulted in fewer correlations between these errors
that could be learned by the model. Due to the wider array of real-world dependen-
cies in Dataset 1, the machine learning methods were more successful and achieved
better performance. For example, some hospitals located on the boundary between
two zip codes have their zip attributes dirty; this is most likely due to a data entry
confusion on where they are really located.
The superior performance of GDR is justified by the following: for a single group
of updates, using the learner uncertainty to select updates can effectively strengthen
the learned model predictions as these “uncertain” updates are more important for
the model. In GDR-S-Learning, randomly inspecting updates from the groups pro-
vided by the VOI-based ranking does enhance the learned model. However, more
user effort is wasted in verifying less important updates according to the learning
benefit. For the Active Learning approach, it is apparent that having the user spend
more effort does not help the learned model due to the model over fitting problem.
This problem is avoided in both GDR and GDR-S-Learning approaches because of
the grouping provided by the GDR framework. The grouping provides the learned
model a mechanism to adapt locally to the current group, which in turn provides
the necessary guidance for the model to strongly learn the associations for a highly
beneficial group rather than just weakly learning the associations for a wide variety
of cases. This is also the reason that the GDR-S-Learning eventually outperforms
the Active Learning with an increase in user effort.
This experiment demonstrates the importance of the learning component for
achieving a faster convergence to a better quality. The results support our initial
hypothesis about the existence of correlations between the dirty and correct versions
47
of the tuples in real-world data. Also, the combination of VOI-based ranking with
active learning improves over the traditional active learning mechanism.
2.5.3 User Efforts vs. Repair Accuracy
We evaluate GDR’s ability to provide a trade-off between user effort and accurate
updates. We use the precision and recall, where precision is defined as the ratio of
the number of values that have been correctly updated to the total number of values
that were updated, while recall is defined as the ratio of the number of values that
have been correctly updated to the number of incorrect values in the entire database.
Since we know the correct data, we can compute these values.
The user in this experiment affords only verifying F updates, then GDR decide
about the rest of the updates automatically. GDR asks the user to verify di of the
suggested updates in a group of repairs ci, until we reach F .
In Figure 2.5, we report the precision and recall values resulting from repairing
the database as we increase F (reported as % of dirty tuples). For both datasets the
precision and recall generally improve as F increases. However, for Dataset 1, the
precision is always higher than for Dataset 2. This is due to the lower accuracy of
the learning component for Dataset 2, which stems from the random nature of the
errors in Dataset 2. Overall, these results illustrate the benefit of user feedback—as
the user effort increases, the repair accuracy increases.
2.6 Summary
We presented GDR, a framework that combines constraint-based repair techniques
with user feedback through an interactive process. The main novelty of GDR is to
solicit user feedback for the most useful updates using a novel decision-theoretic mech-
anism combined with active learning. The aim is to move the quality of the database
to a better state as far as the data quality rules are concerned. Our experiments
48
0.50.60.70.80.9
1
0 20 40 60 80 100
Prec
ision
and
Rec
all
Feedback (User efforts)
PrecisionRecall
(a) Dataset 1.
0.50.60.70.80.9
1
0 20 40 60 80 100
Prec
ision
and
Rec
all
Feedback (user efforts)
PrecisionRecall
(b) Dataset 2.
Fig. 2.5.: Accuracy vs. user efforts. As the user spends more effort with GDR, the
overall accuracy is improved. The user feedback is reported as a percentage of the
initial number of the identified dirty tuples.
show very promising results in moving the data quality forward with minimal user
involvement.
49
3. SCALABLE APPROACH TO GENERATE DATA
CLEANING UPDATES
Existing automatic data cleaning techniques are not scalable, and moreover, the
constraint-based cleaning techniques are recognized to fall short to identify correct
cleaning updates. In Chapter 2, GDR relies on such automatic cleaning techniques
to generate candidate updates to the dirty database. To enable GDR handel large
databases, we introduce in this chapter a scalable data cleaning approach that is
based on Machine Learning techniques. Involving ML helps in obtaining more accu-
rate cleaning updates than the constraint-based methods.
The chapter is organized as follows: In Section 3.1, we highlight the need for
different data cleaning techniques and discuss the challenges. Section 3.2 defines the
problem and introduces the notion of maximal likelihood repair. Section 3.3 presents
our solution for modeling dependencies and predicting accurate replacement values.
Section 3.4 presents SCARE, our scalable solution to repair the data. We demon-
strate the validity of our approach and experimental results in terms of efficiency and
scalability in Section 3.5, and finally, summarize the chapter in Section 3.6.
3.1 Introduction
Most existing solutions to repair dirty databases by value modification follow
constraint-based repairing approaches [3–5], which search for minimal change of the
database to satisfy a predefined set of constraints. While a variety of constraints (e.g.,
integrity constraints, conditional functional and inclusion dependencies) can detect
the presence of errors, they are recognized to fall short of guiding to correct the
errors, and worse, may introduce new errors when repairing the data [6]. Moreover,
despite the research conducted on integrity constraints to ensure the quality of the
50
data, in practice, databases often contain a significant amount of non-trivial errors.
These errors, both syntactic and semantic, are generally subtle mistakes which are
difficult or even impossible to express using the general types of constraints available
in modern DBMSs [7]. This highlights the need for different techniques to clean dirty
databases.
In this chapter, we address the issues on scalability and accuracy of replacement
values by leveraging Machine Learning (ML) techniques for predicting better quality
updates to repair dirty databases.
Statistical ML techniques (e.g., decision tree, Bayesian networks) can capture
dependencies, correlations, and outliers from datasets based on various analytic, pre-
dictive or computational models [8]. Existing efforts in data cleaning using ML tech-
niques mainly focused on data imputation (e.g., [7]) and deduplication (e.g., [9]). To
the best of our knowledge, our work is the first approach to consider ML techniques
for repairing databases by value modification.
Involving ML techniques for repairing erroneous data is not straightforward and
it raises four major challenges: (1) Several attribute values (of the same record) may
be dirty. Therefore, the process is not as simple as predicting values for a single
erroneous attribute. This requires accurate modeling of correlations between the
database attributes, assuming a subset is dirty and its complement is reliable. (2) A
ML technique can predict an update for each tuple in the database; and the question
is how to distinguish the predictions that should be applied. Therefore, a measure
to quantify the quality of the predicted updates is required. (3) Over-fitting problem
may occur when modeling a database with a large variety of dependencies that may
hold locally for data subsets but do not hold globally. (4) Finally, the process of
learning a model from a very large database is expensive, and the prediction model
itself may not fit in the main memory. Despite the existence of scalable ML techniques
for large datasets, they are either model dependent (i.e., limited to specific models,
for example SVM [10]) or data dependent (e.g., limited to specific types of datasets
51
such as scientific data and documents repository). Note that the scalability is also an
issue for the constraint-based repairing approaches [11].
Such limitations motivate the need for effective and scalable methods to accurately
predict cleaning updates with statistical guarantees. Precisely in this chapter, our
contributions can be summarized as follows:
• We formalize a novel data repairing approach that maximizes the likelihood
of the data given the underline data distribution, which can be modeled using
statistical ML techniques. The objective is to apply selected database updates
that (i) will best preserve the relationships among the data values, and (ii)
will introduce a small amount of changes. This approach enables a variety
of ML techniques to be involved for the purpose of accurately repairing dirty
databases by value modification. This way, we eliminate the necessity to pre-
define database constraints, which requires expensive experts involvement. In
contrast to the constraint-based data repair approaches, which find the min-
imum number of changes to satisfy a set of constraints, our likelihood-based
repair approach finds the bounded amount of changes to maximize the data
likelihood.
• One of the challenges is that multiple attributes values may be considered dirty.
Therefore, we introduce a technique to provide predictions for multiple at-
tributes at a time, while taking into account two types of dependencies: (i)
the dependency between the identified clean attributes and dirty attributes, as
well as, (ii) the dependency among the dirty attributes themselves. We present
our technique by introducing the probabilistic principles which it relies upon.
• We propose SCARE (SCalable Automatic REpairing), a systematic scalable
framework for repairing erroneous values that follows the new repairing ap-
proach and, more importantly, it is scalable for very large datasets. SCARE
has a robust mechanism for horizontal data partitioning to ensure the scala-
bility and enable parallel processing of data blocks; various ML methods are
52
applied to each data block to model attributes values correlations and provide
“local” predictions. We then provide a novel mechanism to combine the local
predictions from several data partitions. The mechanism computes the valid-
ity of the predictions for the individual ML models and takes into account the
models’ reliability in terms of minimizing the risk of wrong predictions, as well
as, the significance of partitions’ sizes used in the learning stage. Finally, given
several local predictions for repairing a tuple, we incorporate these predictions
into a graph optimization problem, which captures the associations between
the predicted values across the partitions and obtain more accurate final tuple
repair predictions.
• We present an extensive experimental evaluation to demonstrate the effective-
ness, efficiency, and scalability of our approach on very large real-world datasets.
3.2 Problem Definition and Solution Approach
In this section, we formalize our maximal likelihood repair problem and introduce
our solution approach.
3.2.1 Problem Definition
We consider a database instance D over a relation schema R with A denoting its
set of attributes. The domain of an attribute A ∈ A is denoted by dom(A).
In the relation R, a set F = {E1, . . . , EK} ∈ A represents the flexible attributes,
which are allowed to be modified (in order to substitute the possibly erroneous values),
and the other attributes R = A − F = {C1, . . . CL} are called reliable with correct
values. Hence a database tuple t has two parts: the reliable part (t[R] = t[C1, . . . CL]),
and the flexible part (t[F ] = t[E1, . . . EK ]). For short we refer to t[R] and t[F ] as r
and f , respectively (i.e., t = rf). We assume that it is possible to identify a subset
Dc ⊂ D of clean (or correct) tuples and De = D − Dc represents the remaining
53���� �������� � ��� �� ����� ��� � ������������ � ��� ������ ��� ��� ������� �� �������� �� ������ � ������� ! �"� ������� ��#���$� � ��"��� � %���� � �� �"� ������� �&���$� ' �"��"� � ��������� ()��)��� ��� ��� ������� '���*���� + ������ � ,������ ()��)��� ��� ��� ���"��� ,'���*���� + ������ � -��.���� � ()��)��� ��� ��� ������� ,'���*���� + ������ � %��,���# ��� ������ ��� ��� ������� ��� ���� �� ������ � ����� �/��0�� ! �"� ������� ��#/���$� � ��"��Fig. 3.1.: Illustrative example
possibly dirty tuples. This distinction does not have to be accurate in specifying the
dirty records, but it should be accurate in specifying the clean records. Our objective
is to learn from the correct tuples in Dc to predict accurate replacement values for
the possibly dirty tuples in De.
There are various techniques to distinguish Dc as it is always possible to use
reference data and existing statistical techniques (e.g., [8, 38]), as well as, database
constraints (if available) to provide a score Pe(t) ∈ [0..1] for each database tuple t for
being erroneous. Applying a conservative threshold on the scores of each tuple Pe(t),
we can select high quality records to be used for training.
Example 3.2.1 Consider the example relation in Figure 3.1 with a sample of 8 tu-
ples about some personal information: Name, Institution, area code AC, telephone
number Tel, in addition to address information: City, State and Zip.
This data is a result of integrating professional contact information and lookup
address database. Due to the integration process, we know that some of the address
attributes (City, State and Zip) may contain errors. Therefore, we call the address
attributes flexible attributes. After the integration process, we could separate high
quality records by consulting other reference data or verifying some widely known re-
lationships among the attributes. For this example, tuples t5, . . . , t8 ∈ Dc are identified
as correct ones, while we are not sure about tuples t1, . . . , t4 ∈ De.
54
We introduce the data repair likelihood given the data distribution as a technique
to guide the selection of the updates to repair the dirty tuples. Our hypothesis is
that the more the update will make the data follows the underline data distribution
with least cost, the more likely the update to be correct.
The likelihood of the database D is the product of the tuples’ probabilities given
a probability distribution for the tuples in the database. Given the identified clean
subset of the database Dc, we can model the probability distribution P (R,F ). Then,
the likelihood of the possibly erroneous subset De can be written (as log likelihood):
L(De|Dc) =∑t∈De
logP (t | Dc) =∑
t=rf∈De
logP (f | r) (3.1)
where we use P (t | Dc) = P (f | r), which we discuss in Section 3.3.
Assuming for a given tuple t = rf a ML technique predicted f ′ instead of f . We
say that the update u is predicted to replace f by f ′. Applying u to the database
will change the likelihood of the data; we call the amount of increase in the data
likelihood given the data distribution as the likelihood benefit of u.
Definition 3.2.1 Likelihood benefit of an update u (l(u)): Given a database
D = Dc ∪De, t = rf ∈ De and an update u to replace f by f ′, the likelihood benefit
of u is the increase in the database likelihood given the data distribution learned from
Dc, or (L(Due |Dc)− L(De|Dc)), where Du
e refers to De when the update u is applied.
Using Eq. 3.1 we obtain
l(u) = logP (f ′ | r)− logP (f | r). (3.2)
We also define the cost of an update as follows:
Definition 3.2.2 Cost of an update u (c(u)): For a given database tuple t = rf
and an update u to replace f by f ′, the cost of u is the distance between f and f ′,
c(u) =∑E∈F
dE(f [E], f ′[E]) (3.3)
55
where dE(f [E], f ′[E]) is a distance function for the value domain of attribute E that
returns a score between 0 and 1. Examples of distance functions for string attributes
include the normalized Edit distance or Jaro coefficient; for numerical attributes,
the normalized distance can be used, e.g., dE(a, b) =|a−b|
maxE−minE, where a and b are
two numbers in dom(E), and maxE,minE are the maximum and minimum values in
dom(E) respectively.
Our objective is to modify the data to maximize its likelihood, however, and
similar to existing repairing approaches, we need to be conservative in modifying the
data. Therefore, we bound the amount of changes introduced to the database by
a parameter δ. Hence, the problem becomes: given an allowed amount of changes
δ, how to best select the cleaning updates from all the predicted updates? This is
a constrained maximization problem where the objective is to find the updates that
maximizes the likelihood value under the constraint of a bounded amount of database
changes, δ. We call this problem the “Maximal Likelihood Repair”.
Definition 3.2.3 Maximal Likelihood Repair: Given a scalar δ and a database
D = De ∪ Dc. The Maximal Likelihood Repair problem is to find another database
instance D′ = D′e ∪ Dc, such that L(D′
e | Dc) is maximum subject to the constraint
Dist(D,D′) ≤ δ.
where Dist is a distance function between the two database instances D and
D′ before and after the repairing and it can be defined as Dist(D,D′) =∑∀t∈D,A∈A dA(t[A], t
′[A]), where t′ ∈ D′ is the repaired tuple corresponding to tu-
ple t ∈ D.
Regarding δ estimation, it is possible to use the score Pe(t), which estimates the
erroneousness of tuple t, to estimate δ = ϵ∑
t∈DePe(t), where ϵ ∈ [0..1]. The idea is
that a possibly erroneous tuple is expected to be modified according to its score of
being erroneous. ϵ can be chosen close to zero to be more conservative to the amount
of introduced changes.
56
3.2.2 Solution Approach
For each tuple t = rf , we obtain the prediction f ′ that represents an update u
to t. We compute the likelihood benefit and cost of u. Finally, we need to find the
subset of updates that maximizes the overall likelihood subject to the constraint that
the total cost is not more than δ, i.e., Dist(D,D′) ≤ δ.
Formally, given a set U of updates and, for each update u, we compute l(u) and
c(u) using Eq. 3.2 and 3.3, respectively. Our goal is to find the set of updates U ′ ⊆ U ,
such that ∑∀u∈U ′
l(u) is maximum subject to:∑∀u∈U ′
c(u) ≤ δ. (3.4)
This is typically a 0/1 knapsack problem setting, which implies that the maximal
likelihood repair problem is NP-complete.
Heuristic and quality measure: To solve the above problem, we use the
famous heuristic to solve the 0/1 knapsack problem by processing the updates in de-
creasing order of the ratio l(u)c(u)
. This heuristic suggests that the “correctness measure”
of an update u is the ratio of the update’s likelihood benefit to the cost of applying
the update to the database (i.e., the higher the likelihood benefit with small cost, the
more likely the update to be correct). Empirically, this gives good predictions for the
updates as we will illustrate in our experiments.
Example 3.2.2 In Figure 3.1, assume that two updates were predicted to the
database. u1 updates t3 such that f ′3={“Chicago”, “IL”, “60614”} and u2 updates t4
such that f ′4={“WLafayette”, “IN”, “47907”}. Assume also that both of l(u1), l(u2)
have the same likelihood benefit. In this case, u2 will encounter lower cost in updating
one character in both the Zip and the City attributes (update the Zip from“47906” to
“47907” and the City from “Lafayette” to “WLafayette”), while u1 will cost updating
4 characters in the Zip from “61801” to “60614”. Hence, for δ ≤ 2 characters, only
u2 will be applied to the database.
57
3.3 Modeling Dependencies and Predicting Updates
The key challenge when considering data repair using the data distribution is
that multiple attributes values may be dirty. In the case when a single attribute
is erroneous, the problem is to model the conditional probability distribution of the
erroneous attribute given the other attributes, and hence, a single classification model
can be used to obtain the predicted values for the erroneous attribute. However, it
is mostly the case that a set of attributes have low quality values and not a single
attribute. Therefore, we need to model the probability distribution of the subset of
dirty attributes given the other attributes that have reliable values (i.e., most likely
to be correct) to achieve a better prediction of the replacement values.
Example 3.3.1 In Figure 3.1 assuming we know that only the City attribute con-
tains some errors. This is the simple case because a ML model can be trained by
the database tuples considering the City as the label to be predicted. However in
practice, more than one attribute can be dirty at the same time, for example, all the
address attributes. In this case, we can use a ML technique to model the distribu-
tion of the combination (City, State, Zip)—taking into account their possible inter-
dependencies—given existing reliable values of attributes, e.g., (Name, Institution,
AC, Tel).
3.3.1 Modeling Dependencies
Let SR = dom(C1) × dom(C2) · · · × dom(CL) denotes the space of possible reli-
able parts of tuples t[C1 . . . CL] (with clean attribute values), and SF = dom(E1) ×
dom(E2) · · ·×dom(EK) denotes the space of possible flexible parts of tuples t[E1 . . . EK ]
(with possibly erroneous values). Assuming that the tuples of D are generated ran-
domly according to a probability distribution P (R,F ) on SR × SF , P (F |r) is the
58
conditional distribution of F given R = r and PEi(Ei|r) is the corresponding marginal
distribution of the values of attribute Ei,
PEi(ei|r) =
∑f∈SF |f [Ei]=ei
P (f | r)
Note that the posterior probability distribution P (F | r) provides the means to an-
alyze the dependencies among the flexible attributes. The distribution informs about
the probability of each combination of values for the flexible attributes ⟨e1, . . . , eK⟩,
where e1 ∈ dom(E1), . . . , eK ∈ dom(EK).
Given a database tuple t = rf , the conditional probability of each combination of
the flexible attribute values f can be computed using the product rule:
P (f | r) = P (f [E1] | r)K∏i=2
P (f [Ei] | r, f [E1 . . . Ei−1]). (3.5)
Note that we assume a particular order in the dependencies among the flexible
attributes {E1, . . . , EK}. To obtain this order, we leverage an existing technique [68]
to construct a dependency network for the database attributes. The dependency
network is a graph with the database attributes as the vertices; and there is a directed
edge from Ai to Aj if the analysis determined that Aj depends on Ai. In our case,
there will be two sets of vertices; the reliable set R and flexible set F . The first
flexible attribute E1 in the order is the one that have the maximum number of reliable
attributes as its parents in the graph. Then subsequently, the next attribute in order
i is the one with maximum number of parents that are either reliable attributes
or flexible attributes with an assigned order. In our experiments, we followed this
procedure by analyzing a sample of the database to determine the dependency order
of the flexible attributes. Another alternative method to compute the conditional
probability P (f | r) without considering any particular order of the flexible attributes
is to use Gibbs sampling [69], however, it is very expensive to be applied to even
moderate size databases. Please refer to [68] for further details.
In Section 3.3.2, we introduce an efficient way to obtain the predictions f ′ that is
desirable for our cleaning approach and compute the conditional probabilities P (f | r).
59
Algorithm 3.1 GetPredictions(Classification Model Mi, ⟨r, f [E1], . . . f [Ei−1]⟩ input
tuple ri, Probability P , Database Tuple t = rf)
1: if (i > K) then
2: f ′ = ri − r
3: AllPredictions = AllPredictions ∪{(f ′, P )}
4: return
5: end if
6: fEi = Mi(ri)
7: rs = ⟨ri, fEi⟩
8: Ps = P × P (fEi | ri)
9: GetPredictions(Mi+1, rs, Ps, t)
10: if fEi = t[Ei] then
11: r′s = ⟨ri, t[Ei]⟩ {Adding the original Ei’s value to the next input}
12: P ′s = P × P (t[Ei] | ri)
13: GetPredictions(Mi+1, r′s, P
′s, t) {Predicting attribute Ei’s value}
14: end if
3.3.2 Predicting Updates
We use a ML model M (as predictor) to model the above joint distribution in
Eq. 3.5. The model M is a mapping SR → SF that assigns (or predicts) the flexible
attributes values f ′ for a database tuple t = rf given the values r of the reliable
attributes R. The prediction takes the form:
M(r) = ⟨M1(r), . . . ,MK(r)⟩ = f ′.
To estimate the joint distribution of the flexible attribute values, P (f | r) in Eq.
3.5, we learn K classification models Mi(·) on the input space SR × dom(E1)× · · · ×
dom(Ei−1), (i.e., using all the reliable attributes and the flexible attributes up to
attribute Ei).
Mi : SR × dom(E1)× · · · × dom(Ei−1) → dom(Ei)
60
We assume that Mi is a probabilistic classifier (e.g., Naıve Bayesian) that will be
trained using Dc and produce a probability distribution over the values of the flexible
attribute Ei given ⟨r, f [E1], . . . , f [Ei−1]⟩.
One efficient greedy way to approximate the optimal prediction f ′ is to proceed as
follows: given a tuple t = rf , the classifier M1 is used to predict the value of attribute
E1 (i.e., f ′[E1]) given r. Then, M2 predicts the value for attribute E2 given r and
f ′[E1] as input. Proceeding in this way, Mi predicts the value of attribute Ei given
r and f ′[E1] . . . f′[Ei−1]. This approach can be considered as searching greedily for a
path in a tree that has the possible values of f ′ ∈ SF at the leaves. We call this tree
as the flexible attributes values search tree. Needless to say that this approach does
not guarantee finding the prediction f ′ with the highest probability.
For a tuple t, to find a better prediction that is desired for our cleaning approach,
we follow the conservative assumption in updating the database by considering and
preferring the original attributes values in the tuple. Based on this assumption, the
best prediction will be among these explored tree paths, which involve the original val-
ues of the tuple t. Hence, we can compute, in addition to the greedy path, additional
paths that assume that the original values in the tuple are the supposed predictions.
The algorithm to compute a set of predictions for the flexible attributes F for a given
tuple t is described in Algorithm 3.1, GetPredictions. Basically, GetPredictions pro-
ceeds recursively in the flexible attributes values search tree. At each node tree level
i, two branches are considered when the prediction of attribute Ei is different from
its original value in the tuple, otherwise, a single branch is considered. The initial
call to Algorithm 3.1 to get predictions for tuple t = rf is GetPredictions(M0, r, 1.0,
t = rf).
In Algorithm 3.1, Line 1 checks if we reached the prediction of the last flexible
attribute EK ; and in this case, we add the flexible part f ′ of the obtained final tuple
s to AllPredictions list. In Line 6, we predict the value fEiof attribute Ei. Lines
7 and 8 compose the new input rs by adding fEito ri and compute the prediction
probability so far, Ps. We then proceed recursively to get the prediction for the next
61
flexible attribute E(i+1). The lines 11-13 are executed if the predicted value fEiis
different from the original value t[Ei]. In this case, we compose another input r′s
using the original value t[Ei] and compute the prediction probability so far using
P (t[Ei] | ri) from the model Mi, then finally, proceed recursively to get a prediction
for E(i+1).
Example 3.3.2 Consider the example relation in Figure 3.1. Assume that tuple t4
was marked as erroneous and we want to obtain predictions for its flexible attributes.
In GetPredictions initially ri is the reliable attributes values {“C. Clifton”, “Purdue
Univ.”, “765”, “494-6005”}. In Line 6 the classifier M0, which was trained using only
the set of reliable attributes to predict the first flexible attribute City, provides the pre-
diction to be “WLafayette”. Then rs is composed to be the input to the next classifier
M1, which was trained by the reliable attributes and the first flexible attribute City
to predict the second flexible attribute State, rs = {“C. Clifton”, “Purdue Univ.”,
“765”, “494-6005”, “WLafayette”}. Since the predicted City is different from the
one in the table, we compose another input to the classifier M1 with the original City
value, r′s = {“C. Clifton”, “Purdue Univ.”, “765”, “494-6005”, “Lafayette”}. We
obtain the prediction for the State given the two inputs rs and r′s and proceed recur-
sively until we used M3 to predict the Zip, and we finally extract f ′ from each ri in
Line 2 to end up with the list of AllPredictions.
GetPredictions produces, for a given tuple t, at most 2K predictions with their
probabilities; however, in practice, the number of predictions are far less than 2K .
We select the prediction f ′ with the best benefit-cost ratio, i.e., the update u that
replaces f with f ′ and results in the highest l(u)c(u)
. Note that we need to compute l(u)
for only f ′ with P (f ′ | r) being greater than P (f | r), the probability of the original
values, otherwise the likelihood benefit of the predicted update will be negative. Note
also that P (f | r) is included as well in the output of GetPredictions.
In Section 3.4, we present the method to scale up the maximal likelihood repair
problem and get the predicted updates u along with their likelihood benefit l(u). The
cost c(u) is straight forward to compute.
62
3.4 Scaling Up the Maximal Likelihood Repairing Approach
One of the key challenges in repairing dirty databases is the scalability [11]. For
the case of maximal likelihood repair, the scalability issue is mainly due to learning
a set of classification models to predict the flexible attributes values. Usually, this
process is at least quadratic in the database size. Moreover, the learning process and
the model itself may not fit in main memory. Indeed, there are efforts on learning
from large scale datasets, but most of these techniques are either limited to a specific
ML models (e.g., scalable learning using SVM [10]), or it is limited to specific types
of datasets such as scientific data and documents repository.
In this section, we present a model-independent method to learn and predict up-
dates to the database that is based on horizontally partitioning the database. Each
database tuple will be a member of several partitions (or blocks). Each partition b
is processed to provide predictions to the erroneous tuples t ∈ b depending on the
database distribution learnt from block b (i.e., local predictions). Finally, we present
a novel mechanism to combine the local predictions from the different partitions and
determine more accurate final predictions.
This method is in the flavor of learning ensemble models [70] or committee-based
approaches, where the task is to predict a single class attribute by partitioning the
dataset into several smaller partitions; then a model is trained by each data partition.
For a given tuple, each model provide a prediction on the class attribute, and the
final prediction is the one with the highest aggregated prediction probability. But,
in our case, we want to predict the values of multiple flexible attributes together;
and we are not limited to predict a single attribute value. Hence for a given tuple,
we obtain a prediction (a combination f ′ of the flexible attribute values) from each
data partition. We then propose a technique to combine the models’ predictions into
a graph optimization problem to find the final prediction for the flexible attributes.
Our main insight is that the final (combinations of) predicted values are those which
would maximize the associations among the predicted values across the partitions.
63
Our mechanism to collect and incorporate the predicted updates takes into account
the reliability of the learnt classification models themselves to minimize the risk of
the predicted updates.
After obtaining the final predicted values with their likelihood benefit l(u), we use
them into the maximal likelihood repair problem (Eq. 3.4).
3.4.1 Process Overview
Algorithm 3.2 illustrates the main steps of the SCARE process to get the predicted
updates along with their likelihood benefit. The primary input to the framework is
a database instance D. The second input is a set of database partitioning functions
(or criteria) H = {h1, . . . , hJ}.
There are two main phases for SCARE: (1) Updates generation phase (lines 1-8),
and (2) Tuple repair selection phase (lines 9-13).
In Phase 1 (Line 1), each function hj ∈ H will partition D into blocks
{b1j, b2j, . . . }. Then, the loop in lines 2-8 processes each block bij as follows: (i)
Learn the set of classifiers Mij from the identified clean tuples in bij (lines 3); (ii)
Use Mij to predict the flexible attributes values for the possibly erroneous tuples
in bij using Algorithm 3.1 (lines 4-7). For each tuple, the prediction is considered a
candidate tuple repair and it is stored in a temporary repair storage, denoted as RS.
Since each tuple will be a member of several data partitions, we will end up with a
set of candidate tuple repairs for each possibly erroneous tuple. The details of the
repair generation is provided in Section 3.4.2.
Phase 2 (lines 9-13) loops on each tuple t ∈ De and retrieves all its candidate tuple
repairs from the repair storage RS, then uses Algorithm 3.3 SelectTupleRepair to get
the final tuple repair (update) with its estimated likelihood benefit. The details of
the repair selection algorithm is provided in Section 3.4.3. Note that each iteration
in Phase 1 does not depend on other iterations (similarly for the iterations of Phase
2). Hence, SCARE can be efficiently parallelized.
64
Algorithm 3.2 SCARE(D dirty database, H = {h1, . . . , hJ} DB partitioning func-
tions)
1: Given H, partition D into blocks bij .
2: for all block bij do
3: Learn the models Mij .
4: for all tuple t = rf ∈ bij ∧ t ∈ De do
5: Use Mij to predict f ′j and get Pij(f
′j | r) and Pij(f | r)
6: Store f ′j , Pij(f
′j | r) and Pij(f | r) in RS. {store in the Repair Storage.}
7: end for
8: end for
9: for all tuple t = xy ∈ D do
10: RS(t) = the candidate tuple repairs for t in RS.
11: f ′ =SelectFinalPrediction(RS(t))
12: For the update u to change f to f ′, compute the likelihood measure l(u) if f = f ′.
13: end for
3.4.2 Repair Generation Phase
In this phase, the data is partitioned, as we will explain shortly, for two main ben-
efits: (i) scale for large data by enabling independent processing for each partition,
and (ii) more accurate and efficient learning of the classification models for the pre-
diction task. The first benefit is obvious and the second benefit is obtained because
of the following: When we train a classification model for prediction, ideally, we need
the model to provide high prediction accuracy capturing all the possible dependencies
from the data (we call it a model with global view). All the statistically significant de-
pendencies are considered as the model’s search space. However, if the space contains
a lot of weak dependencies, most likely, the model will not be able to capture them.
But if it does, the global view will not be accurate enough for prediction because of
the model overfitting. Partitioning the database helps to capture local correlations
that are significant within subsets of the database and that require a different degree
of “zooming” to be recognized. Each of the partition functions h ∈ H provides the
search space partitioned according to the criteria shared by the tuples within the
65
same block. If we train models on multiple blocks, we will have models with several
local views (or specialized models, sometimes called experts [71]) for portions of the
search space. Combining these local views, will result in a better prediction accuracy.
Partitioning the database: Each partition function or criterion h(·) maps each
tuple to one of a set of partitions. Multiple criteria H = {h1, . . . , hJ} is used to
partition the database in different ways. Each tuple t is mapped to a set of partitions,
i.e., H(t) =∪
∀ j hj(t).
A simple way to choose the partition criteria is Random (i.e., randomly parti-
tion the data many times). Another way to choose the criteria is Blocking, where
partitions are constructed under the assumption that similar tuples will fit in the
same block or inversely, tuples across different blocks are less likely to be similar.
Many techniques for blocking have been introduced for the efficient detection of the
duplicate records (refer to [9] for a survey).
It is worth mentioning that increasing the number of partition functions will result
in a more accurate final prediction, because the variance in the predictions decreases
as we increase the number of ways (partition functions) to partition the data. We
found that partitioning the data using different blocking techniques provided more
accurate predictions with less number of partition functions in comparison with the
random partitioning.
Example 3.4.1 Consider again the relation in Figure 3.1. In this example, one may
partition the database based on the Institution attribute (as a partition function)
to get the tuples partitioned as follows: {t1, t7}, {t2, t8}, {t3}, {t4, t5, t6}. The result
of the learning process from these data partitions will be expert models based on the
person’s institution. Another function may use the AC or a combination of attributes.
Partition functions can be designed based on a signature based scheme or clustering
as we elaborate in the experimental section.
Reliability measure and risk minimization: In order to be conservative in
considering the predictions from each block bij and its model Mij, we propose a
66
mechanism to measure the reliability of a model and adapt the obtained prediction
probability accordingly to support or detract the model’s predictions.
Two major components help us judging the reliability of a model Mij:
(i) the model quality, which is classically quantified by its loss L(Mij) =
1|bij |
∑t∈bij ,t∈Dc,E∈F dE(f [E], f ′
ij[E]), where |bij| is the number of tuples in partition
bij, E is one of the flexible attributes F , dE is a distance function for the domain of
attribute E and f ′ij is the prediction of Model Mij on the flexible attributes F for
the tuple t ∈ bij and t ∈ Dc; (ii) the second component is the size of the block: the
smaller the block is, the less reliable the predictions will be. Hence, the reliability of
model Mij can be written as:
Re(Mij) =|bij||D|
(1− L(Mij)) . (3.6)
Finally, the prediction probabilities obtained from model Mij are scaled to be:
Pij(f′ | r) = Pij(f
′ | r)×Re(Mij).
Aggregating Suggestions: As mentioned earlier, a tuple t = rf will be a
member of |H(t)| data partitions. From each partition, we get a candidate tuple
repair for t, which are then stored in the storage RS with the following schema:
{t id, partition, E1, ..., EK , P (f ′|r), P (f |r)}, where t id is the original tuple identifier,
partition is the partition name, Ek ∈ F , P (f ′ | r) is the prediction probability of the
repairing update f ′, and P (f | r) is the probability of the original values in t. The
space required for RS is of O(|D| × |H|).
Example 3.4.2 Consider tuple t4 in Figure 3.1 and the flexible attributes are City
and State, Zip. Assume that we used 5 partition functions and hence t4 was a
member of 5 partitions, consequently, we obtain 5 possible candidate tuple repairs.
The table in Figure 3.2 illustrates the candidate tuple repairs of t4, RS(t4), with the
corresponding prediction probabilities obtained from each partition.
67������ ���� ����� �� �� � � � �� � ��� � �������� � ����� ��� ����� � ����� � ����� ��� ����� � �������� � ����� ��� ����� � ������� � ����� ��� ����� � ������� �� ����� ��� ���Fig. 3.2.: Generated predictions for tuple repairs with their corresponding prediction
probabilities for tuple t4 in Figure 3.1.
3.4.3 Tuple Repair Selection Phase
Once the candidate tuple repairs are generated, we need a repair selection strategy
to pick the best one among the candidate set. One suggestion for a selection strategy
can be the majority voting. For a tuple t, the majority voting can be done by selecting
the most voted value from the partitions on each attribute Ei individually.
The majority voting strategy implies the assumption that each attribute was pre-
dicted independently from the others. For a tuple t = rf , we predict the combination
of the flexible attributes f ′ together. Thus, the independence assumption of the
attributes is not valid. Therefore, we propose a mechanism to vote for a final com-
bination of the flexible attributes that takes into account the dependencies between
the predicted values obtained from each partition.
Example 3.4.3 Consider the candidate tuple repairs of t4 in Figure 3.2. Note that
if we use the majority voting while using the prediction probability as the voter’s
certainty, the final prediction would be {“Lafayette”, “IN”, “47906”}. This solution
does not take into account the dependencies between the predicted values within the
same tuple repair. For example, there is a stronger association between “47907”
and “IN” than between “47906” and “IN”. This relationship is reflected on their
corresponding prediction probabilities. The values “47907” and “IN” were predicted
in f ′1, f
′4 with probabilities 0.7 and 0.8, while “47906” and “IN” were predicted in f ′
2, f′3
68
and their probabilities are smaller, 0.4 and 0.6. The same applies for the relationship
between “WLafayette” and “IN”, which have a stronger relationship than “Lafayette”
and “IN”. A more desired prediction will be {“WLafayette”, “IN”, “47907”}.
For a given database tuple t = rf , our goal is to find the final combination
f ′∗ = ⟨e∗1, . . . , e∗K⟩ such that∑
bij ,t∈bij P (f ′∗ | r) is maximum. This is computationally
infeasible, because this requires the computation of the probability of each possible
combination of the flexible attribute in each data block. Instead, we can search for
the values that can maximize all the pairwise joint probabilities. In principle, if
we maximize the pairwise association between the predicted values, then this implies
maximizing the full association between the predicted values. Hence, the final update
is the one that would maximize the prediction probabilities for each pair of attribute
values. We formalize this problem as follows.
Definition 3.4.1 The Tuple Repair Selection Problem: Given a set of predicted
combination for the flexible attributes RS(t)= {f ′1, . . . , f
′|H|} for database tuple t = rf
along with the prediction probabilities of each combination, (i.e., for f ′j ∈RS(t), we
have the corresponding prediction probabilities P (f ′j | r)), the tuple repair selection
problem for tuple t is to find f ′∗ = ⟨e∗1, . . . , e∗K⟩ such that the following sum is maxi-
mum ∑∀ e∗i , e∗k, i =k
∑∀ f ′∈RS(t),e∗i=f ′[Ei], e∗k=f ′[Ek]
p(f ′ | r).
To find a solution, we map this problem to a graph optimization problem for
finding the K-heaviest subgraph (KHS) [72] in a K-partite graph (KPG). The key
idea is to process each database tuple t individually and use its set of candidate
tuple repairs, RS(t), to construct a graph, where each vertex is an attribute value,
and an edge is added between a pair of vertices iff the corresponding values co-occur
in a prediction f ′ ∈RS(t). The edges will have a weight derived from the obtained
prediction probabilities. It is worth noting that this strategy is applied for each tuple
on separate, therefore, this phase can be efficiently parallelized.
69
Finding KHS in KPG: The K-heaviest subgraph (KHS) problem is an NP
optimization problem [72]. In an instance of the KHS problem, we are given a graph
G = (VG, EG), where VG is the set of vertices of size n, EG is the set of edges with
non-negative weights (Wwv denotes the weight on the edge between vertices w, v),
and a positive integer K < n.
The goal is to find V ′ ⊂ VG, |V ′| = K, where∑
(w,v)∈EG∩(V ′×V ′)Wwv is maximum.
In other words, the goal is to find a K-vertex subgraph with the maximum weight.
A graph G = (VG, EG) is said to be K-partite if we can divide VG into K subsets
{V1, . . . , VK}, such that two vertices in the same subset can not be adjacent. We call
KHS in KPG problem, the problem of finding KHS in a K-partite graph such that
the subgraph contains a vertex from each partite.
Definition 3.4.2 The KHS in KPG problem. Given a K-partite
graph G = (V1, . . . , VK , EG), find V ′ = {v1, . . . , vK} such that vk ∈ Vk and∑(vi,vj)∈EG∩(V ′×V ′)Wvivj is maximum.
Lemma 3.4.1 The KHS in KPG is NP-Complete.
Proof This is a proof sketch. It is straight forward to see that we can reduce the
problem of finding K-Clique (Clique of size K) in K-partite graph to the KHS in
KPG problem. The problem of K-Clique in K-partite graph G is NP-complete by
reduction from the problem of (n − K)-vertex cover in the complement K-partite
graph G′, which is NP-Complete (see [73] for details).
Solving the Tuple Repair Selection Problem: Given a set of predic-
tions RS(t)= {f ′1, . . . , f
′|H|} for the flexible attributes of a tuple t = rf , where
f ′j = ⟨e(j)1 , . . . , e
(j)K ⟩ and K = |F | is the number of flexible attributes.
The repair selection problem can be mapped to the KHS in KPG problem using
the following steps:
1. Building vertex sets for each attribute Ek: For each attribute Ek, create a
vertex v for each distinct value in {e(1)k , . . . , e(|H|)k }. Note that we have a set of
vertices for each attribute Ek (i.e., partite).
70
�� ���
��
�����������(b) After removing I(a) Constructed Graph (c) After removing F (d) After removing 6 (e) After removing W
��������� ��� ������ ������ ��� ������ ��� �� ��
��
��������� ��� ������ ������ ������ �� �
��
��������� ��� ������ ������ �� �
���������� ������ ��� � �
����������
Fig. 3.3.: Step by step demonstration for the SelectTupleRepair algorithm. At each
iteration, the vertex with minimum weighted degree is removed as long as it is not
the only vertex in its corresponding vertex set.
2. Adding edges: Add an edge between vertices v, w when their corresponding
values co-occur in a candidate tuple repair. Note that v, w can not belong to
the same vertex set.
3. Assign edge weights: For an edge between v, w, the weight is computed as
follows: Let f(v,w) = {f ′j|f ′
j contains both v, w}, i.e., the set of predictions that
contain both the values v, w.
Wvw =∑
f ′j∈f(v,w)
Pij(f′j | r)
where Pij(f′j | r) is the prediction probability of f ′
j obtained from partition bij.
The graph construction requires a single scan over the predictions RS(t)=
{f ′1, . . . , f
′|H(t)|}, hence, it is of O(K |H|). The number of vertices is the number
of distinct values in the candidate tuple repairs.
Example 3.4.4 Figure 3.3(a) shows the constructed 3-partite graph from the pre-
dictions in Figure 3.2 for tuple t4 in the original relation of Figure 3.1. For each
attribute, there is a vertex set (or partite), e.g., the corresponding set of the Zip at-
tribute contains {“47906”, “47907”}. In the graph, we replaced the actual attributes
values by a character abbreviation to have a more compact graph as follows: {“6”
→ “47906”, “7” → “47907”, “L” → “Lafayette”, “W” → “Wlafayette”, “F” →
“lafytte”, “N” → “IN”, “I” → “IL”}.
71
Note that there is an edge between “W” and “N” with edge weight of 1.1 (= 0.4
+ 0.7). This is because “WLafayette” and “IN” co-occur twice in f ′1 and f ′
3 and their
probabilities are 0.7 and 0.4 respectively. Also, there is an edge between “I” and “6”
with weight of 0.5, because “IL” and “47906” co-occur once in f ′5 with probability 0.5.
Similarly, the rest of the graph is constructed.
Finally, finding the KHS in the constructed KPG is a solution to the tuple repair
selection problem. The underlying idea is that the resulting K-subgraph G′(V ′, E ′)
will contain exactly a single vertex from each vertex set. This corresponds to selecting
a value for each flexible attribute. Moreover, the weight of the selected subgraph
corresponds to the result of maximizing the summation in Definition 3.4.1.
Computing the likelihood benefit: For a tuple t = rf , the solution of the
KHS in KPG problem is the final prediction f ′ for the flexible attributes. The final
prediction probability of f ′ is computed from the solution graph G′(V ′, E ′) by
P (f ′ | r) = 1
|E ′|∑
evw∈E′
1
|f(v,w)|∑
f ′j∈f(v,w)
Pij(f′j | r).
The inner summation averages the probability of each pair of attribute values (i.e.,
each edge in G′) in the final prediction f ′. The outer summation averages the prob-
ability over all the edges in the final graph G′.
The prediction probability of the original values in the flexible attribute f is
computed following the ensemble method by averaging the obtained probability from
each partition, i.e., P (f | r) = 1|H|
∑bij ,t∈bij Pij(f | r).
Finally, for the update u changing f into f ′, we can compute the likelihood benefit
l(u) using Equation 3.2.
Example 3.4.5 Consider the constructed initial graph in Figure 3.3(a). Assuming
that the solution for KHS in KPG is the subgraph {“W”, “N”, “7”} shown in Figure
3.3(e). Now, we have an update u to change the original values in tuple t4 in Figure
3.1 from f ={“Lafayette”, “IN”, “47906”} to f ′ ={“WLafayette”, “IN”, “47907”}.
From the RS(t4) in Figure 3.2, we get P (f | r) = avg{0.6, 0.5, 0.3, 0.6, 0.5} = 0.5.
72
For P (f ′ | r) = 13
[12(0.7 + 0.4) + 1
2(0.7 + 0.8) + 0.7
]= 0.66. Finally, we can use Eq.
3.2 to compute l(u) = 0.12.
3.4.4 Approximate Solution for Tuple Repair Selection
For the general problem of finding the KHS, many approximate algorithms were
introduced (e.g., [72, 74, 75]). For example, in [74] the authors model the problem
as a quadratic 0/1 program and apply random sampling and randomized rounding
techniques resulting in a polynomial-time approximation scheme, and in [75], the
algorithm is based on semi-definite programming relaxation.
If K is very small, then the optimal solution can be found by enumeration. For
the case where K is not very small, we provide here an approximate solution that
is inspired by the greedy heuristic discussed in [72]. For the general case graph
problem, the heuristic repeatedly deletes a vertex with the least weighted degree
from the current graph until K vertices are left. The vertex weighted degree is the
sum of weights on the edges attached to it.
In the following, we follow the same heuristic for the case of K-partite graph.
However, we iteratively remove the vertex with least weighted degree as long as it
is not the only vertex left in the partite, otherwise, we find the next least weighted
degree vertex. The algorithm is a greedy 2-approximation following the analysis
discussed in [72].
Algorithm 3.3 shows the main steps to find the final tuple repair. There are two
inputs to the algorithm: (i) the constructed graph G(VG, EG) from the predictions
and (ii) the sets of vertices S = {S1, . . . , SK}, where each Sk represents the predicted
values for attribute Ek. We store for each vertex v its current weighted degree in
Weighted Degree(v)=∑
∀evw∈EGWvw, which is the sum of the edges weights that are
incident to v.
The algorithm proceeds iteratively in the loop illustrated in lines 1-9. The
loop stops when a solution is found, where there is only one vertex in each ver-
73
Algorithm 3.3 SelectTupleRepair(G(V,E) graph, S = {S1, . . . , SK})1: while ∃S ∈ S s.t. |S| > 1 do
2: v =GetMinWeightedDegreeVertex(G,S)
3: If v = null Then break;
4: for all vertex w ∈ V s.t. ewv ∈ E do
5: Remove ewv from G.
6: Weighted Degree(w) - = Wwv
7: end for
8: Remove v from its corresponding set S.
9: end while
tex set, i.e., |S| = 1 ∀S ∈ S. In each loop iteration, we start (Line 2) by
finding the vertex v that has the minimum weighted degree using the Algorithm
GetMinWeightedDegreeVertex(G,S). Then, we remove all the edges incident to v
and update the WeightedDegree(w) by and subtracting Wwv, where w was connected
to v by the removed edge ewv (Lines 4-7). Finally, vertex v is removed from G and
from its corresponding vertex set in Line 8.
GetMinWeightedDegreeVertex goes through the vertex sets that has more than
one vertex and returns the vertex that has the minimum weighted degree.
Analysis: Algorithm 3.3 requires: First, visiting all n vertices to remove them
except for K ones. For each vertex, each set S ∈ S of the K sets is visited to get
its minimum vertex according to the weighted degree. This requires O(nK log |S|),
where n ≈ O(K|H|) and |S|’s worst case is O(|H|). Hence, visiting the vertices
is of O(K2|H| log |H|). Second, removing the vertices requires visiting their edges,
O(|EG|), which has a worst case of O(K2|H|). Then, the overall complexity of Algo-
rithm 3.3 is O(K2|H| log |H|).
Example 3.4.6 The SelectTupleRepair algorithm is illustrated step-by-step in Fig-
ure 3.3. The algorithm looks for the vertex with the least weighted degree to be re-
moved. The first vertex is “I”, which has a weighted degree equal 1.0 = 0.5 + 0.5,
corresponding to the two incident edges in “I”. This leaves the vertex set of the State
74
attribute with only one vertex, “N”. Therefore, we do not consider removing the ver-
tex “N” in further iterations of the algorithm. The next vertex to remove is “F” to
get Figure 3.3(c), and so on.
Finally, we got the final solution in Figure 3.3(e), which corresponds to a subgraph
with 3 vertices —there is a vertex from each initial partite. This graph is the heaviest
subgraph of size 3 (i.e., the sum of the edges weight is the maximum), where each
vertex belongs to a different partite. It is worth mentioning that the final graph does
not have to be fully connected. Thus, the final prediction is {“WLafayette”, “IN”,
“47907”}.
3.5 Experiments
In this section, we evaluate our data repair approach; specifically, the objectives of
the experiments are as follows: (1) Evaluation of SCARE and the notion of maximal
likelihood repair in comparison with existing constraint-based approaches for data
repair, (2) Assessment of the scalability of SCARE.
Datasets: In our evaluations, we use three datasets: (i) Dataset 1 is the same
real-world dataset about patients discussed in Section 2.5. We selected a subset
of the available patient attributes, namely Patient ID, Age, Sex, Classification,
Complaint, HospitalName, StreetAddress, City, Zip, State and VisitDate. This
is in addition to the Longitude and Latitude of the address information. This
dataset is dirty and it is used as input to the repairing approaches. The flexible
attributes that we consider for repairing are (City, Zip, HospitalName, Longitude
and Latitude). (ii) Dataset 2 is the US Census Data (1990) Dataset1 containing
about 2 M tuples. It has been used only in the scalability experiments. (iii) Dataset
3 is the Intel Lab Data (ILD2) used to evaluate SCARE for predicting missing values
and compare it to ERACER [7], as a recent system that relies on relational learning
for predicting missing data in relational databases.
1http://archive.ics.uci.edu/ml/datasets/US+Census+Data+ (1990)2http://db.csail.mit.edu/labdata/labdata.html.
75
Parameters: In our evaluations, we study several parameters that we list here
with their assigned default values: (1) e: the percentage of the erroneous tuples in
the dataset (default 30%), (2) d: the dataset size (default 10,000 tuples), (3) δ: the
maximum amount of changes, as a fraction of d, the dataset size that SCARE is
allowed to update (default 0.1 or 10% of d). (4) I: the number of iterations to run
SCARE (default 1). (5) |H|: the number of partition functions (default 5).
All the experiments were conducted on a server running Linux with 32 GB RAM
and 2 processors each with 3 GHz speed. We use MySQL to store and query the tuples.
For the probabilistic classifiers, we use the Naıve Bayesian, specifically, we use the
NBC WEKA implementation3 with the default parameters settings. Java is used to
implement our approach and we use Java Threads to benefit from a multiprocessor
environment.
Regarding the partition functions using blocking, we repeat the following process
|H| times: we randomly sample from the dataset a small number of tuples to be
clustered in |D|nb
clusters, where nb is the average number of tuples per partition. Then,
each tuple is assigned to the closest cluster as its corresponding partition name. This
process allows for having different blocking functions due to the random sample of
tuples used in the clustering step of each iteration. The tuples that have been assigned
to the same partition have common or similar features due to their assignment to the
closest cluster. In all our quality experiments, we use blocking as the technique to
partition the dataset. Another simple way to partition the dataset is to use random
partitioning functions. In this case, given nb, we assign each tuple to a partition name
bij, where i is a random number from {1, . . . , |D|nb}, and j = 0 initially. This process
is repeated |H| times while incrementing j each time.
3NBC WEKA available at http://www.cs.waikato.ac.nz/ml/weka
76
0.5
0.6
0.7
0.8
0.9
1
10 20 30 40 50
Precision
Percent of Errors (e)
KHSKPGMVSMConstRpr
(a) Precision
00.10.20.30.40.50.60.7
10 20 30 40 50
Recall
Percent of Errors (e)
KHSKPGMVSMConstRpr
(b) Recall
Fig. 3.4.: Quality vs. the percentage of errors: SCARE maintains high precision by
making the best use of δ, the allowed amount of changes.
3.5.1 Repair Quality Evaluation
To evaluate the quality of the automatic repair, we manually cleaned the dirty
datasets (e.g., for Dataset 1, using addresses web sites, external reference data sources,
and visual inspection) to obtain the ground truth and we compare the clean versions
of the datasets (and each replacement value) with the repair output of our method.
In the following experiments, we use the standard precision and recall to measure
the quality of the applied updates for string attributes.
The precision is defined as the ratio of the number of values that have been
correctly updated to the total number of values that were updated, while the recall
is defined as the ratio of the number of values that have been correctly updated to
the number of incorrect values in the entire database. For numerical attributes, we
use the mean absolute error (MAE): 1N
∑di=1 |vi − ai| where vi is the suggested value
by SCARE, ai is the actual value of the original data. We can compute these values
since we know the ground truth for Dataset 1.
77
We report the quality results for four approaches:
• KHSinKPG : This is SCARE with the described tuple repair selection strategy
as described in Section 3.4.3.
• MV : In this approach, SCARE uses directly the majority voting to select an
attribute value from the candidate tuple repairs. We include this approach in
the evaluation to compare our tuple repair selection strategy to the straight
forward majority voting strategy.
• SM : The single model approach, where the whole dataset is considered as a
single partition. Afterward, the likelihood benefit and cost are computed to
select the best updates. This approach is included to show the advantages of
combining several models with local views rather than using a single model with
global view on the data.
• ConstRepair : One of the recent constraint-based repair approaches. We imple-
mented (to the best of our understanding) the technique described in [3], which
uses CFDs as constraints to find dirty tuples and derive repairs. We manually
compiled a list of CFDs during the process of cleaning Dataset 1. Moreover,
we implemented a CFD discovery algorithm [36] to be used as input to the re-
pairing algorithm. To have a high quality rules and be fair to this approach, we
discovered CFDs from the original “correct” data, which can not be the case in
a realistic situation as we usually start with dirty data. We specified the rules
support threshold to be 1%. The implementation to discover the CFDs is very
time-consuming, therefore, this approach does not show up in all of our plots.
Quality vs. the percentage of errors: We use Dataset 1 and report in Fig-
ure 3.4 the precision and recall results for the applied updates while changing e, the
percentage of tuples with erroneous values, from 10 to 50%. Generally, the approaches
that maximize the data likelihood substantially outperform the constraint-based ap-
proach for the precision. Moreover, for the recall the likelihood approaches outperform
78
0.8
0.85
0.9
0.95
1
0 2 4 6 8 10
Precision
δ/|D| %
P-KHSKPGP-MVP-SM(a)
00.10.20.30.40.50.6
0 2 4 6 8 10
Recall
δ/|D| %
R-KHSKPGR-MVR-SM
(b)
Fig. 3.5.: δ controls the amount of changes to apply to the database: small δ guar-
antees high precision at the cost of the recall and vice versa.
the constraint-based approach when the errors is up to 30%, afterwards, the recall is
comparable for all the approaches when errors is more than 30%.
SCARE with KHSinKPG shows the highest precision. The precision increases
using the three likelihood-based approaches with the increase in the amount of errors,
but the recall is decreasing. For the Longitude and Latitude attributes, SCARE-
based approaches (KHSinKPG and MV ) corrected these attributes with error rate
between 1% and 5%.
For the SCARE-based approaches, the precision increases because we use a fixed δ.
When the amount of errors in the data (noted e) is small (10 to 20%), SCARE-based
approaches modify more data that allow for less accurate updates resulting in less
precision and relatively high recall. As e increases, the data needs more updates to
correct it, however, SCARE applies fewer, yet more accurate updates, and hence the
precision increases, but the recall decreases (Figure 3.4(b)). The KHSinKPG ap-
proach outperforms MV approach, because KHSinKPG takes into account further
associations between the predicted values across the partitions. These associations
are ignored in the MV approach. Both SCARE-based approaches that rely on data
79
partitioning (KHSinKPG and MV ) show a comparable, and sometimes even better,
accuracy than that of the SM approach.
The ConstRepair has been outperformed by all the likelihood-based approaches,
because the ConstRepair relies on the heuristic of finding values replacement that are
close to the original data without considering any information on the data distribution
and relationships.
The recall of all the repairing approaches is in the range of 30 to 65%. However, for
the likelihood-based approaches, the recall can be improved by running the approach
again over the resulting database instance. We illustrate this improvement later in
the experiment of Figure 3.6. But, the ConstRepair approach achieves about 35%
recall that can not be further improved given a fixed set of constraints.
To conclude, in comparison to the constraint-based repairing approaches, which
demonstrated both low precision and recall, our likelihood based approach demon-
strated accurate updates with high precision. Moreover, partitioning the data and
combining the different predictions across the partitions provide more accurate predic-
tions, because partitioning allows to learn data relationships at different granularity
levels (local and global).
Quality vs. the amount of changes δ: In this experiment, we study the effect
of δ, the number of tolerated changes on the repair quality. We report in Figure 3.5
the resulting precision and recall when δ|D| changes from 1% to 10% (e.g., if δ
|D| = 5%,
then SCARE can change up to 10% of the tuples by replacing a selected attribute
value v by v′ with distance d(v, v′) = 0.5). Generally, low values of δ guarantee high
precision and inversely. Increasing δ gives more chance to SCARE to modify the data.
Hence, the recall increases as we increase δ, but updates with a lower confidence could
be made. This justifies the decrease in the precision when we increase δ.
The conclusion from this experiment is that small δ guarantees high precision
at the cost of the recall. However, the recall can be improved by further SCARE
iterations over the data as we will see in the next experiment.
80
0.8
0.85
0.9
0.95
1
1 2 3 4 5
Precision
SCARE Iterations (I)
KHSKPGMVSM
(a)
���������������������
� � � � �
������
�� ����������������
����������
(b)
Fig. 3.6.: Using SCARE in an iterative way helps improving the recall and the overall
quality of the updates. The decrease in the precision is small compared to the increase
in the recall, achieving an overall high quality improvement demonstrated by the f-
measure.
Quality vs. the number of SCARE iterations: This experiment shows
the effect on the quality if we repeatedly execute SCARE over the dataset I times,
I = {1, . . . , 5}. After each iteration, the repaired tuples are considered members of
the clean subset of the database, which is used in training the ML models. We report
the obtained precision and recall in Figure 3.6. For all the SCARE-based approaches,
the recall substantially improves from about 35% to close to 70% as we increase the
number of iterations, while, the precision slightly decreases from the 90’s to the 80’s
%. This indicates that the overall quality improves as we run more SCARE iterations.
The KHSinKPG outperforms the other approaches in terms of both precision and
recall.
In each iteration, SCARE tries to repair the data to maximize the data likelihood
given the learnt classification models subject to a constant amount of changes, δ. In
the first few iterations, the discovered updates have higher correctness measure (i.e.,
ratio of the likelihood benefit to the cost of the update, l(u)c(u)
) than those that are
discovered later. Therefore, the applied updates in the first iterations have higher
81
���
����
���
����
�
� � � �� �� �� �
���������
�� ���������������������
����������������
(a)
�������������������
� � � �� �� �� ��
������
�� ���������������������
���������������
(b)
Fig. 3.7.: Increasing the number of partition functions |H| improves the accuracy of
the predictions and hence increases the precision. The recall is not affected much
because we use a fixed δ.
confidence, and hence, the precision starts high and decreases in later iterations.
The recall increases faster than the decrease in precision, and therefore, the overall
quality is improving. The main reason is that most of the applied updates are correct
in the first few iterations, so the obtained database instances after each iteration are
of higher quality to be learnt and modeled for predictions in the later iterations of
SCARE. A stopping criteria for the iterations can be computed from the obtained
overall likelihood benefit of the updates. If the benefit is not significant, then it is
better to stop SCARE. In the GDR setting of Chapter 2, the user will be involved
interactively to inspect very small number of the least beneficial updates after each
iteration. If the least beneficial updates are correct, then most of the updates are
correct according to the data distribution.
Quality vs. the number of partition functions: Here, we study the sensi-
tivity of SCARE to the number of ways to partition the dataset, i.e., the number
of partition functions |H|. Each tuple is a member of |H| partitions. We report in
Figure 3.7, the precision and recall of the applied updates as we change |H| from 2 to
16. The SCARE-based approaches, KHSinKPG and MV , achieve higher precision
82
2
4
6
8
10
Tim
e (
min
)
Phase 2
Phase 1
0
2
4
6
8
10
0.1 0.2 0.5 0.7 1
Tim
e (
min
)
# of records in M
Phase 2
Phase 1
(a) Dataset 2
10
20
30
40
50
Tim
e (
sec)
Blocking
Random
0
10
20
30
40
50
0 10 20 30 40 50
Tim
e (
sec)
# of records in K (d)
Blocking
Random
(b) Dataset 1
Fig. 3.8.: SCARE scalability when varying the database size.
as we increase |H|, while SM shows constant lower precision as changing |H| does
not affect SM ’s performance. Also, the recall is not much affected by |H| as we are
using δ, a fixed amount of changes to apply.
Increasing |H| will increase the chance that a tuple belongs to a larger number
of partitions. SCARE learns a model from a local view (i.e., from one partition)
and predicts the most accurate value for the tuple attributes. As a consequence, a
larger number of candidate tuple repairs is proposed when increasing the number of
partitions and the variance of the predictions decreases. The repair selection strategy
combines the predictions of the local view models, and this increases the chance to
obtain more accurate predictions. The strategy in KHSinKPG offers better results
as it improves the precision over the majority voting by taking into account the
dependencies obtained from different partitions.
3.5.2 SCARE Scalability
The scalability is one of the main advantages provided by SCARE, this is in
addition to the quality of the updates demonstrated in the previous experiments. In
the following, we assess the scalability of SCARE to large datasets.
83
In Figure 3.8(a), we report the scalability of SCARE on Dataset 2. We report
as well the fraction of time for each of the two phases of SCARE. The reported
time includes the time for learning the classification models. SCARE scales linearly,
because of its systematic process of handling each data partition. Note also that
SCARE finished the processing of a 1 M tuples in less than 6 minutes. Phase 1,
the updates generation, takes 80-85 % of the time because of the process of learning
models and obtaining predictions.
In Figure 3.8(b), we report the overall time taken by SCARE to handle datasets
from Dataset 1 with different size from 5,000 to 50,000 tuples. We use two different
partition techniques: Random and Blocking. In general, SCARE still scales linearly
with the dataset size. Moreover, it is noted that partitioning the data by blocking
makes SCARE more efficient. The reason is that blocking makes a data partition
containing less diversity of the domain values. This results in a faster processing to
train the classification models for prediction.
Usually, the process of statistical modeling and prediction tasks is quadratic, but
this is not the case with SCARE because of the robust mechanism of partitioning the
data and the strategy used to combine multiple predictions.
3.5.3 SCARE vs. ERACER to Predict Missing Values
In this experiment, we use Dataset 3, which includes a number of measurements
taken from 54 sensors once every 31 seconds. It contains only numerical attributes.
We include this dataset to evaluate SCARE for predicting missing values and compare
it to ERACER. We used the same dataset with the same introduced errors as reported
in ERACER evaluation of [7].
ERACER is a recent machine learning technique for data cleaning; it is specifically
dedicated to replace missing values but it can not be used for data repair by value
modification. ERACER leverages domain expert knowledge about the dataset to
84
��
����������
�� �� �� ��
���
��
�����������
���� �������������������������� ����������������������
(a)
��
����������
�� �� �� ��
���
��
�������������
������ ������������������� ������� ������������������ ��
(b)
Fig. 3.9.: Comparison between SCARE and ERACER to predict missing values.
Generally, both SCARE and ERACER show high accuracy in predicting the miss-
ing values. SCARE uses in this experiment Naıve Bayesian model, while ERACER
leverage domain knowledge interpreted in carefully designed Bayesian Network.
design a Bayesian network. Then, ERACER uses an underlying relational database
design to store all constructed model’s parameters.
In Figure 3.9, we evaluate SCARE in the task of predicting the missing values
in comparison with ERACER. Here, we do not use δ as SCARE is not updating
an existing database value. Instead, we consider all the predictions obtained from
SCARE that fill a missing value.
Figure 3.9(a) reports the mean absolute error (MAE) while errors in the dataset
are in the form of missing values. Figure 3.9(b) reports also the MAE, while errors
are in the form of corrupting values, e.g., adding random values. More details on how
this data was corrupted is provided in [7].
There are two major numerical attributes in the Dataset 3: humidity and temper-
ature. In Figure 3.9(a), both SCARE and ERACER predict the missing values for
the humidity and temperature with almost the same low error percentage (2 to 4 %);
and also in Figure 3.9(a), they both behave similarly when increasing the percentage
85
of corrupted data, and in this case, the error percentage in prediction is between 3
and 7 %.
These numbers show that both techniques provide high accuracy predictions.
However, SCARE does not require the expensive domain expert to design a Bayesian
network as for ERACER. For this experiment, SCARE used the Naıve Bayesian for
the statistical models. The Naıve Bayesian did well when plugged into SCARE in
comparison to the Bayesian network which has to be carefully designed for ERACER.
This is thanks to the partitioning technique used in SCARE, which enables several
local views of the data that are then combined at the end to obtain the most reliable
global view for accurate predictions. Moreover, SCARE can benefit from carefully
designed learning techniques, like ERACER, and plug it in the learning step to get
more accurate predictions.
3.6 Summary
In this chapter, we propose SCARE, a robust and scalable approach for accu-
rately repairing erroneous data values. Our solution offers several advantages over
previous methods for data repairing using database constraints: the accuracy of the
replacement values and the scalability of the method are guaranteed; the cost of the
repair is bounded by the amount of changes that the user is willing to tolerate; no
constraint or editing rule is needed since SCARE analyzes the data, learns the cor-
relations from the correct data and takes advantages of them for predicting the most
accurate replacement values. Finally, as shown in our extensive experiments, SCARE
outperforms existing methods on large databases with no limitation on the type of
data (i.e., string, numeric, ordinal, categorical) or on the type of errors (i.e., missing,
incomplete, outlying, or inconsistent values).
86
4. INDIRECT GUIDANCE FOR DEDUPLICATION
(BEHAVIOR BASED RECORD LINKAGE)
In this chapter, we give an example for leveraging the indirect interactions of the
users (entities) to improve the data quality. Specifically, we use entities’ generated
transactions log for the task of finding duplicate entities or linking entities records.
We assume in this chapter that the database is a result of integrating two data sources
where it is expected that the data sources have some entities in common.
The rest of the chapter is organized as follows: In Section 4.1, we start by motivat-
ing our approach and summarizing our contributions. Section 4.2 provides a general
overview on the approach and formalize the problem by examples. A candidate gen-
eration phase for the matched entities is discussed in Section 4.3, and the accurate
techniques for matching entities using their behaviors (generated transactions) are
discussed in Section 4.4. The experimental evaluation is provided in Section 4.5, and
finally, we summarize the chapter in Section 4.6.
4.1 Introduction
Record linkage is the process of identifying records that refer to the same real world
entity. There has been a large body of research on this topic (refer to [9] for a recent
survey). While most existing record linkage techniques focus on simple attribute
similarities, more recent techniques are considering richer information extracted from
the raw data for enhancing the matching process (e.g. [12–15]).
In contrast to most existing techniques, we are considering entity behavior as a
new source of information to enhance the record linkage quality. We observe that
by interpreting massive transactional datasets, for example, transaction logs, we can
discover behavior patterns and identify entities based on these patterns.
87
A straightforward strategy to match two entities is to measure the similarity
between their behaviors. However, a closer examination shows that this strategy may
not be useful, for the following reasons. It is usually the case that the complete
knowledge of an entity’s behavior is not available to both sources, since each source is
only aware of the entity’s interaction with that same source. Hence, the comparison
of entities’ “behaviors” will in reality be a comparison of their “partial behaviors”,
which can easily be misleading. Moreover, even in the rare case when both sources
have almost complete knowledge about the behavior of a given entity (e.g., a customer
who did all his grocery shopping at Walmart for one year and then at Safeway for
another year), the similarity strategy still will not help. The problem is that many
entities do have very similar behaviors, and hence measuring the similarity can at
best group the entities with similar behavior together (e.g., [16–18]), but not find
their unique matches.
Fortunately, we developed an alternative strategy that works well even if complete
behavior knowledge is not known to both sources. The key to our proposed strategy
is that we merge the behavior information for each candidate pair of entities to be
matched. If the two behaviors seem to complete one another, in the sense that stronger
behavioral patterns become detectable after the merge, then this will be a strong
indication that the two entities are, in fact, the same. The problem of distinct entities
having similar overall behavior is also handled by the merge strategy, especially when
their behaviors are split across the two sources with different splitting patterns (e.g.,
20%-80% versus 60%-40%). In this case, two behaviors (from the first and second
sources) will complete each other if they indeed correspond to the same real world
entity, and not just two distinct entities who happen to share a similar behavior
(which is one of the shortcomings of the similarity strategy).
In this work, we develop principled computational algorithms to detect those
behavior patterns which correspond to latent unique entities in merged logs. We
compute the gain in recognizing a behavior before and after merging the entities
transactions and use this gain as a matching score. In our empirical studies with
88
real world data sets, the behavior merge strategy produced much better results
than the behavior similarity strategy in different scenarios of splitting the entities’
transactions among the data sources.
The contributions of this chapter can be summarized as follows:
• We present the first formulation of the record linkage problem using entity
behavior and solve the problem by detecting consistent repeated patterns in
merged transaction logs.
• To model entities’ behavior, we develop an accurate, principled detection ap-
proach that models the statistical variations in the repeated behavior patterns
and estimates them via expectation maximization [76].
• We present an alternative, more computationally efficient, detection technique
that is based on information theory which detects recognized patterns through
high compressibility.
• To speed up the linkage process, we propose a filtering procedure that produces
candidate matches for the above detection algorithms through a fast but inac-
curate matching. This filtering introduces a novel “winnowing” mechanism for
zeroing in a small set of candidate pairs with few false positives and almost no
false negatives.
• We conduct an extensive experimental study on real world datasets that demon-
strates the effectiveness of our approach to enhance the linkage quality.
4.2 Behavior Based Approach
4.2.1 Problem Statement
We are given two sets of entities {A1, . . . , AN1} and {B1, . . . , BN2}, where for each
entity A we have a transaction log {T1, . . . , TnA} and each transaction Ti is a tuple
89
Pre-processing and Behavior Extraction
Candidate Generation(Quick & Dirty Matching)
Accurate Matching(Statistical Technique or Information Theoretic Technique)
Final Filtering and Conflict Resolution
Phase 0:
Phase 1:
Phase 2:
Phase 3:
Fig. 4.1.: Process for behavior-based record linkage.
in the form of ⟨ti, a, F id⟩ where ti represents the time of the transaction, a is the
action (or event) that took place, and F id refers to the set of features that describe
how action a was performed.
Our goal is to return the most likely matches between entities from the two sets
in the form of ⟨Ai, Bj, Sm(Ai, Bj)⟩, where Sm(Ai, Bj) is the matching function. Given
entities A,B (and their transactions), the matching function returns a score reflecting
to what extent the transactions of both A and B correspond to the same entity.
4.2.2 Approach Overview
We begin by giving an overview of our approach for record linkage, which can be
summarized by the process depicted in Figure 4.1.
Phase 0: In the initial pre-processing and behavior extraction phase, we transform
raw transaction logs from both sources into a standard format as shown on the right
side of Figure 4.2(a). Next, we extract the behavior data for each single entity in
each log. Behavior data is initially represented in a matrix format similar to those
given in Figure 4.2(b), which we refer to as Behavior Matrix (BM).
Phases 1: Similar to most record linkage techniques, we start with a candidate
generation phase that uses a “quick and dirty”” matching function. When matching a
pair of entities, we follow the merge strategy described in the introduction. Moreover,
in this phase, we map each row in the BM to a 2-dimensional point resulting in a very
compact representation for the behavior with some information loss. This mapping
90
allows for very fast computations on the behavior data of both the original and merged
entities. The mapping is discussed in Section 4.3 and we will show how we can use it to
generate a relatively small set of candidate matches (with almost no false negatives).
Phase 2: Once the candidate matches are generated, the following phase, which
is the core of our approach, is to perform the accurate (yet more expensive) matching
of entities. Accurate matching of the candidate pair of entities (A,B) is achieved
by first modeling the behavior of entities A, B, and AB using a statistical generative
model, where AB is the entity representing the merge of A, B. The estimated models’
parameters are then used to compute the matching score. The details are provided
in Section 4.4.1.
In addition to the above statistical modeling technique, we also propose an alter-
native heuristic technique that is based on information theoretic principles for the
accurate matching phase (See Section 4.4.2). This alternative technique relies on
measuring the increase in the level of compressibility as we merge the behavior data
of pairs of entities. While, to some extent, it is less accurate than the statistical
technique, it is computationally more efficient.
Phase 3: The final filtering and conflict resolution phase is where the final
matches are selected. In our experiments, a simple filtering threshold, tf , is applied
to exclude low-scoring matches.
In the remainder of this section, we will first examine the details of phase 0, and
then we will give an introduction to phases 1 and 2, whose detailed discussion will be
presented in the following two sections.
4.2.3 Pre-processing and Behavior Extraction
A transaction log, from any domain, would typically keep track of certain types
of information for each action an entity performs. This information includes: (1) the
time at which the action occurred, (2) the key object upon which the action was per-
formed (e.g., buying a Twix bar), and (3) additional detailed information describing
91
the object and how the action was performed (e.g., quantity, payment method, etc).
For simplicity, we will be referring to each action just by its key object. For example,
“Twix” can be used to refer to the action of buying a Twix bar.
The following example illustrates how we can transform a raw transaction log into
a standard format with such information. Although the example is from retail stores,
the same steps can be applied in other domains with the help of domain experts.
Example 4.2.1 An example of a raw log is shown in table “Raw log” in Figure 4.2(a)
which has four columns representing the time, the customer (the entity to be matched),
the ID of the item bought by the customer, and the quantity. Since the item name
may be too specific to be the key identifier for the customer’s buying behavior, an
alternative is to use the item category name as the identifier for the different actions.
This way, actions will correspond to buying Chocolate and Cola rather than Twix
and Coca Cola. The main reason behind this generalization is that, for instance,
buying a bar of Twix should not be considered as a completely different action from
buying a bar of Snickers, and so on. In general, these decisions can be made by
a domain expert to avoid over-fitting when modeling the behavior. In this case, the
specific item name, along with the quantity, will be considered as additional detailed
information, which we will refer to as the action features.
The next step is to assign an id, F id, for each combination of features occurring
with a specific action in “Raw Log”, as shown in the “Action Description” table. This
step ensures that even if we have multiple features, we can always reason about them
as a single object using F id. If there is only one feature, then it can be used directly
with no need for F id.
As a final step, we generate the “Processed Log” by scanning “Raw Log” and
registering the time, entity, action, and F id information for each line.
Behavior Extraction and Representation: Given the standardized log, we
extract the transactions of each entity and represent them in a matrix format, called
Behavior Matrix.
92Raw log Processed LogTime Cstmr itm_id Qty Items Time Entity Action F_id… … … .. itm_id Category Name .. … … …3 A 1001 2 … … 3 A Chocolate 43 A 1004 2 1001 Chocolate 3 A Cola 26 A 1001 1 1002 Chocolate 6 A Chocolate 38 A 1004 2 1003 Chocolate 8 A Cola 210 A 1001 2 1004 Cola 10 A Chocolate 41 B 1003 2 1005 Cola 1 B Chocolate 21 B 1004 2 … … 1 B Cola 26 B 1004 2 6 B Cola 28 B 1001 1 8 B Chocolate 310 B 1004 2 10 B Cola 213 B 1001 1 13 B Chocolate 313 B 1004 2 13 B Cola 215 B 1001 1 Action Description 15 B Chocolate 315 B 1004 2 action F_id 15 B Cola 23 C 1002 4 … … 3 C Chocolate 53 C 1005 1 Chocolate 2 3 C Cola 16 C 1001 2 Chocolate 3 6 C Chocolate 46 C 1005 1 Chocolate 4 6 C Cola 19 C 1005 1 Chocolate 5 9 C Cola 110 C 1002 4 … … 10 C Chocolate 514 C 1002 4 Cola 1 14 C Chocolate 514 C 1005 1 Cola 2 14 C Cola 116 C 1001 2 … … 16 C Chocolate 4… … … … … …. … …<Qty=1>,<Desc=Pepsi Cola><Qty=2>,<Desc=Coca Cola>……Features<Qty=2>,<Desc=KitKat><Qty=1>,<Desc=Twix><Qty=2>,<Desc=Twix><Qty=4>,<Desc=Snickers>…Pepsi Cola……Item NameTwixSnickersKitKatCoca Cola1 1 22
(a) Raw log pre-processing example: The first step is to decide on the action
identifiers and the features describing each action to create the “Action
Description” table. The second step is to use the identified actions to re-
write the log. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16A Chocolate 0 0 4 0 0 3 0 0 0 4 0 0 0 0 0 0Cola 0 0 2 0 0 0 0 2 0 0 0 0 0 0 0 0B Chocolate 3 0 0 0 0 0 0 3 0 0 0 0 3 0 3 0Cola 2 0 0 0 0 2 0 0 0 2 0 0 2 0 2 0C Chocolate 0 0 5 0 0 4 0 0 0 5 0 0 0 5 0 4Cola 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1AB Chocolate 3 0 4 0 0 3 0 3 0 4 0 0 3 0 3 0Cola 2 0 2 0 0 2 0 2 0 2 0 0 2 0 2 0BC Chocolate 3 0 5 0 0 4 0 3 0 5 0 0 3 5 3 4Cola 2 0 1 0 0 3 0 0 1 2 0 0 2 1 2 1AB looks more consistent than BC, then AB is most likely the same customerWhen merging A &B and then B &CTime (Date)
(b) Resulting Behavior Matrices from the processed log.
Fig. 4.2.: Retail store running example.
93
Definition 4.2.1 Given a finite set of n actions performed over m time units by an
entity A, the Behavior Matrix (BM) of entity A is an n×m matrix, such that:
BMi,j =
Fij if action ai is performed
0 otherwise
Where, Fij ∈ Fi is the F id value for the combination of features describing action
ai when performed at time j, Fi is the domain of all possible F id values for action
ai ,i = 1, . . . , n and j = 1, . . . ,m.
Example 4.2.2 The BMs for customers A,B and C are shown in Figure 4.2(b). A
non-zero value indicates that the action was performed and the value itself is the F id
that links to the description of the action at this time instant.�A more compact representation for the entities’ behavior is derived from the Be-
havior Matrix representation, and is constructed and used during the accurate match-
ing phase. This second representation, which is based on the inter-arrival times,
considers each row in the BM as a stream or sequence of pairs {vij,F (vij)}, where
vij is the inter-arrival time since the last time action ai occurred, and F (vij) ∈ Fi
is a feature that describes ai from Lai possible descriptions, |Fi| = Lai. For ex-
ample, in Figure 4.2(b), the row corresponding to action ai = chocolate of en-
tity C, BMi = {0, 0, 5, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 5, 0, 4}, will be represented as Xi =
{{3, 5}, {3, 4}, {4, 5}, {4, 5}, {2, 4}}.
The lossy behavior representation used in the candidate generation phase will be
described in Section 4.3.
It is worth mentioning that the actions, along with their level of details (e.g.,
buying chocolate vs. buying Twix) and their associated features, are assumed to be
homogeneous across the two sources. Otherwise, another pre-processing phase will be
required to match the actions, and thereby ensure the homogeneity. Needless to say,
the sources themselves must belong to the same domain (e.g., two grocery stores, two
news web sites, etc) for the behavior-based approach to be meaningful.
94
4.2.4 Matching Strategy
As we explained in Section 4.2.2, matching entities based on their extracted
behavior data is achieved in two consecutive phases: a candidate generation phase
followed by an accurate matching phase. In this section, we describe our general
matching strategy, which we apply in the two matching phases. Note that ultimately,
we need to assign a matching score, Sm, for each pair of entities (A,B) deemed as a
potential match, and then report the matches with the highest scores.
To compute Sm(A,B), we first compute a behavior recognition score, Sr, for each
entity (i.e., Sr(A) and Sr(B)). We then merge the behavior data of both A and B to
construct the behavior of some hypothetical entity AB, whose score, Sr(AB), is also
computed.
The next step is to check if this merge results in a more recognizable behavior
compared to either of the two individual behaviors. Hence, the overall matching score
should depend on the gain achieved for the recognition scores. More precisely, it can
be stated as follows:
Sm(A,B) =nA[Sr(AB)− Sr(A)] + nB[Sr(AB)− Sr(B)]
nA + nB
(4.1)
where nA and nB are the total number of transactions in the BMs of A and B
respectively. Note that the gains corresponding to the two entities are weighted
based on the density of their respective BMs.
Example 4.2.3 To better understand the intuition behind the behavior merge strat-
egy, we assume that entities A and C are from Source 1 and B is from Source 2 and
their processed log is shown in table “Processed Log” in Figure 4.2(a). To find the
best match for entity B, we first merge it with A, and then do the same with C. It is
apparent from the resulting BMs in Figure 4.2(b) that A is potentially a good match
for B; entity AB is likely to be an entity that buys chocolate every 2 or 3 days and
prefers 2 liters of Coca Cola with either 2 bars Twix or 4 bars Snikers chocolates.
However, it is hard to tell a behavior about entity BC. Of course, in a real scenario
we will deal with much more actions.
95
The key question now is: How to compute Sr(A)? In fact, the goal of the recog-
nition score, Sr, is to capture the consistency of an entity’s behavior along three
main components: (1) consistency in repeating actions, (2) stability in the features
describing the action, and (3) the association between actions. These three compo-
nents, which will be explained shortly, correspond to three score components of Sr;
i.e., Sr1, Sr2, and Sr3. Hence, we compute Sr(A) as their geometric mean as given
below.
Sr(A) =3√Sr1(A)× Sr2(A)× Sr3(A) (4.2)
The three behavior components we just mentioned, and which we would like to
capture in Sr, can be explained as follows.
1- Consistency in repeating actions: Entities tend to repeat specific actions
on a regular basis following almost consistent inter-arrival times. For example, a user
(entity) of a news web site may be checking the financial news (action) every morning
(pattern).
2- Stability in the features describing actions: When an entity performs an
action several times, almost the same features are expected to apply each time. For
example, when a customer buys chocolate, s/he mostly buys either 2 Twix bars or 1
Snickers bar, as opposed to buying a different type of chocolate each time and in
completely different quantities. The latter case is unlikely to occur in real scenarios.
3- Association between actions: Actions performed by entities are typically
associated, and the association patterns can be detected over time. For example,
a customer may be used to buying Twix chocolate and Pepsi cola every Sunday
afternoon, which implies an association between these two actions.
The distinction between each of the matching techniques that we will describe next
is in the method used to compute Sr1, Sr2, and Sr3. The candidate generation phase is
a special case as it only considers the first behavior component; i.e. Sr(A) = Sr1(A).
The matching strategy we have described so far can be referred to as the behavior
merge strategy, since it relies essentially on merging the entities’ behaviors and then
96
measuring the realized gain. This is to be contrasted to an alternative strategy, which
can be referred to as the behavior similarity strategy, where the matching score
can simply be a measure of the similarity between the two behaviors.
We will show in Section 4.4 how the behavior similarity strategy can be imple-
mented in the context of the statistical modeling technique for matching. In Section
4.5, we will experimentally show the superiority of the merge strategy over the simi-
larity strategy in all the scenarios we considered.
4.3 Candidate Generation Phase
To avoid examining all possible pairs of entities during the expensive phase of
accurate matching, we introduce a candidate generation phase, which quickly deter-
mines pairs of entities that are likely to be matched. This phase results in almost no
false negatives, at the expense of relatively low precision.
The high efficiency of this phase is primarily attributed to the use of a very
compact (yet lossy) behavior representation, which allows for fast computations. In
addition, only the first behavior component; i.e., consistency in repeating actions,
which is captured by Sr1, is considered in this phase. Note that because the other
components are ignored, binary BMs are used with 1’s replacing non-zero values.
Each row in the BM , which corresponds to an action, is considered as a binary
time sequence. For each such sequence, we compute the first element of its Discrete
Fourier Transform (DFT) [77], which is a 2-dimensional complex number. The
complex number corresponding to an action ai in the BM of an entity A is computed
by:
C(ai)A =
m−1∑j=0
BMi,je2jπ
√−1
m (4.3)
An interesting aspect of this transformation is that the lower the magnitude of the
complex number the more consistent and regular the time sequence, and vice versa.
This can be explained as follows. Consider each of the elements in the time series
as a vector whose magnitude is either 0 or 1, and that their angles are uniformly
97
distributed along the unit circle (i.e., the angle of the jth vector is 2jπm). The complex
number will then be the resultant of all these vectors. Now, if the time series was
consistent in terms of the inter-arrival times between the non-zero values, then their
corresponding vectors would be uniformly distributed along the unit circle, and hence
they would cancel each other out. Thus, the resultant’s magnitude will be close to
zero.
Another interesting aspect is that merging the two rows corresponding to an action
a in the BMs of two entities, A, B, would effectively reduce to adding two complex
numbers i.e., C(a)AB = C
(a)A + C
(a)B .
The following example shows how the candidate generation phase can distinguish
between “match” and “mismatch” candidates.
Example 4.3.1 Consider the example described in Figure 4.3. Let aA, aB and aC
be the rows of action a (chocolate) in the binary BMs of entities A, B and C from
Figure 4.2(b). At the left of Figure 4.3, when merging aA and aB, the magnitude
corresponding to the merged action, aAB equals 0.19, which is smaller than the original
magnitudes: 1.38 for aA and 1.53 for aB. The reduction in magnitude is because the
sequence aAB is more regular than either of aA and aB.
At the right of Figure 4.3, we apply the same process for aB and aC. The magni-
tudes we obtain are 2.03 for aBC, 1.54 for aB, and 0.09 for aC. In this case, merging
aB and aC resulted in an increase in magnitude because the sequence aBC looks less
regular than either of aB and aC.
Based on the above discussion, we can compute a recognition score, Sr(aA), for
each individual action a that belongs to entity A such that it is inversely proportional
to the magnitude of the complex number C(a)A . In particular, Sr(aA) = M−mag(C
(a)A ),
where mag(C(a)A ) is the magnitude of C
(a)A and M is the maximum computed magni-
tude.
98
-2
0
2
-2 0.2
-2
0
2
-2 0.2
Fig. 4.3.: Actions patterns in the complex plane and the effect on the magnitude.
To compute the overall Sr(A), we average the individual scores, Sr(aA), each
weighted by the number of times its respective action was repeated (n(a)A ). The
formula for Sr(A) is thus given as follows.
Sr(A) =1
nA
∑∀ a
n(a)A · Sr(aA) (4.4)
Standard SQL To Compute Candidate Matches: In the following, we pro-
vide a derivation for a final formulation of the matching score in the candidate gener-
ation matching phase. At the end, we provide the corresponding SQL statement we
used for this computation.
After computing the complex numbers representation for each action in an en-
tity, we computed Sr(aA) = M − mag(C(a)A ), where M is the maximum computed
magnitude. Then, we obtain
S(A) = 1
nA
∑∀ a
n(a)A (M −mag(C
(a)A )) (4.5)
99
By substituting Eq. 4.5 into Eq. 4.1, we obtain the matching score Sm(A,B):
Sm(A,B) =nA
nA + nB
[1
nA + nB
∑∀ a
(n(a)A + n
(a)B )(M −mag(C
(a)AB))
− 1
nA
∑∀ a
(n(a)A )(M −mag(C
(a)A ))]
+nB
nA + nB
[1
nA + nB
∑∀ a
(n(a)A + n
(a)B )(M −mag(C
(a)AB))
− 1
nB
∑∀ a
(n(a)B )(M −mag(C
(a)B ))]
By Simple rearrangement to collect the terms related to mag(C(a)AB), we get
Sm(A,B) =1
nA + nB∑∀ a
[(n(a)A + n
(a)B )M − (n
(a)A + n
(a)B ) mag(C
(a)AB)
−n(a)A M + n
(a)A mag(C
(a)A )
−n(a)B M + n
(a)B mag(C
(a)B )]
Note that the terms of M will cancel out and the final matching score will be
Sm(A,B) =1
nA + nB∑∀ a
[ n(a)A mag(C
(a)A ) + n
(a)B mag(C
(a)B )
− (n(a)A + n
(a)B ) mag(C
(a)AB) ] (4.6)
We store the complex number information for each data source in a relation with
the attributes (entity, action, Re, Im, mag, a supp, e supp), where there is a tuple
for each entity with its actions. For each action of an entity, we store the (Re and
Im) the real and imaginary components of the complex number in addition to the
100
(mag) magnitude value. a supp is the number of transaction for that action within
the entities log and e supp is total number of transactions for the entity repeated with
each tuple corresponding an action. Thus, there are two tables representing each of
the two data sources src1 and src2.
To generate the candidates, we need to compute Eq. 4.6 for each pair of entities
and filter the result using the threshold tc on the resulting matching score. The
following SQL applies this computation and returns the candidate matches.
select
c1.entity as e1 ,
c2.entity as e2 ,
( c1.a_supp * c1.mag // n^a_A * mag^a_A
+ c2.a_supp * c2.mag // n^a_B * mag^a_B
- (c1.a_supp + c2.a_supp ) * // n^a_AB *
SQRT( // mag^a_AB
(c1.Re + c2.Re)*(c1.Re + c2.Re)
+(c1.Im + c2.Im)*(c1.Im + c2.Im))
)/ (c1.e_supp + c2.e_supp) // n_AB
as gain_score
from src1 c1 inner join src2 c2
on c1.action = c2.action
where magscore > t_c
group by c1.entity, c2.entity
4.4 Accurate Matching Phase
4.4.1 Statistical Modeling Technique
Building the Statistical Model
Our goal is to build a statistical model for the behavior of an entity given its
observed actions. The two key variables defining an entity’s behavior with respect
to a specific action are (1) the inter-arrival time between the action occurrences,
and (2) the feature id associated with each occurrence, which represents the features
101
describing how the action was performed at that time, or in other words it reflects
the entity’s preferences when performing this action.
In general, we expect that a given entity will be biased to a narrow set of inter-
arrival times and feature ids which is what will distinguish the entity’s behavior.
In merging two behavior matrices for the same entity, the bias should be enforced
and made clearer. However, when the behavior matrices of two different entities
are merged, the bias will instead be weakened and made harder to recognize. The
statistical model that we build should enable us to measure these properties.
Our problem is similar to classifying a biological sequence as being a motif, i.e., a
sequence that mostly contains a recognized repeated pattern, or not. A key objective
in computational biology is to be able to discover motifs by separating them from
some background sequence that is mostly random. In our case, a motif corresponds
to a sequence of an action by the same entity. In view of this analogy, our statistical
modeling will have the same spirit as the methods commonly used in computational
biology. However our model has to fit the specifics of our problem which are as follows:
(a) sequences are of two variables (inter-arrivals and feature id), rather than just one
variable (DNA character) and (b) for ordinal variables (such as the inter-arrival time),
neighboring values need to be treated similarly.
Modeling the Behavior for an Action: We model the behavior of an entity A
with respect to a specific action a using a finite mixture model M = {M1, . . . ,MK},
with mixing coefficients λ(aA) = {λ(aA)1 , . . . , λ
(aA)K }, where Mk is its kth component.
Each component Mk is associated with two random variables: (i) the inter-arrival,
which is generated from a uniform distribution over the range of inter-arrival times,
rk = [startk, endk]1. (ii) the feature id, is a discrete variable, which is modeled
using a mutinomial distribution with parameter θ(aA)k = {f (aA)
k1 , . . . , f(aA)kL }, where L
is the number of all possible feature ids, and f(aA)kj is the probability to describe the
1the range size of rk is user-configurable as it depends on the application and what values areconsidered close. In our experiments with retail store data from walmart, we generated ranges bysliding, over the time period, a widow of size 5 days with a step of 3 days. (i.e. {{1,6},{4,9},{7,12},. . . })
102
occurrence of action a using feature Fj, j = 1, . . . , L. In what follows, we omit the
superscript aA and assume that there is only one action in the system to simplify the
notations.
What we described so far is essentially a generative model in the sense that once
built, we can use it to generate new action occurrences for a given entity. For example,
using λ, we can select the component Mk to generate the next action occurrence,
which should occur after an inter-arrival time picked from the correspondng range
rk = [startk, endk] and we can describe the action by selecting a feature id using θk.
However, we do not use the model for this purpose. Instead, we use its estimated
parameters (λ and the vectors θk) to determine the level of recognizing repeated
patterns in the sequence corresponding to the action occurrences.
For the estimation of the model parameters, we use the Expectation-Maximization
(EM) algorithm to fit the mixture model for each specific action a of an entity A to
discover the optimal parameter values which maximize the likelihood function of the
observed behavior data.
Before we present the algorithm and the derivations used to estimate the model
parameters, we show an example of a behavior and the properties we desire for its
corresponding model parameters. We demonstrate the challenge in finding those
desired parameters, which we address by using the EM algorithm. We also show
through the example how we choose the initial parameter values required for the EM
algorithm.
Example 4.4.1 Consider that a customer’s behavior with respect
to the action of buying chocolate is represented by the sequence
{{6, s}, {15, l}, {6, s}, {8, s}, {15, l}, {14, l}, {13, l}}, where s denotes a small
quantity (e.g., 1-5 bars), and l denotes a large quantity (e.g., more than 5 bars). So
s/he bought a small quantity of chocolate after 6 days, a large quantity after 15 days,
and so on.
To characterize the inter-arrival times preferred by this customer, the best ranges
of size 2 to use are [6, 8] and [13, 15]. Their associated mixing coefficients (λk) should
103
be 37and 4
7, because the two ranges cover 3 and 4 respectively out of the 7 observed
data points.
However, since in general, the best ranges in a behavior sequence will not be as
clear as in this case, we need to systematically consider all the ranges of a given size
(2 in this case), and assign mixing coefficients to each of them. The possible ranges
for our example would be {[6, 8], [7, 9], [8, 10], . . . , [13, 15]}.
A straightforward approach to compute λk for each range is to compute the nor-
malized frequency of occurrence of the given range for all the observed data points.
For instance, the normalized frequencies for the ranges [6, 8], [12, 14], and [13, 15] are
312, 2
12, and 4
12(or 1
4, 1
6, and 1
3) respectively, where 12 is the sum of frequencies for all
possible ranges. (Note that the same inter-arrival time may fall in multiple overlap-
ping ranges.) Clearly, these are not the desired values for λk. We would rather have
zero values for all ranges other than [6, 8] and [13, 15]. However, we still use these
normalized frequencies as the initial values for λk to be fed into the EM algorithm.
Similarly, to compute the initial values for the θk probabilities, we first consider
the data points covered by the range corresponding to component Mk only. Then, for
each possible value of the feature id, we compute its normalized frequency across these
data points. Clearly, in our example, the customer favors buying small quantities
when s/he shops at short intervals (6-8 days apart), and large quantities when s/he
shops at longer intervals (13-15 days apart).
Behavior Model Parameters Estimation
In the following, we use the expectation maximization (EM) algorithm to fit the
finite mixture model of a given action sequence representing its occurrence and dis-
cover the parameters’ values of the overall model which was discussed earlier in this
section. To simplify the notations, we assume there is only one action in the system,
so we omit the superscript that link the entity and action names.
104
As we mentioned earlier, we use the expectation maximization (EM) for finite
mixture model to discover the parameters’ values of the overall model which would
maximize the likelihood of the data. The EM uses the concept of missing data and
follows an iterative procedure to find values for λ and θ, which maximize the likelihood
of the data given the model. In our case, the missing data is the knowledge of which
components produced X = {{v1,F (v1)}, . . . , {vN ,F (vN )}}. A finite mixture model
assumes that the sequence X arises from two or more components with different,
unknown parameters. Once we obtain these parameters, we use them to compute the
behavior scores along each of the behavior three components.
Let us now introduce a K-dimensional binary random variable Z with a 1-of-K
representation in which a particular zk is equal to 1 and all other elements are equal
to 0, i.e., zk ∈ {0, 1} and∑K
k=1 zk = 1, such that the probability p(zk = 1) = λk.
Every entry in the sequence Xi will be assigned Zi = {zi1, zi2, . . . , ziK}, We can easily
show that the probability
p(Xi|θ1, . . . , θK) =K∑k=1
p(zik = 1)p(Xi|Zi, θ1, . . . , θK)
=K∑k=1
λkp(Xi|θk)
Since we do not know zik, we consider the conditional probability γ(zik) of zik
given Xi p(zik = 1|Xi) which can be found using Bayes’ theorem [78]:
γ(zik) =p(zik = 1)p(Xi|zik = 1)∑Kk=1 p(zik = 1)p(Xi|zik = 1)
=λkp(Xi|θk)∑Kk=1 λkp(Xi|θk)
(4.7)
We shall view λk as the prior probability of zik = 1, and γ(zik) as the corresponding
posterior probability once we got X. γ(zik) can also be viewed as the responsibility
105
that componentMk takes for explaining the observationXi. Therefore, the likelihood
or probability of the data given the parameters can be written in the log form as:
ln p(X|λ, θ) =N∑i=1
K∑k=1
γ(zik) ln [λkp(Xi|θk)]
=N∑i=1
K∑k=1
γ(zik) ln p(Xi|θk) +N∑i=1
K∑k=1
γ(zik) lnλk (4.8)
The EM algorithm monotonically increases the log likelihood of the data until
convergence by iteratively computing the expected log likelihood of the complete
data (X, Z) in the E step and maximizing this expected log likelihood over the model
parameters λ and θ. We first choose some initial values for the parameters λ(0) and
θ(0). Then, we alternate between the E-step and M-step of the algorithm until it
convergences.
In the E-step, to compute the expected log likelyhood of the complete data, we
need to calculate the required conditional distribution γ(0)(zik). We plug the λ(0) and
θ(0) in Eq. 4.7 to get γ(0)(zik), where we can compute p(Xi|θk) as follows:
p(Xi|θk) =L∏
j=1
fI(j,k,F(vi))kj (4.9)
where Xi = {vi,F (vi)} and I(j, k,F (vi)) is an indicator function equal to 1 if vi ∈ rk
and F (vi) = Fj; otherwise it is 0.
Recall that rk = [start, end] is the period identifying the component Mk.
The M-step of EM maximizes Eq. 4.8 over λ and θ in order to re-estimate new
values for them λ(1) and θ(1). The maximization over λ involves only the second term
in Eq. 4.8,
argmaxλ∑N
i=1
∑Kk=1 γ(zik) lnλk, has the solution
λ(1)k =
1
N
N∑i=1
γ(0)(zik) , k = 1, . . . , K. (4.10)
106
We can maximize over θ by maximizing the first term in Eq. 4.8 separately over each
θk for k = {1, . . . , K}. argmaxθ E(logp(X,Z|θi, ..., θK)] is equivalent to maximizing
the right hand side of Eq. 4.11 over θk (only a piece of the parameter) for every k.
θk = argmaxθk
N∑i=1
γ(0)(zik) ln p(Xi|θk), (4.11)
To do this, for k = {1, . . . , K} and j = {1, . . . , L} let
ckj =N∑i=1
γ(0)(zik)I(j, k,F (vi)) (4.12)
Then ckj is in fact the expected number of times to describe the action by Fj when
its inter-arrival falls in Mk’s range rk. We re-estimate θk by substituting Eq. 4.9 into
Eq. 4.11 to get
θ(1)k = {fk1, . . . , fkL} = argmax
θk
L∑j=1
ckj ln fkj (4.13)
Therefore, fkj =ckj∑Lj=1 ckj
(4.14)
To find the initial parameters λ(0) and θ(0), we scan the sequence X once and use
Eq. 4.12 to get ckj by setting all γ(0) = 1. Afterward, we use Eq. 4.14 to get θ(0)k and
compute
λ(0)k =
∑Lj=1 ckj∑K
k=1
∑Lj=1 ckj
Computing Matching Scores
To this point, we succeeded in defining a model and estimating its parameters λ
and θ which can be used to re-generate the sequence {X1, . . . , XN} that represents
the occurrence of the action. Recall that our aim is to match two entities A and B, by
computing the gain Sm(A,B) in recognizing a behavior after merging A and B using
Eq. 4.1. This requires computing the scores Sr(A), Sr(B) and Sr(AB) using Eq. 4.2,
which in turn requires computing the behavior recognition scores corresponding to
the three behavior components, which, for entity A for example, are Sr1(A), Sr2(A),
and Sr3(A).
107
For the first behavior component, the consistency in repeating an action a is
equivalent to classifying its sequence as a motif. We quantify the pattern strength to
be inversely proportional to the uncertainty about selecting a model component using
λ(aA), i.e., action a’s sequence is a motif if the uncertainty about λ(aA) is low. Thus,
we can use the entropy to compute Sr1(aA) = logK − H(λ(aA)), where H(λ(aA)) =
−∑K
k=1 λ(aA)k log λ
(aA)k , and the overall score Sr1(A) is then computed by a weighted
sum over all the actions according to their support, i.e., the number of times the
action was repeated.
Sr1(A) =1
nA
∑∀ a
n(a)A · Sr1(aA) (4.15)
For the second behavior component, the stability in describing the action (ac-
tion features) is more recognizable when the uncertainty in picking the feature id
values is low. The behavior score along this component can be evaluated by first
computing θ′(aA) = {f ′(aA)1 , . . . , f
′(aA)L }, which is the overall parameter to pick a fea-
ture id value for action a using the multinomial distribution such that the over-
all probability for entity A to describe its action a by feature Fj is f′(aA)j . Here,
f′(aA)j =
∑Kk=1 λ
(aA)k f
(aA)kj combined from the all K components for j = 1, . . . , L,
knowing that θ(aA)k = {f (aA)
k1 , . . . , f(aA)kL }. Using the entropy of θ′(aA), we compute
Sr2(aA) = logL − H(θ′(aA)), where H(θ′(aA)) = −∑L
j=1 f′(aA)j log f
′(aA)j . Similar to
Eq. 4.15, we can compute the overall score for Sr2(A) as the weighted sum for Sr2(aA)
according to the actions support.
For the third component, we look for evidence about the associations between
actions. We estimate, for every pair of actions, its probability of being generated from
components with the same inter-arrival ranges. The association between actions can
be recognized when they occur close to each other. In other words, this can occur
when both of them tend to prefer the same model components to generate their
sequences. Consequently, the score for the third component can be computed over
all possible pairs of actions for the same entity as follows:
Sr3(A) =∑∀ a,b
K∑k=1
λ(aA)k λ
(bA)k
108
Computing Behavior Similarity Score:
The similarity between two behaviors can be simply quantified by the closeness
between the parameters of their corresponding behavior models according to the
Euclidean distance. For two entities A and B, we compute the behavior similarity as
follows:
BSim(A,B) = 1− 1
nA + nB
∑∀ a
(naA + na
B)√√√√ K∑k=1
[(λ(aA)k − λ
(aB)k )2 +
L∑j=1
(λ(aA)k f
(aA)kj − λ
(aB)k f
(aB)kj )2].
Note that this method is preferred over directly comparing the BMs of the entities,
since the latter method would require some sort of alignment for the time dimension
of the BMs. In particular, deciding which cells to compare to which cells is not
obvious.
4.4.2 Information Theoretic technique (Compressibility)
We here present an information theory-based technique for the computation of
the matching scores. It is not as accurate as the motif-based technique presented in
Section 4.4.1, but it is more computationally efficient. The underlying idea stems
from observing that if we represent the BM as an image, we will see horizontal
repeated blocks that would be more recognizable if the behavior is well recognized. The
repeated blocks appear because of the repetition in the behavior patterns. Therefore,
we expect more regularity along the rows than along the columns of the BM . In fact,
the order of values in any of the columns depends on the order of the actions in the
BM , which is not expected to follow any recognizable patterns. For these reasons, we
compress the BM on a row by row basis, rather than compressing the entire matrix
as a whole.
Most existing compression techniques exploit data repetition and encode it in a
more compact representation. We thus introduce compressibility as a measure of
109
confidence to recognize behaviors. In our experiments, we compress the BM with the
DCT compression technique [79], being one of the most commonly used compression
techniques in practice. We then use the compression ratios to compute the behavior
recognition scores. Significantly higher compression ratios imply a more recognizable
behavior.
Given the sequence representation of an action occurrence i.e. {{vj,F (vj)}}, if an
entity follows stability in repeating an action, the values vj’s will follow a certain level
of correlation showing the action rate. Moreover, the features values F (vj) will contain
similar values to describe how the action was performed. To perform a compression
of an action sequence, we follow the same approach used in JPEG [80] for a one
dimentional sequence.
Our aim is to compute the three behavior recognition scores along the three be-
havior components (see Section 4.2.4). For the first behavior component, we compress
the sequence {v1, . . . , vn(a)A}, which represents the inter-arrival times for each action
a. The behavior score, Sr1(aA) for action a of entity A, will be the resulted compres-
sion ratio; the higher the compression ratio, the more we can recognize a consistent
inter-interval time (motif). We then use Eq. 4.15 to compute the overall score
Sr1(A). Similarly, for the second behavior component, we compress the sequence
{F (v1), . . . ,F(v
n(a)A
)}, which represents the feature values that describe the action a.
Again, the score Sr2(aA), is the produced compression ratio; the higher the compres-
sion ratio, the more we can recognize stability in action features. Similar to Sr2(aA),
we can compute the overall score Sr2(A).
Finally, for the third behavior component, which evaluates the relationship be-
tween the actions, we compress the concatenated sequences of inter-arrival times of
every possible pair of actions.Given two actions a and b, we concatenate and then
compress their inter-arrival times to get the compression ratio cra,b. If a and b are
closely related, they will have similar inter-arrival times which would allow for bet-
ter compressibility of the concatenated sequence. On the contrary, if they are not
related, the concatenated sequence will contain varying values. Thus, cra,b quanti-
110
fies the association between actions a and b. Hence, the overall pairwise association
is an evidence for the strength in the relationship between the actions that can be
computed by:
S(A)r3 =
∑∀ a,b
cra,b
4.5 Experiments
The goals of our experimental study are:
• Evaluate the overall linkage quality of our behavior based record linkage for
various situations of splitting the entities’ transactions in the log.
• Demonstrate the quality improvement when our technique is combined with a
textual based record linkage technique.
• Study the performance and scalability as well as the effectiveness of the can-
didate generation phase on the overall performance. We also include in our
evaluation the compressibility technique, which is discussed in Appendix 4.4.2.
In the following, we will refer to the statistical model technique as motif.
To the best of our knowledge, this is the first approach to leverage entity behavior
for record linkage. Consequently, there is no other technique to directly compare to.
Instead, we show how our technique can be combined with a textual record linkage
technique.
Dataset: We use a real world transaction log from Walmart which would cover
many similar scenarios in the retail industry. These transactions cover a period of
16 months. An entry in the log represents an item that has been bought at a given
time. Figure 4.2(a) shows a typical example for this log. We use the first level item
grouping as the actions which are described by the quantity feature. This feature was
grouped into {very low, low, medium, high, very high} for each individual item (high
quantity for oranges is different from high quantity of milk gallons).
111
Setup and Parameters: To simulate the existence of two data sources whose
customers (entities) need to be linked, a given entity’s log is partitioned into con-
tiguous blocks which are then randomly assigned to the sources. The log splitting
operation is controlled by the following parameters with their assigned default values
if not specified: (1) e: the percentage of overlapped entities between the two data
sources (default 50%). (2) d: the probability of assigning a log block to the first data
source (default 0.5). (3) b: the transactions block size as a percentage of the entity’s
log size. When b is very small, the log split is called a random split (default 1%),
and for higher values we call the split a block split (default 30%). The block split
represents the case where the customer alternates between stores in different places,
e.g., because of moving during the summer to a different place. When b is 50%, the
log is split into two equal contintigues halves. From the overlapping entities, 50%
have their transactions random split and the rest is block split. These parameters
allow us to test our techniques under various scenarios on how entities interact with
the two systems.
All the matching scores with a phase are scalled to be between 0 and 1, by sub-
tracting the minimum and dividing by the maximum scores. All of the experiments
were conducted on a Linux box with a 3 GHz processor and 32 GB RAM. We im-
plemented the proposed techniques in Java and we used MySQL DBMS to store and
query the transactions and the intermediate results.
4.5.1 Quality
The matching quality of the proposed techniques is analyzed by reporting the
classical precision and recall. We also report the f-measure= 2×precision×recallprecision+recall
, which
corresponds to the weighted harmonic mean of precision and recall. Since we control
the number of overlapping entities, we know the actual unique entities to compute
the precision and recall. In some cases and to provide more readable plots, we only
report the f-measure as an overall quality measur.
112
020406080
100
0 0.2 0.4 0.6 0.8
Perce
ntage
Candidates threshold (tc)precisionrecallreduction
(a) Candidate generation quality.
020406080
100
0 0.2 0.4 0.6 0.8
Perce
ntage
Filtering threshold (tf)
motif-precision motif-recallsim-precision sim-recallcompress-precision compress-recall
(b) Accurate matching comparisons: Pre-
cision & Recall
020406080
100
0 0.2 0.4 0.6 0.8
f-measure
Filtering threshold (tf)
motifcompresssim
(c) Accurate matching comparisons: f-
measure
Fig. 4.4.: Behavior linkage overall quality.
Overall Quality: In this experiment, we use a log of a group of 1000 customers.
For the candidate generation phase, we report in Figure 4.4(a) the recall, precision
and percentage of the reduction in the number of candidates against the candidate
matching score threshold tc. If the two data sources contain p and q entities and the
number of generated candidates is c pairs, the reduction percentage corresponds to
r = 100(pq − c)/pq.
We observe that high recall values close to 100% are achieved for tc ≤ 0.3. More-
over, the reduction in the number of candidates starts around 40% and quickly in-
creases close to 100% for tc ≥ 0.2. The precision starts at very low values close to
zero and increases with tc.
113
The main purpose of this phase is to reduce the number of candidates while
maintaining high recall using an approximate matching process. Therefore, low values
for tc should be used to relax the matching function and avoid false negatives. For
low values around tc = 0.2, the number of candidates are perfectly reduced with very
few false negatives. This result was achieved on different datasets.
Figure 4.4(b) and 4.4(c) illustrate and compare the overall quality of the tech-
niques of the behavior merge strategy; motif and compressibility, and the behavior
similarity technique. In this experiment, we used the candidates produced at tc = 0.2.
The three techniques behave similarly with respect to tf , Phase 2 filtering threshold.
High recall is achieved for low values of tf , while high precision is reached for high tf .
In Figure 4.4(c), f-measure values show that the motif technique can get an accuracy
over 80% while the compressibility technique can hardly reach 65%. The behavior
similarity technique was the worst as it can hardly reach 45%.
The difference between the motif and compressibility techniques is expected as the
motif technique is based on an exhaustive statistical method that is more accurate,
while the compressibility is based on efficient mathematical computations. The be-
havior similarity technique did not perform well because when a customer’s shopping
behavior is split randomly, it will be difficult to accurately model his/her behavior
based on either source taken separately. Consequently, comparing the behavior model
can hardly help in matching the customers. In the case where the transactions are
block split, the behavior can be well modeled. However, since there are many distinct
customers who have similar shopping behaviors, the matching quality will drop.
For subsequent experiments and to be fair to the three behavior matching tech-
niques, we report the best achieved results when changing tf . For the candidate
generation phase, we used tc = 0.2.
Improving Quality with Textual Linkage: In this experiment, we consider a
situation that is similar to the Yahoo-Maktoob acquisition discussed in the introduc-
tion. We constructed a profile for each customer such that the textual information
is not very reliable. Basically, we synthetically perturbed the string values. To
114
0
20
40
60
80
100
0.5 0.6 0.7 0.8 0.9f-m
easure
String similarity threshold
motifcompresssimString simFig. 4.5.: Improving the textual matching quality.
match the customers textual information, we used the Levenshtein2 distance func-
tion [81]. In the experiment, we first match the customers using different string
similarity thresholds and we then pass the resulting matches to our behavior-based
matching techniques.
In Figure 4.5, the overall accuracy improvement is illustrated by reporting the
f-measure values. Generally, as we relax the matching using the textual informa-
tion by reducing the string similarity threshold, the behavior linkage approach get
more chance to improve the overall quality. Reducing the string similarity threshold
resulted in very low precision and high recall with an overall low f-measure. Leverag-
ing the behavior in such case improves the precision and consequently improves the
f-measure. Although all the behavior matching techniques improved the matching
quality, the motif technique is more accurate.
Split Transactions with Different Probabilities: In this experiment, we
study the effect of changing the parameter d (i.e. we assess the quality when the
entities’ transactions are split between the data sources with different density). In
Figure 4.6(a), we report and compare the f-measure when linking several datasets
each splitted with different d value from 0.1 to 0.5.
2We used the implementation in the Second String library (http://secondstring.sourceforge.net/)
115
020406080
100
0.1 0.2 0.3 0.4 0.5
f-mea
sure
d
motifcompresssim
(a) Quality vs. different splitting proba-
bilities
020406080
100
2 3 4 5 6 7
f-mea
sure
p
motifcompresssim
(b) Quality vs behavior exhaustiveness
Fig. 4.6.: Behavior linkage quality vs. different splitting probabilities and behavior
exhaustiveness.
We observe that for d ≤ 0.2 (i.e, the first source contains below 20% of the
entitiy’s transactions), the matching quality drops sharply. The drop happens because
one of the data sources will contain customers with fewer information about their
behaviors. When matching such small behaviors, it is likely to fit and produce well
recognized behavior with many other customers. The behavior merge techniques
consistently produce better results in situations when each of the participating data
sources contains at least 20% of an entity’s behavior. This is a reasonable results
especially when only the behavior information is used for linkage.
Split Incomplete Behaviors: This experiment evaluates the quality of the
behavior linkage when matching incomplete behaviors (i.e., non-exhaustive3). More
precisely, for a customer we split his/her transactions into p parts and proceed to
link the behaviors using only two parts. We report in Figure 4.6(b) the resulting
f-measure matching values when p = 2, . . . , 7.
As expected, as we increase p the matching quality using all the behavior linkage
approach drops. Although the customers we used in the experiments may not have
their complete buying behavior in Walmart stores, the motif technique was able to
3An exhaustive behavior means that the customer does all of his/her purchases from the same storethroughout the entire studied period
116
020406080
100
0 10 20 30 40 50
f-mea
sure
b
motifcompresssim
(a) Quality vs behavior contiguousness
020406080
100
10 30 50 70 90
f-mea
sure
e
compressmotifsim
(b) Quality vs percentage of overlapping
entities
Fig. 4.7.: Behavior linkage quality vs. behavior contiguousness and percentage of
overlapping entities.
get more than 50% quality for customers having about 25% of their behavior split
between the two sources. The matching quality of the compressibility technique drops
below 50% immediately if we split the transactions into 3 parts and link two of them.
The behavior similarity technique was not helpful most of the time even when using
an even transactions split.
Split Contiguous Behavior Blocks: This experiment studies the effect of
changing b, the transactions block size, to split an entity’s transactions. In Figure
4.7(a), we report the matching quality using the f-measure as we change b. For
b = 50%, the transactions are split into two contiguous halves and for b = 1%, it is
almost as if we are randomly splitting the transactions. For low values of b, we get
the best quality matching using the behavior merge, then as we increase the block
size, b, the matching quality drops. For the behavior similarity, the quality starts low,
about 45%, for very low values of b, then the quality improves to about 65% for b
between 5% and 15%. After that the quality keeps decreasing with the increase of b.
The behavior merge techniques benefit from having the original entities’ behavior
more random; when merging the transactions, the behavior patterns emerge. This is
the main reason for having good accuracy for low values of b. Moreover, high values
117
of b close to 50% mean that the behaviors can be well recognized and therefore the
computed gain after merging the transaction will not be significant enough to distin-
guish between the entities. On the other hand, for the behavior similarity technique,
there are two observations affecting its results: (i) there are many customers sharing
the same buying behavior, and (ii) how much of the complete behavior can be recog-
nized. For low values of b, the behavior can not be well recognized leading to poorly
estimated behavior model parameters, and consequently low matching accuracy. For
high values of b, the behavior is well recognized, however, because there are many
customers sharing the same behavior the matching quality drops.
Changing the Number of Overlapping Entities : In this experiment, we
study the effect of changing the overlapping percentage, e, from 10% to 100% (100%
means that all the customers use both stores) and the results are reported in Figure
4.7(b). We see that the overlapping percentage parameter is not significantly affecting
the matching quality for all techniques. This highlights the benefit of our approach to
provide good results even when the expected number of overlapping entities is small.
4.5.2 Performance
Our next set of experiments study the execution time. We start by showing the
positive effect of the candidate generation phase on the overall linking time, then we
discuss the scalability of our approach.
Candidate Generation Phase Effectiveness: In this experiment, we used the
same dataset as in Figure 4.4(a). In Figure 4.8(a), we report the total execution
time of the motif, compressibility and similarity techniques against different values of
Phase 1 threshold, tc. Phase 1 took 45 sec; this execution time is not affected by tc
because all the pairs of entities should be compared anyway and then filtered based
on tc’s selected value. For each value of tc, the candidates are passed to the accurate
matching phase to produce the final matching results.
118
10
100
1000
10000
0 0.2 0.4 0.6 0.8 1
Time (
sec)
Candidate thrshold (tc)motifcompresssim
(a) Candidate generation effectiveness.
110
1001000
10000100000
200 400 600 800 1000
Time (
sec)
Number of entities
motif motif-nP1compress compress-nP1sim sim-nP1
(b) Scalability.
Fig. 4.8.: Behavior linkage performance.
The time spent in the whole matching process decreases as tc increases because
the number of produced candidates drops dramatically. This was illustrated in Figure
4.4(a) in terms of reduction in the percentage of the number of candidates. However,
high values for tc results in many false negatives. As mentioned earlier, values around
tc = 0.2 produce good quality candidates.
When comparing the performance of the accurate matching techniques at tc =
0.2, the compressibility outperforms motif technique by a factor of about 3. This
is because the compressibility uses a technique that does not require scanning the
data many times while the motif technique uses an expensive iterative statistical
method. The compressibility technique is thus more attractive for very large logs.
The similarity technique requires less time than the motif, because the motif computes
for each candidate pair 3 statistical models (two for the original two entities and one
for the resulting merged entity), while the similarity computes models for only two
entities.
Scalability : This experiment analyzes the scalability of the behavior linkage
approach and compares the two cases of using or not the candidate generation phase.
The evaluation was conducted using a sequential implementation of the techniques
(i.e., no parallelization was introduced). In Figure 4.8(b), we report the overall linkage
time for the three behavior matching techniques when using Phase 1 (motif, compress
119
and sim) and without Phase 1 (motif-nP1, compress-nP1 and sim-nP1). When we
used Phase 1, tc = 0.2.
The behavior matching techniques require expensive computations and scale
poorly without the help of the candidate generation phase, which resulted in around
2 orders of magnitude speedup for the case of motif technique. The processing time is
governed by the generated number of candidates using the threshold tc as discussed
in the previous experiment.
For very large scale data processing, generating the candidates can benefit from
standard database join performance. Moreover, the computations required for each
candidate pair is independent from any other pair computation and hence can be
easily parallelized. Therefore, in a parallel environment, all the behavior matching
computations
4.6 Summary
In this chapter, we presented an technique to indirectly involve users in data clean-
ing task. In particular, we presented a novel approach for record linkage that uses
entity behavior extracted from transactions logs. When matching two entities, we
measure the gain in recognizing a behavior in their merged logs. We proposed two
different techniques for behavior recognition: a statistical modeling technique and a
more computationally efficient technique that is based on information theory. To im-
prove efficiency, we introduced a quick candidate generation phase. Our experiments
demonstrate the high quality and performance of our approach.
120
5. HOLISITIC MATCHING WITH WEB TABLES FOR
ENTITIES AUGMENTATION AND FINDING MISSING
VALUES
This chapter introduces an approach to leverage the WWW for a data cleaning task.
In particular, we use extracted web tables in the task of finding missing values in a
database. We assume that we have a table with a list of entities (we call it the query
table) and we want to find the missing values of some attributes of these entities.
The approach requires finding the web tables that match the query table, and then,
use the matched web tables to aggregate the missing values.
This chapter proposes a novel approach for matching web tables with the query
table by modeling the problem as a topic sensitive page rank, where the query table
defines a topic on the web tables. We also propose a system architecture that perform
most of the “heavy lifting” at a preprocessing step using MapReduce to produce a
set of indexes, such that we get a fast response time.
The chapter is organized as follows: We start by a motivating example for our
approach in Section 5.1. Then, we present our holistic matching framework in Sec-
tion 5.2. Section 5.3 describes our system architecture; and Section 5.4 discusses how
we build the sematic matching graph among web tables. In Section 5.7, we evaluate
our approach, and finally, we summarize the chapter in Section 5.8.
5.1 Introduction
The Web contains a vast corpus of HTML tables. In this chapter, we focus
on one class of HTML tables: entity-attribute tables (also referred to as relational
tables [19,20] and 2-dimensional tables [21]). Such a table contains values of multiple
entities on multiple attributes, each row corresponding to an entity and each column
121
Model BrandS80A10
GX-1ST1460
Model BrandS80 NikonA10 Canon
GX-1S SamsungT1460 Benq
S80 NikonA10 Canon
GX-1ST1460
S80 NikonA10 Canon
GX-1S SamsungT1460 Benq
Input (Query) Table
Output Table
(a) Augmentation By Attribute Name
(b) Augmentation By Example
Fig. 5.1.: APIs of the core operations
corresponding to an attribute. Cafarella et. al. reported 154M such tables from a
snapshot of Google’s crawl in 2008; we extracted 573M such tables from a recent
crawl of Microsoft Bing search engine. Henceforth, we refer to such tables as simply
web tables.
Consider an enterprize database where we have a table about companies and all
(or most) of their contact information is missing, or consider a product database
with a table about digital cameras. In the cameras table, the camera model name
is provided, but some other attributes such as brand, resolution, price and optical
zoom have missing values. We call these attributes as augmenting attributes and the
process of finding the missing attributes values as entities augmentation. Gathering
information about the “entities” is a labor-intensive task. We propose to automate
this task using the extracted web tables.
Such augmentation would be difficult to perform using an enterprize database or
an ontology because the entities can be from any arbitrary domain. Today, users try
to manually find the web sources containing this information and assemble the values.
Assuming that this information is available, albeit scattered, in various web tables,
we can save a lot of time and effort if we can perform this operation automatically.
122
To support finding missing values, we support two core operations using web ta-
bles. The first operation is called “Augmentation By Attribute Name” (ABA), where
we have a list of entities and an attribute name to be augmented using web tables.
Figure 5.1(a) shows example input and output for this operation applied to camera
model entities with one augmenting attribute (brand). The second operation is called
“Augmentation By Example” (ABE). It is a variant of ABA, where we provide the
values on the augmenting attribute(s) for a few entities instead of providing the name
of the augmenting attribute(s). Figure 5.1(b) shows example input and output for this
operation applied to camera model entities and one augmenting attribute (brand).
The requirements for these core operations are: (i) high precision (#corraug#aug
) and
high coverage ( #aug#entity
) where #corraug, #aug and #entity denote the number of
entities correctly augmented, the number of entities augmented and the number of
entities, respectively. (ii) fast (ideally interactive) response times and (iii) applicabil-
ity to entities of any arbitrary domain. The focus of the chapter is to perform these
operations using web tables such that the above requirements are satisfied.
Baseline Technique: We present the baseline technique and our insights in the
context of the ABA operation; they apply to ABE as discussed in Section 5.5. For
simplicity, we consider only one augmenting attribute. As shown in Figure 5.1(a),
the input can be viewed as a binary relation with the first column corresponding to
the entity name and the second corresponding to the augmenting attribute. The first
column is populated with the names of entities to be augmented while the second
column is empty. We refer to this table as the query table (or simply the query).
The baseline technique first identifies web tables that semantically “matches” with
the query table using schema matching techniques (we consider simple 1:1 mappings
only) [55]. Subsequently, we look each entity up in those web tables to obtain its value
on the augmenting attribute. The state-of-the-art entity augmentation technique,
namely Octopus, implements a variant of this technique using the search engine API
[20].
123
����� ������������������
� ��������
����� ������� �����
��������� �� �� ������!"#� ����$%&�� ��� '��&�(
����� ������!"#� ������� )��*
$%&�� ��� '��&�(����� �����
����� ������� )��*��� +����&���,��� ��,� �-�#�� .�
�/0"�/"
����� �����'����12 $��,% �����!"#� ��������� ��,� �-��� �����
�/"
�/"
�/"
�/0"
�
�
�
�
Fig. 5.2.: ABA operation using web tables
Example 5.1.1 Consider the query table Q in Figure 5.2. For simplicity, assume
that, like the query table, all the web tables are entity-attribute binary (EAB) relations
with the first column corresponding to the entity name and the second to an attribute
of the entity. Note that for both the query table and web table the first column is
approximately the key column. Using traditional schema matching techniques, a web
table matches Q iff (i) data values in its first column overlaps with those in the first
column of Q and (ii) name of its second column is identical to that of the augmenting
attribute. We refer to such matches as “direct matches” and the approach as “direct
match approach” (DMA). In Figure 5.2, only web tables T1, T2 and T3 directly matches
with Q (shown using solid arrows). A score can be associated with each direct match
based on the degree of value overlap and degree of column name match; such scores are
shown in Figure 5.2. We then look the entities up in T1, T2 and T3. For S80, both T1
124
and T3 contain it but the values are different (Nikon and Benq respectively). We can
either choose arbitrarily or choose the value from the web table with the higher score,
i.e., Benq from T3. For A10, we can choose either Canon from T2 or Innostream
from T3 (they have equal scores). For GX−1S, we get Samsung. We fail to augment
T1460 as none of the matched tables contains that entity.
DMA suffers from two problems:
(i) Low precision: In the above example, T3 contains models and brands of cell
phones, not cameras. The names of some of the cell phone models in T3 are identical
to those of the camera models in the query table, hence, T3 get a high score. This
results in 2 (out of 3) wrong augmentations: S80 and A10 (assuming we choose
Innostream from T3 for A10). Hence, the precision is 33%. Such ambiguity of entity
names exist in all domains as validated by our experiments. Note that this can
mitigated by raising the “matching threshold” but this leads to poor coverage.
(ii) Low coverage: In the above example, we fail to augment T1460. Hence, the
coverage is 75%. This number is much lower in practice, especially for tail domains.
For example, the Octopus system (which implements a variant of DMA) reports
a coverage of 33%. This primarily happens because tables that can provide the
desired values either do not have column names or use different column name as the
augmenting attribute name provided by the user.
One way to address the coverage issue is to use synonyms of the augmenting
attribute [53, 82]. Traditionally, schema-matchers have used hand-crafted synonyms;
this is not feasible in our setting where the entities can be from any arbitrary domain.
Automatically generating attribute synonyms for arbitrary domains, as proposed in
[19], typically result in poor quality synonyms. Our experiments show that these are
unusable without manual intervention.
Main Insights and Contributions: Our key insight is that many tables indirectly
match the query table, i.e., via other web tables. These tables, in conjunction with the
directly matching ones, can improve both coverage and precision. We first consider
coverage. Observe that in Figure 5.2, table T4 contains the desired attribute value of
125
T1460 (Benq) but we cannot “reach” it using direct match. Using schema matching
techniques, we can find that T4 matches with T1 (i.e., there is 1:1 mapping between
the two attributes of the two relations) as well as T2 (as it has 2 records in common
with T1 and 1 in common with T2). Such schema matches among web tables are
denoted by dashed arrows; each such match has a score representing the degree of
match. Since T1 and/or T2 (approximately) matches with Q (using DMA) and T4
(approximately) matches with T1 and T2 (using schema matching among web tables),
we can conclude T4 (approximately) matches with Q. We refer to T4 as an indirectly
matching table; using it, we can correctly augment T1460. This improves coverage
from 75% to 100%.
Many of the indirectly matching tables are spurious matches; using these tables
to predict values would result in wrong predictions. The challenge is to be robust
to such spurious matches. We address this challenge in two ways. First, we perform
holistic matching. We observe that truly matching tables match with each other and
with the directly matching tables, either directly or indirectly while spurious ones do
not. For example, T1, T2 and T4 match directly with each other while T4 only matches
weakly with T2. If we compute the overall matching score of a table by aggregating
the direct match as well as all indirect matches, the true matching tables will get
higher scores; we refer to this as holistic matching1. In the above example, T1, T2
and T4 will get higher score compared with T3; this leads to correct augmentations
for S80 and A10 resulting in a precision of 100% (up from 33%). Second, for each
entity, we obtain predictions from multiple matched tables and “aggregate” them; we
then select the “top” one (or k) value(s) as the final predicted value(s).
This gives rise to additional technical challenges: (i) We need to compute schema
matches between pairs of web tables ; we refer to this as the schema matching among
web tables (SMW) graph . How do we build an accurate SMW graph over 573M ×
573M pairs of tables? (ii) How do we model the holistic matching? The model
should take into account the scores associated with the edges in the SMW graph as
1 This is different from holistic matching proposed in [83].
126
well as those associated with the direct matches. (iii) How do we augment the entities
efficiently at query time?
We have built the InfoGather system based on the above insights. Our contri-
butions can be summarized as follows:
• We develop a novel holistic matching framework based on topic sensitive pager-
ank (TSP) over the SMW graph. We argue that by considering the query table
as a topic and web tables as documents, we can efficiently model the holistic
matching as TSP (details are in Section 5.2.4). To the best of our knowledge,
this is the first work to propose holistic matching with web tables.
• We present a novel architecture for the InfoGather system that leverages
preprocessing in MapReduce to achieve extremely fast (interactive) response
times at query time. Our architecture overcomes the limitations of the prior
architecture (viz., Octopus) that uses the search API: its inability to perform
indirect/holistic matches and its high response times.
• We present a machine learning-based technique for building the SMW graph.
Our key insight is that the text surrounding the web tables is important in
determining whether two web tables match or not. We propose a novel set
of features that leverage this insight. Furthermore, we develop MapReduce
techniques to compute these (pairwise) features that scales to 573M tables.
Finally, we propose a novel approach to automatically generate training data
for this learning task; this liberates the system designer for manually producing
labeled data.
• We perform extensive experiments on six real-life query datasets and 573M
web tables. Our experiments show that our holistic matching framework has
significantly higher precision and coverage compared with both direct matching
approach as well as the state-of-the-art entity augmentation technique, Octopus.
Furthermore, our technique have four orders of magnitude faster response times
compared with Octopus.
127
5.2 Holistic Matching Framework
We present the data model, the general augmentation framework and its two
specializations: direct matching and holistic matching frameworks. We present them
in the context of ABA operation. How we leverage these frameworks for the ABE
operation is discussed in Section 5.5.
5.2.1 Data Model
For the purpose of exposition, we assume that the query table is an entity-attribute
binary (EAB) relation, i.e., a query table Q is of the form Q(K,A), where K denotes
the entity name attribute and A is the augmenting attribute. Since Q.K is approx-
imately the key attribute, we refer to it as the query table key attribute and the
entities as keys. The key column is populated while the augmenting attribute column
is empty. An example of the query table satisfying the above properties is shown in
Figure 5.2.
We assume that all web tables are EAB relations as well. For each web table
T ∈ T , we have the following: (1) the EAB relation TR(K,B) where K denotes the
entity name attribute and B is an attribute of the entity; as in the query table, since
T.K is approximately the key attribute, we refer to it as the web table key attribute,
(2) the url TU of the web page from which it was extracted, and (3) its context TC
(i.e., the text surrounding the table) in the web page from which it was extracted. For
simplicity, we denote TR(K,B) as T (K,B) when it is clear from the context. Figure
5.2 shows four web tables (T1,T2,T3,T4) satisfying the EAB property.
The ABA problem can be stated as follows.
Definition 5.2.1 Augmentation By Attribute Name (ABA): Given a query table
Q(K,A) and a set of web tables ⟨T (K,B), TU , TC⟩ ∈ T , predict the value of each
query record q ∈ Q on attribute A.
128
In practice, not all web tables are EAB relations; we show how our framework can
be used for general, n-ary web tables in Section 5.6. Furthermore, the query table
can have more than one augmenting attribute; we assume that those attributes are
independent and perform predictions for one attribute at a time.
5.2.2 General Augmentation Framework
Our augmentation framework consists of two main steps: First, identify web tables
that “match” with the query table. Second, use each matched web table to provide
value predictions for the particular keys that happen to overlap between the query
and the web table; then aggregate these predictions and pick the top value as the
final predicted value. We describe the two steps in further detail.
• Identify Matching Tables: Intuitively, a web table T (K,B) matches the
query table Q(K,A) if Q.K and T.K refer to the same type of entities and Q.A
and Q.B refers to the same attribute of the entities. In this work, we consider
simple 1:1 mappings only. Each web table T will be assigned a score S(Q, T )
representing the matching score to the query table Q. Since Q is fixed, we omit
Q from the notation and simply denote it as S(T ). There are many ways to
obtain the matching scores between the query table and web tables; we consider
two such ways in the next two subsections.
• Predict Values: For each record q ∈ Q, we predict the value q[Q.A] of record
q on attribute Q.A from the matching web tables. This is done by joining
the query table Q(K,A) with each matched web table T (K,B) on the key
attribute K. If there exists a record t ∈ T such that q[Q.K] ≈ t[T.K] (where ≈
denotes either exact or approximately equality of values), then we say that the
web table T predicted the value v = t[T.B] for q[Q.A] with a prediction score
ST (v) = S(T ) and return (v, ST (v)).
After processing all the matched tables, we end up with a set Pq =
{(x1, ST1(x1)), (x2, ST2(x2)), . . . } of predicted values for q[Q.A] along with their cor-
129
responding prediction scores. We then perform fuzzy grouping [84] on the xi’s to get
the groups Gq = {g1, g2, . . . }, such that, ∀xi ∈ gk, xi ≈ vk, where vk is the centroid
or the representative of group gk. We compute the final prediction score for each
group representative v by aggregating the predictions scores of the group’s members
as follows:
S(v) = F(xi,STi
(xi))∈Pq |xi≈vSTi
(xi) (5.1)
where F is an aggregation function. Any aggregation function such as sum or max
can be used in this framework.
The final predicted value for q[Q.A] is the one with the highest final prediction
score:
q[Q.A] = argmaxv
S(v) (5.2)
If the goal is to augment k values for an entity on an attribute (e.g., the entity is a
musical band and the goal is to augment it with all its albums), we simply pick the
k with the highest final prediction score.
Example 5.2.1 Consider the example in Figure 5.2. Using the table matching scores
shown, for the query record S80, Pq = {(Nikon, 0.25), (Benq, 0.5)} (predicted by
tables T1 and T3 respectively). The final predicted values are Nikon and Benq with
scores 0.25 and 0.5 respectively, so the predicted value is Benq.
5.2.3 Direct Match Approach
One way to compute the matching web tables and their scores is the direct match
approach (DMA) discussed in Section 5.1. The prediction step is identical to that in
the general augmentation framework. Using traditional schema matching techniques,
DMA considers a web table T to match with the query table Q iff (i) data values in
T.K overlaps with those Q.K and (ii) the attribute name T.B matches Q.A (denoted
130
by T.B ≈ Q.A). DMA computes the matching score S(T ) between Q and T , denoted
as SDMA(T ), as follows:
SDMA(T ) =
|T∩KQ|
min(|Q|,|T |) if Q.A ≈ T.B
0 otherwise.(5.3)
where |T ∩K Q| = |{t | t ∈ T & ∃ q ∈ Q s.t. t[T.K] ≈ q[Q.K]}|. For example, in
Figure 5.2, the scores for T1, T2 and T3 are 14, 2
4and 2
4respectively as they have 1, 2
and 2 matching keys respectively, min(|Q|, |T |) = 4 and Q.A ≈ T.B; the score for T4
is 0 because Q.A ≈ T.B.
5.2.4 Holistic Match Approach
To overcome the limitations of the DMA approach as outlined in Section 5.1,
we study the holistic approach to compute matching tables and their scores. The
prediction step remains the same as above. We model the holistic matching using
TSP. We start by reviewing the definitions of personalized pagerank (PPR) and TSP;
and then make the link to our problem in Section 5.2.4.
Preliminaries: Personalized and Topic Sensitive Pagerank
Consider a weighted, directed graph G(V,E). We denote the weight on an edge
(u, v) ∈ E with αu,v. Pagerank is the stationary distribution of a random walk on G
that at each step, with a probability ϵ, usually called the teleport probability, jumps
to a random node, and with probability (1− ϵ) follows a random outgoing edge from
the current node. Personalized Pagerank (PPR) is the same as Pagerank, except all
the random jumps are done back to the same node, denoted as the “source” node,
for which we are personalizing the Pagerank.
Formally, the PPR of a node v, with respect to the source node u, denoted by
πu(v), is defined as the solution of the following equation:
πu(v) = ϵδu(v) + (1− ϵ)∑
{w|(w,v)∈E}
πu(w)αw,v (5.4)
131
where δu(v) = 1 iff u = v, and 0 otherwise. The PPR values πu(v) of all nodes v ∈ V
with respect to u is referred to as the PPR vector of u.
A “topic” is defined as a preference vector β inducing a probability distribution
over V . We denote the value of β for node v ∈ V as βv. Topic sensitive pagerank
(TSP) is the same as Pagerank, except all the random jumps are done back to one of
the nodes u with βu > 0, chosen with probability βu. Formally, the TSP of a node v
for a topic β is defined as the solution of the following equation [85]:
πβ(v) = ϵβ + (1− ϵ)∑
{w|(w,v)∈E}
πβ(w)αw,v (5.5)
Modeling Holistic Matching using TSP
First, we draw the connection between the PPR of a node with respect to a source
node and the holistic match between two web tables. Then, we show how the holistic
matching between the query table and a web table can be modeled with TSP.
Consider two nodes u and v of any weighted, directed graph G(V,E). The PPR
πu(v) of v with respect to u represents the holistic relationship of v to u where E
represents the direct, pairwise relationships, i.e., it considers all the paths, direct as
well as indirect, from u to v and “aggregates” their scores to compute the overall
score. PPR has been applied to different types of relationships. When the direct,
pairwise relationships are hyperlinks between web pages, πu(v) is the holistic impor-
tance conferral (via hyperlinking) of v from u; when the direct, pairwise relationships
are direct friendships in a social network, πu(v) is the holistic friendship of v from u.
In this work, we propose to use PPR to compute the holistic semantic match
between two web tables. Therefore, we build the weighted graph G(V,E), where each
node v ∈ V corresponds to a web table and each edge (u, v) ∈ E represents the direct
pair-wise match (using schema matching) between the web tables corresponding to u
and v. Each edge (u, v) ∈ E has a weight αu,v which represents the degree of match
between the web tables u and v (provided by the schema matching technique). We
discuss building this graph and computing the weights in detail in Section 5.4.1. We
132
refer to this graph as the schema matching graph among web tables (SMW graph).
Thus, the PPR of πu(v) of v with respect to u over the SMW graph models the holistic
semantic match of v to u.
Suppose the query table Q is identical to a web table corresponding to the node
u, then the holistic match score SHol(T ) between Q and the web table T is πu(v),
where v is the node corresponding to T . However, the query table Q is typically not
identical to a web table. In this case, how can we model the holistic match of a web
table T to Q? Our key insight is to consider Q as a “topic” and model the match as
the TSP of the node v corresponding to T to the topic. In the web context where
the relationship is that of importance conferral, the most important pages on a topic
are used to model the topic (the ones included under that topic in Open Directory
Project); in our context where the relationship is semantic match, the top matching
tables should be used to model the topic of Q. We use the set of web tables S (referred
to as seed tables) that directly matches with Q, i.e., S = {T |SDMA(T ) > 0} to model
it. Furthermore, we use the direct matching scores SDMA(T )|T ∈ S as the preference
values β:
βv =
SDMA(T )∑
T∈S SDMA(T )if T ∈ S
0 otherwise(5.6)
where v corresponds to T . For example, βv are 0.251.25
, 0.51.25
and 0.51.25
for T1, T2 and
T3 respectively and 0 for all other tables. Just like the TSP score of web page
representing the holistically computed importance of a page to the topic, πβ(v) over
the SMW graph models the holistic semantic match of v to Q. Thus, we propose to
use SHol(T ) = πβ(v) where v corresponds to T .
5.3 System architecture
Suppose the SMW graph G has been built upfront. The naive way to compute
the holistic matching score SHol(T ) for each web table is to run the TSP computation
algorithm over G at augmentation time. This results in prohibitively high response
times. We leverage the following result to overcome this problem:
133
Theorem 5.3.1 (Linearity [85]) For any preference vector β, the following equality
holds:
πβ(v) =∑u∈V
βu × πu(v) (5.7)
If we can precompute the PPR πu(v) of every node v with respect to every other node
u (referred to as Full Personalize Pagerank (FPPR) computation) in the SMW graph,
we can compute the holistic matching score for any query table πβ(v) efficiently using
Eq. 5.7. This leads to very fast response times at query time.
InfoGather architecture has two components as shown in Figure 5.3. The
first component performs offline preprocessing for the web crawl to extract the web
tables, build the SMW graph and compute the FPPR. For all these offline steps, our
techniques need to scale to hundreds of millions of tables. We propose to leverage the
MapReduce framework for this purpose. The second component concerns the query
time processing, where we compute the TSP scores for the web tables and aggregate
the predictions from the web tables. In the following, we give more details about each
component:
Preprocessing: There are five main processing steps in this component:
• P1: Extract the HTML web tables from the web crawl and use a classifier to
distinguish the entity attribute tables from the other types of web tables, (e.g.,
formatting tables, attribute value tables, etc.). Our approach is similar to the
one proposed in [86]; we do not discuss this step further as it is not the focus
of this work.
• P2: Index the web tables to facilitate faster identification of the seed tables. We
use three indexes: (i) An index on the web tables’ key attribute values (WIK).
Given a query table Q, WIK(Q) returns the set of web tables that overlaps
with Q on at least one of the keys. (ii) An index for the web tables complete
records (that is key and value combined) (WIKV). WIKV(Q) returns the set
of web tables that contain at least one record from Q. (iii) An index on the
134
web tables attributes names (WIA), such that, WIA(Q) returns the set of web
tables {T |T.B ≈ Q.A}
• P3: Build the SMW graph based on schema matching techniques as we describe
in Section 5.4.1.
• P4: Compute the FPPR and store the PPR vector for each web table (we store
only the non-zero entries). We refer to this as the T2PPV index. For any web
table T , T2PPV(T ) returns the PPR vector of T . We discuss the technique we
use to compute the FPPR in Section 5.4.2.
The indexes (WIK, WIKV, WIA and T2PPV) may either be disk-resident or
reside in memory for faster access.
Query Time Processing: The query time processing can be abstracted in three
main steps. The details of each step depends on the operation. We provide those
details for the ABE operation in Section 5.5.
• Q1: Identify the seed tables: We leverage the WIK, WIKV and WIA indexes to
identify the seed tables and compute their DMA scores.
• Q2: Compute the TSP scores: We compute the preference vector β by plugging
the DMA matching scores in Eq. 5.6. According to Theorem 5.3.1, we can
use β and the stored PPR vectors of each table to compute the TSP score
for each web table. Note that only the seed tables have non-zero entries in β.
Accordingly, we need to retrieve the PPR vectors of only the seed tables using
the T2PPV index. Furthermore, we do not need to compute TSP scores for all
web tables in the retrieved PPR vectors. We need to compute it only for the
tables that could be used in the aggregation step: the one that have at least one
key overlapping with the query table. We refer to them as relevant tables. These
can be identified efficiently by invoking WIK(Q). These two optimizations are
important to compute the TSP scores efficiently.
135
Query Time ProcessingPre-processing
Web tables Indexes
WIK, WIKV WIA
Web Crawl
Extract & identify relational web
tables
Build web tables Graph
FPPR
Query Table
PredictionsTSP
T2PPV
Fig. 5.3.: InfoGather System Architecture
• Q3: Aggregate and select values: In this step, we collect the predictions provided
by the relevant web tables T along with the scores SHol(T ). The predictions are
then processed, the scores are aggregated and the final predictions are selected
according to the operation.
5.4 Building the SMW Graph and computing FPPR
This section discusses the major preprocessing steps of the web tables, namely,
building the SMW graph (P3) and computing the FPPR (P4).
5.4.1 Building the SMW Graph
First, we give details on how we match a pair of web tables and then address the
scalability challenges in building the SMW graph.
136
Matching Web Tables
In the SMW graph, there is an edge between a pair (T (K,B), T ′(K ′, B′)) of web
tables if T matches with T ′, i.e., T.K and T.K ′ refer to the same type of entities and
T.B and T ′.B′ refers to the same attribute of those entities. Our problem can be
formally stated as follows:
Definition 5.4.1 Pairwise web tables matching problem: Given two web tables
⟨T (K,B), TU , TC⟩ and ⟨T ′(K ′, B′), T ′U , T
′C⟩, determine whether T matches with T ′ and
compute the score of the mapping T.K → T ′.K ′, T.B → T ′.B′.
In schema matching [53, 55], the problem of matching two schemas S and S ′ is
normally framed as follows: Given the two schemas, for each attribute A of S, find
the best corresponding attribute A′ of S ′, possibly with an associated matching score.
The problems are similar enough so that the techniques used in standard schema
matching problem can be used for ours as well. Schema matching techniques first
identify information about each element of each schema that is relevant to discovering
matches. For each pair of elements, one from each schema, they compute a set of
“feature scores” where each feature score represent a match between the pair on a
different aspect. Finally, they combine those feature scores into a single score based
on which they decide whether the element pair matches or not. The combination
module can either use machine learning-based techniques or non-learning methods
[41,56]; we use machine learning-based techniques in this work.
Traditionally, the focus is on schema level features (e.g., attribute names match-
ing) and instance level features (e.g., attribute data values matching). Specifically for
web tables, [20] suggested to use two specific features: (i) the average columns width
similarity and (ii) the similarity of the text of the table content without considering
the columns and rows structure.
But for web tables it may not be sufficient to rely only on these traditional schema
and instance level information. For example, consider tables T2 and T3 in Figure 5.2.
At the schema level, they both share the same column names, and moreover, at the
137
instance level both share the Model A10 and share the Brand Samsung. Despite all
these similarities, these web tables are not a match, because T2 is about cameras
and T3 is about cell phones. On the other hand, consider web tables T2 and T4.
They neither share schema level nor instance level similarities. However, both T2 and
T4 contain digital camera models with their brands and should have high matching
score. Furthermore, many web tables do not have column names [51]; this further
exacerbates the problem.
Our main insight is that there is additional information about the web tables that
can be leveraged to overcome the above limitations. We propose 4 novel feature scores
based on this insight:
• Context similarity: The context or the text around the web table in the web
page provides valuable information about the table. Suppose, the context for
T3 is “Mobile phones or cell phones, ranging from . . . ”, while that for T2 and T4
are “Ready made correction data for cameras and lenses” and “Camera Bags
Compatibility List” respectively. This indicates that T2 and T4 are probably
about cameras while T3 about phones. Clearly, sharing the term ‘cameras’
indicates that similarity between T2 and T4. We capture this intuition using a
context similarity feature which is computed using the tf-idf cosine similarity of
the text around the table.
• Table-to-Context similarity: The context of a table may contain keywords
that overlap with values inside another web table. This provides evidence that
the web pages containing the tables are about similar subjects, and hence, the
tables may be about similar subjects as well. We capture this intuition using
table-to-context similarity feature, which is computed using the tf-idf cosine
similarity of the text around the first table and the text inside the second table.
• URL similarity: The URL of the web page containing the table can help
in matching with another table. Sometimes, a web site lists the records from
the same original large table in several web pages. For example a web site may
138
list the movies in several web pages by year, by first letter, etc. In this case,
matching the URLs of the web pages is a good signal in the matching of the
web tables. We capture this intuition using a URL similarity feature, computed
using cosine similarity of the URL terms.
• Tuples similarity: The web tables that we consider are EAB relations and the
correspondences between the attributes are frozen (T.K → T ′.K ′ and T.B →
T ′.B′); we just need to determine the strength of the correspondence. Hence, the
number of tuples that overlaps between the two tables will be a strong evidence
to decide upon the tables matching. Note that this is different from the instance
level feature, which consider the data values of each attribute individually.
We use the above similarities as features in a classification model. Given the fea-
tures, the model predicts the match between two tables with a probability, which is
used as the weight on the edge between them. The set of features include the newly
proposed features, namely, (1) Context similarity, (2) Table-to-Context similarity,
(3) URL similarity, and (4) Tuples similarity; in addition to the traditional schema
and instance level features, namely, (5) attribute names similarity, (6) column val-
ues similarity, (7) Table-to-Table similarity as a bag of words, (8) columns widths
similarity.
There are two major challenges in building the SMW graph: (i) computing the
pairwise features that scales to hundreds of millions of tables and, (ii) getting labeled
pairs of web tables to train a classifier. We address these challenges in the following
two subsections.
Scalable Computation of Pairwise Features
Note that we are computing these features for 573M × 573M web table pairs
and, obviously, we cannot do the cross product computation. Our key insight here
is that, for each of the mentioned features, the web table can be considered as a bag
of words (or a document). We can then leverage scalable techniques for computing
139
Table 5.1: Web tables matching features as documents.
Feature name Document
Context Terms in the text around the web table with idf
weights
Table-to-Context The table content as text and context text with idf
weights
URL The terms in the URL with idf weight computed from
all URL set
Tuples All the distinct table rows (or key-value pairs) form
terms of a document with equal weights
Attributes name The terms mentioned in the column names with equal
weights
Column values All the distinct values in a column form terms of a
document with equal weights
Table-to-Table The table content as text with idf weights
pairwise document similarities over a large document collection. Table 5.1 describes
the mapping of a web table to a document for each feature.
We leverage the technique described in [87] to compute the document similarity
matrix of a large document set using MapReduce. The technique can be summarized
as follows: Each document d contains a set of terms and can be represented as a vector
Wd of term weights wt,d. The similarity between two documents is the inner product
of the term weights as sim(d1, d2) =∑
t∈d1∪d2 wt,d1 · ww,d2 . The key observation here
is that a term t will contribute to the similarity of two documents d1, d2 iff t ∈ d1
and t ∈ d2. If we have an inverted index I, we can easily get the documents I(t) that
contain a particular term t. For each pair of document ⟨di, dj⟩ ∈ I(t)×I(t), sim(di, dj)
is incremented by (wt,di · wt,dj). By processing all the terms we have computed the
entire similarity matrix without the expensive cross-product computations.
140
This can be implemented directly as two MapReduce tasks: (1) Indexing : The
mapper processes each document d and emits for each term t ∈ d (key = t, value
= (d, wt,d)). The reducer outputs the term as the key and the list of documents
containing that key (key = t, value =I(t)). (2) Similarity computation: The mapper
processes each term with its list of documents, (t, I(t)), and emits for each pair of
documents ⟨di, dj⟩ ∈ I(t) × I(t) and i < j (key = ⟨di, dj⟩, value = wt,di · wt,dj).
Finally, the reducer does the summation to output the sim(di, dj) (key= ⟨di, dj⟩,
value=sim(di, dj) =∑
t∈(di∩dj)wt,di · wt,dj). For more efficiency, a df-cut notion is
used to eliminate terms with high document frequency [87].
Getting labeled pairs of web tables:
We mentioned earlier that we rely on a classification model to get the matching
score of two tables given their similarity features vector. The challenge here is to
obtain labeled examples to train the classifier. One way is to use a human to manually
label a random set of pairs of web tables. However, this is going to be painful and
time consuming. We propose an automatic way to obtain labeled pairs of web tables.
To label a pair of web tables (T , T ′) as a positive example, our hypothesis is that
T and T ′ may not have records in common, but a third web table T ′′ have some
records in common with T and T ′ individually (we call it a labeling web table). For
example, consider tables T2 and T4 in Figure 5.2. T1 is found to be a labeling web
table for them. T1 overlaps with T2 on one record (DSC W570, Sony), as well as, it
overlaps with T4 on one record (Optio E60, Pentax).
We formalize our hypothesis as follows: A pair of tables Ti(Ki, Bi) and Tj(Kj, Bj)
is a true example pair, if ∃ a web table TL(KL, BL) (labeling web table) such that (i)
the set of overlapping records |TL ∩Ti| ≥ θ and |TL ∩Tj| ≥ θ, and (ii) for each record
tL ∈ TL and ∃ record ti ∈ Ti(tj ∈ Tj), s. t., if tL[KL] = ti[Ki](tL[KL] = tj[Kj]), then
tL[BL] = ti[Bi](tL[BL] = tj[Bj]). The second condition guarantees that if Ti shares
141
a key with TL then, the value of the other attribute must match. If we do not find
such labeling web table, then the web table pair is considered as a negative example.
It may come to mind that the web table labeling approach can be used to generate
all the pairwise semantic matches to build the SMW graph, but this is too expensive
to be done for 573M × 573M pairs of web tables. However, using the labeling web
table approach to generate a few thousand examples to train a classifier is feasible.
5.4.2 Computing FPPR on SMW Graph
Once the SMW graph is constructed, we compute the full personalized pagerank
matrix. There are two broad approaches to compute personalized pagerank. The
first approach is to use linear algebraic techniques, such as Power Iteration [88]. The
other approach is Monte Carlo, where the basic idea is to approximate Personalized
Pagerank by directly simulating the corresponding random walks and then estimating
the stationary distributions with the empirical distributions of the performed walks.
We use the recently proposed MapReduce algorithm to compute the FPPR [89].
It is based on the Monte Carlo approach. The basic idea is to very efficiently compute
single random walks of a given length starting at each node in the graph. Then these
random walks are used to efficiently compute the PPR vector for each node.
5.5 Supporting Core Operations
We discuss how we support the core operations using our holistic matching frame-
work. Note that for each operation, we re-define the DMA score.
5.5.1 Augmentation-By-Attribute (ABA)
We have discussed the ABA operation already in Section 5.2. Here, we mention
the details of the 3 query time steps abstracted in Section 5.3. We present the pseudo
code for the ABA operation in Algorithm 5.1.
142
Algorithm 5.1 ABA(Query table Q(K,A))
1: ∀ q ∈ Q , Pq = {}
2: R = WIK(Q).
3: R = R∩WIA(Q) {Relevant web tables.}
4: for all T ∈ R do
5: for all q ∈ Q and t ∈ T , s.t., q[Q.K] ≈ t[T.K] do
6: Pq = Pq ∪ {( v = t[T.B] , ST (v) = S(T ) )}
7: end for
8: end for
9: ∀ q ∈ Q, Fuzzy group Pq to get Gq
10: for all q ∈ Q do
11: ∀ g ∈ Gq, s.t., v =centroid(g),
S(v) = F (xi,STi(xi))∈Pq |xi≈v STi(xi)
12: end for
13: ∀ q ∈ Q, q[Q.A] = argmaxv S(v)
• Q1: Identifying the seed table: The seed tables for Q(K,A) are identified using
the WIK and WIA indexes such that a web table T (K,B) is considered if there
is at least one key overlap and Q.A ≈ T.B. The DMA scores are computed
using Eq. 5.3.
• Q2: Computing the tables TSP scores: is identical to Step 2 in Section 5.3.
• Q3: Aggregating and processing values: This step is identical to the predict
values step in the augmentation framework discussed in Section 5.2.
5.5.2 Augmentation-By-Example (ABE)
This is a variation of ABA operation. Instead of providing the augmenting at-
tribute name, the user provides the query table with some complete records as exam-
ples, i.e., for some of the keys, she provides the values on the augmenting attribute
(e.g., Figure 5.1(b)).
143
Definition 5.5.1 Augmenting-By-Example (ABE): Given query table Q(K,A) =
Qc ∪ Qe, where Qc denotes the set of records {qc ∈ Q | qc[A] = null} (referred as
example complete records ) and Qe denotes the set of records {qe ∈ Q | qe[A] = null}
(referred as incomplete records), predict the value of each incomplete record qe ∈ Qe
on attribute A.
The 3 query time steps for the ABE operation is identical to those of ABA oper-
ation, except for the way we identify the seed tables and compute the DMA scores.
DMA considers a web table T to match the query table Q iff the records Qc over-
laps with those in T . For example, in Figure 5.2, table T1 is considered a seed table
for the query table illustrated in Figure 5.1(b), because they overlap on the record
(S80, Nikon). Given the query table, we use the WIKV index to get the seed tables
efficiently.
Intuitively, a web table T should be assigned a high DMA score if, for each shared
key between T and Qc, the two tables agree on the value of the augmenting attribute
as well. Accordingly, we redefine the DMA matching score as the fraction of the
shared keys that agree on the value of the augmenting attribute;
SDMA(T ) =|Qc ∩KV T ||Qc ∩K T |
(5.8)
where |Qc ∩KV T | denotes the number of overlapping records between the complete
records of the query table Q and the web table T . Recall that |Qc ∩K T | denotes the
number of shared keys.
5.6 Handling n-ary Web Tables
Throughout our discussion, we assume that the web tables are entity-attribute
binary (EAB) relations. The result is working with a simpler graph with a single
score among the nodes, and this enables us to model the problem as a TSP problem.
If we consider n-ary web tables and use a single score among the nodes, a matching
score between the query table and a web table will not say which column of the web
table is the desired augmenting attribute.
144
050
100150200250300
2 4 6 8 10
# o
f W
eb T
able
sM
Number of Columns
(a) columns # dist.
# of web tables 573MAvg. # columns 3.09Avg. # rows 26.54
(b) Statistics
Fig. 5.4.: The distribution of the number of columns per web table and statistics
about the relational web tables and
In practice, not all the tables on the web are binary relations. Fortunately, re-
lational tables on the web are meant for human consumption and usually it has a
subject column [19, 51]. According to [19], there are effective heuristics to identify
web table’s subject. For example, using the web search log where the subject column
name will have high search query hits (i.e., the subject column name appears in the
search query that hits the web page containing the web table), and also usually the
subject column appears in the left most column. If the subject column can be identi-
fied, then we split the table into several EAB relations, i.e., the subject column with
each of the other columns comprise a set of EAB relations. The main assumption
that we make on the web table is that the subject appears in a single column and we
do not consider multiple columns as subjects.
In this work, we do not assume anything and in this case, we split an n-ary web
table into (n− 1)2 EAB relations—all possible pairs of columns are considered EAB
relations. Our study shows the feasibility of doing that. Figure 5.4(a) shows the
distribution of the number of columns per a relational web table. The average is 3.1
and about 54% are binary tables and 70% are either binary or ternary relations.
145
5.7 Experimental Evaluation
We present an experimental evaluation of the techniques proposed in the chapter
for the ABA and ABE operations. The goals of the study are:
• To compare holistic matching approach with DMA, DMA with attribute syn-
onyms and the state-of-the-art approach, Octopus [20], in terms of precision
and coverage for the ABA operation
• To compare holistic matching approach with DMA in terms of precision and
coverage for the ABE operation
• To study the sensitivity of quality (precision and coverage) of the approaches
to “head” vs “tail” query entities
• To study the sensitivity of quality to the number of example complete records
for ABE operation
• To evaluate the (direct) impact of our novel features (context, table-to-context,
URL, tuples similarities) on the quality of the SMW graph as well as its (indi-
rect) impact on the quality of ABA operation
• To evaluate the holistic approach in terms of query response times and compare
with Octopus
5.7.1 Experimental Setting
Implementation: We implemented the InfoGather system described in Fig-
ure 5.3. In the offline preprocessing step, we extracted 573M entity-attribute HTML
tables from a recent snapshot (July, 2011) of Microsoft Bing search engine; such snap-
shots are available in the internal MapReduce clusters within Microsoft. We then built
the WIK, WIKV and WIA indexes, built the SMW graph, computed T2PPV index
and the T2Syn index. We performed all these steps in our MapReduce clusters as
146
Table 5.2: Query entity domains and augmenting attributes
Dataset name Entity (Key attribute) Augmenting attribute
Cameras Camera model Brand
Movies Movie Director
baseball Baseball team Player
albums Musical band Album
uk-pm UK parliament party Member of parliament
us-gov US state Governor
discussed in Section 5.3. To build the SMW graph, we use the solution described
in [87] and used a df-cut of 99.9%. We stored the indexes (WIK, WIKV, WIA and
T2PPV) on a single machine for query processing. We used an Intel x64 machine
with 8 2.66GHz Intel Xeon processors and 32GB RAM, running Windows 2008 Server
for this purpose.
Datasets: We conducted experiments on 6 datasets shown in Table 5.2. For
example, for the cameras dataset: the ABA operation augments the brand given a
set of camera model names and the string “brand” and the ABE operation augments
the brands of a set of camera model names given a set of (model, brand) pairs. Toy
examples of inputs and outputs for this dataset is shown in Figure 5.1.
We chose 4 datasets (baseball, albums, uk-pm, us-gov) that were also used to
evaluate Octopus. We compiled the complete ground truth for these datasets by
manually identifying a knowledgebase and extracting the desired information from it.
For example, for baseball, we got the “all-time roster” for a randomly chosen set of
12 baseball teams from Wikipedia; for albums, we got all the albums for a randomly
chosen set of 14 bands from Freebase. We chose two additional datasets (cameras,
movies) for which we had complete ground truth (from Microsoft Bing Shopping
product catalog and IMDB database, respectively). One distinguishing characteristic
of these two datasets are that the augmenting attribute has 1:1 relationship with
147
the key (as opposed to 1:n in the above 4 datasets). We generate a query table by
randomly selecting keys from the ground truth. For the movies we use a query table
of 6,000 and for the cameras 1,000. All our results are averaged over 5 such query
tables. We use F = sum in Eq. 5.1 for all our experiments.
Measures: Since some of the datasets have 1:n relationships between the key and
augmenting attribute, we generalize the precision and coverage measures defined in
Section 5.1 as follows. We first compute the precision and coverage for each key as
follows:
precision = #values correctly predicted#values predicted
coverage = #values predicted#values in ground truth
We average over all the keys in the query table Q to obtain the precision and coverage
for Q. Recall that if the ground truth has k values for a key on the augmenting
attribute, the augmentation framework selects the top-k values for that key.
5.7.2 Experimental Results
Evaluating Augmentation-By-Attribute (ABA) We implemented five dif-
ferent approaches for ABA:
• Holistic: This is our approach using TSP.
• DMA: This is the direct matching approach.
• DMA with attribute synonyms: This is the DMA approach where we use a set
of synonyms for the augmenting attribute. A web table T (K,B) will be used
for prediction if its keys overlap with those in the query table and Q.A matches
with any of the synonyms of T.B. We consider the state of the art technique to
get the synonyms using the attribute correlation statistics database (ACSDb)
as described in [19]. The algorithm requires a context attribute name for each
dataset. We provide the key attribute name as the context. We refer to this
approach as DMA-ACSDbSyn.
148
�
���
���
���
���
�
��� ����� ���� � ���� ��� � �������������
���� ��� ����������� ��� !�����
(a) ABA precision
��������������
��� ����� ���� � ���� ��� � ����
��������
���� ��� ����������� ��� !�����
(b) ABA coverage
Fig. 5.5.: Augmenting-By-Attribute (ABA) evaluation
• Octopus: This is the EXTEND operation using the MultiJoin algorithm intro-
duced in [20]. This is the state of the art to do ABA using web tables. Given a
query table and an attribute name a, MultiJoin composes a web search query in
the form “k a” for each key k in the query table. Then all the web tables in the
resulting web pages from all the web search queries are obtained. The resulting
web tables are then clustered according to their schema similarity. Finally, the
cluster that best cover the query table is selected to join each of cluster member
with the query table and augment the values.
Figure 5.5 reports the precision and coverage. The Holistic approach significantly
outperforms all other approaches both in terms of precision and coverage. The average
precision (over all 6 datasets) of Holistic is 0.79 respectively compared with 0.65, 0.42
and 0.39 for DMA, DMA-ACSDbSyn and Octopus respectively. The average coverage
149
(over all 6 datasets) of Holistic is 0.97 respectively compared with 0.36, 0.59 and 0.38
for DMA, DMA-ACSDbSyn and Octopus respectively. This shows that considering
indirect matches and computing the scores holistically improves both precision and
coverage.
DMA demonstrates high precision with all the datasets except for cameras where
it was 60%; the main limitation of DMA is coverage as it does not consider indirectly
matching tables.
DMA-ACSDbSyn has lower precision compared to DMA, due to the quality of
the synonyms used. We manually inspected the synonyms we get from the ACSDb;
there were almost no meaningful synonyms in the top 20 for the cameras and movies
datasets. This is because DMA-ACSDbSyn uses only schema-level correlations to
compute synonyms; attribute names are often ambiguous (e.g., the attribute name
“name”) leading to many spurious synonyms.
Octopus demonstrates low precision as well as low coverage for all the datasets,
except for the cameras where the precision was high and for the us-gov dataset where
the coverage was high. On average the coverage is about 33%, which matches the
results reported in [20]. Octopus uses the web search API to retrieve matching ta-
bles; since web search is not meant for matching tables, in many cases, the top 1000
returned urls did not provide any matching tables. Furthermore, Octopus’s archi-
tecture does not consider indirectly matching tables and does not perform holistic
matching.
Evaluating Augmentation-By-Example (ABE): We study the sensitivity to
the number of example complete records, as well as, the sensitivity to the nature of
the provided examples in terms of being famous (head) or rare (tail) examples. Head
(tail) examples are those records that show up in a high (low) number of web tables.
In Figure 5.6, we report the precision and coverage of the augmented values for
the query table using the Holistic and DMA approaches as we increase the number of
example complete records between 1 and 50. We report the results for the cameras
and movies datasets; for the other datasets, we observe quite similar results.
150
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50
Pre
cisi
on
# examples
Holistic-camerasHolistic-moviesDMA-camerasDMA-movies
(a) Precision
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50
Co
vera
ge
# of examples
Holistic-camerasHolistic-moviesDMA-camerasDMA-movies
(b) Coverage
Fig. 5.6.: Sensitivity of the precision and coverage to the number of examples. The Holistic
shows high precision and maintains high coverage in comparison to DMA.
In Figure 5.6(a), the reported precision is high for both datasets using the two
approaches. However, in Figure 5.6(b), the Holistic significantly outperforms the
DMA in coverage when there are very few example complete records (between 1 and
10). The Holistic provides values for about 99% and 93% of the incomplete records
for the movies and cameras datasets, respectively, while the DMA provides coverage
in the range between 20% and 75% for up to 10 examples. This shows that even
with small number of example complete records the Holistic can get enough tables to
augment the query table while the DMA does not.
The number of example complete records is not the only factor impacting the cov-
erage. The frequency of the examples in the web tables also impacts the performance.
We note that the query records distribution on the web tables follow the power law.
Hence, if the example complete records appear in a lot of web tables (head or famous
query records), then we will directly match a lot of web tables to increase the cover-
age. On the other hand, if the examples are tail records, there will be very few direct
matching web tables.
In Figure 5.7, we do joint sensitivity analysis of both the number of example
complete records and the nature of the records (i.e., being head, tail or mid—records
in the middle). We also report the results for number of examples of 2, 10, and 50.
151
� ������������������
�� ��� ��� ��� ���� ��� �� ������� ��� ������� ���� ���������������
���������������� �����������������������������������������������������Fig. 5.7.: Joint sensitivity analysis to the number of examples and the head vs. tail records
in the web tables. The Holistic is robust in comparison to the DMA.
The precision results were high and similar for all the datasets. The figure shows the
coverage for the movies dataset; we observed similar behavior for the other datasets.
TheDMA is sensitive to both the number of example records and their nature. The
coverage degrades as we decrease the number of examples; and it degrades further if
the examples are tail records. On the other hand, the Holistic does well and maintains
a coverage of 99% even in the hard situations with small number of example tail
records.
Impact of new features for building SMW graph: The objective of this exper-
iment is to evaluate the usefulness of our proposed set of features for matching web
tables in comparison to other features used in the literature before. In our evaluation,
we compare four techniques: (1)SMW Graph: This represents all our proposed set of
features, (2)Traditional: This represents the schema and instance level features. We
use the features that represent the similarity between the attributes names, as well
as, the similarity between the values in each column. (3) WTCluster: This represents
the features that were introduced in [20] for matching web tables, namely, the table
152
����������������
�� �� � ��� � ����� ����� ���������
�����
����� �
�
������ �� �� �!��" ��#���$���!���� �!��" � ��$���!��
(a) Direct schema matching evaluation using the proposed fea-
tures set
��������������������
�� ���� ������ ���� ������
���������
��������� ��� �!��"�#�$����#��!����� �!��"�# ���#��!��
(b) The impact of the schema matching quality on the ABA
operation
Fig. 5.8.: Web tables matching accuracy
text similarity and columns widths similarity. (4) Traditional & WTCluster: This
combines both the previous two sets of features.
We first compare the above techniques on the quality of the SMW graph. In this
experiment, we randomly picked 500 web tables relevant to each of the datasets, and
then we computed the features values and identified labels for each pair of web tables
as being a match or not (using our automatic labeling technique described in Section
5.4.1). We created a balanced set of examples (i.e., almost equal negative and positive
examples). We trained a classifier using the examples and reported in Figure 5.8(a)
153
0.001
0.1
10
1000
0 0.2 0.4 0.6 0.8 1T
ime
(sec
)# of query records (K)
InfoGatherOctopus
Fig. 5.9.: Response time evaluation
the model’s accuracy per dataset in addition to the overall accuracy. The displayed
results are average of 5 different runs to the above procedure.
In general, the web tables matching accuracy using our set of features, SMW
Graph, shows the best performance. SMW Graph has about 6% improvement over
the Traditional and about 10% improvement over ClusterWT. SMW graph also out-
performs the combined set of features Traditional & WTCluster by about 3%. These
results prove the importance of the new introduced features for matching web tables.
The SMW graph technique is consistently outperforming the other techniques,
however, the other techniques are not consistently reliable. For example, in the uk-pm
dataset, the Traditional technique performs better than WT-Cluster. The situation
is reversed in the us-gov dataset.
In Figure 5.8(b), we evaluate the impact of our features on the quality of the
ABA operation; that is whether the improved quality of the SMW graph translates
in better quality ABA (indirect evaluation of the features). Here, we report only the
precision as we obtain the same coverage using each of the features set. Note the
similarity between the two figures 5.8(a) and 5.8(b). The SMW graph features get
better quality SMW graph, and hence, better precision.
Efficiency evaluation: We evaluate in Figure 5.9 the efficiency of our approach
and architecture in comparison with the Octopus approach for the ABA operation
154
for the cameras dataset. We obtained similar performance with the other datasets.
Our implementation of Octopus involves using a web search engine API. We report
the query response time as we increase the size of the query table. Our approach
takes milliseconds to respond and is 4 orders of magnitude faster than Octopus.
Our fast response time is due to the fast computations of the TSP scores using
the pre-computed PPR vectors and indexes that we introduce in Section 5.3. For
Octopus, most of the time goes in processing the web search queries as it is based
on SOAP request/response kind of communication. As mentioned in [20], Octopus
can be implemented more efficiently if web search engines support Octopus-specific
operations; however, current search engines do not support such operations.
In summary, our experiments show that our holistic matching framework and pro-
posed system architecture can support the three core operations with high precision,
high coverage and interactive response times.
5.8 Summary
In this chapter, we present the InfoGather system to automate information
gathering tasks, like augmenting entities with attribute values and discovering at-
tributes, using web tables. Our experiments demonstrate the superiority of our tech-
niques compared to the state-of-the-art.
155
6. SUMMARY
This dissertation addresses the data cleaning problem from a practical and paragamtic
view point. The main goal is to introduce techniques to efficiently involve users and
the WWW, as well as, handling large scale databases. Involving users in the data
cleaning process is a must to guarantee accurate cleaning decisions. Automating the
process of consulting the WWW for data cleaning will save a tremendous amount of
time.
This thesis fills the gap between the theoretical research conducted on cleaning
algorithms and the practical systems for data cleaning, where usually the quality and
scalability are of great concern.
6.1 Summary of Contributions
This dissertation introduces four main contributions for guided data cleaning by
involving users or the WWW. First, we introduced GDR, a guided data repair frame-
work that combines the best of both; the user fidelity to guide the cleaning process
and the existing automatic cleaning techniques to suggest cleaning updates. The
user can help exploring the search space of the possible cleaning updates, if he/she
is consulted before taking some decisions. Once the user confirms a decision, any
further dependent decisions taken by the algorithm are guaranteed to be more accu-
rate, hence, achieving better data quality. The ultimate goal is to achieve better data
quality with minimal user feedback. Therefore to minimize the user’s efforts, GDR
learns from user feedback to eventually replace the user and minimize the efforts. The
key novelty we proposed in GDR is the ranking of the questions that are forwarded
to the user. For this purpose, we introduced a principled mechanism that depends
on a combination of decision theory and active learning to quantify the utility and
156
impact of obtaining user feedback on the questions. GDR was accepted for system
demonstration [60] in SIGMOD 2010, as well as, a research paper [59] was accepted
for publishing in PVLDB 2011.
The second contribution of this dissertation is proposing a new scalable data re-
pair approach that is based on machine learning techniques. The idea is to maximize
the data likelihood, given the learned data distributions, using a small amount of
changes to the database. We built a system (SCARE) which is a scalable framework
that follows our likelihood-based repair approach. SCARE has three advantages over
previous automatic cleaning approaches: (1) it is more accurate in identifying erro-
neous values and in finding the correct cleaning updates, (2) it scales well because
it relies on a robust mechanism to partition the database, and then, aggregate the
final cleaning decisions from the several partitions, and (3) it does not require data
quality rules, instead, it is based on modeling the data distributions using machine
learning techniques. In comparison to the quality rules discovery techniques, the ma-
chine learning techniques are more flexible and accurate in capturing the relationships
between the database attributes and values. In contrast to the constraint-based data
repair approaches, which finds the minimal changes to satisfy a set of data quality
rules, our likelihood-based repair approach finds a constrained amount of changes
that maximize the data likelihood.
Our third contribution is a novel approach to involve users or entities indirectly
for a data cleaning task. In this approach, we noticed that the users actions (or
behavior), which can be found in the systems log, can be a useful evidence for the
task of deduplicating the users themselves if they have different representations in the
system. For example, in the retial stores the customers are usually identified by their
credit cards and sometimes the customers use several cards for their transactions. Also
the users of a web site are usually identified by their IP addresses and the addresses
changes from time to time. The idea of our solution is to first merge the behavior
information (transaction log) for each candidate pair of entities to be matched. If the
two behaviors seem to complete one another, in the sense that stronger behavioral
157
patterns become detectable after the merge, then this will be a strong indication
that the two entities are, in fact, the same. To this end, we developed the necessary
patterns detection and modeling algorithms and computed the matching score as the
gain (or certainty increase) in the identification of the behavior patterns after merging
the entities’ behavior. Our approach was accepted for publishing [61] in PVLDB 2010.
The last contribution is a new approach to leverage the WWW for data cleaning.
We focused in this approach on the web tables, which are the relational web tables
that can be found in the web pages. We investigated the use of the web tables for the
task of finding missing values in the databases and augmenting entities attributes.
For example, a user may have a list of cameras models and using our system she can
find the cameras brands, optical zoom level, and other relevant attributes. The main
challenge in this work is the ambiguity of the entities and dirtiness of the data on the
web. Therefore, our solution relies on aggregating answers from several web tables
that directly and indirectly match the user’s list of entities. This required modeling
the relationships between the web tables to accurately identifying the relevant ta-
bles to the user’s database. We modeled this problem as a topic sensitive pagerank
problem [85]. We introduced a system to extract, process, index and compute the
pagerank of the web tables. This is a data intensive application and our solution rely
on steps that are done using MapReduce in an off-line phase, such that, the processing
at query time is done in milliseconds. Our approach was accepted for publishing [62]
in SIGMOD 2012.
6.2 Future Extensions
This dissertation raises a number of research problems related to data cleaning. It
is motivated by our believe that data cleaning should a result of close collaboration of
multiple resources including users and other information sources such as the WWW.
In this section, we give an overview of several directions for future research.
158
6.2.1 User Centric Data Cleaning
We introduced GDR to involve users directly in the data cleaning process. In our
proposed solution, we assumed that the user is always correct. However, there could
be uncertain answers from the users or even the user may do mistakes. It is always
useful to interact with experts who will provide feedback that is certainly trusted.
However, the experts are more expensive than the regular users of the data. Therefore,
to involve less expert users in cleaning the data, it is mandatory to take into account
the uncertainty of the user when providing feedback. This may require getting answers
for the same question from a different user or give another similar question to the
same user. Revising the suggested updates while propagating uncertain feedback will
change as it should take the user uncertainty into account. The problem of involving
direct user interaction for data cleaning is by itself challenging. Taking into account
the uncertainty of the feedback adds another level of challenge, however, it leads to
a more realistic setting.
Crowdsourcing has drawn a lot of attention recently and we believe that cleaning
and improving the quality of databases must benefit from such model. Our initial work
in guided data repair opens the door to involve the crowd in the data cleaning tasks.
However, we can identify several new challenges to be addressed in this model. For
example, the users modeling and taking into account their uncertainty, identifying
the need to aggregate feedbacks on the same questions to improve the certainty,
global aggregation of feedbacks while resolving conflicting answers, and the need for
an economical model to address the trade-off between the quality, cost and time.
6.2.2 Holistic Data Cleaning
It has always been the case that the research in the data cleaning area focus on a
single dimension of data quality (e.g., inconsistency, deduplication, . . . etc.). However,
a single database may have a combination of these problems, and unfortunately, these
problems interact with each others. For example, if we look into the deduplication
159
and repairing inconsistent database (i.e., violating a set of constraints). If we resolve
and merge duplicate records, we may end up fixing inconsistency in their values (or
introducing new inconsistency), and fixing inconsistent record may result in having
new duplicate records (or separating identified duplicates). Our system SCARE,
which we propose in Chapter 3 focuses on cleaning dirty values and it relies on the
blocking techniques for data partitioning. Blocking has been usually used to improve
the efficiency of deduplication techniques. We believe that SCARE can be extended
further to handle both problems of repairing dirty values, as well as, identifying and
merging duplicate records.
6.2.3 The WWW for Data Cleaning
We introduced in Chapter 5 an approach to leverage web tables for data cleaning.
There are multiple directions to extend this work. First, more structured information
can be found on the web and can be leveraged as an additional source of information,
for example; the HTML lists, the attribute-value tables and deep web. The challenge
here is how to correctly extract the information in a well formatted form to be easily
processed and used. Secondly, we focused in our work only on the data cleaning task
of finding missing values, however, other tasks need to be studied. For example, the
data on the web can be leveraged for deduplication. Imagine a matching function on
the web, such that given a record of any schema and another record with another
schema. Such matching function should tell whether records refer to the same entity
or not. The information is there on the web to help deciding upon the entities match,
but the problem needs to be studied.
6.2.4 Private Data Cleaning
Owners of the data in the same domain can collaborate and help each other
improving their data quality. For example, stores may collaborate to unify and repair
their information about the products. The challenge here is the privacy. How can
160
we develop a data cleaning solution to benefit from the union of the parties data
and give suggestions to every data owners to clean his/her data? We already did
efforts [90, 91]in this direction and introduced solutions for doing records matching
efficiently in a private preserving setting. Matching values and records is an important
and most frequently used basic block in the data cleaning techniques. Therefore,
our effort lays down the necessary infrastructure to revisit existing data cleaning
techniques for a private setting.
LIST OF REFERENCES
161
LIST OF REFERENCES
[1] C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies and Tech-niques (Data-Centric Systems and Applications). Springer, 2006.
[2] L. P. English, Information Quality Applied: Best Practices for Improving Busi-ness Information, Processes and Systems. Wiley, 2009.
[3] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, “Improving data quality: Con-sistency and accuracy,” in Proceedings of the 33rd International Conference onVery Large Data Bases, VLDB ’07, pp. 315–326, 2007.
[4] A. Lopatenko and L. Bravo, “Efficient approximation algorithms for repairinginconsistent databases,” in IEEE 23rd International Conference on Data Engi-neering, ICDE ’07, pp. 216–225, April 2007.
[5] S. Kolahi and L. V. S. Lakshmanan, “On approximating optimum repairs forfunctional dependency violations,” in Proceedings of the 12th International Con-ference on Database Theory, ICDT ’09, pp. 53–62, 2009.
[6] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, “Towards certain fixes with editingrules and master data,” Proceedings of VLDB Endowment (PVLDB), vol. 3,pp. 173–184, September 2010.
[7] C. Mayfield, J. Neville, and S. Prabhakar, “ERACER: A database approach forstatistical inference and data cleaning,” in Proceedings of the 2010 InternationalConference on Management of Data, SIGMOD ’10, pp. 75–86, 2010.
[8] X. Zhu and X. Wu, “Class noise vs. attribute noise: A quantitative study of theirimpacts,” Artificial Intellegnce Review, vol. 22, pp. 177–210, November 2004.
[9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record de-tection: A survey,” IEEE Transactions on Knowledge and Data Engineering,vol. 19, pp. 1–16, January 2007.
[10] S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf, “Large scale multiplekernel learning,” Journal Machine Learning Research, vol. 7, pp. 1531–1565,December 2006.
[11] W. Fan, “Dependencies revisited for improving data quality,” in Proceedingsof the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems, PODS ’08, pp. 159–170, 2008.
[12] I. Bhattacharya and L. Getoor, “Iterative record linkage for cleaning and inte-gration,” in Proceedings of the SIGMOD Workshop on Research Issues in DataMining and Knowledge Discovery, DMKD ’04, pp. 11–18, 2004.
162
[13] D. V. Kalashnikov, S. Mehrotra, and Z. Chen, “Exploiting relationships fordomain-independent data cleaning,” in SIAM International Conference on DataMining, 2005.
[14] A. Doan, Y. Lu, Y. Lee, and J. Han, “Object matching for information integra-tion: A profiler-based approach,” in In Workshop on Information Integration onthe Web, IIWeb ’03, 2003.
[15] S. Chaudhuri, A. Das Sarma, V. Ganti, and R. Kaushik, “Leveraging aggregateconstraints for deduplication,” in Proceedings of the 2007 International Confer-ence on Management of Data, SIGMOD ’07, pp. 437–448, 2007.
[16] F. Radlinski and T. Joachims, “Query chains: Learning to rank from implicitfeedback,” in Proceedings of the 11th ACM SIGKDD International Conferenceon Knowledge Discovery in Data Mining, KDD ’05, pp. 239–248, 2005.
[17] S. Holland, M. Ester, and W. Kießling, “Preference mining: A novel approach onmining user preferences for personalized applications,” in Knowledge Discoveryin Databases: PKDD 2003, 7th European Conference on Principles and Practiceof Knowledge Discovery in Databases, pp. 204–216, 2003.
[18] E. Agichtein, E. Brill, and S. Dumais, “Improving web search ranking by incor-porating user behavior information,” in Proceedings of the 29th Annual Inter-national ACM SIGIR Conference on Research and Development in InformationRetrieval, SIGIR ’06, pp. 19–26, 2006.
[19] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, “Webtables:Exploring the power of tables on the web,” Proceedings of VLDB Endowment(PVLDB), vol. 1, pp. 538–549, August 2008.
[20] M. J. Cafarella, A. Halevy, and N. Khoussainova, “Data integration for therelational web,” Proceedings of VLDB Endowment (PVLDB), vol. 2, pp. 1090–1101, August 2009.
[21] X. Yin, W. Tan, and C. Liu, “Facto: A fact lookup engine based on web tables,”in Proceedings of the 20th International Conference on World Wide Web, WWW’11, pp. 507–516, 2011.
[22] M. Arenas, L. Bertossi, and J. Chomicki, “Consistent query answers in inconsis-tent databases,” in Proceedings of the 18th ACM SIGMOD-SIGACT-SIGARTSymposium on Principles of Database Systems, PODS ’99, pp. 68–79, 1999.
[23] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, “A cost-based model andeffective heuristic for repairing constraints by value modification,” in Proceedingsof the 2005 International Conference on Management of Data, SIGMOD ’05,pp. 143–154, 2005.
[24] R. Bruni and A. Sassano, “Errors detection and correction in large scale datacollecting,” in Proceedings of the 4th International Conference on Advances inIntelligent Data Analysis, IDA ’01, pp. 84–94, 2001.
[25] J. Chomicki and J. Marcinkowski, “Minimal-change integrity maintenance usingtuple deletions,” Journal of Information and Computation, vol. 197, pp. 90–121,February 2005.
163
[26] E. Franconi, A. L. Palma, N. Leone, S. Perri, and F. Scarcello, “Census data re-pair: A challenging application of disjunctive logic programming,” in Proceedingsof the Artificial Intelligence on Logic for Programming, LPAR ’01, pp. 561–578,2001.
[27] J. Wijsen, “Condensed representation of database repairs for consistent query an-swering,” in Proceedings of the 9th International Conference on Database Theory,ICDT ’03, pp. 378–393, 2003.
[28] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional de-pendencies for capturing data inconsistencies,” ACM Transaction on DatabaseSystems (TODS), vol. 33, pp. 6:1–6:48, June 2008.
[29] W. Fan, X. Jia, J. Li, and S. Ma, “Reasoning about record matching rules,”Proceedings of VLDB Endowment (PVLDB), vol. 2, pp. 407–418, August 2009.
[30] A. Arasu, C. Re, and D. Suciu, “Large-scale deduplication with constraints usingdedupalog,” in Proceedings of the 2009 IEEE International Conference on DataEngineering, ICDE ’09, pp. 952–963, 2009.
[31] W. Fan, S. Ma, Y. Hu, J. Liu, and Y. Wu, “Propagating functional dependencieswith conditions,” Proceedings of VLDB Endowment (PVLDB), vol. 1, pp. 391–407, August 2008.
[32] L. Bravo, W. Fan, F. Geerts, and S. Ma, “Increasing the expressivity of con-ditional functional dependencies without extra complexity,” in Proceedings ofthe 2008 IEEE 24th International Conference on Data Engineering, ICDE ’08,pp. 516–525, 2008.
[33] L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu, “On generating near-optimal tableaux for conditional functional dependencies,” Proceedings of VLDBEndowment (PVLDB), vol. 1, pp. 376–390, August 2008.
[34] G. Cormode, L. Golab, K. Flip, A. McGregor, D. Srivastava, and X. Zhang, “Es-timating the confidence of conditional functional dependencies,” in Proceedingsof the 2009 International Conference on Management of Data, SIGMOD ’09,pp. 469–482, 2009.
[35] F. Chiang and R. J. Miller, “Discovering data quality rules,” Proceedings ofVLDB Endowment (PVLDB), vol. 1, pp. 1166–1177, August 2008.
[36] W. Fan, F. Geerts, L. V. S. Lakshmanan, and M. Xiong, “Discovering condi-tional functional dependencies,” in Proceedings of the 2009 IEEE InternationalConference on Data Engineering, ICDE ’09, pp. 1231–1234, 2009.
[37] F. Chu, Y. Wang, D. S. Parker, and C. Zaniolo, “Data cleaning using beliefpropagation,” in Proceedings of the 2nd International Workshop on InformationQuality in Information Systems, IQIS ’05, pp. 99–104, 2005.
[38] J. L. Y. Koh, M. L. Lee, W. Hsu, and K. T. Lam, “Correlation-based detectionof attribute outliers,” in Proceedings of the 12th International Conference onDatabase Systems for Advanced Applications, DASFAA ’07, pp. 164–175, 2007.
[39] H. Galhardas, D. Florescu, D. Shasha, and E. Simon, “Ajax: An extensible datacleaning tool,” in Proceedings of the 2000 International Conference on Manage-ment of Data, SIGMOD ’00, pp. 590–, 2000.
164
[40] V. Raman and J. M. Hellerstein, “Potter’s wheel: An interactive data cleaningsystem,” in Proceedings of the 27th International Conference on Very Large DataBases, VLDB ’01, pp. 381–390, 2001.
[41] A. Doan, P. Domingos, and A. Y. Halevy, “Reconciling schemas of disparate datasources: A machine-learning approach,” in Proceedings of the 2001 InternationalConference on Management of Data, SIGMOD ’01, pp. 509–520, 2001.
[42] A. Doan and R. McCann, “Building data integration systems: A mass collabora-tion approach,” in In Workshop on Information Integration on the Web, IIWeb’03, pp. 183–188, 2003.
[43] S. R. Jeffery, M. J. Franklin, and A. Y. Halevy, “Pay-as-you-go user feedbackfor dataspace systems,” in Proceedings of the 2008 International Conference onManagement of Data, SIGMOD ’08, pp. 847–860, 2008.
[44] W. Wu, C. Yu, A. Doan, and W. Meng, “An interactive clustering-based ap-proach to integrating source query interfaces on the deep web,” in Proceedingsof the 2004 International Conference on Management of Data, SIGMOD ’04,pp. 95–106, 2004.
[45] S. Sarawagi and A. Bhamidipaty, “Interactive deduplication using active learn-ing,” in Proceedings of the 8th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’02, pp. 269–278, 2002.
[46] P. Turney, “Types of cost in inductive concept learning,” in In Workshop onCost-Sensitive Learning at the Seventeenth International Conference on MachineLearning, pp. 15–21, 2000.
[47] F. Provost, “Toward economic machine learning and utility-based data mining,”in Proceedings of the 1st International Workshop on Utility-Based Data Mining,UBDM ’05, pp. 1–1, 2005.
[48] D. Cohn, L. Atlas, and R. Ladner, “Improving generalization with active learn-ing,” Machine Learning, vol. 15, pp. 201–221, May 1994.
[49] A. Kapoor, E. Horvitz, and S. Basu, “Selective supervision: Guiding supervisedlearning with decision-theoretic active learning,” in Proceedings of the 20th In-ternational Joint Conference on Artifical Intelligence, IJCAI’07, pp. 877–882,2007.
[50] V. S. Sheng, F. Provost, and P. G. Ipeirotis, “Get another label? improvingdata quality and data mining using multiple, noisy labelers,” in Proceedings ofthe 14th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, KDD ’08, pp. 614–622, 2008.
[51] P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, andC. Wu, “Recovering semantics of tables on the web,” Proceedings of VLDB En-dowment (PVLDB), vol. 4, pp. 528–538, June 2011.
[52] G. Limaye, S. Sarawagi, and S. Chakrabarti, “Annotating and searching webtables using entities, types and relationships,” Proceedings of VLDB Endowment(PVLDB), vol. 3, pp. 1338–1347, September 2010.
165
[53] E. Rahm and P. A. Bernstein, “A survey of approaches to automatic schemamatching,” The VLDB Journal, vol. 10, pp. 334–350, December 2001.
[54] P. A. Bernstein, J. Madhavan, and E. Rahm, “Generic Schema Matching, TenYears Later,” in PVLDB, 2011 (VLDB 10 Year Best Paper Award Paper), 2011.
[55] Z. Bellahsene, A. Bonifati, and E. Rahm, Schema Matching and Mapping.Springer, 1st ed., 2011.
[56] J. Madhavan, P. A. Bernstein, A. Doan, and A. Halevy, “Corpus-based schemamatching,” in Proceedings of the 21st International Conference on Data Engi-neering, ICDE ’05, pp. 57–68, 2005.
[57] Y. He and D. Xin, “Seisa: Set expansion by iterative similarity aggregation,” inProceedings of the 20th International Conference on World Wide Web, WWW’11, pp. 427–436, 2011.
[58] R. Gupta and S. Sarawagi, “Answering table augmentation queries from un-structured lists on the web,” Proceedings of VLDB Endowment (PVLDB), vol. 2,pp. 289–300, August 2009.
[59] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas, “Guideddata repair,” Proceedings of VLDB Endowment (PVLDB), vol. 4, pp. 279–289,February 2011.
[60] M. Yakout, A. K. Elmagarmid, J. Neville, and M. Ouzzani, “GDR: A systemfor guided data repair,” in Proceedings of the 2010 International Conference onManagement of Data, SIGMOD ’10, pp. 1223–1226, 2010.
[61] M. Yakout, A. K. Elmagarmid, H. Elmeleegy, M. Ouzzani, and A. Qi, “Behav-ior based record linkage,” Proceedings of VLDB Endowment (PVLDB), vol. 3,pp. 439–448, September 2010.
[62] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri, “InfoGather: Entityaugmentation and attribute discovery by holistic matching with web tables,”in Proceedings of the 2012 International Conference on Management of Data,SIGMOD ’12, 2012.
[63] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditionalfunctional dependencies for data cleaning,” in Proceedings of the 23rd Interna-tional Conference on Data Engineering, ICDE ’07, pp. 746–755, 2007.
[64] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. PearsonEducation, 2 ed., 2003.
[65] S. Tong and D. Koller, “Support vector machine active learning with applicationsto text classification,” The Journal of Machine Learning Research, vol. 2, pp. 45–66, March 2002.
[66] B. Zadrozny and C. Elkan, “Learning and making decisions when costs andprobabilities are both unknown,” in In Proceedings of the 7th International Con-ference on Knowledge Discovery and Data Mining, KDD’01, pp. 204–213, 2001.
[67] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32, 2001.
166
[68] D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie, “De-pendency networks for inference, collaborative filtering, and data visualization,”Journal Machine Learning Research, vol. 1, pp. 49–75, September 2001.
[69] S. German and D. German, “Neurocomputing: Foundations of research,”ch. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration ofimages, MIT Press, 1988.
[70] T. G. Dietterich, “Ensemble methods in machine learning,” in Proceedings of the1st International Workshop on Multiple Classifier Systems, MCS ’00, pp. 1–15,2000.
[71] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixturesof local experts,” Neural Computation, vol. 3, pp. 79–87, March 1991.
[72] Y. Asahiro, K. Iwama, H. Tamaki, and T. Tokuyama, “Greedily finding a densesubgraph,” Journal of Algorithms, vol. 34, pp. 203–221, February 2000.
[73] D. S. Hochbaum, “Efficient bounds for the stable set, vertex cover and set packingproblems,” Discrete Applied Mathematics, vol. 6, pp. 243–254, 1983.
[74] S. Arora, D. Karger, and M. Karpinski, “Polynomial time approximation schemesfor dense instances of np-hard problems,” in Proceedings of the 27th annual ACMSymposium on Theory of Computing, STOC ’95, pp. 284–293, 1995.
[75] U. Feige and M. Seltser, “On the densest k-subgraph problem,” Algorithmica,vol. 29, p. 2001, 1997.
[76] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood fromincomplete data via the EM algorithm,” Journal of the Royal Statistical Society,Series B, vol. 39, no. 1, pp. 1–38, 1977.
[77] S. W. Smith, The scientist and engineer’s guide to digital signal processing. Cal-ifornia Technical Publishing, 1997.
[78] C. M. Bishop, Pattern Recognition and Machine Learning (Information Scienceand Statistics). Springer, 2006.
[79] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transfom,” IEEETransaction on Computers, vol. 23, pp. 90–93, January 1974.
[80] G. K. Wallace, “The jpeg still picture compression standard,” Communicationsof the ACM, vol. 34, pp. 30–44, April 1991.
[81] V. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions andReversals,” Soviet Physics Doklady, vol. 10, p. 707, 1966.
[82] J. Madhavan, P. A. Bernstein, and E. Rahm, “Generic schema matching withcupid,” in Proceedings of the 27th International Conference on Very Large DataBases, VLDB ’01, pp. 49–58, 2001.
[83] B. He and K. C.-C. Chang, “Statistical schema matching across web query in-terfaces,” in Proceedings of the 2003 International Conference on Managementof Data, SIGMOD ’03, pp. 217–228, 2003.
167
[84] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and efficientfuzzy match for online data cleaning,” in Proceedings of the 2003 InternationalConference on Management of Data, SIGMOD ’03, pp. 313–324, 2003.
[85] T. H. Haveliwala, “Topic-sensitive pagerank,” in Proceedings of the 11th Inter-national Conference on World Wide Web, WWW ’02, pp. 517–526, 2002.
[86] M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu, “Uncoveringthe relational web,” in Proceedings of the 11th International Workshop on theWeb and Databases, WebDB ’08, 2008.
[87] T. Elsayed, J. Lin, and D. W. Oard, “Pairwise document similarity in largecollections with mapreduce,” in Proceedings of the 46th Annual Meeting of theAssociation for Computational Linguistics on Human Language Technologies:Short Papers, HLT-Short ’08, pp. 265–268, 2008.
[88] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation rank-ing: Bringing order to the web.,” Technical Report 1999-66, Stanford InfoLab,November 1999. Previous number = SIDL-WP-1999-0120.
[89] B. Bahmani, K. Chakrabarti, and D. Xin, “Fast personalized pagerank on mapre-duce,” in Proceedings of the 2011 International Conference on Management ofData, SIGMOD ’11, pp. 973–984, 2011.
[90] M. Yakout, M. J. Atallah, and A. Elmagarmid, “Efficient private record linkage,”in Proceedings of the 2009 IEEE International Conference on Data Engineering,ICDE ’09, pp. 1283–1286, 2009.
[91] M. Yakout, M. J. Atallah, and A. Elmagarmid, “Efficient and practical approachfor private record linkage,” Journal of Data and Information Quality, 2012.
VITA
168
VITA
Mohamed Yakout was born in the beautiful city of Alexandria, Egypt. His inter-
est in technology and computer science started in high school. In 1996, Mohamed
joined the Faculty of Engineering at Alexandria University. After a very competitive
freshman year, he was ranked among the top students and joined the Computer Sci-
ence Department. After graduating with a bachelor’s degree in Computer Science,
Mohamed joined the ICT technical staff members of the Bibliotheca Alexandrina (Li-
brary of Alexandria) in 2001. During his work in the Bibliotheca, he participated
in major projects related to digital libraries, in particular, he led projects related to
the digitization of the Egyptian cultural heritage. In 2006, Mohamed earned a M.Sc.
in Computer Science from Alexandria University and after that he started to think
about combining his industrial skills with more advanced research skills.
In 2007, he joined the graduate program at Purdue University, working with his
advisor, Ahmed Elmagarmid. At Purdue, Mohamed learned to conduct world-class
research. He worked on very interesting problems in the area of improving data quality
and data cleaning. His work was published in the top data management conferences
such as SIGMOD, VLDB, and ICDE. In summer 2009, Mohamed interned at Google
Inc. and started to get the flavor of large US corporations. In the summers of 2010 and
2011, he interned with Microsoft Research, where he interacted with many world-class
researchers and gained his large-scale real-data management experience. Mohamed’s
main research interests focus on advancing technologies for improving data quality
by involving external resources such as users and data on the WWW. Mohamed
graduated with a Ph.D. degree in computer science from Purdue University in August
2012. In 2012, he joined the technical staff of Microsoft Bing in Seattle, Washington.
Recommended