Graduate School ETD Form 9 PURDUE UNIVERSITY GRADUATE SCHOOL

Graduate School ETD Form 9 (Revised 12/07)

PURDUE UNIVERSITY GRADUATE SCHOOL

Thesis/Dissertation Acceptance

This is to certify that the thesis/dissertation prepared

Entitled

For the degree of

Is approved by the final examining committee:

To the best of my knowledge and as understood by the student in the Research Integrity and Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy on Integrity in Research” and the use of copyrighted material.

Approved by Major Professor(s): ____________________________________

____________________________________

Approved by: Head of the Graduate Program Date

Mohamed Ahmed Mohamed Ahmed Yakout

Guided Data Cleaning

Doctor of Philosophy

Ahmed K. Elmagarmid

Walid G. Aref

Luo Si

Jennifer Neville

Ahmed K. Elmagarmid

Sunil K. Parbhakar / William J. Gorman 06/15/2012

Graduate School Form 20 (Revised 9/10)

PURDUE UNIVERSITY GRADUATE SCHOOL

Research Integrity and Copyright Disclaimer

Title of Thesis/Dissertation:

For the degree of Choose your degree

I certify that in the preparation of this thesis, I have observed the provisions of Purdue University Executive Memorandum No. C-22, September 6, 1991, Policy on Integrity in Research.*

Further, I certify that this work is free of plagiarism and all materials appearing in this thesis/dissertation have been properly quoted and attributed.

I certify that all copyrighted material incorporated into this thesis/dissertation is in compliance with the United States’ copyright law and that I have received written permission from the copyright owners for my use of their work, which is beyond the scope of the law. I agree to indemnify and save harmless Purdue University from any and all claims that may be asserted or that may arise from any copyright violation.

______________________________________ Printed Name and Signature of Candidate

______________________________________ Date (month/day/year)

*Located at http://www.purdue.edu/policies/pages/teach_res_outreach/c_22.html

Guided Data Cleaning

Mohamed Yakout

06/23/2012

GUIDED DATA CLEANING

A Dissertation

Submitted to the Faculty

Purdue University

Mohamed A. Yakout

In Partial Fulfillment of the

Requirements for the Degree

August 2012

Purdue University

West Lafayette, Indiana

To my parents, Princisa and Ahmed, for a life of sacrifice and inspiration.

To my wife, Walaa, for years of love, dedication and support.

To my children, Jasmine and Zeyad, the light of my eyes.

ACKNOWLEDGMENTS

Coming to the end of this long journey, It is my pleasure to express my gratitude

to a large number of people who have contributed, in many different ways, to make

my success a part of their own.

First, I wish to express my deepest gratitude to my supervisor Prof. Ahmed

Elmagarmid. I am totally indebted to his continuous encouragement, efforts and

invaluable advices. He was a wonderful advisor, a great leader and a close friend. I

learned from Ahmed how to do high quality research, how to transform my fledging

ideas into crisp research endeavors, how to present and sell my ideas. I also learned

from him how to think “out of the box” and see the value of a proposition. After all,

I was really fortune to have Ahmed as my advisor and I am so delighted and honored

for being his student.

I will be always grateful to Prof. Walid Aref for the thoughtful discussions with

him on both professional and personal levels. Whenever, I was in need to an advice

or stuck in a decision, Walid was always there by his experience and invaluable com-

ments. I am also grateful to Dr. Mourad Ouzzani for the countless hours he spent

with me on multiple research projects. He treated me as his brother and never made

me feel that he was a research faculty member while I was only a graduate student.

I would like to thank Prof. Mikhail Atallah for his support and encouragement. I

learned from him how teamwork is vital to enable solving problems efficiently. Special

thanks to Prof. Jennifer Neville for her help and continuous support. I learned from

her a great deal about the importance of Data Mining and this made me able to find

plenty of rooms to involve data mining techniques in my solutions.

I would also like to thank Prof. Luo Si for serving on my exam committee. He was

always so kind and supportive; this is in addition to his insightful comments. I want

also to thank Prof. Chris Clifton for his continuous help and for his useful comments

in my prelim I would also like to thank Dr. William Gorman and Renate Mallus for

their dedication to students and for helping me. Dr. Gorman was always there to fill

my advisor’s absence during his leave.

During my summer internships with Microsoft Research and Google, I worked

with wonderful smart people. My internships at Microsoft was truly unforgettable

experience My sincere thanks to Kris Ganjam for being a wonderful mentor and Dr.

Kaushik Chakrabarti for sharing his advice and experience with me. My discussions

with them significantly contributed to my way of thinking and attacking real-world

data problems. Also I had a great experience during my internship at Google Inc.

I am especially grateful to Dr. Moustafa Hammad for his wonderful mentorship.

Moustafa was always ready to get into deep discussions with me on how to improve

solutions approaches, or even how to better implement them. At the personal level I

value my friendship with Moustafa to the greatest extent.

Special thanks are due to my friends and colleagues who made my graduate life

easier. In particular, thanks to Dr. Hicham Elmongui for his continuous help and

advices during my first few years in the PhD. I would also like to acknowledge Dr.

Hazem Elmeleegy, Dr. Mohamed El Tabakh, Samer Barakat, Dr. Ahmed Amin,

Ahmed Abdel-Gawad, Amr Ebeid and Amgad Madkour.

My sincere gratitude goes to my wife Walaa. Walaa’s love, dedication, persever-

ance, and belief in me were key factors in my success. Her support is infinite and

her patience is endless. She was always reliable in taking care of anything that might

keep me away from studying. She gave the highest priority to me and to our kids,

Jasmine and Zeyad.

My forever gratitude goes to my parents for their sacrifices, endless support, en-

couragement and continuous prayers for me. I can not be grateful enough to them.

They taught me the value of respect, hard work, good judgment and honesty. Thanks

to my sisters Rabab and Rania for their support and advices.

Above all, I thank ALLAH. For only through ALLAH’s grace and blessing has

this pursuit been possible. I pray for ALLAH’s support and guidance in the rest of

my career and my life.

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Key Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 User’s Direct Interaction for Data Cleaning . . . . . . . . . . 3

1.1.2 Scalable Data Cleaning Techniques . . . . . . . . . . . . . . 3

1.1.3 User’s Indirect Interaction for Data Cleaning . . . . . . . . . 5

1.1.4 Leveraging the WWW for Data Cleaning . . . . . . . . . . . 7

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Constraint-based Data Cleaning . . . . . . . . . . . . . . . . 8

1.2.2 Machine Learning Techniques for Data Cleaning . . . . . . . 9

1.2.3 Involving Users in the Data Cleaning Process . . . . . . . . 10

1.2.4 WWW for Data Integration and Data Cleaning . . . . . . . 12

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 GUIDED DATA REPAIR . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Problem Definition and Solution Overview . . . . . . . . . . . . . . 20

2.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Generating Candidate Updates . . . . . . . . . . . . . . . . . . . . 24

2.3.1 Dirty Tuples Identification and Updates Discovery: . . . . . 24

2.3.2 Updates Consistency Manager . . . . . . . . . . . . . . . . . 28

2.3.3 Grouping Updates . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 Ranking and Displaying Suggested Updates . . . . . . . . . . . . . 31

2.4.1 VOI-based Ranking . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.2 Active Learning Ordering . . . . . . . . . . . . . . . . . . . 36

2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.5.1 VOI Ranking Evaluation . . . . . . . . . . . . . . . . . . . . 41

2.5.2 GDR Overall Evaluation . . . . . . . . . . . . . . . . . . . . 43

2.5.3 User Efforts vs. Repair Accuracy . . . . . . . . . . . . . . . 47

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 SCALABLE APPROACH TO GENERATE DATA CLEANING UPDATES 49

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Problem Definition and Solution Approach . . . . . . . . . . . . . . 52

3.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 52

3.2.2 Solution Approach . . . . . . . . . . . . . . . . . . . . . . . 56

3.3 Modeling Dependencies and Predicting Updates . . . . . . . . . . . 57

3.3.1 Modeling Dependencies . . . . . . . . . . . . . . . . . . . . . 57

3.3.2 Predicting Updates . . . . . . . . . . . . . . . . . . . . . . . 59

3.4 Scaling Up the Maximal Likelihood Repairing Approach . . . . . . 62

3.4.1 Process Overview . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4.2 Repair Generation Phase . . . . . . . . . . . . . . . . . . . . 64

3.4.3 Tuple Repair Selection Phase . . . . . . . . . . . . . . . . . 67

3.4.4 Approximate Solution for Tuple Repair Selection . . . . . . 72

3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.5.1 Repair Quality Evaluation . . . . . . . . . . . . . . . . . . . 76

3.5.2 SCARE Scalability . . . . . . . . . . . . . . . . . . . . . . . 82

3.5.3 SCARE vs. ERACER to Predict Missing Values . . . . . . . 83

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4 INDIRECT GUIDANCE FOR DEDUPLICATION (BEHAVIOR BASEDRECORD LINKAGE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.2 Behavior Based Approach . . . . . . . . . . . . . . . . . . . . . . . 88

4.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 88

4.2.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . 89

4.2.3 Pre-processing and Behavior Extraction . . . . . . . . . . . 90

4.2.4 Matching Strategy . . . . . . . . . . . . . . . . . . . . . . . 94

4.3 Candidate Generation Phase . . . . . . . . . . . . . . . . . . . . . . 96

4.4 Accurate Matching Phase . . . . . . . . . . . . . . . . . . . . . . . 100

4.4.1 Statistical Modeling Technique . . . . . . . . . . . . . . . . 100

4.4.2 Information Theoretic technique (Compressibility) . . . . . . 108

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.5.1 Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5 HOLISITIC MATCHING WITH WEB TABLES FOR ENTITIES AUG-MENTATION AND FINDING MISSING VALUES . . . . . . . . . . . . 120

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.2 Holistic Matching Framework . . . . . . . . . . . . . . . . . . . . . 127

5.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.2.2 General Augmentation Framework . . . . . . . . . . . . . . 128

5.2.3 Direct Match Approach . . . . . . . . . . . . . . . . . . . . 129

5.2.4 Holistic Match Approach . . . . . . . . . . . . . . . . . . . . 130

5.3 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.4 Building the SMW Graph and computing FPPR . . . . . . . . . . . 135

5.4.1 Building the SMW Graph . . . . . . . . . . . . . . . . . . . 135

5.4.2 Computing FPPR on SMW Graph . . . . . . . . . . . . . . 141

5.5 Supporting Core Operations . . . . . . . . . . . . . . . . . . . . . . 141

5.5.1 Augmentation-By-Attribute (ABA) . . . . . . . . . . . . . . 141

5.5.2 Augmentation-By-Example (ABE) . . . . . . . . . . . . . . 142

5.6 Handling n-ary Web Tables . . . . . . . . . . . . . . . . . . . . . . 143

5.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 145

5.7.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . 145

5.7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 147

5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 155

6.2 Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

6.2.1 User Centric Data Cleaning . . . . . . . . . . . . . . . . . . 158

6.2.2 Holistic Data Cleaning . . . . . . . . . . . . . . . . . . . . . 158

6.2.3 The WWW for Data Cleaning . . . . . . . . . . . . . . . . . 159

6.2.4 Private Data Cleaning . . . . . . . . . . . . . . . . . . . . . 159

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

LIST OF TABLES

Table Page

5.1 Web tables matching features as documents. . . . . . . . . . . . . . . . 139

5.2 Query entity domains and augmenting attributes . . . . . . . . . . . . 146

LIST OF FIGURES

Figure Page

2.1 Example data and rules . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 GDR Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Comparing VOI-based ranking in GDR (GDR-NoLearning) to other strate-gies against the amount of feedback. Feedback is reported as the percent-age of the maximum number of verified updates required by an approach.Our application of the VOI concept shows superior performance comparedto other naıve ranking strategies. . . . . . . . . . . . . . . . . . . . . . 42

2.4 Overall evaluation of GDR compared with other techniques. The combina-tion of the VOI-based ranking with the active learning was very successfulin efficiently involving the user. The user feedback is reported as a per-centage of the initial number of the identified dirty tuples. . . . . . . . 45

2.5 Accuracy vs. user efforts. As the user spends more effort with GDR, theoverall accuracy is improved. The user feedback is reported as a percentageof the initial number of the identified dirty tuples. . . . . . . . . . . . . 48

3.1 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2 Generated predictions for tuple repairs with their corresponding predictionprobabilities for tuple t4 in Figure 3.1. . . . . . . . . . . . . . . . . . . 67

3.3 Step by step demonstration for the SelectTupleRepair algorithm. At eachiteration, the vertex with minimum weighted degree is removed as long asit is not the only vertex in its corresponding vertex set. . . . . . . . . . 70

3.4 Quality vs. the percentage of errors: SCARE maintains high precision bymaking the best use of δ, the allowed amount of changes. . . . . . . . . 76

3.5 δ controls the amount of changes to apply to the database: small δ guar-antees high precision at the cost of the recall and vice versa. . . . . . . 78

3.6 Using SCARE in an iterative way helps improving the recall and the overallquality of the updates. The decrease in the precision is small compared tothe increase in the recall, achieving an overall high quality improvementdemonstrated by the f-measure. . . . . . . . . . . . . . . . . . . . . . . 80

Figure Page

3.7 Increasing the number of partition functions |H| improves the accuracy ofthe predictions and hence increases the precision. The recall is not affectedmuch because we use a fixed δ. . . . . . . . . . . . . . . . . . . . . . . 81

3.8 SCARE scalability when varying the database size. . . . . . . . . . . . 82

3.9 Comparison between SCARE and ERACER to predict missing values.Generally, both SCARE and ERACER show high accuracy in predict-ing the missing values. SCARE uses in this experiment Naıve Bayesianmodel, while ERACER leverage domain knowledge interpreted in carefullydesigned Bayesian Network. . . . . . . . . . . . . . . . . . . . . . . . . 84

4.1 Process for behavior-based record linkage. . . . . . . . . . . . . . . . . 89

4.2 Retail store running example. . . . . . . . . . . . . . . . . . . . . . . . 92

4.3 Actions patterns in the complex plane and the effect on the magnitude. 98

4.4 Behavior linkage overall quality. . . . . . . . . . . . . . . . . . . . . . . 112

4.5 Improving the textual matching quality. . . . . . . . . . . . . . . . . . 114

4.6 Behavior linkage quality vs. different splitting probabilities and behaviorexhaustiveness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.7 Behavior linkage quality vs. behavior contiguousness and percentage ofoverlapping entities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.8 Behavior linkage performance. . . . . . . . . . . . . . . . . . . . . . . . 118

5.1 APIs of the core operations . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.2 ABA operation using web tables . . . . . . . . . . . . . . . . . . . . . . . 123

5.3 InfoGather System Architecture . . . . . . . . . . . . . . . . . . . . . . 135

5.4 The distribution of the number of columns per web table and statisticsabout the relational web tables and . . . . . . . . . . . . . . . . . . . . 144

5.5 Augmenting-By-Attribute (ABA) evaluation . . . . . . . . . . . . . . . . . 148

5.6 Sensitivity of the precision and coverage to the number of examples. The Holis-

tic shows high precision and maintains high coverage in comparison to DMA. 150

5.7 Joint sensitivity analysis to the number of examples and the head vs. tail

records in the web tables. The Holistic is robust in comparison to the DMA. 151

5.8 Web tables matching accuracy . . . . . . . . . . . . . . . . . . . . . . . . 152

5.9 Response time evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 153

ABSTRACT

Yakout, Mohamed A. Ph.D., Purdue University, August 2012. Guided Data Cleaning.Major Professor: Ahmed K. Elmagarmid.

Until recently, all data cleaning techniques have focused on providing fully auto-

mated solutions, which are risky to rely on, without efficiently and effectively consider-

ing collaboration with the data users and other available resources. This dissertation

studies techniques to involve data users directly and indirectly, as well as leverag-

ing the WWW, specifically web tables, for data cleaning tasks. In particular, the

dissertation addresses four key challenges for guided data cleaning.

The first challenge relates to directly involving users in the data cleaning process.

The goal is to efficiently combine the best of both the user fidelity to guide the data

cleaning process and the existing automatic cleaning techniques to suggest cleaning

updates. For this purpose, we develop the necessary principles to reason about which

questions to forward to the user using a novel combination of decision theory and

active learning.

The second challenge is scalability as existing automatic cleaning techniques are

not scalable. We introduce a new approach that is based on statistical machine

learning techniques. We achieve scalability by introducing a robust mechanism to

partition the database, and then aggregate the final cleaning decisions from the several

partitions.

The third challenge relates to involving users indirectly for a data cleaning task.

We notice that the users’ actions (or behavior), which can be found in the systems

log, can be useful evidence for the task of deduplicating the users themselves. We

develop the necessary pattern detection and modeling algorithms for this purpose.

Finally, the fourth challenge relates to leveraging the WWW for data cleaning

tasks. We address the problem of finding missing values (or entity augmentation)

using web tables. Our solution relies on aggregating answers from several web tables

that directly and indirectly match the user’s entities. We model this problem as a

topic sensitive pagerank, which models the holistic semantic match of a web table to

the topic of the list of entities.

Our experimental evaluations using real-world datasets demonstrate the effec-

tiveness and efficiency of our proposed approaches to improve the quality of dirty

databases.

1. INTRODUCTION

This dissertation studies techniques to involve data users (or domain experts), in

addition to leveraging data on the Web in data cleaning tasks. The purpose is to

achieve better data quality efficiently and effectively.

Data quality issues are of several kinds, e.g., inaccuracy, inconsistency, duplicates

and incompleteness, and may be the consequence of several reasons, e.g., misspelling,

integration from heterogeneous sources, and software bugs. Poor data quality is a

fact of life for most organizations and can have serious implications on their efficiency

and effectiveness [1].

Data quality experts estimate that erroneous data can cost a business as much

as 10 to 20% of its total system implementation budget [2]. They agree that as

much as 40 to 50% of a project budget might be spent correcting data errors in

time-consuming, labor-intensive and tedious processes. The proliferation of data also

heightens the relevance of data cleaning and makes the problem more challenging:

more sources and larger amounts of data imply larger variety and intrication of the

data quality problems and higher complexity for maintaining the quality of the data

in a cost-effective way. Not to mention the importance of data quality in the health

care domain as well. In such critical applications, incorrect information about pa-

tients in an Electronic Health Record (EHR) may lead to inconsistent treatments

and prescriptions, which consequently may cause severe medical problems including

death. As a result, various computational procedures for data cleaning have been pro-

posed by the database community to (semi-)automatically identify errors and, when

possible, correct them.

Most existing approaches to clean dirty databases either rely on predefined data

quality rules that should be satisfied by the database or rely on machine learning

techniques. Most of these techniques focus on providing fully automated solutions

using different heuristics, which could be risky especially for critical data. To guaran-

tee that the best desired quality updates are applied to the database, users (domain

experts) should be involved to confirm updates. This highlights the increasing need

for a techniques that combines the best of both worlds.

There are other cases where involving users or relying on data cleaning rules will

not be helpful. For example, when there are a lot of missing values in the database.

Consider the following scenario in an enterprize database, where we have a table

containing a list of companies, but their location or contact information is missing.

Neither rules nor correlations among the attributes are helpful in this case; and a user

has to collect all this information manually. Usually, the WWW is helpful in most of

such situations as it covers a large spectrum of domains. This highlight the need for

techniques to automatically leverage the WWW for such data cleaning tasks.

In this chapter, we start in Section 1.2 by highlighting the key challenges we

address in this dissertation in Section 1.1, and we then discuss the related work.

Section 1.3 summarizes the contributions and goals of this dissertation in view of the

challenges presented in Section 1.1. Finally, Section 1.4 outlines the structure of this

dissertation.

1.1 Key Challenges

In this section, we highlight some of the challenges in data cleaning. We will

focus on four key challenges, where the contributions of this dissertation revolve

around. The challenges are related to efficiently involving the user in the data cleaning

process, the scalability of the techniques to generate cleaning updates, leveraging

users generated log for a data cleaning task, and finally, leveraging the WWW for

data cleaning.

1.1.1 User’s Direct Interaction for Data Cleaning

Exiting automated solutions for data cleaning can be used as generators for data

cleaning updates. Then the data users can be involved to inspect such updates and

confirm the correct ones. However, involving the user can be very expensive because

of the large number of possibilities to be verified. Since automated techniques for

data cleaning produce far more updates than one can expect the user to handle,

techniques for selecting the most useful updates for presentation to the user become

very important.

The key challenge in involving users is to determine how and in what order sug-

gested updates should be presented to them. This requires developing a set of princi-

pled measures to estimate the improvement in quality to reason about the selection

process of possible updates, as well as, investigating machine learning techniques to

minimize user effort. The goal is to achieve a good trade-off between high quality

data and minimal user involvement.

1.1.2 Scalable Data Cleaning Techniques

For the inconsistent databases, we focus on solutions that rely on providing

cleaning updates in the form of value modification. Most existing solutions follow

constraint-based repairing approaches [3–5], which search for minimal change of the

database to satisfy a predefined set of constraints. While a variety of constraints (e.g.,

integrity constraints, conditional functional and inclusion dependencies) can detect

the presence of errors, they are recognized to fall short of guiding to correct the er-

rors, and worse, may introduce new errors when repairing the data [6]. Moreover,

despite the research conducted on integrity constraints to ensure the quality of the

data, in practice, databases often contain a significant amount of non-trivial errors.

These errors, both syntactic and semantic, are generally subtle mistakes which are

difficult or even impossible to express using the general types of constraints available

in modern DBMSs [7]. This highlights the need for different techniques to clean dirty

databases.

Usually statistical Machine Learning (ML) (e.g., decision tree, Bayesian networks)

can capture dependencies, correlations, and outliers from datasets based on various

analytic, predictive or computational models [8]. Existing efforts in data cleaning

using ML techniques mainly focused on data imputation (e.g., [7]) and deduplication

(e.g., [9]). To the best of our knowledge, we are not aware of an approach to consider

ML techniques for repairing databases by value modification.

Involving ML techniques for repairing erroneous data is not straightforward and

it raises four major challenges: (1) Several attribute values (of the same record) may

be dirty. Therefore, the process is not as simple as predicting values for a single

erroneous attribute. This requires accurate modeling of correlations between the

database attributes, assuming a subset is dirty and its complement is reliable. (2) A

ML technique can predict an update for each tuple in the database; and the question

is how to distinguish the predictions that should be applied. Therefore, a measure

to quantify the quality of the predicted updates is required. (3) Over-fitting problem

may occur when modeling a database with a large variety of dependencies that may

hold locally for data subsets but do not hold globally. (4) Finally, the process of

learning a model from a very large database is expensive, and the prediction model

itself may not fit in the main memory. Despite the existence of scalable ML techniques

for large datasets, they are either model dependent (i.e., limited to specific models,

for example SVM [10]) or data dependent (e.g., limited to specific types of datasets

such as scientific data and documents repository). Note that the scalability is also an

issue for the constraint-based repairing approaches [11].

Such limitations motivate the need for effective and scalable methods to accurately

predict cleaning updates with statistical guarantees.

1.1.3 User’s Indirect Interaction for Data Cleaning

There are situations where the users interaction with the systems is registered in

a transaction log. This raises the following question: is it possible to leverage such

log for a data cleaning task? In this case, the users are indirectly involved for data

cleaning; this is in contrast to the direct interaction we discussed in Section 1.1.1.

Specifically, we focus on the task of deduplication or record linkage.

Record linkage is the process of identifying records that refer to the same real world

entity. There has been a large body of research on this topic (refer to [9] for a recent

survey). While most existing record linkage techniques focus on simple attribute

similarities, more recent techniques are considering richer information extracted from

the raw data for enhancing the matching process (e.g. [12–15]).

In contrast to most existing techniques, we are considering entity behavior as a

new source of information to enhance the record linkage quality. We observe that

by interpreting massive transactional datasets, for example, transaction logs, we can

discover behavior patterns and identify entities based on these patterns. Various

applications such as retail stores, web sites, and surveillance systems, maintain trans-

action logs that track the actions performed by entities over time. Entities in these

applications will usually perform actions, e.g., buying a specific quantity of milk at

a specific point in time or browsing specific pages within a web site, which represent

their behavior vis-a-vis the system.

To further motivate the importance of using the behavior for record linkage, con-

sider the following real-life example. Yahoo has recently acquired a Jordanian Inter-

net company called Maktoob, which, similar to Yahoo, provides a large number of

Internet services to its customers in the region like e-mail, blogs, news, and online

shopping. It was reported that with this acquisition, Yahoo will be able to add the 16

million Maktoob users to its 20 million users from the middle east region1. Clearly,

Yahoo should expect that the overlap between these two groups of users can be quite

1http://www.techcrunch.com/2009/08/25/confirmed-yahoo-acquires-arab-internet-portal-maktoob/

significant, and hence the strong need for record linkage. However, user profile in-

formation stored by both companies may not be reliable enough because of different

languages, unreal information, . . . etc. In this scenario, analyzing the users behavior,

in terms of how they use the different Internet services, will be an invaluable source of

information to identify potentially common users. Record linkage analysis based on

entity behavior has also many other applications. For example, identifying common

customers for stores that are considering a merge, tracking users accessing web sites

from different IP addresses, as well as helping in crime investigations.

A seemingly straightforward strategy to match two entities is to measure the

similarity between their behaviors. However, a closer examination shows that this

strategy may not be useful, for the following reasons. It is usually the case that

the complete knowledge of an entity’s behavior is not available to both sources, since

each source is only aware of the entity’s interaction with that same source. Hence, the

comparison of entities’ “behaviors” will in reality be a comparison of their “partial

behaviors”, which can easily be misleading. Moreover, even in the rare case when

both sources have almost complete knowledge about the behavior of a given entity

(e.g., a customer who did all his grocery shopping at Walmart for one year and then

at Safeway for another year), the similarity strategy still will not help. The problem is

that many entities do have very similar behaviors, and hence measuring the similarity

can at best group the entities with similar behavior together (e.g., [16–18]), but not

find their unique matches.

The key challenge is to devise an alternative strategy to match entities using their

behavior, because the straightforward similarity is not the write way to address this

problem. This highlight further challenges on how we devise a matching function for

entities behavior, how to represent and model entities behavior, and finally, how to

design an efficient solution to handle the expected large amount of transactions.

1.1.4 Leveraging the WWW for Data Cleaning

As we mentioned earlier, there are cases where none of the automated cleaning

techniques can be helpful and we have to rely on external data sources. The WWW

is the richest data source and the question here is how to effectively and efficiently

leverage the WWW for data cleaning.

The Web contains a vast corpus of HTML tables. In this dissertation, we focus

on one class of HTML tables: entity-attribute tables (also referred to as relational

tables [19,20] and 2-dimensional tables [21]). Such a table contains values of multiple

entities on multiple attributes, each row corresponding to an entity and each column

corresponding to an attribute. Cafarella et. al. reported 154M such tables from a

snapshot of Google’s crawl in 2008; we extracted 573M such tables from a recent

crawl of Microsoft Bing search engine. Henceforth, we refer to such tables as simply

web tables.

Consider an enterprize database where we have a table about companies and all

(or most) of their contact information is missing, or consider a product database

with a table about digital cameras. In the cameras table, the camera model name

is provided, but some other attributes such as brand, resolution, price and optical

zoom have missing values. We call these attributes as augmenting attributes and the

process of finding the missing attributes values as entities augmentation.

Such augmentation would be difficult to perform using an enterprize database or

an ontology because the entities can be from any arbitrary domain. Today, users try

to manually find the web sources containing this information and assemble the values.

Assuming that this information is available, albeit scattered, in various web tables, we

can save a lot of time and effort if we can perform this operation automatically. This

will require discovering sematic matching relationships between the web tables. The

result is going to be a Semantic Matching Web tables (SMW) graph. Constructing

and processing such large graph is a big challenge.

The challenges and requirements for such operation are: (i) high precision

(#corraug#aug

) and high coverage ( #aug#entity

) where #corraug, #aug and #entity denote

the number of entities correctly augmented, the number of entities augmented and

the number of entities, respectively. (ii) fast (ideally interactive) response times and

(iii) applicability to entities of any arbitrary domain.

1.2 Related Work

Improving data quality has been the focus of a large body of research for decades.

Our work is closely related to the following research areas: (i) constraint-based data

repair, (ii) statistical machine learning techniques for data cleaning, (iii) interactive

systems for data cleaning and user’s modeling, and (iv) leveraging the WWW for

data integration tasks.

1.2.1 Constraint-based Data Cleaning

This approach has two main steps: (1) identify a set of constraints that should be

followed by the data, and then (2) use the constraints and the data to find another

consistent database that minimally differs from the original database (e.g., [3–5, 22–

26]). Most earlier work (except [3, 23,24,26,27]) considers traditional full and denial

dependencies, which subsume functional dependencies (FDs). The repair algorithm

in [23] uses traditional FDs and inclusion dependencies (INDs) to derive repairs,

while [4] is applicable for restricted denial constraint. The work in [23] uses equivalent

classes to group the attributes values that are equivalent in obtaining a final consistent

database instance. The repair approach in [3] uses CFDs for data repair and it is

considered a non-trivial extension to the repair algorithms described in [23]. The

proposed algorithms are based on a cost-based greedy heuristic to decide upon a

repair of errors. The work in [5] uses FDs to map the repairing problem to hyper-graph

optimization problem, where a heuristic vertex cover algorithm can help finding the

minimal number of attributes values to modify in order to find a database consistent

with the FDs. The main drawback of these approaches is that the data should be

covered by a set of constraints that have been specified or validated by domain experts,

which may be an expensive manual process and may not be affordable for all data

domains. Moreover, the constraints usually fall short to correctly identify the right

fixes [6].

In the literature, several classes of data quality rules have been introduced. For

example, the Conditional Functional Dependencies (CFDs) [28], which extend stan-

dard functional dependencies (FDs) with conditional pattern tableaux that define

the subset of tuples or context in which the underlying FD holds. The Matching

Dependencies (MDs) [11], which is similar to the FDs, but it takes into account simi-

larity between values instead of exact matching. The records matching rules [29] and

Dedupalog [30] are used to identify duplicate records.

CFDs is being extensively studies due to its usefulness as integrity constraint to

summarize data semantics and identify data inconsistencies. Prior work focused on

consistency and implication analysis for CFDs [28], propagation of CFDs from source

data to views in data integration [31], extensions of CFDs by adding disjunction and

negation [32] or adding ranges [33], estimating CFDs confidence [34]. Consequently,

competing algorithms for discovering CFDs were immediately introduced in [33, 35,

1.2.2 Machine Learning Techniques for Data Cleaning

Data cleaning using ML techniques mainly focused on deduplication (refer to [9]

for survey), data imputation (e.g., [7, 37]) and errors detecting (e.g., [8, 38]). To

the best of our knowledge, the problem of using ML techniques for repairing dirty

databases by value modification has not been addressed.

In data imputation for example, [7] uses relational learning to learn the charac-

teristics of the attributes relationships in a relational database. Then, the learnt

model is used to infer the missing values. This technique requires a priori knowledge

about the relationships between the attributes to construct the appropriate Bayesian

network for learning. Most of similar techniques for data imputation are limited to

numerical or categorical attributes.

The main challenges for these techniques are (i) the scalability for large databases

to be modeled with all existing data correlations and (ii) the accuracy of the replace-

ment values prediction due to the fact that existing methods usually capture either

local or global data relationships and do not combine both views. Despite the exis-

tence of scalable ML techniques for large datasets, they are either model dependent

(i.e., limited to specific models, for example SVM [10]) or data dependent (e.g., lim-

ited to specific types of datasets such as scientific data and documents repository).

The scalability is also an issue for the constraint-based repairing approaches [11].

1.2.3 Involving Users in the Data Cleaning Process

Most existing systems for data cleaning provide tools for data exploration and

transformation without taking advantage of recent efforts on automatic data re-

pair. Usually, the repair actions are “explicitly specified by the user”. For example,

AJAX [39] proposes a declarative language to eliminate duplicates during data trans-

formations. Potter’s Wheel [40] combines data transformations with the detection

of errors in the form of irregularities. None of these systems efficiently leverage user

feedback by either ranking or using learning mechanisms.

A recent work to repair critical data with quality guarantee was introduced in [6].

In [6] it is assumed that a reference correct data exists and the user is required

to specify certain attributes to be correct across the entire dataset. Moreover, the

proposed solution relies on a pre-specified set of editing rules.

Previous work on soliciting user feedback to improve data quality focuses mostly

on two objectives: (i) identify correctly matched references in a large scale integrated

data (e.g. [41–44] ) or for duplicate elimination (e.g. [45]); and (ii) improve the pre-

diction quality of a learning model by taking into account data acquisition costs (e.g.

cost sensitive-learning [46], utility-based learning [47], active learning [48], selective

supervision [49], selective repeated labeling [50]).

The work in [41] and [44] addresses incorporating user feedback into schema match-

ing tasks. [42] introduced a framework to provide many users with candidate matches,

without any ranking or selection mechanism, and then combine the responses to con-

verge to a correct answer. In [43], a decision theoretic framework has been proposed

to rank candidates reference matches to improve the quality of query response of

dataspace. This framework is limited to soliciting user feedback to resolve candidate

matches from a dataspace and it can not be applied in a constrained repair frame-

work for relational database. [45] introduced active-learning based approach to build

a generic matching function for identifying duplicate records.

Selective supervision [49] combines decision theory with active learning. It uses a

value of information approach for selecting unclassified cases for labeling. Selective

repeating labeling [50] assumes unreliability of user feedback and combines label un-

certainty with active learning to select instances for repeated labeling. The overall

goal of these approaches is to reduce the uncertainty in the predicted output without

regard to how important those predictions to quality of the underlying database are.

For the approaches that leverage users or entities behavior for entities dedupli-

cation, a closely related area is the users adaptive systems for web navigation and

information retrieval (e.g., [16–18]). Most of these techniques focus on statistically

modeling user interactions to extract domain specific features to understand users

preferences. These models focus on the statistical significance of extracted features

and may take into account the sequence of users actions. However, they do not take

into account the time dimension to determine the repeated patterns of actions. They

are better suited to determine groups of common behaviors and may be used to eval-

uate the similarities between entities. However, they cannot be helpful at the level of

computing the pair-wise matching between entities based on their registered actions.

1.2.4 WWW for Data Integration and Data Cleaning

The most related work is the Octopus system developed by Cafarella et. al.

[20]. The Extend operator proposed by Octopus is similar to the operation of

finding missing values using web tables. Octopus uses the web search API to retrieve

matching tables; this does not have any well-defined semantics. Since web search is

not meant for matching tables, in many cases, the top 1000 returned urls does not

provide any matching tables. Moreover, Octopus needs to invoke the search API

for each user record and perform clustering of web tables at query time leading to a

prohibitive performance to handel even small size databases. This highlight the need

for a different approach that rely on a well defined semantic matching between the

web tables and the user table (the table with missing values). Moreover, the approach

needs to perform most of the “heavy lifting” at a preprocessing step such that we get

a fast response time.

Researchers have developed techniques to annotate web tables with column names

and names of relationships [51, 52]. These techniques can help to build a better

Semantic Matching Web tables (SMW) graph.

Building the SMW graph is related to the vast body of work on schema matching

[53–55]. Most modern approaches uses several base techniques such as linguistic

matching of attribute names and detecting overlap of data instances and combines

them to determine the final matchings; the base techniques as well as the combiner

can either be machine learning-based techniques or non-learning methods [41,56]. In

contrast to enterprize tables, web tables have more features that can be involved and

obtain better sematic matching. For example, the context (i.e., the text in the web

page) where the table came from.

There exists a rich body of work of leveraging HTML lists for set expansion and

table augmentation [57,58]. However, the focus is on discovering more entities rather

than augment the provided entities.

1.3 Contributions

The central thesis of this dissertation can be stated as follows: A complete and

effective solution to improve the data quality is likely to depend on a close collaboration

between humans in the form of data users (or domain experts) and machines in the

form of the automated solutions to clean dirty databases, in addition to leveraging

other information resources such as the enormous amount of data on the Web.

The data users must get in the loop, because the automatic cleaning techniques

may cause undesired changes to the database and the data quality may even get worse.

Moreover, sometimes exploring other information resources is helpful. For example,

it is common for users to refer to the WWW to search for accurate information to

correct the database.

We claim the following list of contributions:

• We propose a novel interactive data cleaning framework for Guided Data Repair

(GDR) that tackle the problem of data cleaning from a more realistic and

pragmatic viewpoint. GDR interactively involves the user directly in guiding

the cleaning process alongside existing automatic cleaning techniques. The goal

is to effectively involve users in a way to achieve better data quality as quickly

as possible. The basic intuition is to continuously consult the user for cleaning

updates that are most beneficial in improving the data quality as we go.

• Since existing automatic data cleaning approaches are not scalable, we introduce

a new approach that is based on machine learning techniques. The objective

is to build a new data cleaning updates generators to be used within GDR for

large databases. Our approach relies on maximizing the data likelihood given

the underline data distribution, which can be modeled using ML techniques.

We achieve scalability by introducing a mechanism for horizontal data parti-

tioning and enable parallel processing of data blocks; various ML methods can

be applied and provide “local” predictions that are then combined to obtain a

final accurate predictions.

• We introduce a novel technique to involve the user indirectly into a data cleaning

task. We propose a technique that leverage the users’ or entities’ generated

transaction log to do entities deduplication or record linkage. We present the

first formulation for this problem and introduce statistical techniques to model

the entities behavior. Our approach for matching entities using their behavior

does not rely on measuring behavior similarities. However, we first merge the

entities transactions (or behavior) and measure the gain in recognizing behavior

patterns in the merged log. Since the transactions log is expected to be large we

introduce efficient fast techniques to produce candidate matches by computing

inaccurate summaries of the entities behaviors.

• To effectively and efficiently leverage the WWW for a data cleaning task, we

propose a novel approach that rely on web tables to augment entities on missing

values attributes. The core of our approach is to match the dirty table (or query

table) with the web tables to get the relevant matched web tables. The relevant

matched web tables are then used to obtain the missing values. We develop a

novel holistic matching framework based on topic sensitive pagerank (TSP) over

the SMW graph. We argue that by considering the query table as a topic and

web tables as documents, we can efficiently model the holistic matching as TSP.

We propose a system architecture that leverages preprocessing in MapReduce

to achieve extremely fast (interactive) response times at query time. Finally,

we present a machine learning-based technique for building the SMW graph.

Our key insight is that the text surrounding the web tables is important in

determining whether two web tables match or not. We propose a novel set of

features that leverage this insight.

For each one of our proposed techniques, we implemented a research prototype

showing their applicability. Moreover, we conducted experimental studies using real-

istic datasets to validate the effectiveness of our approaches.

1.4 Outline

The rest of the dissertation is organized as follows: Chapter 2 describes the GDR

framework for guided data repair. In Chapter 3, we introduce our scalable automatic

repair approach. Chapter 4 introduces our approach to leverage user’s indirectly

for the data cleaning task of deduplication. The approach that uses web tables for

augmenting entities is described in Chapter 5. Finally, Chapter 6 concludes the

dissertation and points out directions for future work.

Parts of this dissertation have been published in conferences. In particular, the

work on guided data repair (Chapter 2) is described in a paper [59] in the Proceedings

of the 2011 International Conference on Very Large Databases (PVLDB 2011). Also

our implemented system for the GDR was accepted for demonstration [60] in the

2010 International Conference on Management of Data (SIGMOD 2010). The work

on leveraging entities behavior for deduplication (Chapter 4) is also described in a

paper [61] in the Proceedings of the 2010 International Conference on Very Large

Databases (PVLDB 2010). Finally, the work that leverage web tables for entities

augmentation (Chapter 5) is described in a paper [62] in the the 2012 International

Conference on Management of Data (SIGMOD 2012).

2. GUIDED DATA REPAIR

In this chapter, we introduce GDR, a framework for guided data repair, that effi-

ciently involves the user directly in the data cleaning process. Here, we describe the

framework components and the principals upon which we rely upon to reason about

the interaction with the data user. The objective is to converge faster to a better

data quality with minimal user involvement.

The chapter is organized as follows: Section 2.1 provides a motivating example

for our approach, and we describe the problem and our solution approach in Sec-

tion 2.2. In Section 2.3 we discuss our mechanism to generate candidate cleaning

updates. In Section 2.4, we develop a principled approach to decide upon ranking

the questions for user feedback. We experimentally evaluate GDR in Section 2.5, and

finally, summarize the chapter in Section 2.6.

2.1 Introduction

A recent approach for repairing dirty databases is to use data quality rules in the

form of database constraints to identify tuples with errors and inconsistencies and

then use these rules to derive updates to these tuples. Most of the existing data

repair approaches (e.g., [3, 4, 23, 25]) focus on providing fully automated solutions

using different heuristics to select updates that would introduce minimal changes to

the data, which could be risky especially for critical data. To guarantee that the best

desired quality updates are applied to the database, users (domain experts) should

be involved to confirm updates. This highlights the increasing need for a framework

that combines the best of both worlds. The framework will automatically suggest

updates while efficiently involve users to guide the cleaning process.

17Name SRC STR CT STT ZIPt1: Jim H1 REDWOOD DR MICHIGAN CITY MI 46360t2: Tom H2 REDWOOD DR WESTVILLE IN 46360t3: Jeff H2 BIRCH PARKWAY WESTVILLE IN 46360t4: Rick H2 BIRCH PARKWAY WESTVILLE IN 46360t5: Joe H1 BELL AVENUE FORT WAYNE IN 46391t6: Mark H1 BELL AVENUE FORT WAYNE IN 46825t7: Cady H2 BELL AVENUE FORT WAYNE IN 46825t8: Sindy H2 SHERDEN RD FT WAYNE IN 46774(a) Data

ϕ1 : (ZIP → CT, STT, {46360 ∥ MichiganCity, IN})

ϕ2 : (ZIP → CT, STT, {46774 ∥ NewHaven, IN})

ϕ3 : (ZIP → CT, STT, {46825 ∥ FortWayne, IN})

ϕ4 : (ZIP → CT, STT, {46391 ∥ Westville, IN})

ϕ5 : (STR, CT → ZIP, { ,FortWayne ∥ })

(b) CFD Rules

Fig. 2.1.: Example data and rules

Motivation Example

Consider the following example. Let Relation Customer(Name, SRC, STR, CT, STT,

ZIP) specifies personal address information Street (STR), City (CT), State (STT) and

(ZIP), in addition to the source (SRC) of the data or the data entry operator. An

instance of this relation is shown in Figure 2.1(a).

Data quality rules can be defined in the form of Conditional Functional Dependen-

cies (CFDs) as described in Figure 2.1(b). A CFD is a pair consisting of a standard

Functional Dependency (FD) and a pattern tableau that specifies the applicability of

the FD on parts of the data. For example, ϕ1 − ϕ4 state that the FD ZIP → CT, STT

(i.e., zip codes uniquely identify city and state) holds in the context where the ZIP is

46360, 46774, 46825 or 46391. Moreover, the pattern tableau enforces bindings be-

tween the attribute values, e.g., if ZIP= 46360, then CT= ‘Michigan City’. ϕ5 states

that the FD STR, CT → ZIP holds in the context where CT = ‘Fort Wayne’, i.e., street

names uniquely identify the zip codes whenever the city is ‘Fort Wayne’. Note that

all the tuples in Figure 2.1(a) have violations.

Typically, a repairing algorithm will use the rules and the current database in-

stance to find the best possible repair operations or updates. For example, t5 violates

ϕ4 and a possible update would be to either replace CT by ‘Westville’ or replace ZIP

by 46825, which would make t5 fall in the context of ϕ3 and ϕ5 but without violations.

To decide which update to apply, different heuristics can be used [4, 23].

However, automatic changes to data can be risky especially if the data is critical,

e.g., choosing the wrong value among the possible updates. On the other hand,

involving the user can be very expensive because of the large number of possibilities

to be verified. Since automated methods for data repair produce far more updates

than one can expect the user to handle, techniques for selecting the most useful

updates for presentation to the user become very important.

Moreover, to efficiently involve the user in guiding the cleaning process, it is

helpful if the suggested updates are presented in groups that share some contextual

information. This will make it easier for the user to provide feedback. For example,

the user can quickly inspect a group of tuples where the value ‘Michigan City’ is

suggested for the CT attribute. Similar grouping ideas have been explored in [45].

In the example in Figure 2.1, let us assume that a cleaning algorithm suggested

two groups of updates. In the first group, the updates suggest replacing the attribute

CT with the value ‘Michigan City’ for t2, t3, and t4 while in the second group they

suggest replacing the attribute ZIP with the value 46825 for t5 and t8. Let us assume

further that we were able to obtain the user feedback on the correct values for these

tuples; namely that the user has confirmed ‘Michigan City’ as a correct value of CT

for t2, t3, but as incorrect for t4, and 46825 as the correct value of ZIP for t5, but

as incorrect for t8. In this case, consulting the user on the first group, which has

more correct updates, is better and would allow for faster convergence to a cleaner

database instance as desired by the user. The second group will not lead for such fast

convergence.

Finally in our example, we could recognize correlations between the attribute

values in a tuple and the correct updates. For example, when SRC = ‘H2’, the CT

attribute is incorrect most of the time, while the ZIP attribute is correct. This is

an example of recurrent mistakes that exist in real data. Patterns like that with

correlations between the original tuple and the correct updates, if captured by a

machine learning algorithm, can reduce user involvement.

The key challenge in involving users is to determine how and in what order sug-

gested updates should be presented to them. This requires developing a set of princi-

pled measures to estimate the improvement in quality to reason about the selection

process of possible updates as well as investigating machine learning techniques to

minimize user effort. The goal is to achieve a good trade-off between high quality

data and minimal user involvement.

In this chapter, we propose to tackle the problem of data cleaning from a more

realistic and pragmatic viewpoint. We present GDR, a framework for guided data

repair, that interactively involves the user in guiding the cleaning process alongside

existing automatic cleaning techniques. The goal is to effectively involve users in a

way to achieve better data quality as quickly as possible. The basic intuition is to

continuously consult the user for updates that are most beneficial in improving the

data quality as we go.

We use CFDs [63] as the data quality rules to derive candidate updates. CFDs

have proved to be very useful for data quality and triggered several efforts e.g., [33,36],

for their automatic discovery as well as making them a practical choice for data repair

techniques.

We summarize the contributions of this chapter as follows:

• We introduce GDR, a framework for data repair, that selectively acquire user

feedback on suggested updates. User feedback is used to train the GDR machine

learning component that can take over the task of deciding the correctness of

these updates.

• We propose a novel ranking mechanism for suggested updates that applies a

combination of decision theory and active learning in the context of data quality

to reason about such task in a principled manner.

• We use the concept of value-of-information (VOI) [64] from decision theory to

develop a mechanism to estimate the update benefit from consulting the user on

a group of updates. We quantify the data quality loss by the degree of violations

to the rules. The benefit of a group of updates can be then computed by the

difference between the data quality loss before and after user feedback. Since we

do not know the user feedback beforehand, we develop a set of approximations

that allow efficient estimations.

• We apply active learning to order the updates within a group such that the

updates that can strengthen the prediction capabilities of the learned model the

most come first. To this end, we assign to each suggested update an uncertainty

score that quantifies the benefit to the prediction model, learning benefit, when

the update is labeled.

We conduct an extensive experimental evaluation on real datasets that shows the

effectiveness of GDR in allowing fast convergence to a better quality database with

minimal user intervention.

2.2 Problem Definition and Solution Overview

2.2.1 Problem Definition

We consider a database instance D with a relational schema S. Each relation

R ∈ S is defined over a set of attributes attr(R) and the domain of an attribute

A ∈ attr(R) is denoted by dom(A). We also consider a set of data quality rules Σ

that represent data integrity semantics. In this work, we consider rules in the form

of CFDs.

CFD Overview A CFD ϕ over R can be represented by ϕ : (X → Y, tp), where

X and Y ∈ attr(R), X → Y is a standard functional dependency (FD), referred to

as FD embedded in ϕ, and tp is a tuple pattern containing all attributes in X and Y .

For each A ∈ (X ∪ Y ), the value of the attribute A for the tuple pattern tp, tp[A], is

either a constant ’a’ ∈ dom(A), or ’−’ which represents a variable value. We denote

X as LHS(ϕ) (left hand side) and Y as RHS(ϕ) (right hand side). Examples for

CFD rules are provided in Figure 2.1.

To denote that a tuple t ∈ D matches a particular pattern tp, the symbol ≍ is

defined on data values and ’−’. We write t[X] ≍ tp[X] iff for each A ∈ X, either

t[A] = tp[A] or tp[A] = ’−’. For example, (Sherden RD, Fort Wayne, IN) ≍ (−,

Fort Wayne, −). We assume that CFDs are provided in the normal form [3], i.e.,

ϕ : (X → A, tp), A ∈ attr(R) and tp is a single pattern tuple.

A CFD ϕ : (X → A, tp) is said to be constant, if tp[A] =’−’. Otherwise, ϕ is a

variable CFD. For example in Figure 2.1, ϕ1 is a constant CFD, while ϕ5 is a variable

A database instance D satisfies the constant CFD ϕ = (X → A, tp), denoted by

D |= ϕ, iff for each tuple t ∈ D, if t[X] ≍ tp[X] then t[A] = tp[A]. If ϕ is a variable

CFD, then D |= ϕ iff for each pair of tuples t1, t2 ∈ D, if t1[X] = t2[X] ≍ tp[X] then

t1[A] = t2[A] ≍ tp[A]. This means that if t1[X] and t2[X] are equal and match the

pattern tp[X], then t1[A] and t2[A] must also be equal to each other. CFDs address

a single relation only. However, the repairing algorithm that uses CFDs is applicable

to general relational schemas by simply repairing each relation in isolation.

We address the following problems:

• The use of the data quality rules Σ to generate candidate updates for the tuples

that are violating Σ. The rules can be either given or discovered by an automatic

discovery technique (e.g., [33,36]). Usually, the automatic discovery techniques

employ thresholds on the confidence of the discovered rules. In this setting, the

22PossibleUpdates{rj, sj} Learning ComponentRankingDirty Tuples Identification& Updates DiscoveryDCFDs Repository Training ExamplesUser Feedback to DBUser Feedback to train the learning componentUpdates are ranked according to the benefit the DQGroupingGroups of updates Learner decisions to repair the dataUpdates Consistency Mngr TrRanked Groups{c, g(c)}Updates{rj} Display ordered byUncertaintyInput Updates Generation Ranking Updates

Fig. 2.2.: GDR Framework.

user is the one to guide the repairing process and we assume that user decisions

are consistent with Σ.

• Deciding upon the best groups of updates—as mentioned in Section 2.1— to be

presented to the user during an interactive process for faster convergence and

higher data quality.

• Applying active learning to learn user feedback and use the learned models to

decide upon the correctness of the suggested updates without user’s involve-

2.2.2 Solution Overview

Figure 2.2 shows the GDR framework and the cleaning process is outlined in

Algorithm 2.1.

GDR guides the user to focus her efforts on providing feedback on the updates

that would improve quality faster, while the user guides the system to automatically

identify and apply updates on the data. This continuous feedback process, illustrated

in steps 3-10 (Procedure 1), runs while there are dirty tuples and the user is available

to give feedback.

Algorithm 2.1 GDR Process(D dirty database, Σ DQRs)

1: Identify dirty tuples in D using Σ and generate and store initial suggested updates in

PossibleUpdates list.

2: Group the candidate updates appropriately.

3: while User is available and dirty tuples exist do

4: Rank groups of updates such that the most beneficial come first.

5: The user selects group c from the top.

6: Updates in c are labeled by learner predictions and the user interactively gives feed-

back on the suggested updates, until the user is satisfied with the learner predictions

or has verified all the updates within c.

7: User feedback and learner decisions are applied to the database.

8: Remove rejected updates from PossibleUpdates and replace as needed.

9: Check for new dirty tuples and generate updates.

10: end while

In Step 1, all dirty tuples that violate the rules are identified and a repairing

algorithm is used to generate candidate updates. In Step 2, we group the updates for

the user in a way that makes it easier for a batch inspection.

The interactive loop in steps 3-10 starts with ranking the groups of updates such

that groups that are more likely to move the database to a cleaner state faster come

first. The user will then pick one of the top groups (c) in the list and provide feedback

through an interactive active learning session (Step 6). (The ranking mechanism and

active learning are discussed in Section 2.4.)

In Step 7, all decisions on suggested updates, either made by the user or the

learner, are applied to the database. In Step 8, the list of candidate updates is

modified by replacing rejected updates and generating new ones for emerging dirty

tuples because of the applied updates.

After getting the user feedback, the violations are recomputed by the consistency

manager and new updates may be proposed. The assumption is that if the user

verifies all the database cells then the final database instance is consistent with the

rules. This guarantees that we are always making progress toward the final consistent

database and the process will terminate.

2.3 Generating Candidate Updates

In this section, we outline the different steps involved in suggesting updates, main-

taining their consistency when applied to the database, and grouping them for the

2.3.1 Dirty Tuples Identification and Updates Discovery:

Once a set Σ of CFDs is defined, dirty tuples can be identified through violations

of Σ and stored in a DirtyTuples list. A tuple t is considered dirty if ∃ ϕ ∈ Σ such

that t |= ϕ, i.e., t violates rule ϕ.

Resolving CFD Violations

A dirty tuple t may violate a CFD ϕ = (R : X → A, tp) in Σ following two possible

cases [3]:

• Case 1: ϕ is a constant CFD (i.e., tp[A] = a, where a is a constant) and t[X] ≍

tp[X] but t[A] = a.

• Case 2: ϕ is a variable CFD, t[X] ≍ tp[X], and ∃t′ such that t′[X] = t[X] ≍

tp[X] but t[A] = t′[A].

The latter case is similar to the violation of a standard FD. Accordingly, given a set Σ

of CFDs, the dirty tuples can be immediately identified and stored in theDirtyTuples

To resolve a violation of a CFD ϕ = (R : X → A, tp) by a tuple t, we proceed as

follows: For case 1, we either modify the RHS(ϕ) attribute such that t[A] = tp[A]

or we change some of the attributes in LHS(ϕ) such that t[X] ≍ tp[X]. For case 2,

we either modify t[A] (resp. t′[A]) such that t[A] = t′[A] or we change some LHS(ϕ)

attributes t[X] (resp. t′[X]) such that t[X] = t′[X] or t[X] ≍ tp[X] (resp. t′[X] ≍

tp[X]).

Example 2.3.1 In Figure 2.1, the normal form of

ϕ1 : (ZIP → CT, STT, {46360 ∥ MichiganCity, IN}) would be

ϕ1,1 : (ZIP → CT, {46360 ∥ MichiganCity}) and

ϕ1,2 : (ZIP → STT, {46360 ∥ IN}).

t2 violates ϕ1,1 : (ZIP → CT, {46360 ∥ MichiganCity}) following case 1. Thus,

a suggested update by changing RHS(ϕ1,1) is to replace ‘Westville’ by ‘Michigan City’

in t2[CT], while another update by changing LHS(ϕ1,1) is to replace ‘46360’ by ‘46391’

in t2[ZIP], for example. t5, t6 both violate ϕ5 following case 2. A possible update is to

change RHS(ϕ5) by modifying t5[ZIP] to be ’46825’ instead of ’46391’. Yet, another

possible update is to make a change in LHS(ϕ5). For example, by changing t5[STR] or

t5[CT] to another value.

We implemented an on demand update discovery process based on the above

mechanism for resolving CFDs violations and generating candidate updates. This

process is triggered to suggest an update for t[A], the value of attribute A in tuple t.

Initially, the process is called for all dirty tuples and their attributes. Later during the

interactions with the user, it is triggered by the consistency manager as a consequence

of receiving user feedback.

The generated updates are tuples in the form rj = ⟨t, A, v, sj⟩ stored in the

PossibleUpdates list, where v is the suggested value in t[A] and sj is the update

score. sj ∈ [0..1] is assigned to each update rj by an update evaluation function

to reflect the certainty of the repairing technique about the suggested update. We

follow the same evaluation approach used in [23] and [3]. Given an update r to mod-

ify t[A] = v such that t[A] = v′, we compute the update evaluation score s as the

similarity between v and v′. This can be done based on the edit distance function

distA(v, v′) as follows

s(r) = sim(v, v′) = 1− distA(v, v′)

max(|v|, |v′|). (2.1)

where |v|, |v′| denote the size of v, v′, respectively. The intuition here is that, the

more accurate v′, the more it is close to v. s(r) is in the range [0..1] and any domain

specific similarity function can be used for this purpose. Finally, the update can be

composed in the tuple form r = ⟨t, A, v′, s(r)⟩.

Generating Updates

We now show how to use CFDs to generate updates for each potentially dirty

attribute B in t ∈ DirtyTuples. The generated updates are tuples in the form

⟨t, B, v, s⟩, where v is the suggested repair value for t[B] and s is the repair evaluation

score from Eq. 2.1.

The suggested updates correspond to attribute value modifications, which are

enough for CFDs violations [3]. For each dirty tuple t, we store the list of violated

rules in t.vioRuleList. Furthermore, for each pair ⟨t, B⟩, we keep a list of values

⟨t, B⟩.preventedList, which contains values for t[B] that are confirmed as wrong.

Thus, when searching a new suggestion for t[B], the values in ⟨t, B⟩.preventedList

are discarded. Also, we keep a flag ⟨t, B⟩.Changeable that is set to False when the

value in t[B] was confirmed to be correct.

Initially, we assume that each attribute value is incorrect for all t ∈ DirtyTuples

and proceed by searching for the best update value that provides the best

score according to Eq. 2.1. This can be performed by calling Algorithm 2.2,

UpdateAttributeTuple(t, B) for all t ∈ DirtyTuples and B ∈ attr(R).

UpdateAttributeTuple described in Algorithm 2.2 finds the best update value for

t[B] by exploring three possible scenarios:

1. B = A for some violated CFD ϕ = (X → A, tp) and tp[A] =’−’ (i.e., ϕ is a

constant CFD): This corresponds to case 1 of rule violations where t[X] ≍ tp[X]

and t[A] ≍ tp[A]. In this scenario, a value v = a is suggested (lines 4-6).

Algorithm 2.2 UpdateAttributeTuple (Tuple t, Attribute B)

1: if ⟨t, B⟩.Changeable = false then return;

2: best s = 0 ; v = null

3: for all ϕ = (X → A, tp) ∈ t.vioRuleList do

4: if B = A ∧ tp[A] =’−’ then

5: cur s = sim(t[A], tp[A]) {scenario 1}

6: if cur s > best s then { best s = cur s; v = tp[A] }

7: else if B = A ∧ t[A] =’−’ then

8: ⟨best s, v⟩ = getValueForRHS(ϕ, A, t, best s) {scenario 2}

9: end if

10: end for

11: if ∃ ϕ = (X → A, tp) ∈ t.vioRuleList s.t. B ∈ X then

12: ⟨best s, v⟩ = getValueForLHS(A, t, best s) {scenario 3}

13: end if

14: if v = null then

15: PossibleUpdates = PossibleUpdates ∪ {⟨t, B, v, s = sim(t[B], v)⟩}

16: end if

2. B = A for some violated CFD ϕ = (X → A, tp) and tp[A] =’−’ (i.e., ϕ is a

variable CFD): This corresponds to case 2 of rule violations where t[X] ≍ tp[X]

and t[A] ≍ tp[A] and there exists another tuple t′ that violates ϕ with t, i.e.,

t′[X] ≍ t[X] but t′[A] ≍ t[A]. In this scenario, a value v = t′[A] is suggested

(lines 7-9).

3. B ∈ LHS(ϕ) for some violated CFD ϕ = (X → A, tp): This corresponds to

either case 1 or case 2 of rule violations. In this scenario, we look for a value

v that maximizes the repair evaluation score sim(t[B], v) (Eq. 2.1.) The aim is

to select semantically related values by first using the values in the CFDs, then

searching in the tuples identified by the pattern t[X ∪ A− {B}] (lines 11-13).

In each of the above scenarios, the value v ∈ ⟨t, B⟩.preventedList. Finally, a

repair tuple is composed ⟨t, B, v, s⟩ and inserted into PossibleUpdates in line 14

Example 2.3.2 In Figure 2.1, t5 violates ϕ4 and when repairing the attribute CT ∈

RHS(ϕ4), a suggested update according to Scenario 1 will be ‘Westville’. Also, t5

violates ϕ5 and when repairing the attribute ZIP ∈ RHS(ϕ5), a suggested update will

be 46825 according to Scenario 2. When repairing the attribute STR ∈ LHS(ϕ5), a

suggested value from the domain dom(STR) can be ‘Sherden RD’ according to Sce-

nario 3.

2.3.2 Updates Consistency Manager

Once an update r = ⟨t, A, v, s⟩ is confirmed to be correct, either by the user or

the learning component, it is immediately applied to the database resulting into a

new database instance. Consequently, (i) new violations may arise and hence the

on demand update discovery process needs to be triggered for the new dirty tuples,

and (ii) some of the already suggested updates that are not verified yet may become

inconsistent since they were generated according to a different database instance. For

example, in Figure 2.1, two updates are proposed: r1 replaces t6[ZIP] = 46391 and r2

replaces t6[CT] = ’FT Wayne’. If a feedback is received confirming r1, then r2 is not

consistent with the new database instance and the rules anymore since t6 will fall in

the context of ϕ4. The on demand process can then find a consistent update r′2 that

corresponds to replacing t6[CT] by ’Westville’, and r2 will be discarded in favor of r′2.

The consistency manager needs to maintain two invariants: (i) There is no tuple

t ∈ D such that t |= ϕ for any ϕ ∈ Σ, and t ∈ DirtyTuples. (ii) There is no update

r ∈ PossibleUpdates such that r depends on data values that have been modified

in the database. In the following, we provide the detailed steps of the consistency

manager procedure that we implemented in GDR. Given an update r = ⟨t, B, v, s⟩

along with the feedback ∈ {confirm, reject, retain}:

1. If the feedback is to retain the current value t[B], then we set

⟨t, B⟩.Changeable = false to stop looking for updates for t[B].

2. If the feedback is to reject the update, i.e., t[B] cannot be v, then v is added

immediately to the list ⟨t, B⟩.P reventedList. This is followed by a call to

UpdateAttributeTuple(t, B) to find another update for t[B].

3. If the feedback confirms that t[B] must be v, then the update is applied to

the database immediately and we stop generating updates for t[B] by setting

⟨t, B⟩.Changeable = false. Afterward, we go through the rules that involve the

attribute B and update the necessary data structures to reflect the removed

violations as well as new emerging violations. Particularly, for each ϕ : (X →

A, tp) ∈ Σ where B ∈ (X ∪ A), we do the following:

(a) If t |= ϕ, then we consider two cases:

i. ϕ is a constant CFD: If ⟨t, C⟩.Changeable = false, ∀C ∈ X, i.e., all at-

tributes in LHS(ϕ) have been confirmed as correct and are not change-

able values, then RHS(ϕ) should be applied; we apply t[A] = tp[A]

to the database directly, set ⟨t, A⟩.Changeable = false, and remove ϕ

from t.vioRuleList. If some of the LHS(ϕ) attribute values are change-

able in t, then ∀ C ∈ ({X ∪ A} B) we add ⟨t, C⟩ to RevisitList. ϕ

is added to t.vioRuleList, if it is not already there, and t is added to

the DirtyTuples as well.

ii. ϕ is a variable CFD: We add ϕ to t.vioRuleList and then identify

the tuples t′ that violate ϕ with t. Then for each t′, we add ϕ to

t′.vioRuleList and add t′ to the DirtyTyples. Also, we add ⟨t′, C⟩ to

the RevisitList, ∀ C ∈ {X∪A} because this ϕ may be a new emerging

violation for t′ and all the attributes are candidates to be wrong for t′.

(b) If t |= ϕ while ϕ ∈ t.vioRuleList, then ϕ originally was violated by t before

applying this update. Therefore, we remove ϕ from t.vioRuleList. If ϕ is

a constant CFD, no further action is required. However, if ϕ is a variable

CFD, we need to check the other tuples t′, which were involved with t in

violating ϕ, and eventually update their vioRuleList. We remove ϕ from

t′.vioRuleList as long as @ t′′ s.t. t′, t′′ |= ϕ, i.e., t′ is not involved in

violating ϕ with another tuple t′′.

4. Remove update r = ⟨t, C, v, s⟩ from the PossibleUpdates, if ⟨t, C⟩ ∈

RevisitedList or ⟨t, C⟩.Changeable = false.

5. For every element ⟨t, C⟩ ∈ RevisitedList, we call UpdateAttributeTuple(t, C)

to find another repair for t[C].

6. Remove t from DirtyTuples, if t.vioRuleList is empty.

Note that the first update consistency invariant is maintained because of the

following: A tuple t may become dirty if it is modified or another tuple t′ is modified

so that t, t′ violates some variable CFD ϕ ∈ Σ. For a tuple t and a CFD rule ϕ,

assuming that due to a database update t |= ϕ, then t must be in DirtyTuples after

applying Step 3a.

If ϕ is a constant CFD, then Step 3(a)i should have been applied. If t continues to

violate ϕ it should be in DirtyTuples. If ϕ is a variable CFD, then Step 3(a)ii should

have been applied. There are two cases to consider: First, if t is the tuple being

repaired and t |= ϕ, then it is added to DirtyTuples, if not already there. Second,

if t |= ϕ because another tuple t′ was repaired (or modified), then Step 3(a)ii should

have been applied on t′. Thus all tuples involved with t′ in violating ϕ, including

t will be added to DirtyTuples. Following the same rationale, Step 3b maintains

that t.vioRuleList contains only rules that are being violated by t. Thus, Step 6

guarantees that the content of DirtyTuples corresponds to tuples involved in rules

violation.

The second update consistency invariant is maintained as well because of steps

3(a)i, 4, and 5. These steps maintain a local list, RevisitedList, to hold tuple-

attribute pairs, where their generated updates may depend on the applied update.

In Step 3(a)i, changing the value of t[B] may affect the update choice for the other

attributes of ϕ. For a variable CFD, Step 3(a)ii, all the tuples involved in the vio-

lations due to the modified value will need their attributes values to be revisited to

find updates. Step 4 removes the corresponding updates from the PossibleUpdates

and we proceed in Step 5 to get potentially new updates.

Note that Step 3 loops on the set of rules for the particular tuple t that was

updated. In Steps 3(a) and 3(b), we consider the immediate dependencies (conse-

quences) of updating t with respect to a single rule ϕ. Particularly in Step 3(a), we

check for new violations for ϕ that involve t, because it is the only change to the

database. In Step 3(b), we check for already resolved violations for ϕ due to updating

t. This local process to tuple t that considers only a single rule ϕ at a time guarantees

that the consistency manager will terminate and will not get into an infinite loop.

2.3.3 Grouping Updates

There are two reasons for the grouping: (i) Providing a useful-looking set of up-

dates with some common contextual information will be easier for the user to handle

and process. (ii) Providing a machine learning algorithm with a group of training

examples that have some correlations due to the grouping will increase the predic-

tion accuracy compared with just providing random, unrelated examples. Similar

grouping ideas have been explored in [45]. We use a grouping function where the

tuples with the same update value in a given attribute are grouped together. This

technique of grouping is in the flavor of the equivalence classes techniques described

in [23]. Another way to do the grouping is based on the conflicting structures, a

concept introduced in [5] The conflicting structures identifies the groups of cells that

can not stay unchanged in the final consistent database (i.e., for each group there

must be at least a database cell to change).

2.4 Ranking and Displaying Suggested Updates

In this section, we introduce the key concepts of GDR, namely the ranking and

learning components (Figure 2.2), which describe how GDR interacts with the user

to get feedback on suggested updates. The task of these components is to devise how

to best present the updates to the user, in a way that will provide the most benefit

for improving the quality of the data. To this end, we apply the concept of value

of information (VOI) [64] from decision theory, combined with an active learning

approach, to choose a ranking in a principled manner.

2.4.1 VOI-based Ranking

At any iteration of the process outlined in Algorithm 2.1, there will be several

possible suggested updates to forward to the user. As discussed in the previous

section, these updates are grouped into groups {c1, c2 . . . }.

VOI is a mean of quantifying the potential benefit of determining the true value

of some unknown. At the core of VOI is a loss (or utility) function that quantifies the

desirability of a given level of database quality. To make a decision on which group

to forward first to the user, we compare data quality loss before and after the user

works on a group of updates. More specifically, we devise a data quality loss function,

L, based on the quantified violations to the rules Σ. Since the exact loss in quality

cannot be measured, as we do not know the correctness of the data, we develop a set

of approximations that allow for efficient estimation of this quality loss. Before we

proceed, we need first to introduce the notion of database violations.

Definition 2.4.1 Given a database D and a CFD ϕ, we define the tuple t violation

w.r.t ϕ, denoted vio(t, {ϕ}), as follows:

vio(t, {ϕ}) =

1 , if ϕ is a constant CFD.

Number of tuples t′

that violate ϕ with t , if ϕ is a variable CFD.

Consequently, the total violations for D with respect to Σ is:

vio(D,Σ) =∑ϕ∈Σ

∑t∈D

vio(t, {ϕ}).

The definition for the variable CFDs is equivalent to the pairwise counting of viola-

tions discussed in [3]. The violation can be scaled further using a weight attached to

the tuple denoting its importance for the business to be clean.

Update Benefit: Given a database instance D and a group c = {r1, . . . , rJ}.

If the system receives a feedback from the user on rj, there are two possible cases:

either the user confirms rj to be applied or not. We denote the two corresponding

database instances as Drj and Drj , respectively. Assuming that the user will confirm

rj with a probability pj, then the expected data quality loss after consulting the user

on rj can be expressed by: pj L(Drj)+ (1− pj) L(D

rj). If we further assume that all

the updates within the group c are independent then the update benefit g (or data

quality gain) of acquiring user feedback for the entire group c can be expressed as:

g(c) = L(D|c)−∑rj∈c

[ pj L(Drj) + (1− pj) L(D

rj) ] (2.2)

where L(D|c) is the current loss in data quality given that c is suggested. To simplify

our analysis, we assumed that these updates are independent. Taking into account

these dependencies would require to model the full joint probabilities of the updates,

which will lead to a formulation that is computationally infeasible due to the expo-

nential number of possibilities.

Data Quality Loss (L): We define quality loss as inversely proportional to the

degree of satisfaction of the specified rules Σ. To compute L(D|c), we first need to

measure the quality loss with respect to ϕ ∈ Σ, namely ql(D|c, ϕ). Assuming that

Dopt is the clean database instance desired by the user, we can express ql by:

ql(D|c, ϕ) = 1− |D |= ϕ||Dopt |= ϕ|

=|Dopt |= ϕ| − |D |= ϕ|

|Dopt |= ϕ|(2.3)

where |D |= ϕ| and |Dopt |= ϕ| are the numbers of tuples satisfying the rule ϕ in the

current database instance D and Dopt, respectively. Consequently, the data quality

loss, given c, can be computed for Eq. 2.2 as follows:

L(D|c) =∑ϕi∈Σ

wi × ql(D|c, ϕi). (2.4)

where wi is a user defined weight for ϕi. These weights are user defined parameters.

In our experiments, we used the values wi =|D(ϕi)||D| , where |D(ϕi)| is the number of

tuples that fall in the context of the rule ϕi. The intuition is that the more tuples

fall in the context of a rule, the more important it is to satisfy this rule. to express

the business or domain value of satisfying the rule ϕi.

To use this gain formulation, we are faced with two practical challenges: (1) we

do not know the probabilities pj for Eq. 2.2, since we do not know the correctness of

the update rj beforehand, and (2) we do not know the desired clean database Dopt

for computing Eq. 2.3, since that is the goal of the cleaning process in the first place.

User Model: To approximate pj, we learn and model the user as we obtain

his/her feedback for the suggested updates. pj is approximated by the prediction prob-

ability, pj, of having rj correct (learning user feedback is discussed in Section 2.4.2).

Since initially there is no feedback, we assign sj to pj, where sj ∈ [0, 1] is a score that

represents the repairing algorithm certainty about the suggested update rj.

Estimating Update Benefit: To compute the overall quality loss L in Eq. 2.4,

we need to first compute the quality loss with respect to a particular rule ϕ, i.e.,

ql(D|c, ϕ) in Eq. 2.3. To this end, we approximate the numerator and denomina-

tor separately. The numerator expression, which represents the difference between

the numbers of tuples satisfying ϕ in Dopt and D, respectively, is approximated us-

ing D’s violations with respect to ϕ. Thus, we use the expression vio(D, {ϕ}) (cf.

Definition 2.4.1) as the numerator in Eq. 2.3.

The main approximation we made is to assume that the updates within a group

c are independent. Hence to approximate the denominator of Eq. 2.3, we assume

further that there is only one suggested update rj in c. The effect of this last as-

sumption is that we consider two possible clean desired databases—one in which rj

is correct, denoted by Drj , and another one in which rj is incorrect, denoted by

Drj . Consequently, there are two possibilities for the denominator of Eq. 2.3, each

with a respective probability pj and (1− pj). Our evaluations show that despite our

approximations, our approach produces a good ranking of the groups of updates.

We apply this approximation independently for each rj ∈ c and estimate the

quality loss ql as follows:

E[ql(D|c, ϕ)] =∑rj∈c

[pj ·vio(D, {ϕ})|Drj |= ϕ|

+ (1− pj)vio(D, {ϕ})|Drj |= ϕ|

] (2.5)

where we approximate pj with pj.

The expected loss in data quality for the database D, given the suggested group

of updates c, can be then approximated based on Eq. 2.4 by replacing ql with E[ql]

obtained from Eq. 2.5:

E[L(D|c)] =∑ϕi∈Σ

∑rj∈c

vio(D, {ϕ})|Drj |= ϕ|

+ (1− pj)vio(D, {ϕ})|Drj |= ϕ|

](2.6)

We can also compute the expected loss for Drj and Drj using Eq. 2.4 and Eq. 2.6 as

follows: E[L(Drj)] =∑

ϕi∈Σwi · vio(Drj ,{ϕi})

|Drj |=ϕi| where we use pj = 1 since in Drj we know

that rj is correct and E[L(Drj)] =∑

ϕi∈Σwi · vio(Drj ,{ϕi})|Drj |=ϕi|

where we use pj = 0 since

in Drj we know that rj is incorrect.

Finally, using Eq. 2.2 and substituting L(D|c) with E[L(D|c)] from Eq. 2.6, we

compute an estimate for the data quality gain of acquiring feedback for the group c

as follows:

E[g(c)] = E[L(D|c)]−∑rj∈c

[pj E[L(Drj )] + (1− pj)E[L(Drj )]

∑ϕi∈Σ

∑rj∈c

vio(D, {ϕi})|Drj |= ϕi|

+ (1− pj)vio(D, {ϕi})|Drj |= ϕi|

∑rj∈cpj ∑ϕi∈Σ

wivio(Drj , {ϕi})|Drj |= ϕi|

+ (1− pj)∑ϕi∈Σ

wivio(Drj , {ϕi})|Drj |= ϕi|

Note that vio(D, {ϕi})− vio(Drj , {ϕi}) = 0 since Drj is the database resulting from

rejecting the suggested update rj which will not modify the database. Therefore, Drj

is the same as D with the same violations. After a simple rearrangement, we obtain

the final formula to compute the estimated gain for c:

E[g(c)] =∑ϕi∈Σ

∑rj∈c

pjvio(D, {ϕi})− vio(Drj , {ϕi})

|Drj |= ϕi|

The final formula in Eq. 2.7 is intuitive by itself and can be justified by the follow-

ing. The main objective to improve the quality is to reduce the number of violations

in the database. Therefore, the difference in the amount of database violations as

defined in Definition 1, before and after applying rj, is a major component to com-

pute the update benefit. This component is computed, under the first summation,

for every rule ϕi as a fraction of the number of tuples that would be satisfying ϕi, if

rj is applied. Since the correctness of the repair rj is unkown, we cannot use the term

under the first summation as a final benefit score. Instead, we compute the expected

update benefit by approximating our certainty about the benefit by the prediction

probability pj.

Example 2.4.1 For the example in Figure 2.1, assume that the repairing algorithm

generated 3 updates to replace the value of the CT attribute by ‘Michigan City’ in

t2, t3 and t4. Assume also that the probabilities, pj, for each of them are 0.9, 0.6,

and 0.6, respectively. The weights wi for each ϕi, i = 1, . . . , 5 are {48, 18, 28, 18, 38}.

Due to this modifications only ϕ1 will have their violations affected. Then for this

group of updates, the estimated benefit can be computed as follow using Eq. 2.7:

48× (0.9× 4−3

1+ 0.6× 4−3

1) = 1.05.

2.4.2 Active Learning Ordering

One way to reduce the cost of acquiring user feedback for verifying each update is

to relegate the task of providing feedback to a machine learning algorithm. The use of

a learning component in GDR is motivated by the existence of correlations between

the original data and the correct updates. If these correlations can be identified

and represented in a classification model, then the model can be trained to predict

the correctness of a suggested update and hence replace the user for similar (future)

situations.

As stated earlier, GDR provides groups of updates to the user for feedback. Here,

we discuss how the updates within a group will be ordered and displayed to the

user, such that user feedback for the top updates would strengthen the learning

component’s capability to replace the user for predicting the correctness for the rest

of the updates.

Interactive Active Learning Session: After ranking the groups of updates, the

user will pick a group c that has a high score E[g(c)]. The learner orders these updates

such that those that would most benefit, i.e., improve the model prediction accuracy,

from labeling come first. The updates are displayed to the user along with their

learner predictions for the correctness of the update. The user will then give feedback

on the top ns updates, that she is sure about, and inherently correct any mistakes

made by the learner. The newly labeled examples in ns are added to the learner

training dataset Tr and the active learner is retrained. The learner then provides

new predictions and reorder the currently displayed updates based on the training

examples obtained so far. If the user is not satisfied with the learner predictions,

the user will then give feedback on another ns updates from c. This interactive

process continues until the user is either satisfied with the learner predictions, and

thus delegates the remaining decisions on the suggested updates in c to the learned

model, or the updates within c are all labeled, i.e., verified, by the user.

Active Learning: In the learning component, there is a machine learning algo-

rithm that constructs a classification model. Ideally, we would like to learn a model

to automatically identify correct updates without user intervention. Active learning

is an approach to learning models in such situations where unlabeled examples (i.e.

suggested updates) is plentiful but there is a cost to labeling examples (acquiring user

feedback) for training.

By delegating some decisions on suggested updates to the learned models, GDR is

allowing for “automatic” repairing. However, there is a guarantee to correctly repair

the data that is inherently provided by the active learning process to learn accurate

classifiers to predict the correctness of the updates. The user is the one to decide

whether the classifiers are accurate while inspecting the suggestions.

Learning User Feedback: The learning component predicts for a suggested

update r = ⟨t, A, v, s⟩ one of the following predictions, which corresponds to the

expected user feedback. (i) confirm, the value of t[A] should be v. (ii) reject, v is not

a valid value for t[A] and GDR needs to find another update. (iii) retain, t[A] is a

correct value and there is no need to generate more updates for it. The user may also

suggest new value v′ for t[A] and GDR will consider it as a confirm feedback for the

repair r′ = ⟨t, A, v′, 1⟩.

In the learning component, we learn a set of classification models {MA1 , . . . ,MAn},

one for each attribute Ai ∈ attr(R). Given a suggested update for t[Ai], model MAi

is consulted to predict user feedback. The models are trained by examples acquired

incrementally from the user. We present here our choices for data representation

(input to the classifier), classification model, and learning benefit scores.

Data Representation: For a given update r = ⟨t, Ai, v, s⟩ and user feedback F ∈

{confirm, reject, retain}, we construct a training example for model MAiin the form

⟨t[A1], . . . , t[An], v,R(t[Ai], v),F⟩. Here, t[A1], . . . , t[An] are the original attributes’

values of tuple t and R(t[Ai], v)1 is a function that quantifies the relationship between

t[Ai] and its suggested value v.

Including the original dirty tuple along with the suggested update value enables

the classifier to model associations between original attribute values and suggested

values. Including the relationship function, R, enables the classifier to model asso-

ciations based on similarities that do not depend solely on the values in the original

database instance and the suggested updates.

Active Learning Using Model Uncertainty: Active learning starts with a

preliminary classifier learned from a small set of labeled training examples. The

classifier is applied to the unlabeled examples and a scoring mechanism is used to

estimate the most valuable example to label next and add to the training set. Many

1We use a string similarity function.

criteria have been proposed to determine the most valuable examples for labeling

(e.g, [65,66]) by focusing on selecting the examples whose predictions have the largest

uncertainty.

One way to derive the uncertainty of an example is by measuring the disagreement

amongst the predictions it gets from a committee of k classifiers [45]. The committee

is built so that the k classifiers are slightly different from each other, yet they all

have similar accuracy on the training data. For an update r to be classified by label

F ∈ {confirm, reject, retain}, it would get the same prediction F from all members.

The uncertain ones will get different labels from the committee and by adding them

in the training set the disagreement amongst the members will be lowered.

In our implementation, each model MAiis a random forest which is an ensemble

of decision trees [67] that are built in a similar way to construct a committee of

classifiers. Random forest learns a set of k decision trees. Let the number of instances

in the training set be N and the number of attributes in the examples be M . Each

of the k trees are learned as follows: randomly sample with replacement a set S of

size N ′ < N from the original data, then learn a decision tree with the set S. The

random forest algorithm uses a standard decision-tree learning algorithm with the

exception that at each attribute split, the algorithm selects the best attribute from

a random subsample of M ′ < M attributes. We used the WEKA2 random forest

implementation with k = 10 and default values for N ′ and M ′.

Computing Learning Benefit Score: To classify an update r = ⟨t, Ai, v, s⟩

with the learned random forest MAi, each tree in the ensemble is applied separately

to obtain the predictions F1, . . . ,Fk for r, then the majority prediction from the set of

trees is used as the output classification for r. The learning benefit or the uncertainty

of predictions of a committee can be quantified by the entropy on the fraction of

committee members that predicted each of the class labels.

Example 2.4.2 Assume that r1, r2 are two candidate updates to change the CT at-

tribute to ‘Michigan City’ in tuples t2, t3. The model of the CT attribute, MCT, is a

2http://www.cs.waikato.ac.nz/ml/weka/

random forest with k = 5. By consulting the forest MCT, we obtain for r1, the predic-

tions {confirm, confirm, confirm, reject, retain}, and for r2, the predictions {confirm,

reject, reject, reject, reject}. In this case, the final prediction for r1 is ‘confirm’ with

an uncertainty score of 0.86 (= −35× log3

35− 1

5× log3

15− 1

5× log3

15) and for r2

the final prediction is ’reject’ with an uncertainty score of 0.45. In this case, r1 will

appear to the user before r2 because it has higher uncertainty.

2.5 Experiments

In this section, we present a thorough evaluation of the GDR framework, which

has already been demonstrated in [60]. Specifically, we show that the proposed

ranking mechanism converges quickly to a better data quality state. Moreover, we

assess the trade-off between the user efforts and the resulting data quality.

Datasets. In our experiments, we used two datasets, denoted as Dataset 1

and 2 respectively. Dataset 1 is a real world dataset obtained by integrating

(anonymized) emergency room visits from 74 hospitals. Such patient data is used

to monitor naturally occurring disease outbreaks, biological attacks, and chemical

attacks. Since such data is coming from several sources, a myriad of data quality

issues arise due to the different information systems used by these hospitals and

the different data entry operators responsible for entering this data. For our ex-

periments, we selected a subset of the available patient attributes, namely Patient

ID, Age, Sex, Classification, Complaint, HospitalName, StreetAddress, City,

Zip, State, and VisitDate. For Dataset 2, we used the adult dataset from the

UCI repository (http://archive.ics.uci.edu/ml/). For our experiments, we used the

attributes education, hours per week, income, marital status, native country,

occupation, race, relationship, sex, and workclass.

Ground truth. To evaluate our technique against a ground-truth, we manually

repaired 20,000 patient records in Dataset 1. We used address and zip code lookup web

sites for this purpose. We assumed that Dataset 2, which is about 23,000 records, is

already clean and hence can be used as our ground truth. We synthetically introduced

errors in the attribute values as follows. We randomly picked a set of tuples, and then

for each tuple, we randomly picked a subset of the attributes to perturb by either

changing characters or replacing the attribute value with another value from the

domain attribute values. All experiments are reported when 30% of the tuples are

dirty.

Data Quality Rules. For Dataset 1, we used CFDs similar to what was

illustrated in Figure 2.1. The rules were identified while manually repairing the

tuples. For Dataset 2, we implemented the technique described in [36] to discover

CFDs and we used a support threshold of 5%.

User interaction simulation. We simulated user feedback to suggested updates

by providing answers as determined by the ground truth.

Data quality state metric. We report the improvement in data quality through

computing the loss (Eq. 2.4). We consider the ground truth as the desired clean

database Dopt.

Settings. All the experiments were conducted on a server with a 3 GHz pro-

cessor and 32 GB RAM running on Linux. We used Java to implement the proposed

techniques and MySQL to store and query the records.

2.5.1 VOI Ranking Evaluation

The objective here is to evaluate the effectiveness and quality of the VOI-based

ranking mechanism described in Section 2.4.1. In this experiment, we did not use the

learning component to replace the user; the user will need to evaluate each suggested

update. Recall that the grouping provides the user with related tuples and their

corresponding updates that could help in a quick batch inspection by the user.

We compare in this experiment the following techniques:

• GDR-NoLearning : The GDR framework of Figure 2.2 without the learning

component.

020406080

0 20 40 60 80 100

Feedaback (User efforts)

GDR-NoLearningGreedyRandom

(a) Dataset 1.

020406080

0 20 40 60 80 100

Feedback (User efforts)

GDR-NoLearningGreedyRandom

(b) Dataset 2.

Fig. 2.3.: Comparing VOI-based ranking in GDR (GDR-NoLearning) to other strate-

gies against the amount of feedback. Feedback is reported as the percentage of the

maximum number of verified updates required by an approach. Our application of the

VOI concept shows superior performance compared to other naıve ranking strategies.

• Greedy : Here, we rank the groups according to their sizes. The rationale behind

this strategy is that groups that cover larger numbers of updates may have high

impact on the quality if most of the suggestions within them are correct.

• Random: The naıve strategy where we randomly order the groups; all update

groups are equally important.

In Figure 2.3, we show the progress in improving the quality against the number

of verified updates (i.e., the amount of feedback). The feedback is reported as a

percentage of the total number of suggested updates through the interaction process

to reach the desired clean database.

The ultimate objective of GDR is to minimize user effort while reaching better

quality quickly. In Figure 2.3, the slope of the curves in the first of iterations with the

user is the key component to the curve: the steeper the curve the better the ranking.

As illustrated for both datasets, the GDR-NoLearning approach performs well com-

pared to the Greedy and Random approaches. This is because the GDR-NoLearning

approach perfectly identifies the most beneficial groups that are more likely to have

correct updates. While the Greedy approach improves the quality, most of the content

of the groups is sometimes incorrect updates leading to wasted user efforts. The Ran-

dom approach showed the worst performance in Dataset 1, while for Dataset 2, it was

comparable with the Greedy approach especially in the beginning of the curves. This

is because in Dataset 2, most of the sizes of the groups were close to each others mak-

ing the Random and Greedy approaches behave almost identically, while in Dataset 1

the groups sizes varies widely making the random choices ineffective. Finally, we no-

tice that GDR-NoLearning is much better for Dataset 1 than for Dataset 2, because

of two reasons related to the nature of the Dataset 2: (i) most of the initially sug-

gested updates for Dataset 2 are correct, and (ii) the sizes of the groups in Dataset 2

are close to each other. The consequence is that any ranking strategy for Dataset 2

will not be far from the optimal.

The results reported above justify clearly the importance and effectiveness of the

GDR ranking component. The GDR-NoLearning approach is well suited for repairing

“very” critical data, where every suggested update has to be verified before applying

it to the database.

2.5.2 GDR Overall Evaluation

Here, we evaluate GDR’s performance when using the learning component to re-

duce user efforts. More precisely, we evaluate the VOI-based ranking when combined

with the active learning ordering. For this experiment, we evaluate the following

approaches:

• GDR: is the approach proposed in this chapter. In each interactive session, the user

provides feedback for the top ranked updates. The required amount of feedback

per group is inversely proportional to the benefit score of the group (Eq. 2.7)—

the higher the benefit the less effort from the user is needed, since most likely the

updates are correct and there are very few uncertain updates for the learned model

that would require user involvement. As such, we require that the user verifies di

updates for a group ci, di = E×(1− g(ci)

), where E is the initial number of dirty

tuples and gmax = max∀cj{g(cj)}.

• GDR-S-Learning : Here, we eliminate the active learning from the system—the

updates are grouped and then ranked using VOI-based scoring alone. User is

solicited for a random selection of updates within each group, instead of being

ordered by uncertainty. However, all of the user feedback is used to train the

learning component, which then replaces the user on deciding for the remaining

updates in the group. GDR-S-Learning is included to assess the benefit of the active

learning aspect of our framework, compared with traditional passive learning.

• Active-Learning : In this approach, we eliminate the grouping and their ranking

from the GDR framework. In other words, we neither group the updates nor use

VOI-based scores for ranking. We only solicit user feedback for updates ordered

with the learner uncertainty scores. The user is required to provide feedback for

the top update and then the learning component is updated to reorder the updates

for the user in an iterative fashion. The resulting learned model is applied for pre-

dicting the remaining suggested updates and the database is updated accordingly.

We report the quality improvement for different amount of feedbacks. This ap-

proach is included to assess the benefit of the grouping and the VOI-based ranking

mechanisms compared with using only an active learning approach.

• GDR-NoLearning : This approach is the one described in the previous experiment;

It provides a baseline to assess the utility of machine learning aspect for GDR.

• Automatic-Heuristic: The BatchRepair method described in [3] for automatic data

repair using CFDs.

In Figure 2.4, we report the improvement in data quality as the amount of feedback

increases. Assuming that the user can afford verifying at most a number of updates

equal to the number of initially identified dirty tuples (6000 for Dataset 1 and 3000

020406080

0 20 40 60 80 100

GDRGDR-S-LearningGDR-NoLearningActive LearningHeuristic

(a) Dataset 1.

020406080

0 20 40 60 80 100

GDRGDR-S-LearningGDR-NoLearningActive LearningHeuristic

(b) Dataset 2.

Fig. 2.4.: Overall evaluation of GDR compared with other techniques. The com-

bination of the VOI-based ranking with the active learning was very successful in

efficiently involving the user. The user feedback is reported as a percentage of the

initial number of the identified dirty tuples.

for Dataset 2), we report the amount of feedbacks as a percentage of this number.

The results show that GDR achieves superior performance compared with the other

approaches; For Dataset 1, GDR gain about 90% improvement with 20% efforts or

verifying about 1000 updates. For Dataset 2, about 94% quality improvement was

gained with 30% efforts or verifying about 1000 updates.

In Dataset 1, Active Learning is comparable to GDR only in the beginning of the

curve until reaching about 70% quality improvement. GDR-S-Learning starts to out-

perform Active Learning after about 45% user effort. The Heuristic approach repairs

the database without user feedback, therefore, it produces a constant result. Note

that the quality improvement achieved by the Heuristic approach is attained by GDR

with about 10% user effort, i.e., giving feedback for updates numbering about 10% of

the initial set of dirty tuples in the database. The GDR-NoLearning approach does

improve the quality of the database, but not as quickly as any of the approaches that

use learning methods. In comparison to Figure 2.3, the final performance of GDR-

NoLearning is 100%, assuming all required feedback were obtained. GDR involves

learning which allows for automatic updates to be applied and hence opens the door

for some mistakes to occur. Thus, the 100% accuracy may not be reached.

For Dataset 2, similar results were achieved. However, the Active Learning ap-

proach was not as successful as for Dataset 1. This is due to the randomness nature

of the errors in this dataset, which resulted in fewer correlations between these errors

that could be learned by the model. Due to the wider array of real-world dependen-

cies in Dataset 1, the machine learning methods were more successful and achieved

better performance. For example, some hospitals located on the boundary between

two zip codes have their zip attributes dirty; this is most likely due to a data entry

confusion on where they are really located.

The superior performance of GDR is justified by the following: for a single group

of updates, using the learner uncertainty to select updates can effectively strengthen

the learned model predictions as these “uncertain” updates are more important for

the model. In GDR-S-Learning, randomly inspecting updates from the groups pro-

vided by the VOI-based ranking does enhance the learned model. However, more

user effort is wasted in verifying less important updates according to the learning

benefit. For the Active Learning approach, it is apparent that having the user spend

more effort does not help the learned model due to the model over fitting problem.

This problem is avoided in both GDR and GDR-S-Learning approaches because of

the grouping provided by the GDR framework. The grouping provides the learned

model a mechanism to adapt locally to the current group, which in turn provides

the necessary guidance for the model to strongly learn the associations for a highly

beneficial group rather than just weakly learning the associations for a wide variety

of cases. This is also the reason that the GDR-S-Learning eventually outperforms

the Active Learning with an increase in user effort.

This experiment demonstrates the importance of the learning component for

achieving a faster convergence to a better quality. The results support our initial

hypothesis about the existence of correlations between the dirty and correct versions

of the tuples in real-world data. Also, the combination of VOI-based ranking with

active learning improves over the traditional active learning mechanism.

2.5.3 User Efforts vs. Repair Accuracy

We evaluate GDR’s ability to provide a trade-off between user effort and accurate

updates. We use the precision and recall, where precision is defined as the ratio of

the number of values that have been correctly updated to the total number of values

that were updated, while recall is defined as the ratio of the number of values that

have been correctly updated to the number of incorrect values in the entire database.

Since we know the correct data, we can compute these values.

The user in this experiment affords only verifying F updates, then GDR decide

about the rest of the updates automatically. GDR asks the user to verify di of the

suggested updates in a group of repairs ci, until we reach F .

In Figure 2.5, we report the precision and recall values resulting from repairing

the database as we increase F (reported as % of dirty tuples). For both datasets the

precision and recall generally improve as F increases. However, for Dataset 1, the

precision is always higher than for Dataset 2. This is due to the lower accuracy of

the learning component for Dataset 2, which stems from the random nature of the

errors in Dataset 2. Overall, these results illustrate the benefit of user feedback—as

the user effort increases, the repair accuracy increases.

2.6 Summary

We presented GDR, a framework that combines constraint-based repair techniques

with user feedback through an interactive process. The main novelty of GDR is to

solicit user feedback for the most useful updates using a novel decision-theoretic mech-

anism combined with active learning. The aim is to move the quality of the database

to a better state as far as the data quality rules are concerned. Our experiments

0.50.60.70.80.9

0 20 40 60 80 100

PrecisionRecall

(a) Dataset 1.

0.50.60.70.80.9

0 20 40 60 80 100

Feedback (user efforts)

PrecisionRecall

(b) Dataset 2.

Fig. 2.5.: Accuracy vs. user efforts. As the user spends more effort with GDR, the

overall accuracy is improved. The user feedback is reported as a percentage of the

initial number of the identified dirty tuples.

show very promising results in moving the data quality forward with minimal user

involvement.

3. SCALABLE APPROACH TO GENERATE DATA

CLEANING UPDATES

Existing automatic data cleaning techniques are not scalable, and moreover, the

constraint-based cleaning techniques are recognized to fall short to identify correct

cleaning updates. In Chapter 2, GDR relies on such automatic cleaning techniques

to generate candidate updates to the dirty database. To enable GDR handel large

databases, we introduce in this chapter a scalable data cleaning approach that is

based on Machine Learning techniques. Involving ML helps in obtaining more accu-

rate cleaning updates than the constraint-based methods.

The chapter is organized as follows: In Section 3.1, we highlight the need for

different data cleaning techniques and discuss the challenges. Section 3.2 defines the

problem and introduces the notion of maximal likelihood repair. Section 3.3 presents

our solution for modeling dependencies and predicting accurate replacement values.

Section 3.4 presents SCARE, our scalable solution to repair the data. We demon-

strate the validity of our approach and experimental results in terms of efficiency and

scalability in Section 3.5, and finally, summarize the chapter in Section 3.6.

3.1 Introduction

Most existing solutions to repair dirty databases by value modification follow

constraint-based repairing approaches [3–5], which search for minimal change of the

database to satisfy a predefined set of constraints. While a variety of constraints (e.g.,

integrity constraints, conditional functional and inclusion dependencies) can detect

the presence of errors, they are recognized to fall short of guiding to correct the

errors, and worse, may introduce new errors when repairing the data [6]. Moreover,

despite the research conducted on integrity constraints to ensure the quality of the

data, in practice, databases often contain a significant amount of non-trivial errors.

These errors, both syntactic and semantic, are generally subtle mistakes which are

difficult or even impossible to express using the general types of constraints available

in modern DBMSs [7]. This highlights the need for different techniques to clean dirty

databases.

In this chapter, we address the issues on scalability and accuracy of replacement

values by leveraging Machine Learning (ML) techniques for predicting better quality

updates to repair dirty databases.

Statistical ML techniques (e.g., decision tree, Bayesian networks) can capture

dependencies, correlations, and outliers from datasets based on various analytic, pre-

dictive or computational models [8]. Existing efforts in data cleaning using ML tech-

niques mainly focused on data imputation (e.g., [7]) and deduplication (e.g., [9]). To

the best of our knowledge, our work is the first approach to consider ML techniques

for repairing databases by value modification.