Utility of Considering M ultiple A lternative R ectifications in Data Cleaning

Preview:

DESCRIPTION

Utility of Considering M ultiple A lternative R ectifications in Data Cleaning. Preet Inder Singh Rihan Master’s Thesis. Committee Members Dr. Subbarao Kambhampati (Chair) Dr. Huan Liu Dr. Hasan Davulcu. Importance of Data Cleaning. Data is one of the most useful resources - PowerPoint PPT Presentation

Citation preview

1

Utility of Considering Multiple Alternative Rectifications in Data CleaningPREET INDER SINGH RIHAN MASTER ’S THESIS

Committee MembersDr. Subbarao Kambhampati

(Chair)Dr. Huan Liu

Dr. Hasan Davulcu

2

Importance of Data Cleaning

Data is one of the most useful resources◦ Crucial to numerous important decision making

and analysis processes

High volume, variety and velocity of data make it difficult to obtain data in the cleanest form

3

Sources and Types of Noise

A few reasons for the noise present in data◦ Imperfect sensing devices/ information extractor◦ Heterogeneity in data from multiple sources◦ Errors in data entry, misspelling etc.

Data that suffers from quality issues is called Dirty data

Make Model Cartype Condition Drivetrain

TSX Acura Used FWD

Honda Corolla Sedan New FWD

Honda Civic Sdna Used FWD

Example of dirty data

Sushovan De
Change title (this is the same as previous slide).You can consider: "Sources and types of noise"

4

Current Techniques

◦ De-Duplication

◦ Inconsistencies

◦ Schema Noise

◦ Outlier detection ◦ Conditional

functional dependencies

◦ BayesWipe

Types of Problems Industry Solutions Academic Approaches

Sushovan De
The content in this slide is too squished to the top. Consider moving it downwards

5

Common themes in Current Data Cleaning Techniques

Considers multiple rectifications Picks most likely rectification deterministically, using:

◦ fixed rules◦ domain experts

Sushovan De
Are you sure this picture makes sense? I know I put it in, but make sure you have a story, or an explanation that goes with it. (removing it is okay, or replacing it with an example)

6

2 Honda Civic Sedan New FWD 0.6

ExampleTID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Corolla Sedan New FWD

3 Honda Civic Sedan Used FWD

4 Honda Civic Sedan Used FWD

5 Toyota Corolla Sedan New FWD

Dirty Database

2 Honda Corolla Sedan New FWD

TID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Civic Sedan New FWD

3 Honda Civic Sedan Used FWD

4 Honda Civic Sedan Used FWD

5 Toyota Corolla Sedan New FWD

Deterministic clean output

2 Toyota Corolla Sedan New FWD 0.4

2 Toyota Corolla Sedan New FWDTrue Tuple

Dirty Tuple

Rectification 1

Rectification 2

7

Data Cleaning Approaches: Problems

Hard to get the perfect fixed rules/domain knowledge

Partially correct rules/knowledge may ignore true rectification◦ Results in information loss◦ Irrecoverable when original data is decoupled

from cleaned outcome

Sushovan De
The picture is on top of the word "ignore". Click the picture, then click the "Send backwards" button.

8

✔✔✔✔

Alternative Approach: Considering multiple alternative candidates after data cleaning Keep multiple alternative rectifications in a probabilistic database

Advantages:◦ Prevent information loss◦ Generates query results with more

recall

9

Alternative Approach : Potential Challenges

Keeping multiple alternative rectifications of a dirty tuple poses some challenges:◦ Query results with many irrelevant results -- low precision◦ Query processing over probabilistic data◦ Size of probabilistic data

10

Problem Statement

To investigate the trade-offs of considering multiple alternative rectifications of a dirty data instance against having a deterministically selected unique clean rectification of a dirty data instance

Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results

12

Background SystemFor investigation, BayesWipe[1] is used◦ End to end probabilistic data cleaning system◦ Cleans structured data◦ Handles data quality issues due to

◦ Inconsistency◦ Incompleteness◦ Substitutions

[1] Y. Hu, S. De, Y. Chen, and S. Kambhampati. Bayesian data cleaning for web data. arXiv preprint arXiv:1204.3677, 2012

13

BayesWipe For every tuple T in dirty database:

◦ Set of rectifications (T*)s is generated◦ Every T* has a probability value P(T*|T)

◦ P(T*|T) is the system’s confidence in claiming T* to be the true tuple

Sushovan De
The first statement is unnecessary - it is implied by the second statement
Sushovan De
(T*)s -- the s should be outside the bracket, or just remove 's'
Sushovan De
system's -- apostrophe missing

14

BayesWipe-DET

BayesWipe’s Clean Outcomes

BayesWipe is used to produced outcomes in two modes◦ BayesWipe-DET: - Only most likely rectification◦ BayesWipe-PDB: - All rectification with associated

probability

T

T*1

T*i

T*n

T*2.

.BayesWipeDirty Data BayesWipe-PDB

Sushovan De
The phrase "outcomes for investigation" is weird. Consider writing: "BayesWipe can produce output in two modes"

15

TID Make Model Cartype Condition Drivetrain Probability

1Honda Civic Sedan Used FWD 0.9

Honda Civik Sedan New FWD 0.1

2

Toyota Corolla Sedan New FWD 0.2

Toyota Corolla Sedan Used FWD 0.2

Honda Civic Sedan Used FWD 0.6

3Honda Civik Sedan New FWD 0.05

Honda Civic Sedan Used FWD 0.95

4Toyota Corolla Sedan New FWD 0.9

Honda Corolla Sedan New FWD 0.1

Example:BayesWipe-DET and BayesWipe-PDB

TID Make Model Cartype Condition Drivetrain

1 Honda Civik Sedan Used FWD

2 Honda Corolla Sedan New FWD

3 Honda Civic Sedan Used FWD

4 Toyota Corolla Sedan New FWD

Dirt

y Da

taba

se

BayesWipe

TID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Civic Sedan New FWD

3 Honda Civic Sedan Used FWD

4 Toyota Corolla Sedan New FWD

Baye

sWip

e-DE

T

Baye

sWip

e-PD

B

16

BayesWipe-PDB Type Types of Probabilistic database

◦ Tuple Independent ◦ Block Independent Disjoint (BID) ◦ C-Table

BayesWipe-PDB type◦ Block Independent Disjoint (BID)

TID Make Model Cartype Condition Drivetrain Probability

12

Honda Civic Sedan Used FWD 0.85

Honda Corolla Sedan New FWD 0.10

Honda Civic Sdfkshf Used FWD 0.05

210Toyota Corolla Sedan New FWD 0.9

Honda Corolla Sedan New FWD 0.1

Block Independent Disjoint Probabilistic database

17

BayesWipe-PDB Storage BayesWipe-PDB is stored into a relational database:

◦ SQL Server

Query Processing Engine:◦ Mystiq[2] -- a prototype of Probabilistic database

management system

[2] Boulos, Jihad, Nilesh Dalvi, Bhushan Mandhani, Shobhit Mathur, Chris Re, and Dan Suciu. "MYSTIQ: a system for finding more answers by using probabilities." In SIGMOD, pp. 891-893. ACM, 2005.

Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results

19

Investigation Strategy Criteria to compare BayesWipe-PDB and BayesWipe-DET◦ Accuracy of query results◦ Scalability

20

Accuracy of Query results

To check if BayesWipe-PDB makes improvement in query results

Query results from BayesWipe-PDB and BayesWipe-DET are compared using:◦ Precision of query results◦ Recall of query results◦ Total increase in true positives over multiple queries◦ Total increase in false positive over multiple queries

21

Query Results:BayesWipe-PDB and BayesWipe-DET

BayesWipe-DET Query results ◦ Set of deterministic

tuples

BayesWipe-PDB Query results ◦ Set of Probabilistic

tuples◦ Multiple rectifications of

a tuple

TID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Civic Sedan New FWD

3 Honda Civic Sedan Used FWD

4 Toyota Corolla Sedan New FWD

σ𝑚𝑎𝑘𝑒¿ ′𝐻𝑜𝑛𝑑𝑎′

TID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Civic Sedan New FWD

3 Honda Civic Sedan Used FWD

Deterministic Query Results

TID Make Model Cartype Condition Drivetrain Probability

1 Honda Civic Sedan Used FWD 0.9

2 Honda Civic Sedan Used FWD 0.6

3Honda Civik Sedan New FWD 0.05

Honda Civic Sedan Used FWD 0.95

4 Honda Corolla Sedan New FWD 0.1

σ𝑚𝑎𝑘𝑒¿ ′𝐻𝑜𝑛𝑑𝑎′

TID Make Model Cartype Condition Drivetrain Probability

1Honda Civic Sedan Used FWD 0.9

Honda Civik Sedan New FWD 0.1

2

Toyota Corolla Sedan New FWD 0.2

Toyota Corolla Sedan Used FWD 0.2

Honda Civic Sedan Used FWD 0.6

3Honda Civik Sedan New FWD 0.05

Honda Civic Sedan Used FWD 0.95

4Toyota Corolla Sedan New FWD 0.9

Honda Corolla Sedan New FWD 0.1

Probabilistic Query Results

22

Accuracy of Query results:Evaluation Challenges

Precision/Recall calculation is not straightforward Evaluation Challenges:

◦ Defining accuracy/relevance of resultant Tuple◦ Precision/Recall for Query Results from BayesWipe-PDB

23

TID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Corolla Sedan New FWD

3 Honda Civic Sedan Used AWD

4 Toyota Corolla Sedan New FWD

3 Honda Civic Sedan Used AWD

Defining Relevance/Accuracy of Resultant tuple

TID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Corolla Sedan New FWD

3 Honda Civic Sedan Used FWD

4 Toyota Corolla Sedan New FWD

Ground Truth Results Observed Results

Query Results for

3 Honda Civic Sedan Used FWD

Ground Truth Result

Observed Result

Sushovan De
This looks good

24

Relevance of Resulting Tuples In this precision and recall computation

◦ Relevance is defined only by tuple ids

A probabilistic/deterministic tuple from query results is relevant if:◦ Its tuple id appears in query results from ground

truth

25

Precision/Recall of Probabilistic Query Results

Precision and Recall is not defined for probabilistic query results

True precision/recall for probabilistic query results◦ Calculated over all possible worlds*◦ Overall precision recall is weighted sum of

precision/recall over all possible worlds

Exponential numbers of possible worlds

* A possible world is a state of a probabilistic database in which each random variable in the PDB has been assigned one of its possible values

Sushovan De
The footnote is confusing. Consider writing: "A possible world is a state of a probabilistic database in which each random variable in the PDB has been assigned one of its possible values"

26

Approximate Precision/Recall of Probabilistic Results

Two ways to approximate precision/recall calculation◦ Consider partial belongingness of tuples

◦ Where P(t) is probability of the tuple t

◦ Use a pass threshold to classify probabilistic tuples as query results or not ◦ Calculated precision and Recall using standard formula for query results◦ Traditional way [3] to handle uncertain results

[3] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. VLDB,pages 586–597. VLDB Endowment, 2002.

27

Precision/Recall approximation using a pass Threshold

A fixed pass threshold is applied to total probability

◦ Aggregated probabilities of all rectifications of probabilistic tuple

Total Probability>= ◦ Query results◦ Rejected

TID Make Model Cartype Condition Drivetrain Probability

1 Honda Civic Sedan Used FWD 0.9

2 Honda Civic Sedan Used FWD 0.6

3Honda Civik Sedan New FWD 0.05

Honda Civic Sedan Used FWD 0.95

4 Honda Corolla Sedan New FWD 0.1

Probabilistic query results

TID Probability

1 0.9

2 0.6

3 1

4 0.1

Aggregated probabilities of probabilistic tuple

h h𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 θ=0.2 TID

1

2

3

Determinized results

28

Scalability To check if BayesWipe-PDB is scalable

Two comparisons are performed◦ Size of BayesWipe-PDB vs. Size of

BayesWipe-DET◦ Query Processing time over BayesWipe-

PDB vs. Query Processing time over BayesWipe-DET

29

Potential issues with BayesWipe-PDB

Number of rectifications with low probabilities increases as data size increases

Potential issues:◦ Query results with very low precision

◦ One way to control is by good pass threshold ◦ Scalability issues

◦ High physical space◦ High query processing time

Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results

31

Optimization Technique Reason for potential issues

◦ Too many irrelevant results in BayesWipe-PDB

Pre-Pruning, an optimization technique ◦ Prevent irrelevant rectifications to be stored in

BayesWipe-PDB◦ Checks every rectification T* of tuple T◦ Stores T* in BayesWipe-PDB if

◦ T* passes pre-pruning algorithm

32

PrePruned BayesWipe-PDB

Probabilistic Database stores multiple candidate clean version T* after Pre-Pruning

Dirty data

Multiple alternatives

(T*)BayesWipe BayesWipe-PDBPruning

33

Pre-Pruning Algorithm Pre-Pruning Algorithm considers every candidate clean version and associated probability i.e. P(T*|T)

α β P(T*|T)

Rejected AcceptedFurther

Investigated

34

Pre-Pruning Algorithm If

◦ T* is kept as clean version of the tuple T

In the case of rare, legitimate tuples, it is possible that both T and T* have low probabilities

The values of α, β and γ are set to 0.009, 0.5 and 5 respectively

For this, Prior of tuple T and Prior of tuple T* is considered

P[T ]=¿ tupleT occurs∈dirty database

¿ tuples∈dirty database

α β P(T*|T)

Rejected AcceptedFurther Investigated

Sushovan De
The sentence "It is more likely that a more frequent rectifications tuple has been corrupted than a less frequent rectifications" is confusing (and grammatically incorrect, i think). Consider writing: "In the case of rare, legitimate tuples, it is possible that both T and T* have low probabilities"

Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results

36

Experimental Setup Experiments are performed on used car dataset crawled from Google base

Experiments are performed on size of data set varying from 1000 tuples to 30000 tuples

Synthetic noise is introduced at random to the clean dataset◦ Noise level varies from 1% to 20%

Random queries were selected to compare the quality of query results extracted from BayesWipe-DET and BayesWipe-PDB

37

Experimental Results Present finding of comparison of BayesWipe-PDB and BayesWipe-DET on◦ Accuracy of query results (Precision and Recall)◦ Scalability (Size and Query processing time)

Present the effect of optimization technique on BayesWipe-PDB

38

00.30.60.9

BayesWipe-PDB Recall BayesWipe-DET Recall

RECA

LBayesWipe-PDB vs. BayesWipe-DET Recall and Precision Data Size = 2500

Noise = 10%Threshold = 0.1

make = acura model = outlander sports

cartype = sedan make = bmw & condition = used

model = jetta model = cooper s model = h3 mini Average0

0.10.20.30.40.50.60.70.80.9

1

BayesWipe-PDB Precision BayesWipe-DET Precision

Prec

isio

n

Sushovan De
This looks much nicer

39

BayesWipe-PDB vs. BayesWipe-DET:Effect of Threshold Dataset size 30000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.2

0.4

0.6

0.8

1

Average BayesWipe-PDB Precision

Average BayeWipe-PDB Recall

Threshold

Prec

ision

/Rec

all

DET Recall=0.82

DET Precision= 0.997

Sushovan De
Please mention the exact values of DET-Precision and Recall, since the lines have been removed. (The arrows are too fat to be judged)

40

Accuracy of Query Results from Multiple Random Queries

Average of precision and recall values of multiple random queries does not give good idea

Comparison on non normalized metrics over 100 random queries◦ Total increase numbers of true positives generated

◦ True Positive (BayesWipe-PDB) – True Positives(BayesWipe-DET)◦ Total increase number of false positive generated

◦ False Positive (BayesWipe-PDB) – False Positives(BayesWipe-DET)

41

1 2 5 10 15 200

100

200

300

400

500

600

Increase in True Positives Increase in False Negative

Noise Percentage

BayesWipe-PDB vs. BayesWipe-DET: True Positives and False Positives Gain

Data Size =30000Threshold = 0.1

42

BayesWipe-PDB vs. BayesWipe-DET: Size of Database

1 2 5 10 15 200

50000

100000

150000

200000

250000

300000

350000

400000

BayesWipe-PDB database size BayesWipe-DET database size

Noise

Num

ber o

f tup

les

Data Size = 30000

43

BayesWipe-PDB vs. BayesWipe-DET: Average Query Processing time

Data Size =30000

1 2 5 10 15 200

10

20

30

40

50

60

70

80

BayesWipe-PDB query processing time BayesWipe-DET query processing time

Noise Percentage

Tim

e in

ms

44

model = 'c

ontinental gt'

model = 'e

quinox'

make = 'is

uzu'

make = 'fo

rd'

make = 'h

yundai'

model = '5

25'

'550 gran turis

mo'

model = 'e

nclave

'0.86

0.90.940.98

BayesWipe-PDB Recall Optimized BayesWipe-PDB Recall BayesWipe-DET Recall

Reca

ll

model = 'c

ontinental gt'

model = 'e

quinox'

make = 'is

uzu'

make = 'fo

rd'

make = 'h

yundai'

model = '5

25'

'550 gran turis

mo'

model = 'e

nclave

'0.2

0.4

0.6

0.8

1

BayesWipe-PDB Precision Optimized BayesWipe-PDB Precision BayesWipe-DET Precision

Prec

isio

n

Optimized BayesWipe-PDB vs. BayesWipe-DET Recall and Precision

Sushovan De
Please mark which graph is the optimized graph

45

Optimized BayesWipe-PDB:True Positives and False Positives Gain

1 2 5 10 15 200

100

200

300

400

500

600

Increase in True Positives Increase in True Positive (Optimized) Increase in False NegativeIncrease in False Positive (Optimized)

Noise Percentage

Data Size =30000Threshold = 0.1

46

1 2 5 10 15 200

50000

100000

150000

200000

250000

300000

350000

400000

BayesWipe-DET database size BayesWipe-PDB database sizePrePruned BayesWipe-PDB database size

Noise Percentage

Num

ber o

f tup

les

Optimized BayesWipe-PDB:Database Size Comparison

Data Size =30000

47

1 2 5 10 15 200

10

20

30

40

50

60

70

80

BayesWipe-DET query processing time BayesWipe-PDB query processing timePrePruned BayesWipe-PDB query processing time

Noise Percentage

Tim

e in

ms

Optimized BayesWipe-PDB: Average Query Processing time

Data Size =30000

48

Results SummaryQuery Processing time Database size Precision Recall

BayesWipe-DET Low Same as Dirty Data High Low

BayesWipe-PDB Very High Very largeIncreases with

increase inThreshold value

Decreases with increase in

Threshold value

Optimized BayesWipe-PDB High Large

Higher to Precision of

BayesWipe-DET in most cases (65%

times)

Higher or equal to Recall of

BayesWipe-DET

Sushovan De
Minor: press SHIFT+ENTER after the word "Prepruned" in Prepruned BayesWipe-PDB for a nicer-looking title

49

Conclusion I studied the utility of considering multiple alternative rectifications in data cleaning

For that, I compare BayesWipe-PDB and BayesWipe-DET BayesWipe-PDB always has better recall for query results at the cost of precision

BayesWipe-PDB also requires larger physical space and high query processing time

Optimization technique provide a way to minimize the cost of precision and scalability issues

Recommended