Utility of Considering M ultiple A lternative R ectifications in Data Cleaning

Utility of Considering Multiple Alternative Rectifications in Data CleaningPREET INDER SINGH RIHAN MASTER ’S THESIS

Committee MembersDr. Subbarao Kambhampati

(Chair)Dr. Huan Liu

Dr. Hasan Davulcu

Importance of Data Cleaning

Data is one of the most useful resources◦ Crucial to numerous important decision making

and analysis processes

High volume, variety and velocity of data make it difficult to obtain data in the cleanest form

Sources and Types of Noise

A few reasons for the noise present in data◦ Imperfect sensing devices/ information extractor◦ Heterogeneity in data from multiple sources◦ Errors in data entry, misspelling etc.

Data that suffers from quality issues is called Dirty data

Make Model Cartype Condition Drivetrain

TSX Acura Used FWD

Honda Corolla Sedan New FWD

Honda Civic Sdna Used FWD

Example of dirty data

Current Techniques

◦ De-Duplication

◦ Inconsistencies

◦ Schema Noise

◦ Outlier detection ◦ Conditional

functional dependencies

◦ BayesWipe

Types of Problems Industry Solutions Academic Approaches

Common themes in Current Data Cleaning Techniques

Considers multiple rectifications Picks most likely rectification deterministically, using:

◦ fixed rules◦ domain experts

2 Honda Civic Sedan New FWD 0.6

ExampleTID Make Model Cartype Condition Drivetrain

1 Honda Civic Sedan Used FWD

2 Honda Corolla Sedan New FWD

5 Toyota Corolla Sedan New FWD

Dirty Database

TID Make Model Cartype Condition Drivetrain

2 Honda Civic Sedan New FWD

Deterministic clean output

2 Toyota Corolla Sedan New FWD 0.4

2 Toyota Corolla Sedan New FWDTrue Tuple

Dirty Tuple

Rectification 1

Rectification 2

Data Cleaning Approaches: Problems

Hard to get the perfect fixed rules/domain knowledge

Partially correct rules/knowledge may ignore true rectification◦ Results in information loss◦ Irrecoverable when original data is decoupled

from cleaned outcome

✔✔✔✔

Alternative Approach: Considering multiple alternative candidates after data cleaning Keep multiple alternative rectifications in a probabilistic database

Advantages:◦ Prevent information loss◦ Generates query results with more

recall

Alternative Approach : Potential Challenges

Keeping multiple alternative rectifications of a dirty tuple poses some challenges:◦ Query results with many irrelevant results -- low precision◦ Query processing over probabilistic data◦ Size of probabilistic data

Problem Statement

To investigate the trade-offs of considering multiple alternative rectifications of a dirty data instance against having a deterministically selected unique clean rectification of a dirty data instance

Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results

Background SystemFor investigation, BayesWipe[1] is used◦ End to end probabilistic data cleaning system◦ Cleans structured data◦ Handles data quality issues due to

◦ Inconsistency◦ Incompleteness◦ Substitutions

[1] Y. Hu, S. De, Y. Chen, and S. Kambhampati. Bayesian data cleaning for web data. arXiv preprint arXiv:1204.3677, 2012

BayesWipe For every tuple T in dirty database:

◦ Set of rectifications (T*)s is generated◦ Every T* has a probability value P(T*|T)

◦ P(T*|T) is the system’s confidence in claiming T* to be the true tuple

BayesWipe-DET

BayesWipe’s Clean Outcomes

BayesWipe is used to produced outcomes in two modes◦ BayesWipe-DET: - Only most likely rectification◦ BayesWipe-PDB: - All rectification with associated

probability

.BayesWipeDirty Data BayesWipe-PDB

TID Make Model Cartype Condition Drivetrain Probability

1Honda Civic Sedan Used FWD 0.9

Honda Civik Sedan New FWD 0.1

Toyota Corolla Sedan New FWD 0.2

Toyota Corolla Sedan Used FWD 0.2

Honda Civic Sedan Used FWD 0.6

3Honda Civik Sedan New FWD 0.05

4Toyota Corolla Sedan New FWD 0.9

Honda Corolla Sedan New FWD 0.1

Example:BayesWipe-DET and BayesWipe-PDB

1 Honda Civik Sedan Used FWD

BayesWipe

BayesWipe-PDB Type Types of Probabilistic database

◦ Tuple Independent ◦ Block Independent Disjoint (BID) ◦ C-Table

BayesWipe-PDB type◦ Block Independent Disjoint (BID)

Honda Civic Sdfkshf Used FWD 0.05

Block Independent Disjoint Probabilistic database

BayesWipe-PDB Storage BayesWipe-PDB is stored into a relational database:

◦ SQL Server

Query Processing Engine:◦ Mystiq[2] -- a prototype of Probabilistic database

management system

[2] Boulos, Jihad, Nilesh Dalvi, Bhushan Mandhani, Shobhit Mathur, Chris Re, and Dan Suciu. "MYSTIQ: a system for finding more answers by using probabilities." In SIGMOD, pp. 891-893. ACM, 2005.

Investigation Strategy Criteria to compare BayesWipe-PDB and BayesWipe-DET◦ Accuracy of query results◦ Scalability

Accuracy of Query results

To check if BayesWipe-PDB makes improvement in query results

Query results from BayesWipe-PDB and BayesWipe-DET are compared using:◦ Precision of query results◦ Recall of query results◦ Total increase in true positives over multiple queries◦ Total increase in false positive over multiple queries

Query Results:BayesWipe-PDB and BayesWipe-DET

BayesWipe-DET Query results ◦ Set of deterministic

tuples

BayesWipe-PDB Query results ◦ Set of Probabilistic

tuples◦ Multiple rectifications of

a tuple

σ𝑚𝑎𝑘𝑒¿ ′𝐻𝑜𝑛𝑑𝑎′

Deterministic Query Results

1 Honda Civic Sedan Used FWD 0.9

4 Honda Corolla Sedan New FWD 0.1

σ𝑚𝑎𝑘𝑒¿ ′𝐻𝑜𝑛𝑑𝑎′

1Honda Civic Sedan Used FWD 0.9

Honda Civik Sedan New FWD 0.1

Toyota Corolla Sedan New FWD 0.2

Toyota Corolla Sedan Used FWD 0.2

Probabilistic Query Results

Accuracy of Query results:Evaluation Challenges

Precision/Recall calculation is not straightforward Evaluation Challenges:

◦ Defining accuracy/relevance of resultant Tuple◦ Precision/Recall for Query Results from BayesWipe-PDB

3 Honda Civic Sedan Used AWD

Defining Relevance/Accuracy of Resultant tuple

Ground Truth Results Observed Results

Query Results for

Ground Truth Result

Observed Result

Relevance of Resulting Tuples In this precision and recall computation

◦ Relevance is defined only by tuple ids

A probabilistic/deterministic tuple from query results is relevant if:◦ Its tuple id appears in query results from ground

Precision/Recall of Probabilistic Query Results

Precision and Recall is not defined for probabilistic query results

True precision/recall for probabilistic query results◦ Calculated over all possible worlds*◦ Overall precision recall is weighted sum of

precision/recall over all possible worlds

Exponential numbers of possible worlds

* A possible world is a state of a probabilistic database in which each random variable in the PDB has been assigned one of its possible values

Approximate Precision/Recall of Probabilistic Results

Two ways to approximate precision/recall calculation◦ Consider partial belongingness of tuples

◦ Where P(t) is probability of the tuple t

◦ Use a pass threshold to classify probabilistic tuples as query results or not ◦ Calculated precision and Recall using standard formula for query results◦ Traditional way [3] to handle uncertain results

[3] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. VLDB,pages 586–597. VLDB Endowment, 2002.

Precision/Recall approximation using a pass Threshold

A fixed pass threshold is applied to total probability

◦ Aggregated probabilities of all rectifications of probabilistic tuple

Total Probability>= ◦ Query results◦ Rejected

4 Honda Corolla Sedan New FWD 0.1

Probabilistic query results

TID Probability

Aggregated probabilities of probabilistic tuple

h h𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 θ=0.2 TID

Determinized results

Scalability To check if BayesWipe-PDB is scalable

Two comparisons are performed◦ Size of BayesWipe-PDB vs. Size of

BayesWipe-DET◦ Query Processing time over BayesWipe-

PDB vs. Query Processing time over BayesWipe-DET

Potential issues with BayesWipe-PDB

Number of rectifications with low probabilities increases as data size increases

Potential issues:◦ Query results with very low precision

◦ One way to control is by good pass threshold ◦ Scalability issues

◦ High physical space◦ High query processing time

Optimization Technique Reason for potential issues

◦ Too many irrelevant results in BayesWipe-PDB

Pre-Pruning, an optimization technique ◦ Prevent irrelevant rectifications to be stored in

BayesWipe-PDB◦ Checks every rectification T* of tuple T◦ Stores T* in BayesWipe-PDB if

◦ T* passes pre-pruning algorithm

PrePruned BayesWipe-PDB

Probabilistic Database stores multiple candidate clean version T* after Pre-Pruning

Dirty data

Multiple alternatives

(T*)BayesWipe BayesWipe-PDBPruning

Pre-Pruning Algorithm Pre-Pruning Algorithm considers every candidate clean version and associated probability i.e. P(T*|T)

α β P(T*|T)

Rejected AcceptedFurther

Investigated

Pre-Pruning Algorithm If

◦ T* is kept as clean version of the tuple T

In the case of rare, legitimate tuples, it is possible that both T and T* have low probabilities

The values of α, β and γ are set to 0.009, 0.5 and 5 respectively

For this, Prior of tuple T and Prior of tuple T* is considered

P[T ]=¿ tupleT occurs∈dirty database

¿ tuples∈dirty database

α β P(T*|T)

Rejected AcceptedFurther Investigated

Experimental Setup Experiments are performed on used car dataset crawled from Google base

Experiments are performed on size of data set varying from 1000 tuples to 30000 tuples

Synthetic noise is introduced at random to the clean dataset◦ Noise level varies from 1% to 20%

Random queries were selected to compare the quality of query results extracted from BayesWipe-DET and BayesWipe-PDB

Experimental Results Present finding of comparison of BayesWipe-PDB and BayesWipe-DET on◦ Accuracy of query results (Precision and Recall)◦ Scalability (Size and Query processing time)

Present the effect of optimization technique on BayesWipe-PDB

00.30.60.9

BayesWipe-PDB Recall BayesWipe-DET Recall

LBayesWipe-PDB vs. BayesWipe-DET Recall and Precision Data Size = 2500

Noise = 10%Threshold = 0.1

make = acura model = outlander sports

cartype = sedan make = bmw & condition = used

model = jetta model = cooper s model = h3 mini Average0

0.10.20.30.40.50.60.70.80.9

BayesWipe-PDB Precision BayesWipe-DET Precision

BayesWipe-PDB vs. BayesWipe-DET:Effect of Threshold Dataset size 30000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

Average BayesWipe-PDB Precision

Average BayeWipe-PDB Recall

Threshold

DET Recall=0.82

DET Precision= 0.997

Accuracy of Query Results from Multiple Random Queries

Average of precision and recall values of multiple random queries does not give good idea

Comparison on non normalized metrics over 100 random queries◦ Total increase numbers of true positives generated

◦ True Positive (BayesWipe-PDB) – True Positives(BayesWipe-DET)◦ Total increase number of false positive generated

◦ False Positive (BayesWipe-PDB) – False Positives(BayesWipe-DET)

1 2 5 10 15 200

Increase in True Positives Increase in False Negative

Noise Percentage

BayesWipe-PDB vs. BayesWipe-DET: True Positives and False Positives Gain

Data Size =30000Threshold = 0.1

BayesWipe-PDB vs. BayesWipe-DET: Size of Database

1 2 5 10 15 200

100000

150000

200000

250000

300000

350000

400000

BayesWipe-PDB database size BayesWipe-DET database size

Data Size = 30000

BayesWipe-PDB vs. BayesWipe-DET: Average Query Processing time

Data Size =30000

1 2 5 10 15 200

BayesWipe-PDB query processing time BayesWipe-DET query processing time

Noise Percentage

model = 'c

ontinental gt'

model = 'e

quinox'

make = 'is

make = 'fo

make = 'h

yundai'

model = '5

'550 gran turis

model = 'e

nclave

0.90.940.98

BayesWipe-PDB Recall Optimized BayesWipe-PDB Recall BayesWipe-DET Recall

model = 'c

ontinental gt'

model = 'e

quinox'

make = 'is

make = 'fo

make = 'h

yundai'

model = '5

'550 gran turis

model = 'e

nclave

BayesWipe-PDB Precision Optimized BayesWipe-PDB Precision BayesWipe-DET Precision

Optimized BayesWipe-PDB vs. BayesWipe-DET Recall and Precision

Optimized BayesWipe-PDB:True Positives and False Positives Gain

1 2 5 10 15 200

Increase in True Positives Increase in True Positive (Optimized) Increase in False NegativeIncrease in False Positive (Optimized)

Noise Percentage

Data Size =30000Threshold = 0.1

1 2 5 10 15 200

100000

150000

200000

250000

300000

350000

400000

BayesWipe-DET database size BayesWipe-PDB database sizePrePruned BayesWipe-PDB database size

Noise Percentage

Optimized BayesWipe-PDB:Database Size Comparison

Data Size =30000

1 2 5 10 15 200

BayesWipe-DET query processing time BayesWipe-PDB query processing timePrePruned BayesWipe-PDB query processing time

Noise Percentage

Optimized BayesWipe-PDB: Average Query Processing time

Data Size =30000

Results SummaryQuery Processing time Database size Precision Recall

BayesWipe-DET Low Same as Dirty Data High Low

BayesWipe-PDB Very High Very largeIncreases with

increase inThreshold value

Decreases with increase in

Threshold value

Optimized BayesWipe-PDB High Large

Higher to Precision of

BayesWipe-DET in most cases (65%

times)

Higher or equal to Recall of

BayesWipe-DET

Conclusion I studied the utility of considering multiple alternative rectifications in data cleaning

For that, I compare BayesWipe-PDB and BayesWipe-DET BayesWipe-PDB always has better recall for query results at the cost of precision

BayesWipe-PDB also requires larger physical space and high query processing time

Optimization technique provide a way to minimize the cost of precision and scalability issues

Utility of Considering M ultiple A lternative R ectifications in Data Cleaning

Documents

2018 TAILINGS AND MINE WASTE ONFERENCE LTERNATIVE

Superresolution in Fluorescence and Diffraction Microscopies with M ultiple I lluminations

Considering Corrosion

A lternative P ositioning, N avigation, and T iming Initiative Ground Rules

A LTERNATIVE P RACTICUM P LACEMENTS Year 4 Concurrent Education Brantford Campus

P ath-Following A utonomous C onvoy with M ultiple A synchronous N odes

DRB1-DQB1 interaction in ultiple sclerosis

M ultiple Sclerosis

Quick Guide: M ULTIPLE SCLERO SIS R OAD M AP...M ultiple sclerosis (MS) is an autoimmune, chronic inßammatory demyelinating disease of the central nervous system (CNS). This disabling

M ULTIPLE I NTELLIGENCES. M RS. B LACK ’ S R ESULTS

Comparing a lternative vaccination schedules to control PRRS

Single Visit Versus Ultiple Vists Root Canal Treatment

A lternative approach

G RAPHIC T EXT AND M ULTIPLE C HOICE Preparing for the 2014 OSSLT

E TP 210: ASSESSMENT AND REPORTING A LTERNATIVE APPROACHES TO ASSESSMENT

PS 2s 5.: M ULTIPLE regression equations were … ULTIPLE regression equations were developed to ... PS 5.: 9 g 2 GO 0 a0 2s We determined the wind erodibility index (7

In-field Simulation Considering Considering Analog Variability€¦ · In-field Simulation Considering Considering Analog Variability FAC’18 – May 16, 2018 Michael Rathmair1,

P REFERRED AND A LTERNATIVE M ETHODS FOR E STIMATING A IR E

P arkside E lementary I mprovement P lan 2017-2018 · 4.Foster m ultiple p erspectives t o d evelop g lobal c itizens 5.Provide m ultiple p athways t hat i ntellectually e ngage a

A LTERNATIVE & A UGMENTATIVE C OMMUNICATION FOR A NDROID OS By