37
Data Cleansing LIU Jingyuan, Vislab WANG Yilei, Theoretical group

Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

  • Upload
    vandung

  • View
    240

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Data Cleansing

LIU Jingyuan, Vislab

WANG Yilei, Theoretical group

Page 2: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

What is Data Cleansing

• Data cleansing (data cleaning) is the process of detecting and correcting (or

removing) errors or inconsistencies from a record set, table, or database.

Name Age Gender Salary

Peter 23 M 16,330 HKD

Tom 34M 20,000

HKD

Sue 21 F 2,548 USD

Name Age Gender Salary

Peter 23 M 16,330 HKD

Tom 34 M 20,000 HKD

Sue 21 F 20,000 HKD

Data Cleansing

Page 3: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Why we need Data Cleansing

• Error universally exists in real-world data.

erroneous measurements, lazy input habits, omissions, etc.

• Error data leads to false conclusions and misdirected investments.

to keep track of employees, customers, or the sales volume

• Error data leads to unnecessary costs and probably loss of reputation.

invalid mailing addresses, inaccurate buying habits and preferences

Page 4: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Data Anomalies

• Use the term “anomalies” to represent the errors to be detected or corrected.

• Classification of Data Anomalies:

• Syntactical Anomalies describe characteristics concerning the format and

values used for representation of the entities.

• Semantic Anomalies hinder the data collection from being a comprehensive

and non-redundant representation to the mini-world.

• Coverage Anomalies decrease the amount of entities and entity properties

from the mini-world that are represented in the data collection.

Page 5: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Syntactical Anomalies

• Lexical errors name discrepancies between the structure of data items and

the specified format.

• The degree of the tuple #t is different from #R, the degree of the relation

schema for the tuple.

Name Age Gender Size

Peter 23 M 7’1

Tom 34 M

Sue 21 5’8

Data table with lexical errors

Page 6: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Syntactical Anomalies

• Lexical errors

• Domain format errors specify errors where the given value for an attribute

does not conform with the anticipated format.

• Required format of name “FirstName, LastName”

Name Age Gender

Rachel, Green 24 F

Monica, Geller 24 F

Ross Geller 26 M

Data table with domain format errors

Page 7: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Syntactical Anomalies

• Lexical errors

• Domain format errors

• Irregularities are concerned with the non-uniform use of values, units and

abbreviations.

Name Age Gender Salary

Peter 23 M 16,330 HKD

Tom 34 M 20,000 HKD

Sue 21 F 2,548 USD

Data table with irregularities

Page 8: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Semantic Anomalies

• Integrity constraint violations describe tuples that do not satisfy some

integrity constraints, which are used to describe our understanding of the

mini-world by restricting the set of valid instances (e.g. AGE≥0).

• Contradictions are values between tuples that violate some kind of

dependency between the values (e.g. the contradiction between AGE and

DATE_OF_BIRTH).

Page 9: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Semantic Anomalies

• Integrity constraint violations

• Contradictions

• Duplicates are two or more tuples representing the same entity from the

mini-world. The values of these tuples can be different, which may also be

specific cases of contradiction.

• Invalid tuples represent tuples that do not display anomalies of the classes

defined above but still do not represent valid entries from the mini-world.

Page 10: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Coverage Anomalies

• Missing values or tuples.

• Tom’s salary is missing.

• Sue’s information is missing, who

is the employee of this company.

Name Age Gender Salary

Peter 23 M 16,330 HKD

Tom 34 M NULL

… … … …

Page 11: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Data Anomalies

Syntactical Anomalies

Lexical errors

Domain format errors

Irregularities

Semantic Anomalies

Integrity constraint violations

Contradictions

Duplicates Invalid tuples

Coverage Anomalies

Missing values

Missing tuples

Page 12: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Data Quality

• Data quality is defined as an aggregated value over a set of quality criteria.

• With data quality, we can

• Decide whether we need to do data cleansing on a data collection

• Assess and compare the performances of different data cleansing

methods

Page 13: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Data Quality

Hierarchy of data quality criteria:

Page 14: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Completeness ValiditySchema

conformUniformity Density Uniqueness

Lexical error ━ ● ━ ━ ━

Domain format

error━ ● ━ ━

Irregularities ━ ● ━

Constraint

Violation●

Missing Value ● ━

Missing Tuple ●

Duplicates ●

Invalid Tuple ●

Data anomalies affecting data quality criteria

Page 15: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Process of

Data Cleansing

Page 16: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Process of Data Cleansing

1. Data Auditing is the step to find the types of anomalies contained within data.

2. Workflow Specification is the step to decide the data cleansing workflow, which is a sequence of operations on the data, in order to detect and eliminate anomalies.

3. Workflow Execution is the step to execute the workflow after specification and verification of its correctness.

4. Post-Processing and Controlling is the step to inspect the results again to find the tuples that are still not correct, which should be corrected manually.

Page 17: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Data Cleansing Methods

1. Anomaly Detection:

a) rule-based detection

b) pattern enforcement detection

c) duplicate detection

2. Error Correction in terms of signals:

a) integrity constraints

b) external information

c) quantitative statistics

Page 18: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

1. Anomaly Detection

(a) Rule-based detection specify a collection of rules that clean data will obey.

Rules are represented as multi-attribute functional dependencies(FDs) or user-

defined functions.

Mistake Heuristic

Illegal values Value should not be outside of permissible range (min,max)

Misspellings Sorting on values often brings misspelled values next to correct values

Missing values Presence of default value may indicate real value is missing

Duplicates Sorting values by number of occurrences and more than 1 occurrence indicates

duplicates

Ref: Rahm E, Do H H. Data cleaning: Problems and current approaches[J]. IEEE Data Eng. Bull., 2000, 23(4): 3-13. Ref: Rahm E, Do H H. Data cleaning: Problems and current approaches[J]. IEEE Data Eng. Bull., 2000, 23(4): 3-13.

Page 19: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

1. Anomaly Detection

(b) Pattern enforcement utilizes syntactic or semantic patterns in data, and

detect cells that do not conform with the patterns. This is the focus of data

mining models including clustering, summarization, association discovery and

sequence discovery.

e.g. relationships holding between several attributes

Ref: Abedjan Z, Chu X, Deng D, et al. Detecting Data Errors: Where are we and what needs to be done?[J].

Proceedings of the VLDB Endowment, 2016, 9(12): 993-1004.

𝐴𝑖1 = 𝑎𝑖1 ∩ 𝐴𝑖2 = 𝑎𝑖2 … → 𝐴𝑗1 = 𝑎𝑗1 ∩ 𝐴𝑗2 = 𝑎𝑗2

Ref: Abedjan Z, Chu X, Deng D, et al. Detecting Data Errors: Where are we and what needs to be done?[J].

Proceedings of the VLDB Endowment, 2016, 9(12): 993-1004.

𝐴𝑖1 = 𝑎𝑖1 ∩ 𝐴𝑖2 = 𝑎𝑖2 … → 𝐴𝑗1 = 𝑎𝑗1 ∩ 𝐴𝑗2 = 𝑎𝑗2

Page 20: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

1. Anomaly Detection

(c)Duplicate detection identifies multiple records for the same entity. Meanwhile, conflicting values for the same attribute can be found, indicating possible errors.

Duplicate representations might differ slightly in their values, thus well-chosen similarity measures improve the effectiveness of duplicate detection.

Algorithms are developed to perform on very large volumes of data in search for duplicates.

Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.

Page 21: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

1. Anomaly Detection

Duplicate detection

Similarity measure1: Jaccard Coefficient: compare two sets P and Q

Similarity measure2: Edit distance

Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.

𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑃, 𝑄 =|𝑃 ∩ 𝑄|

|𝑃 ∪ 𝑄|

𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒 𝑇ℎ𝑜𝑚𝑎𝑠 𝑆𝑒𝑎𝑛 𝐶𝑜𝑛𝑛𝑒𝑟𝑦 = 𝑇ℎ𝑜𝑚𝑎𝑠, 𝑆𝑒𝑎𝑛, 𝐶𝑜𝑛𝑛𝑒𝑟𝑦

𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒 𝑆𝑖𝑟 𝑆𝑒𝑎𝑛 𝐶𝑜𝑛𝑛𝑒𝑟𝑦 = {𝑆𝑖𝑟, 𝑆𝑒𝑎𝑛, 𝐶𝑜𝑛𝑛𝑒𝑟𝑦}

𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑇ℎ𝑜𝑚𝑎𝑠 𝑆𝑒𝑎𝑛 𝐶𝑜𝑛𝑛𝑒𝑟𝑦, 𝑆𝑖𝑟 𝑆𝑒𝑎𝑛 𝐶𝑜𝑛𝑛𝑒𝑟𝑦 =2

4

Page 22: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

1. Anomaly Detection

Duplicate detection examples:

Detection algorithm:

To avoid the cost of pair-wise comparisons, sorted-neighborhood method first

assigns a sorting key to each record and sort all records according to that key.

Then all pairs of records that appear in the same window are compared.

Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.

Page 23: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(a) Integrity Constraints

Functional Dependencies:

Page 24: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(a) Integrity Constraints

Ref: Bohannon P, Fan W, Flaster M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]//

Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005: 143-154.

Tuple 𝑡4: Modify name to “Alice Smith” and street to “17 bridge”

Page 25: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(a) Integrity Constraints

Cost-based model is to find another database that is consistent and minimally

differs from the original database.

Assign a weight to each tuple, the cost of a modification is the weight times the

distance according to a similarity metric between the original value and the

repaired value.

Ref: Bohannon P, Fan W, Flaster M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]//

Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005: 143-154.

Ref: Bohannon P, Fan W, Flaster M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]//

Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005: 143-154.

𝑐𝑜𝑠𝑡 𝑡 = 𝜔(𝑡) ∙

𝐴∈𝑎𝑡𝑡𝑟(𝑅𝑖)

𝑑𝑖𝑠(𝐷 𝑡, 𝐴 , 𝐷′(𝑡, 𝐴))

Page 26: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(a) Integrity Constraints

Cost-based model is to find another database that is consistent and minimally

differs from the original database.

Page 27: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(a) Integrity Constraints

Denial Constraints is a more expressive first order logic than integrity constraints

in that they involve order predicates (>,<) and compares different attributes in

the same predicate.

Page 28: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(a) Integrity Constraints

A denial constraint expresses that a set of predicates cannot be true together for

any combination of tuples in a relation.

Ref: Chu X, Ilyas I F, Papotti P. Discovering denial constraints[J]. Proceedings of the VLDB Endowment, 2013, 6(13): 1498-1509.

∀𝑡𝛼 , 𝑡𝛽, … ∈ 𝑅: ¬(𝑝1 ∩⋯∩ 𝑝𝑚)

∀𝑡𝛼 , 𝑡𝛽 ∈ 𝑅,¬(𝑡𝛼 . 𝑍𝐼𝑃 = 𝑡𝛽. 𝑍𝐼𝑃 ∩ 𝑡𝛼 . 𝑆𝐴𝐿 < 𝑡𝛽 . 𝑆𝐴𝐿 ∩ 𝑡𝛼 . 𝑇𝑅 > 𝑡𝛽 . 𝑇𝑅)

e.g. There cannot exist two persons who live in the same zip code and one person

has a lower salary and higher tax rate:

Page 29: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(b) External Information

External information include dictionaries, knowledge bases and annotations by

experts.

It is used for correcting data entry errors and correct them automatically.

For example, identifying and correcting misspellings based on dictionary lookup,

dictionaries on geographic names and zip codes help to correct address data.

Attribute dependencies (birthday-age, total price-unit price/quantity, city-phone

area code…) can be used to detect wrong values and substitute missing values.

Page 30: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(c) Quantitative Statistics

Relational dependency network (RDN) captures attribute dependencies with graphical models to propagate inferences throughout the database.

Compared with conventional conditional models, this model deals with statistically datasets with dependent instances. When relational data exhibit autocorrelation, inferences about one object can inform inferences about related objects.

Compared with other probabilistic relational models, this model can deal with cyclic autocorrelation dependencies.

Ref: Neville J, Jensen D. Relational dependency networks[J]. Journal of Machine Learning Research, 2007, 8(Mar): 653-692.Ref: Neville J, Jensen D. Relational dependency networks[J]. Journal of Machine Learning Research, 2007, 8(Mar): 653-692.

Page 31: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(c) Quantitative Statistics

There are three graphs associated with relational data:

Data graph:

Each node has a number of associated attributes. A probabilistic relational model represents a joint distribution over the values of the attributes in the data graph.

Page 32: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(c) Quantitative Statistics

There are three graphs associated with relational data:

Model graph:

Represent the dependencies among attributes. Attributes of an item can depend probabilistically on other attributes of the same item, as well as on attributes of other related objects.

Page 33: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(c) Quantitative Statistics

There are three graphs associated

with relational data:

Inference graph:

During inference, an inference

graph is instantiated to represent

the probabilistic dependencies

among all the variables in a test set.

Page 34: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(c) Quantitative Statistics

Learning a RDN: Maximum a pseudolikelihood

1. learn the dependency structure among the attributes of each object type;

2. estimating the parameters of the local probability models for an attribute given

its parents.

if 𝑝 𝑥𝑖 𝑋 − 𝑥𝑖 = 𝛼𝑥𝑗 + 𝛽𝑥𝑘

then 𝑃𝐴𝑖 = {𝑥𝑗 , 𝑥𝑘}

Page 35: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

2. Error Correction

(c) Quantitative Statistics

Inference: Gibbs sampling

1. Create the inference graph, where the values of all unobserved variables are initialized to values drawn from prior distributions;

2. Given the current state of the rest of the graph, Gibbs sampling iteratively relabels each unobserved variable by drawing from its local conditional distribution;

3. Finally, the values will be drawn from a stationary distribution and we can use the samples to estimate probabilities of interest.

Page 36: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Conclusion

• Data cleansing is the process of detecting and correcting errors and

inconsistencies.

• The process of data cleansing is a sequence of operations intending to

enhance to overall data quality of a data collection.

• There have been many methods of data cleansing, which aim at error

detection and error correction in different steps of data cleansing.

Page 37: Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Thank you!