Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or

Data Cleansing

LIU Jingyuan, Vislab

WANG Yilei, Theoretical group

https://www.youtube.com/watch?v=dzqOwmxK3j4

What is Data Cleansing

• Data cleansing (data cleaning) is the process of detecting and correcting (or

removing) errors or inconsistencies from a record set, table, or database.

Name Age Gender Salary

Peter 23 M 16,330 HKD

Tom 34M 20,000

HKD

Sue 21 F 2,548 USD



Tom 34 M 20,000 HKD

Sue 21 F 20,000 HKD

Data Cleansing

Why we need Data Cleansing

• Error universally exists in real-world data.

erroneous measurements, lazy input habits, omissions, etc.

• Error data leads to false conclusions and misdirected investments.

to keep track of employees, customers, or the sales volume

• Error data leads to unnecessary costs and probably loss of reputation.

invalid mailing addresses, inaccurate buying habits and preferences

Data Anomalies

• Use the term “anomalies” to represent the errors to be detected or corrected.

• Classification of Data Anomalies:

• Syntactical Anomalies describe characteristics concerning the format and

values used for representation of the entities.

• Semantic Anomalies hinder the data collection from being a comprehensive

and non-redundant representation to the mini-world.

• Coverage Anomalies decrease the amount of entities and entity properties

from the mini-world that are represented in the data collection.

Syntactical Anomalies

• Lexical errors name discrepancies between the structure of data items and

the specified format.

• The degree of the tuple #t is different from #R, the degree of the relation

schema for the tuple.

Name Age Gender Size

Peter 23 M 7’1

Tom 34 M

Sue 21 5’8

Data table with lexical errors


• Lexical errors

• Domain format errors specify errors where the given value for an attribute

does not conform with the anticipated format.

• Required format of name “FirstName, LastName”

Name Age Gender

Rachel, Green 24 F

Monica, Geller 24 F

Ross Geller 26 M

Data table with domain format errors


• Lexical errors

• Domain format errors

• Irregularities are concerned with the non-uniform use of values, units and

abbreviations.



Tom 34 M 20,000 HKD

Sue 21 F 2,548 USD

Data table with irregularities

Semantic Anomalies

• Integrity constraint violations describe tuples that do not satisfy some

integrity constraints, which are used to describe our understanding of the

mini-world by restricting the set of valid instances (e.g. AGE≥0).

• Contradictions are values between tuples that violate some kind of

dependency between the values (e.g. the contradiction between AGE and

DATE_OF_BIRTH).

Semantic Anomalies

• Integrity constraint violations

• Contradictions

• Duplicates are two or more tuples representing the same entity from the

mini-world. The values of these tuples can be different, which may also be

specific cases of contradiction.

• Invalid tuples represent tuples that do not display anomalies of the classes

defined above but still do not represent valid entries from the mini-world.

Coverage Anomalies

• Missing values or tuples.

• Tom’s salary is missing.

• Sue’s information is missing, who

is the employee of this company.



Tom 34 M NULL

… … … …

Data Anomalies


Lexical errors

Domain format errors

Irregularities

Semantic Anomalies

Integrity constraint violations

Contradictions

Duplicates Invalid tuples

Coverage Anomalies

Missing values

Missing tuples

Data Quality

• Data quality is defined as an aggregated value over a set of quality criteria.

• With data quality, we can

• Decide whether we need to do data cleansing on a data collection

• Assess and compare the performances of different data cleansing

methods

Data Quality

Hierarchy of data quality criteria:

Completeness ValiditySchema

conformUniformity Density Uniqueness

Lexical error ━ ● ━ ━ ━

Domain format

error━ ● ━ ━

Irregularities ━ ● ━

Constraint

Violation●

Missing Value ● ━

Missing Tuple ●

Duplicates ●

Invalid Tuple ●

Data anomalies affecting data quality criteria

Process of

Data Cleansing

Process of Data Cleansing

1. Data Auditing is the step to find the types of anomalies contained within data.

2. Workflow Specification is the step to decide the data cleansing workflow, which is a sequence of operations on the data, in order to detect and eliminate anomalies.

3. Workflow Execution is the step to execute the workflow after specification and verification of its correctness.

4. Post-Processing and Controlling is the step to inspect the results again to find the tuples that are still not correct, which should be corrected manually.

Data Cleansing Methods

1. Anomaly Detection:

a) rule-based detection

b) pattern enforcement detection

c) duplicate detection

2. Error Correction in terms of signals:

a) integrity constraints

b) external information

c) quantitative statistics

1. Anomaly Detection

(a) Rule-based detection specify a collection of rules that clean data will obey.

Rules are represented as multi-attribute functional dependencies(FDs) or user-

defined functions.

Mistake Heuristic

Illegal values Value should not be outside of permissible range (min,max)

Misspellings Sorting on values often brings misspelled values next to correct values

Missing values Presence of default value may indicate real value is missing

Duplicates Sorting values by number of occurrences and more than 1 occurrence indicates

duplicates

Ref: Rahm E, Do H H. Data cleaning: Problems and current approaches[J]. IEEE Data Eng. Bull., 2000, 23(4): 3-13. Ref: Rahm E, Do H H. Data cleaning: Problems and current approaches[J]. IEEE Data Eng. Bull., 2000, 23(4): 3-13.


(b) Pattern enforcement utilizes syntactic or semantic patterns in data, and

detect cells that do not conform with the patterns. This is the focus of data

mining models including clustering, summarization, association discovery and

sequence discovery.

e.g. relationships holding between several attributes

Ref: Abedjan Z, Chu X, Deng D, et al. Detecting Data Errors: Where are we and what needs to be done?[J].

Proceedings of the VLDB Endowment, 2016, 9(12): 993-1004.

𝐴𝑖1 = 𝑎𝑖1 ∩ 𝐴𝑖2 = 𝑎𝑖2 … → 𝐴𝑗1 = 𝑎𝑗1 ∩ 𝐴𝑗2 = 𝑎𝑗2

Ref: Abedjan Z, Chu X, Deng D, et al. Detecting Data Errors: Where are we and what needs to be done?[J].

Proceedings of the VLDB Endowment, 2016, 9(12): 993-1004.

𝐴𝑖1 = 𝑎𝑖1 ∩ 𝐴𝑖2 = 𝑎𝑖2 … → 𝐴𝑗1 = 𝑎𝑗1 ∩ 𝐴𝑗2 = 𝑎𝑗2


(c)Duplicate detection identifies multiple records for the same entity. Meanwhile, conflicting values for the same attribute can be found, indicating possible errors.

Duplicate representations might differ slightly in their values, thus well-chosen similarity measures improve the effectiveness of duplicate detection.

Algorithms are developed to perform on very large volumes of data in search for duplicates.

Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.


Duplicate detection

Similarity measure1: Jaccard Coefficient: compare two sets P and Q

Similarity measure2: Edit distance


𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑃, 𝑄 =|𝑃 ∩ 𝑄|

|𝑃 ∪ 𝑄|

𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒 𝑇ℎ𝑜𝑚𝑎𝑠 𝑆𝑒𝑎𝑛 𝐶𝑜𝑛𝑛𝑒𝑟𝑦 = 𝑇ℎ𝑜𝑚𝑎𝑠, 𝑆𝑒𝑎𝑛, 𝐶𝑜𝑛𝑛𝑒𝑟𝑦

𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒 𝑆𝑖𝑟 𝑆𝑒𝑎𝑛 𝐶𝑜𝑛𝑛𝑒𝑟𝑦 = {𝑆𝑖𝑟, 𝑆𝑒𝑎𝑛, 𝐶𝑜𝑛𝑛𝑒𝑟𝑦}

𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑇ℎ𝑜𝑚𝑎𝑠 𝑆𝑒𝑎𝑛 𝐶𝑜𝑛𝑛𝑒𝑟𝑦, 𝑆𝑖𝑟 𝑆𝑒𝑎𝑛 𝐶𝑜𝑛𝑛𝑒𝑟𝑦 =2

4


Duplicate detection examples:

Detection algorithm:

To avoid the cost of pair-wise comparisons, sorted-neighborhood method first

assigns a sorting key to each record and sort all records according to that key.

Then all pairs of records that appear in the same window are compared.


2. Error Correction

(a) Integrity Constraints

Functional Dependencies:

2. Error Correction


Ref: Bohannon P, Fan W, Flaster M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]//

Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005: 143-154.

Tuple 𝑡4: Modify name to “Alice Smith” and street to “17 bridge”

2. Error Correction


Cost-based model is to find another database that is consistent and minimally

differs from the original database.

Assign a weight to each tuple, the cost of a modification is the weight times the

distance according to a similarity metric between the original value and the

repaired value.





𝑐𝑜𝑠𝑡 𝑡 = 𝜔(𝑡) ∙

𝐴∈𝑎𝑡𝑡𝑟(𝑅𝑖)

𝑑𝑖𝑠(𝐷 𝑡, 𝐴 , 𝐷′(𝑡, 𝐴))

2. Error Correction


Cost-based model is to find another database that is consistent and minimally

differs from the original database.

2. Error Correction


Denial Constraints is a more expressive first order logic than integrity constraints

in that they involve order predicates (>,<) and compares different attributes in

the same predicate.

2. Error Correction


A denial constraint expresses that a set of predicates cannot be true together for

any combination of tuples in a relation.

Ref: Chu X, Ilyas I F, Papotti P. Discovering denial constraints[J]. Proceedings of the VLDB Endowment, 2013, 6(13): 1498-1509.

∀𝑡𝛼 , 𝑡𝛽, … ∈ 𝑅: ¬(𝑝1 ∩⋯∩ 𝑝𝑚)

∀𝑡𝛼 , 𝑡𝛽 ∈ 𝑅,¬(𝑡𝛼 . 𝑍𝐼𝑃 = 𝑡𝛽. 𝑍𝐼𝑃 ∩ 𝑡𝛼 . 𝑆𝐴𝐿 < 𝑡𝛽 . 𝑆𝐴𝐿 ∩ 𝑡𝛼 . 𝑇𝑅 > 𝑡𝛽 . 𝑇𝑅)

e.g. There cannot exist two persons who live in the same zip code and one person

has a lower salary and higher tax rate:

2. Error Correction

(b) External Information

External information include dictionaries, knowledge bases and annotations by

experts.

It is used for correcting data entry errors and correct them automatically.

For example, identifying and correcting misspellings based on dictionary lookup,

dictionaries on geographic names and zip codes help to correct address data.

Attribute dependencies (birthday-age, total price-unit price/quantity, city-phone

area code…) can be used to detect wrong values and substitute missing values.

2. Error Correction

(c) Quantitative Statistics

Relational dependency network (RDN) captures attribute dependencies with graphical models to propagate inferences throughout the database.

Compared with conventional conditional models, this model deals with statistically datasets with dependent instances. When relational data exhibit autocorrelation, inferences about one object can inform inferences about related objects.

Compared with other probabilistic relational models, this model can deal with cyclic autocorrelation dependencies.

Ref: Neville J, Jensen D. Relational dependency networks[J]. Journal of Machine Learning Research, 2007, 8(Mar): 653-692.Ref: Neville J, Jensen D. Relational dependency networks[J]. Journal of Machine Learning Research, 2007, 8(Mar): 653-692.

2. Error Correction


There are three graphs associated with relational data:

Data graph:

Each node has a number of associated attributes. A probabilistic relational model represents a joint distribution over the values of the attributes in the data graph.

2. Error Correction


There are three graphs associated with relational data:

Model graph:

Represent the dependencies among attributes. Attributes of an item can depend probabilistically on other attributes of the same item, as well as on attributes of other related objects.

2. Error Correction


There are three graphs associated

with relational data:

Inference graph:

During inference, an inference

graph is instantiated to represent

the probabilistic dependencies

among all the variables in a test set.

2. Error Correction


Learning a RDN: Maximum a pseudolikelihood

1. learn the dependency structure among the attributes of each object type;

2. estimating the parameters of the local probability models for an attribute given

its parents.

if 𝑝 𝑥𝑖 𝑋 − 𝑥𝑖 = 𝛼𝑥𝑗 + 𝛽𝑥𝑘

then 𝑃𝐴𝑖 = {𝑥𝑗 , 𝑥𝑘}

2. Error Correction


Inference: Gibbs sampling

1. Create the inference graph, where the values of all unobserved variables are initialized to values drawn from prior distributions;

2. Given the current state of the rest of the graph, Gibbs sampling iteratively relabels each unobserved variable by drawing from its local conditional distribution;

3. Finally, the values will be drawn from a stationary distribution and we can use the samples to estimate probabilities of interest.

Conclusion

• Data cleansing is the process of detecting and correcting errors and

inconsistencies.

• The process of data cleansing is a sequence of operations intending to

enhance to overall data quality of a data collection.

• There have been many methods of data cleansing, which aim at error

detection and error correction in different steps of data cleansing.

Thank you!

Documents

Data Cleansing - cse.ust.hkdimitris/5311/P7b-DC.pdf · What is Data Cleansing •Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or