Upload
vandung
View
240
Download
1
Embed Size (px)
Citation preview
Data Cleansing
LIU Jingyuan, Vislab
WANG Yilei, Theoretical group
What is Data Cleansing
• Data cleansing (data cleaning) is the process of detecting and correcting (or
removing) errors or inconsistencies from a record set, table, or database.
Name Age Gender Salary
Peter 23 M 16,330 HKD
Tom 34M 20,000
HKD
Sue 21 F 2,548 USD
Name Age Gender Salary
Peter 23 M 16,330 HKD
Tom 34 M 20,000 HKD
Sue 21 F 20,000 HKD
Data Cleansing
Why we need Data Cleansing
• Error universally exists in real-world data.
erroneous measurements, lazy input habits, omissions, etc.
• Error data leads to false conclusions and misdirected investments.
to keep track of employees, customers, or the sales volume
• Error data leads to unnecessary costs and probably loss of reputation.
invalid mailing addresses, inaccurate buying habits and preferences
Data Anomalies
• Use the term “anomalies” to represent the errors to be detected or corrected.
• Classification of Data Anomalies:
• Syntactical Anomalies describe characteristics concerning the format and
values used for representation of the entities.
• Semantic Anomalies hinder the data collection from being a comprehensive
and non-redundant representation to the mini-world.
• Coverage Anomalies decrease the amount of entities and entity properties
from the mini-world that are represented in the data collection.
Syntactical Anomalies
• Lexical errors name discrepancies between the structure of data items and
the specified format.
• The degree of the tuple #t is different from #R, the degree of the relation
schema for the tuple.
Name Age Gender Size
Peter 23 M 7’1
Tom 34 M
Sue 21 5’8
Data table with lexical errors
Syntactical Anomalies
• Lexical errors
• Domain format errors specify errors where the given value for an attribute
does not conform with the anticipated format.
• Required format of name “FirstName, LastName”
Name Age Gender
Rachel, Green 24 F
Monica, Geller 24 F
Ross Geller 26 M
Data table with domain format errors
Syntactical Anomalies
• Lexical errors
• Domain format errors
• Irregularities are concerned with the non-uniform use of values, units and
abbreviations.
Name Age Gender Salary
Peter 23 M 16,330 HKD
Tom 34 M 20,000 HKD
Sue 21 F 2,548 USD
Data table with irregularities
Semantic Anomalies
• Integrity constraint violations describe tuples that do not satisfy some
integrity constraints, which are used to describe our understanding of the
mini-world by restricting the set of valid instances (e.g. AGE≥0).
• Contradictions are values between tuples that violate some kind of
dependency between the values (e.g. the contradiction between AGE and
DATE_OF_BIRTH).
Semantic Anomalies
• Integrity constraint violations
• Contradictions
• Duplicates are two or more tuples representing the same entity from the
mini-world. The values of these tuples can be different, which may also be
specific cases of contradiction.
• Invalid tuples represent tuples that do not display anomalies of the classes
defined above but still do not represent valid entries from the mini-world.
Coverage Anomalies
• Missing values or tuples.
• Tom’s salary is missing.
• Sue’s information is missing, who
is the employee of this company.
Name Age Gender Salary
Peter 23 M 16,330 HKD
Tom 34 M NULL
… … … …
Data Anomalies
Syntactical Anomalies
Lexical errors
Domain format errors
Irregularities
Semantic Anomalies
Integrity constraint violations
Contradictions
Duplicates Invalid tuples
Coverage Anomalies
Missing values
Missing tuples
Data Quality
• Data quality is defined as an aggregated value over a set of quality criteria.
• With data quality, we can
• Decide whether we need to do data cleansing on a data collection
• Assess and compare the performances of different data cleansing
methods
Data Quality
Hierarchy of data quality criteria:
Completeness ValiditySchema
conformUniformity Density Uniqueness
Lexical error ━ ● ━ ━ ━
Domain format
error━ ● ━ ━
Irregularities ━ ● ━
Constraint
Violation●
Missing Value ● ━
Missing Tuple ●
Duplicates ●
Invalid Tuple ●
Data anomalies affecting data quality criteria
Process of
Data Cleansing
Process of Data Cleansing
1. Data Auditing is the step to find the types of anomalies contained within data.
2. Workflow Specification is the step to decide the data cleansing workflow, which is a sequence of operations on the data, in order to detect and eliminate anomalies.
3. Workflow Execution is the step to execute the workflow after specification and verification of its correctness.
4. Post-Processing and Controlling is the step to inspect the results again to find the tuples that are still not correct, which should be corrected manually.
Data Cleansing Methods
1. Anomaly Detection:
a) rule-based detection
b) pattern enforcement detection
c) duplicate detection
2. Error Correction in terms of signals:
a) integrity constraints
b) external information
c) quantitative statistics
1. Anomaly Detection
(a) Rule-based detection specify a collection of rules that clean data will obey.
Rules are represented as multi-attribute functional dependencies(FDs) or user-
defined functions.
Mistake Heuristic
Illegal values Value should not be outside of permissible range (min,max)
Misspellings Sorting on values often brings misspelled values next to correct values
Missing values Presence of default value may indicate real value is missing
Duplicates Sorting values by number of occurrences and more than 1 occurrence indicates
duplicates
Ref: Rahm E, Do H H. Data cleaning: Problems and current approaches[J]. IEEE Data Eng. Bull., 2000, 23(4): 3-13. Ref: Rahm E, Do H H. Data cleaning: Problems and current approaches[J]. IEEE Data Eng. Bull., 2000, 23(4): 3-13.
1. Anomaly Detection
(b) Pattern enforcement utilizes syntactic or semantic patterns in data, and
detect cells that do not conform with the patterns. This is the focus of data
mining models including clustering, summarization, association discovery and
sequence discovery.
e.g. relationships holding between several attributes
Ref: Abedjan Z, Chu X, Deng D, et al. Detecting Data Errors: Where are we and what needs to be done?[J].
Proceedings of the VLDB Endowment, 2016, 9(12): 993-1004.
𝐴𝑖1 = 𝑎𝑖1 ∩ 𝐴𝑖2 = 𝑎𝑖2 … → 𝐴𝑗1 = 𝑎𝑗1 ∩ 𝐴𝑗2 = 𝑎𝑗2
Ref: Abedjan Z, Chu X, Deng D, et al. Detecting Data Errors: Where are we and what needs to be done?[J].
Proceedings of the VLDB Endowment, 2016, 9(12): 993-1004.
𝐴𝑖1 = 𝑎𝑖1 ∩ 𝐴𝑖2 = 𝑎𝑖2 … → 𝐴𝑗1 = 𝑎𝑗1 ∩ 𝐴𝑗2 = 𝑎𝑗2
1. Anomaly Detection
(c)Duplicate detection identifies multiple records for the same entity. Meanwhile, conflicting values for the same attribute can be found, indicating possible errors.
Duplicate representations might differ slightly in their values, thus well-chosen similarity measures improve the effectiveness of duplicate detection.
Algorithms are developed to perform on very large volumes of data in search for duplicates.
Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.
1. Anomaly Detection
Duplicate detection
Similarity measure1: Jaccard Coefficient: compare two sets P and Q
Similarity measure2: Edit distance
Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.
𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑃, 𝑄 =|𝑃 ∩ 𝑄|
|𝑃 ∪ 𝑄|
𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒 𝑇ℎ𝑜𝑚𝑎𝑠 𝑆𝑒𝑎𝑛 𝐶𝑜𝑛𝑛𝑒𝑟𝑦 = 𝑇ℎ𝑜𝑚𝑎𝑠, 𝑆𝑒𝑎𝑛, 𝐶𝑜𝑛𝑛𝑒𝑟𝑦
𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒 𝑆𝑖𝑟 𝑆𝑒𝑎𝑛 𝐶𝑜𝑛𝑛𝑒𝑟𝑦 = {𝑆𝑖𝑟, 𝑆𝑒𝑎𝑛, 𝐶𝑜𝑛𝑛𝑒𝑟𝑦}
𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑇ℎ𝑜𝑚𝑎𝑠 𝑆𝑒𝑎𝑛 𝐶𝑜𝑛𝑛𝑒𝑟𝑦, 𝑆𝑖𝑟 𝑆𝑒𝑎𝑛 𝐶𝑜𝑛𝑛𝑒𝑟𝑦 =2
4
1. Anomaly Detection
Duplicate detection examples:
Detection algorithm:
To avoid the cost of pair-wise comparisons, sorted-neighborhood method first
assigns a sorting key to each record and sort all records according to that key.
Then all pairs of records that appear in the same window are compared.
Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.Ref: Naumann F, Herschel M. An introduction to duplicate detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.
2. Error Correction
(a) Integrity Constraints
Functional Dependencies:
2. Error Correction
(a) Integrity Constraints
Ref: Bohannon P, Fan W, Flaster M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]//
Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005: 143-154.
Tuple 𝑡4: Modify name to “Alice Smith” and street to “17 bridge”
2. Error Correction
(a) Integrity Constraints
Cost-based model is to find another database that is consistent and minimally
differs from the original database.
Assign a weight to each tuple, the cost of a modification is the weight times the
distance according to a similarity metric between the original value and the
repaired value.
Ref: Bohannon P, Fan W, Flaster M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]//
Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005: 143-154.
Ref: Bohannon P, Fan W, Flaster M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]//
Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005: 143-154.
𝑐𝑜𝑠𝑡 𝑡 = 𝜔(𝑡) ∙
𝐴∈𝑎𝑡𝑡𝑟(𝑅𝑖)
𝑑𝑖𝑠(𝐷 𝑡, 𝐴 , 𝐷′(𝑡, 𝐴))
2. Error Correction
(a) Integrity Constraints
Cost-based model is to find another database that is consistent and minimally
differs from the original database.
2. Error Correction
(a) Integrity Constraints
Denial Constraints is a more expressive first order logic than integrity constraints
in that they involve order predicates (>,<) and compares different attributes in
the same predicate.
2. Error Correction
(a) Integrity Constraints
A denial constraint expresses that a set of predicates cannot be true together for
any combination of tuples in a relation.
Ref: Chu X, Ilyas I F, Papotti P. Discovering denial constraints[J]. Proceedings of the VLDB Endowment, 2013, 6(13): 1498-1509.
∀𝑡𝛼 , 𝑡𝛽, … ∈ 𝑅: ¬(𝑝1 ∩⋯∩ 𝑝𝑚)
∀𝑡𝛼 , 𝑡𝛽 ∈ 𝑅,¬(𝑡𝛼 . 𝑍𝐼𝑃 = 𝑡𝛽. 𝑍𝐼𝑃 ∩ 𝑡𝛼 . 𝑆𝐴𝐿 < 𝑡𝛽 . 𝑆𝐴𝐿 ∩ 𝑡𝛼 . 𝑇𝑅 > 𝑡𝛽 . 𝑇𝑅)
e.g. There cannot exist two persons who live in the same zip code and one person
has a lower salary and higher tax rate:
2. Error Correction
(b) External Information
External information include dictionaries, knowledge bases and annotations by
experts.
It is used for correcting data entry errors and correct them automatically.
For example, identifying and correcting misspellings based on dictionary lookup,
dictionaries on geographic names and zip codes help to correct address data.
Attribute dependencies (birthday-age, total price-unit price/quantity, city-phone
area code…) can be used to detect wrong values and substitute missing values.
2. Error Correction
(c) Quantitative Statistics
Relational dependency network (RDN) captures attribute dependencies with graphical models to propagate inferences throughout the database.
Compared with conventional conditional models, this model deals with statistically datasets with dependent instances. When relational data exhibit autocorrelation, inferences about one object can inform inferences about related objects.
Compared with other probabilistic relational models, this model can deal with cyclic autocorrelation dependencies.
Ref: Neville J, Jensen D. Relational dependency networks[J]. Journal of Machine Learning Research, 2007, 8(Mar): 653-692.Ref: Neville J, Jensen D. Relational dependency networks[J]. Journal of Machine Learning Research, 2007, 8(Mar): 653-692.
2. Error Correction
(c) Quantitative Statistics
There are three graphs associated with relational data:
Data graph:
Each node has a number of associated attributes. A probabilistic relational model represents a joint distribution over the values of the attributes in the data graph.
2. Error Correction
(c) Quantitative Statistics
There are three graphs associated with relational data:
Model graph:
Represent the dependencies among attributes. Attributes of an item can depend probabilistically on other attributes of the same item, as well as on attributes of other related objects.
2. Error Correction
(c) Quantitative Statistics
There are three graphs associated
with relational data:
Inference graph:
During inference, an inference
graph is instantiated to represent
the probabilistic dependencies
among all the variables in a test set.
2. Error Correction
(c) Quantitative Statistics
Learning a RDN: Maximum a pseudolikelihood
1. learn the dependency structure among the attributes of each object type;
2. estimating the parameters of the local probability models for an attribute given
its parents.
if 𝑝 𝑥𝑖 𝑋 − 𝑥𝑖 = 𝛼𝑥𝑗 + 𝛽𝑥𝑘
then 𝑃𝐴𝑖 = {𝑥𝑗 , 𝑥𝑘}
2. Error Correction
(c) Quantitative Statistics
Inference: Gibbs sampling
1. Create the inference graph, where the values of all unobserved variables are initialized to values drawn from prior distributions;
2. Given the current state of the rest of the graph, Gibbs sampling iteratively relabels each unobserved variable by drawing from its local conditional distribution;
3. Finally, the values will be drawn from a stationary distribution and we can use the samples to estimate probabilities of interest.
Conclusion
• Data cleansing is the process of detecting and correcting errors and
inconsistencies.
• The process of data cleansing is a sequence of operations intending to
enhance to overall data quality of a data collection.
• There have been many methods of data cleansing, which aim at error
detection and error correction in different steps of data cleansing.
Thank you!