Upload
kasper-sorensen
View
166
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Slides for DataCleaner community hangout - data quality overview
Citation preview
Data quality overviewDataCleaner / Human Inferencecommunity hangout
CIA map error led to hit on mission
By Richard Norton-TaylorWednesday May 12, 1999
“A US B2 bomber hit the Chinese embassy in Belgrade not only because the CIA used an outdated map but also because of a simple map-reading error by its intelligence officers, it emerged yesterday…”
“It was the right address applied to the wrong building”
Is data quality important?
Contradictory definitions
• A consumer oriented definition:
“Data is of high quality if they are fit for their intended uses in operations, decision making and planning” (Joseph Juran).
• A more idealistic definition:
”Data are deemed of high quality if they correctly represent the real-world construct to which they refer”.
• And a touch of post-modernism:
“Defining quality is destroying quality” (R.M. Pirsig)
• What’s right?• What’s wrong?
Why data quality?
• Operational excellence– Better marketing.– Less work.
• Risk & compliance– Validate your information.– Comply with standardards.– Due diligence – check against ”blacklists”.
Characteristics of data quality
• Completeness• Validity• Consistency• Uniqueness• Timeliness• Accuracy
Characteristic: Validity
• Data represents the real world.
• Typical mechanisms to verify validity:– Reference data, eg.• Post/address registers• Own ”white lists” and ”black lists”.
– Business rules–Machine learning
KasperSørensen
KaperSørensen
Characteristic: Consistency + Uniqueness• Data should be consistent across ...– Systems• Are all the customers in all the systems?• Are the customer details the same in those
systems?
– Entities• Are all fields filled in the same way?
Characteristic: Timeliness
• Data is available to the right people, at the right time.
• Quality can be ensured in several ways:– At point of entry - ”First time right”.– Using continuous monitoring and
improvement. Typically batch-wise.
Demo
Characteristic: Accuracy
• How accurately can you use the data?
• For instance: Adresses:– One big address field?– Adresslines?– 10-30 adress details (street, housenumber, zip,
country etc.)
• Often times, it’s not just the ”details” that are important– Address lines have different formats depending on
the country.
The Data Quality Life-Cycle
Profile
Merge
Enrich
Report
Cleanse
Transform
Identify
Cleanse - Interpretation
Cleanse - Interpretation
Identify – Detect duplicates
The Data Quality Life-Cycle
Profile
Merge
Enrich
Report
Cleanse
Transform
Identify
Data QualityMonitoring
15
Data quality monitoring
16
Data quality monitoring
17
Data quality monitoring
18
Thank you
Questions