18
Data quality overview DataCleaner / Human Inference community hangout

DataCleaner community hangout - data quality overview

Embed Size (px)

DESCRIPTION

Slides for DataCleaner community hangout - data quality overview

Citation preview

Page 1: DataCleaner community hangout - data quality overview

Data quality overviewDataCleaner / Human Inferencecommunity hangout

Page 2: DataCleaner community hangout - data quality overview

CIA map error led to hit on mission

By Richard Norton-TaylorWednesday May 12, 1999

“A US B2 bomber hit the Chinese embassy in Belgrade not only because the CIA used an outdated map but also because of a simple map-reading error by its intelligence officers, it emerged yesterday…”

“It was the right address applied to the wrong building”

Is data quality important?

Page 3: DataCleaner community hangout - data quality overview

Contradictory definitions

• A consumer oriented definition:

“Data is of high quality if they are fit for their intended uses in operations, decision making and planning” (Joseph Juran).

• A more idealistic definition:

”Data are deemed of high quality if they correctly represent the real-world construct to which they refer”.

• And a touch of post-modernism:

“Defining quality is destroying quality” (R.M. Pirsig)

• What’s right?• What’s wrong?

Page 4: DataCleaner community hangout - data quality overview

Why data quality?

• Operational excellence– Better marketing.– Less work.

• Risk & compliance– Validate your information.– Comply with standardards.– Due diligence – check against ”blacklists”.

Page 5: DataCleaner community hangout - data quality overview

Characteristics of data quality

• Completeness• Validity• Consistency• Uniqueness• Timeliness• Accuracy

Page 6: DataCleaner community hangout - data quality overview

Characteristic: Validity

• Data represents the real world.

• Typical mechanisms to verify validity:– Reference data, eg.• Post/address registers• Own ”white lists” and ”black lists”.

– Business rules–Machine learning

KasperSørensen

KaperSørensen

Page 7: DataCleaner community hangout - data quality overview

Characteristic: Consistency + Uniqueness• Data should be consistent across ...– Systems• Are all the customers in all the systems?• Are the customer details the same in those

systems?

– Entities• Are all fields filled in the same way?

Page 8: DataCleaner community hangout - data quality overview

Characteristic: Timeliness

• Data is available to the right people, at the right time.

• Quality can be ensured in several ways:– At point of entry - ”First time right”.– Using continuous monitoring and

improvement. Typically batch-wise.

Demo

Page 9: DataCleaner community hangout - data quality overview

Characteristic: Accuracy

• How accurately can you use the data?

• For instance: Adresses:– One big address field?– Adresslines?– 10-30 adress details (street, housenumber, zip,

country etc.)

• Often times, it’s not just the ”details” that are important– Address lines have different formats depending on

the country.

Page 10: DataCleaner community hangout - data quality overview

The Data Quality Life-Cycle

Profile

Merge

Enrich

Report

Cleanse

Transform

Identify

Page 11: DataCleaner community hangout - data quality overview

Cleanse - Interpretation

Page 12: DataCleaner community hangout - data quality overview

Cleanse - Interpretation

Page 13: DataCleaner community hangout - data quality overview

Identify – Detect duplicates

Page 14: DataCleaner community hangout - data quality overview

The Data Quality Life-Cycle

Profile

Merge

Enrich

Report

Cleanse

Transform

Identify

Data QualityMonitoring

Page 15: DataCleaner community hangout - data quality overview

15

Data quality monitoring

Page 16: DataCleaner community hangout - data quality overview

16

Data quality monitoring

Page 17: DataCleaner community hangout - data quality overview

17

Data quality monitoring

Page 18: DataCleaner community hangout - data quality overview

18

Thank you

Questions