DataCleaner community hangout - data quality overview

Preview:

DESCRIPTION

Slides for DataCleaner community hangout - data quality overview

Citation preview

Data quality overviewDataCleaner / Human Inferencecommunity hangout

CIA map error led to hit on mission

By Richard Norton-TaylorWednesday May 12, 1999

“A US B2 bomber hit the Chinese embassy in Belgrade not only because the CIA used an outdated map but also because of a simple map-reading error by its intelligence officers, it emerged yesterday…”

“It was the right address applied to the wrong building”

Is data quality important?

Contradictory definitions

• A consumer oriented definition:

“Data is of high quality if they are fit for their intended uses in operations, decision making and planning” (Joseph Juran).

• A more idealistic definition:

”Data are deemed of high quality if they correctly represent the real-world construct to which they refer”.

• And a touch of post-modernism:

“Defining quality is destroying quality” (R.M. Pirsig)

• What’s right?• What’s wrong?

Why data quality?

• Operational excellence– Better marketing.– Less work.

• Risk & compliance– Validate your information.– Comply with standardards.– Due diligence – check against ”blacklists”.

Characteristics of data quality

• Completeness• Validity• Consistency• Uniqueness• Timeliness• Accuracy

Characteristic: Validity

• Data represents the real world.

• Typical mechanisms to verify validity:– Reference data, eg.• Post/address registers• Own ”white lists” and ”black lists”.

– Business rules–Machine learning

KasperSørensen

KaperSørensen

Characteristic: Consistency + Uniqueness• Data should be consistent across ...– Systems• Are all the customers in all the systems?• Are the customer details the same in those

systems?

– Entities• Are all fields filled in the same way?

Characteristic: Timeliness

• Data is available to the right people, at the right time.

• Quality can be ensured in several ways:– At point of entry - ”First time right”.– Using continuous monitoring and

improvement. Typically batch-wise.

Demo

Characteristic: Accuracy

• How accurately can you use the data?

• For instance: Adresses:– One big address field?– Adresslines?– 10-30 adress details (street, housenumber, zip,

country etc.)

• Often times, it’s not just the ”details” that are important– Address lines have different formats depending on

the country.

The Data Quality Life-Cycle

Profile

Merge

Enrich

Report

Cleanse

Transform

Identify

Cleanse - Interpretation

Cleanse - Interpretation

Identify – Detect duplicates

The Data Quality Life-Cycle

Profile

Merge

Enrich

Report

Cleanse

Transform

Identify

Data QualityMonitoring

15

Data quality monitoring

16

Data quality monitoring

17

Data quality monitoring

18

Thank you

Questions