27
1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

Embed Size (px)

Citation preview

Page 1: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

1

Managing Data Quality

Dr Richard White

Original version by Dr Mikhaila Burgess

School of Computer Science & Informatics

Cardiff University

Page 2: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

2

Session overview What is quality? What is Data Quality (DQ)? And

why is it important anyway? Potential impact of poor DQ (data quality) Defining Data Quality

Designing for Quality Data Ensuring DQ in databases

So what goes wrong? Potential causes of poor DQ

Managing DQ

… and some exercises

Page 3: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

Designing for Quality DataEnsuring a level of quality in your databases

Page 4: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

4

Database Quality: Design Designed to meet requirements Data entry process – data recorded that meets

requirements Normalised DB design

1NF: no duplicate records, one candidate key 2NF: 1NF, attributes only dependent on key 3NF: …

Access restrictions Data Integrity

Page 5: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

5

Data Integrity The validity and consistency of stored data Integrity constraints

protect the database from becoming inconsistent. or, rules that the database is not permitted to

violate Constraints can be placed on

individual data items, relationships between tables

Time of application of constraints for example, on data entry

a Very brief overview of

Page 6: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

6

Integrity Constraints Some types of integrity constraints

We’ll look at 4 types of integrity constraints Entity integrity Attribute domain constraints Referential integrity Business rules

Type Size Values

Range Not Null Unique

Page 7: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

7

IC1: Entity Integrity Some fields must always contain data Key constraints

Each row in a table identified by a unique key Key fields cannot contain null values Primary keys must be unique

Examples Car number plate Engine serial number

Page 8: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

8

IC2: Attribute Domain ConstraintsSome attributes can only take specific values …

Restrictive?

Gender Male or Female M or F Y (not allowed!)

Title Mr / Mrs / Ms

Page 9: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

9

IC3: Referential Integrity Rules enforcing consistency in relationships

between tables in a database Primary and Foreign keys

Every foreign key in every table must match a primary key in another table.

If a foreign key exists in one table that refers to a specific row in another table, that other row should exist.

There should be no invalid references

Page 10: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

10

Deleting a Primary Key Referential integrity not lost if no foreign key

references exist. If foreign key references DO exist in the database,

several possible actions: NO ACTION – do not allow deletion of the record CASCADE – allow the deletion, and automatically delete

all referencing rows SET NULL – set all the referencing foreign keys to null SET DEFAULT – set the referencing foreign keys to a

default value

Page 11: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

11

IC4: Business Rules Also called ‘enterprise constraints’ Data constraints specific to organisation Examples

Manager can only be responsible for up to 20 people

Library: maximum of 2 short-loan books at any one time

Bowling: Max of 26 lanes, and 8 people per lane

Page 12: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

12

Data Consistency A consistent database - all integrity

constraints are satisfied Two possible approaches

1. Only allow data into the database if it is valid and meets integrity constraints

2. Allow data into the database, then check/clean later

Page 13: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

13

Measuring Data Quality Tools (Trillium, Datanomic, etc) Nic Caine

Lights Out Integrity Subsystem Quality reporting

Redman Estimating data quality

Wang and Strong Cell level tagging Data quality algebra

Page 14: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

14

Estimating Data Quality (Redman) Estimating quality levels is difficult. At least two methods for calculating database

errors: the record and field methods Field method:

Record method:

(Number of erred fields / Total number of fields) * 100

(Number of erred records / Total number of records) * 100

Page 15: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

15

Estimating Data Quality Calculating error

rates: Field method:

(6 / 50) * 100 = 12% Record method:

(5 / 10) * 100 = 50%

Field1 Field2 Field3 Field4 Field5

Record 1

Record 2 X

Record 3 X X

Record 4

Record 5

Record 6 X

Record 7

Record 8

Record 9 X

Record 10 X

Page 16: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

16

Quality Entity-Relationship Diagram(Wang & Strong)

Customer StocksTrades

AccountNo

Name

Address

Telephone

Date

Buy_Sell

Quantity

Price

CurrentPrice

TickerSymbol

ResearchReport

Page 17: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

17

Quality Entity-Relationship Diagram(Wang & Strong)

Customer StocksTrades

AccountNo

Name

Address

Telephone

Date

Buy_Sell

Quantity

Price

CurrentPrice

TickerSymbol

ResearchReport

timeliness

timeliness

cost

credibilityformat

interpretability

Page 18: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

So what goes wrong?Some causes of poor quality data & information

Page 19: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

19

Data Entry: Human Aspect Unintentional errors in data entry Lack of understanding Poor Training Intentional incorrect data entry

Malicious / Non-malicious

Poorly defined or out-of-date collection process

Multiple levels of data entry

Garbage in, Garbage out

Page 20: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

20

Some examples of poor IQTV: INTERNET: PHONE: MOBIILE

Mizuho Securities December 2005 One share @ 610,000 yen (£2,893) 610,000 shares @ 1 yen (0.47p) Lost over 27bn yen (previous year net

profit 28.1bn yen) Government ordered enquiry

http://news.bbc.co.uk/1/hi/business/4512962.stm

Page 21: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

21

Organ Donor Register

“The mistake occurred in 1999 when a coding error on driving licences wrongly specifying donors’ wishes was transferred to the organ registry.”

Last year - NHS Blood and Transplant wrote to new donors with details of consent

800,000 individuals’ details recorded incorrectly

11 April 2010

400,000 changed; 400,000 to be contacted 45 people since died; 21 incorrect donations

www.timesonline.co.uk/tol/life_and_style/health/article7094454.ece

Page 22: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

22

Data Entry: Technical Aspect Inaccurate measuring or counting device Errors in the data storage process Missing data fields Data scanner

Poor quality data scanner Inappropriate scanner

Microfiche Microfilm Aperture cards

Incorrect set-up

Page 23: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

23

Herbarium Catalogue Approx 7 million specimens

Pressed & dried Preserved in spirit

30,000 per year HerbCat

www.kew.org/herbcat/ ePIC – electronic Plant

Information Centre www.kew.org/epic/

Page 24: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

24

Type Specimen Over 350,000 Original specimen Fixed species name &

description

18th century Reference point for

botanists – applying names correctly (taxonomy & systematics)

http://www.kew.org/collections/herb_types.html

Page 25: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

25

Random Data

“The snafu started when police used the address as part of what Browne called “random material’’ to test an automated computer system that tracks crime complaints and records of

other internal police information”

Thursday 18th March 2010 – NYPD’s Identity Theft Squad deliver cheesecake to Walter (83) and Rose (82) Martin, Brooklyn, NY

50 raids over 8 years

50 errant visits blamed on computer glitch

Apologise & explain … and to check people “weren’t using that address for identity theft”

Cops Sorry For Coming To Wrong Home 50 Times

(Associated Press & Boston Globe)

Page 26: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

26

10 Potholes to IQ#1 Multiple sources of the same information produce different values.

#2 Information is produced using subjective judgments, leading to bias.

#3 Systemic errors in information production lead to lost information.

#4 Large volumes of stored information make it difficult to access information in a reasonable time.

#5 Distributed heterogeneous systems lead to inconsistent definitions, formats, and values.

#6 Nonnumeric information is difficult to index.

#7 Automated content analysis across information collections is not yet available.

#8 As information consumers’ tasks and the organisational environment change, the information that is relevant and useful changes.

#9 Easy access to information may conflict with requirements for security, privacy, and confidentiality.

#10 Lack of sufficient computing resources limits access.

(Strong et al 1997)

Page 27: 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer Science & Informatics Cardiff University

27

Review What is quality?

Defining Quality & DQ Importance of quality data

DQ in databases Database design Database Integrity

Some examples of poor DQ and it’s impact http://www.iqtrainwrecks.com/

Measuring DQ Managing data as product