41
Data Quality Class 4

Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Data Quality

Class 4

Page 2: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Goals

• Discuss Project

• Midterm

• Statistical Process Control

• Data Quality Rules

Page 3: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Project

• Informtion is now on web site

• Final version is due on July 26

• Data will be available by end of the week

• We will spend some time discussing goals today

Page 4: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Midterm

• Written exam on July 5th

• Will cover:– Cost of low data quality– Dimensions of data quality– domains and mappings– SPC– Data Quality Rules

Page 5: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Statistical Process Control

• Developed by Shewhart at Bell Labs in the 1920’s through 1950’s

• Notions of Variation vs. Control

• Important in original context of both equpiment manufacture and service quality

Page 6: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Variation

• Natural variations

• Defects

• Errors

• Mistakes

• Some variations are meaningful, some are not

Page 7: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Causes of Variation

• Common, or Chance causes– minor fluctuations or differences– not necessarily important to correct– observed to form a normal distribution

• Assignable, or Special causes– (self explanatory)

• We expect to see the normal variations, but assignable cause variations are interesting

Page 8: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Example

• Measure railroad on-time performance– Trains are typically on time or a few minutes

late– One night, the trains are all 1 hour late due to

electrical problems – a special cause

Page 9: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Statistical Control

• State in which variations observed can be attributed to common causes that do not change with time

Page 10: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Pareto Principle

• In a population that contributes to a common effect, relaively few of the contributors account for the bulk of the effect

• Example: code performance analysis

• Can be used to direct analysis

Page 11: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Control Chart

UCL

LCL

Center line

Page 12: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Control Chart 2

• Used to look for distinct variations from the mean

• Goal: predictable behavior

• Plot series of data over time

• Variations are represented as distance from the mean

Page 13: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Control Chart 3

• Center Line: can be computed as mean of variable points

• Upper Contril Limit: three standard deviations above center line

• Lower Control Limit: three standard deviations below center line

Page 14: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Control Chart 4

• As long as all points are between UCL and LCL, the variations are due to common causes, and the process is said to be in control, or stable

• Points above UCL or below LCL are indicative of abnormal variation, and are due to special causes – the process is not in control

Page 15: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Control Chart 5

• Select variables chart or attributes chart

• Use data quality dimensions as guideline

• Select meaningful variables to measure (i.e., stuff that will point at a diagnosible problem)

Page 16: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Interpreting the Control Chart

• Lack of stability indicates potential problem• Look for:

– points utside of control limits– zone testing (clusters of points within certain

standard deviation limits)– potential to split out data points into different

logical data sets

• Look for cycles

Page 17: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

SPC and Data Quality

• “The Information Factory”

• Use data quality dimensions as guideline for investigation

• Analyze the state of data as it passes through the information chain

• Probing can be automated with data quality rules

Page 18: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Inserting the Probes

• FInd a location in information chain that is:– nondisruptive– easy to access– easy to retool

Page 19: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Data Quality Rules

• Definitions

• Proscriptive Assertions

• Prescriptive Assertions

• Conditional Assertions

• Operational Assertions

Page 20: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Definitions

• Nulls

• Domains

• Mappings

Page 21: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Proscriptive Assertions

• Describe what is not allowed

• Used to figure out what is wrong with data

• Used for validation

Page 22: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Prescriptive Assertions

• Describe what is supposed to happen with data

• Can be used for data population, extraction, transformation

• Can also be used for validation

Page 23: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Conditional Assertions

• Define an assertion that must be true if a condition is true

Page 24: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Operational Assertions

• Define an action that must be taken if a condition is true

Page 25: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

9 Classes of Rules

• 1)      Null value rules• 2)      Value rules• 3)      Domain membership rules• 4)      Domain Mappings• 5)      Relation rules• 6)      Table, Cross-table, and Cross-message assertions• 7)      In-Process directives• 8)      Operational Directives• 9)      Other rules

Page 26: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Null Value Rules

• Null value specification– Define GETDATE for unavailable as “fill in

date”

• Null values allowed– Attribute A allowed nulls {GETDATE, U, X}

• Null values not allowed– Attribute B nulls not allowed

Page 27: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Value Rules

• Value restriction ruleRestrict GRADE: value >= ‘A’ AND value <=

‘F’ AND value != ‘E’

Page 28: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Domain Rules

• Domain Definition

• Domain Membership

• Domain Nonmembership

• Domain Assignment

Page 29: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Mapping Rules

• Mapping definition

• Mapping membership

• Mapping nonmembership

Page 30: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Relation Rules

• Completeness

• Exemption

• Consistency

• Derivation

Page 31: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Completeness

• Defines when a record is complete (I.e., what fields must be present)IF (Orders.Total > 0.0), Complete With

{Orders.Billing_Street,

Orders.Billing_City,

Orders.Billing_State,

Orders.Billing_ZIP}

Page 32: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Exemption

Defines which fields may be missingIF (Orders.Item_Class != “CLOTHING”)

Exempt

{Orders.Color,

Orders.Size

}

Page 33: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Consistency

• Define a relationship between attributes based on field content– IF (Employees.title == “Staff Member”)

Then (Employees.Salary >= 20000 AND Employees.Salary < 30000)

Page 34: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Derivation

• Prescriptive form of consistency rule

• Details how one attribute’s value is determined based on other attributesIF (Orders.NumberOrdered > 0) Then {

Orders.Total = (Orders.NumberOrdered * Orders.Price) * 1.05

}

Page 35: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Table and Cross-Table Rules

• Functional Dependence

• Primary Key Assertion

• Foreign Key Assertion (=referential integrity)

Page 36: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Functional Dependence

• Functional Dependence between columns X and Y:– For any two records R1 and R2 in a table,

• if field X of record R1 contains value x and field X of record R2 contains the same value x, then if field Y of record R1 contains the value y, then field Y of record R2 must contain the value y.

• In other words, attribute Y is said to be determined by attribute X.

Page 37: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Primary Key Assertion

• A set of attributes defined as a primary key must uniquely identify a record

• Enforcement = testing for duplicates across defined key set

Page 38: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Foreign Key Assertion

• When the values in field f in table T is chosen from the key values in field g in table S, field S.g is said to be a foreign key for field T.f

• If f is a foreign key, the key must exist in table S, column g (=referential integrity)

Page 39: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

In-process Directives

• Definition directives (labeling information chain members)

• Measurement directives

• Trigger directives

Page 40: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Operational Directives

• Transformation

• Update

Page 41: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules

Other Rules

• Approximate Searching rules

• Approximate Matching rules