39
Data Quality Strategies From Data Duckling to Successful Swan Peter Aiken, Ph.D. DAMA International President 2009-2013 DAMA International Achievement Award 2001 (with Dr. E. F. "Ted" Codd DAMA International Community Award 2005 Peter Aiken, Ph.D. 33+ years in data management Repeated international recognition Founder, Data Blueprint (datablueprint.com) Associate Professor of IS (vcu.edu) DAMA International (dama.org) 10 books and dozens of articles Experienced w/ 500+ data management practices Multi-year immersions: US DoD (DISA/Army/Marines/DLA) Nokia Deutsche Bank Wells Fargo Walmart PETER AIKEN WITH JUANITA BILLINGS FOREWORD BY JOHN BOTTEGA MONETIZING DATA MANAGEMENT Unlocking the Value in Your Organization’s Most Important Asset. The Case for the Chief Data Ocer Recasting the C-Suite to Leverage Your Most Valuable Asset Peter Aiken and Michael Gorman 2 Copyright 2017 by Data Blueprint Slide #

Data-Ed Webinar: Data Quality Strategies - From Data Duckling to Successful Swan

Embed Size (px)

Citation preview

Data Quality StrategiesFrom Data Duckling to Successful Swan

Peter Aiken, Ph.D.

• DAMA International President 2009-2013

• DAMA International Achievement Award 2001 (with Dr. E. F. "Ted" Codd

• DAMA International Community Award 2005

Peter Aiken, Ph.D.• 33+ years in data management • Repeated international recognition • Founder, Data Blueprint (datablueprint.com) • Associate Professor of IS (vcu.edu) • DAMA International (dama.org) • 10 books and dozens of articles • Experienced w/ 500+ data

management practices • Multi-year immersions:

– US DoD (DISA/Army/Marines/DLA)– Nokia – Deutsche Bank– Wells Fargo – Walmart– …

PETER AIKEN WITH JUANITA BILLINGSFOREWORD BY JOHN BOTTEGA

MONETIZINGDATA MANAGEMENT

Unlocking the Value in Your Organization’sMost Important Asset.

The Case for theChief Data OfficerRecasting the C-Suite to LeverageYour Most Valuable Asset

Peter Aiken andMichael Gorman

2Copyright 2017 by Data Blueprint Slide #

3Copyright 2017 by Data Blueprint Slide #

1. Data Quality in Context of Data Management

2. DQE Definition

3. DQE Cycle & Contextual Complications

4. DQ Causes and Dimensions

5. Quality and the Data Life Cycle

6. DDE Tool Sets

7. Takeaways and Q&A

Data Quality Strategies

4Copyright 2017 by Data Blueprint Slide #

• Before further construction could proceed • No IT equivalent

Our barn had to pass a foundation inspection

You can accomplish Advanced Data Practices without becoming proficient in the Foundational Data Practices however this will: • Take longer • Cost more • Deliver less • Present

greaterrisk(with thanks to Tom DeMarco)

Data Management Practices Hierarchy

Advanced Data

Practices • MDM • Mining • Big Data • Analytics • Warehousing • SOA

Foundational Data Practices

Data Platform/Architecture

Data Governance Data Quality

Data Operations

Data Management Strategy

Technologies

Capabilities

5Copyright 2017 by Data Blueprint Slide #

DMM℠ Structure of 5 Integrated DM Practice Areas

Data architecture implementation

Data Governance

Data Management

Strategy

Data Operations

PlatformArchitecture

SupportingProcesses

Maintain fit-for-purpose data, efficiently and effectively

6Copyright 2017 by Data Blueprint Slide #

Manage data coherently

Manage data assets professionally

Data life cycle management

Organizational support

Data Quality

Data architecture implementation

Maintain fit-for-purpose data, efficiently and effectively

Manage data coherently

Manage data assets professionally

Data life cycle management

Organizational support

DMM℠ Structure of 5 Integrated DM Practice Areas

Data Governance

Data Management

Strategy

Data Operations

PlatformArchitecture

SupportingProcesses

7Copyright 2017 by Data Blueprint Slide #

Data Quality

3

3

33

1

The DAMA Guide to the Data Management Body of Knowledge

8Copyright 2017 by Data Blueprint Slide #

Data Management

Functions from

The

DAM

A G

uide

to th

e Da

ta M

anag

emen

t Bod

y of

Kno

wled

ge ©

200

9 by

DAM

A In

tern

ation

al

• Good enough to criticize – All models

are wrong – Some models

are useful

• Missing two important concepts – Optionality – Dependency

Overview: Data Quality Engineering

9Copyright 2017 by Data Blueprint Slide #

10Copyright 2017 by Data Blueprint Slide #

OrganizationalStrategy

Data Strategy Data Governance

Data Quality and Data Governance in ContextData

asset support for organizational

strategy

What the data assets do to

support strategy (business goals)

How well the data strategy is working

(metadata)

Data Quality

Governance of quality aspects

of data assets

Evolutionary feedback about the

current focus

11Copyright 2017 by Data Blueprint Slide #

1. Data Quality in Context of Data Management

2. DQE Definition

3. DQE Cycle & Contextual Complications

4. DQ Causes and Dimensions

5. Quality and the Data Life Cycle

6. DDE Tool Sets

7. Takeaways and Q&A

Data Quality Strategies

Data Data

Data

Information

Fact Meaning

Request

A Model Specifying Relationships Among Important Terms

[Built on definition by Dan Appleton 1983]

Intelligence

Use

1. Each FACT combines with one or more MEANINGS. 2. Each specific FACT and MEANING combination is referred to as a DATUM. 3. An INFORMATION is one or more DATA that are returned in response to a specific

REQUEST 4. INFORMATION REUSE is enabled when one FACT is combined with more than one

MEANING. 5. INTELLIGENCE is INFORMATION associated with its USES.

Wisdom & knowledge are often used synonymously

Data

Data

Data Data

12Copyright 2017 by Data Blueprint Slide #

Definitions• Quality Data

– Fit for purpose meets the requirements of its authors, users, and administrators (adapted from Martin Eppler)

– Synonymous with information quality, since poor data quality results in inaccurate information and poor business performance

• Data Quality Management – Planning, implementation and control activities that apply quality

management techniques to measure, assess, improve, and ensure data quality

– Entails the "establishment and deployment of roles, responsibilities concerning the acquisition, maintenance, dissemination, and disposition of data" http://www2.sas.com/proceedings/sugi29/098-29.pdf

✓ Critical supporting process from change management ✓ Continuous process for defining acceptable levels of data quality to meet

business needs and for ensuring that data quality meets these levels • Data Quality Engineering

– Recognition that data quality solutions cannot not managed but must be engineered – Engineering is the application of scientific, economic, social, and practical knowledge

in order to design, build, and maintain solutions to data quality challenges – Engineering concepts are generally not known and understood within IT or business!

13Copyright 2017 by Data Blueprint Slide #

Spinach/Popeye story from http://it.toolbox.com/blogs/infosphere/spinach-how-a-data-quality-mistake-created-a-myth-and-a-cartoon-character-10166

Improving Data Quality during System Migration• Challenge

– Millions of NSN/SKUs maintained in a catalog

– Key and other data stored in clear text/comment fields

– Original suggestion was manual approach to text extraction

– Left the data structuring problem unsolved • Solution

– Proprietary, improvable text extraction process – Converted non-tabular data into tabular data – Saved a minimum of $5 million

– Literally person centuries of work

Copyright 2017 by Data Blueprint Slide #14

Unmatched Items

Ignorable Items

Items Matched

Week # (% Total) (% Total) (% Total)1 31.47% 1.34% N/A2 21.22% 6.97% N/A3 20.66% 7.49% N/A4 32.48% 11.99% 55.53%

… … … …14 9.02% 22.62% 68.36%15 9.06% 22.62% 68.33%16 9.53% 22.62% 67.85%17 9.5% 22.62% 67.88%18 7.46% 22.62% 69.92%

Determining Diminishing Returns

Copyright 2017 by Data Blueprint Slide #15

BeforeAfter

Time needed to review all NSNs once over the life of the project:NSNs 2,000,000Average time to review & cleanse (in minutes) 5Total Time (in minutes) 10,000,000

Time available per resource over a one year period of time:Work weeks in a year 48Work days in a week 5Work hours in a day 7.5Work minutes in a day 450Total Work minutes/year 108,000

Person years required to cleanse each NSN once prior to migration:Minutes needed 10,000,000Minutes available person/year 108,000Total Person-Years 92.6

Resource Cost to cleanse NSN's prior to migration:Avg Salary for SME year (not including overhead) $60,000.00Projected Years Required to Cleanse/Total DLA Person Year Saved 93Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $5.5 million

Quantitative Benefits

Copyright 2017 by Data Blueprint Slide #16

Time needed to review all NSNs once over the life of the project:NSNs 2,000,000Average time to review & cleanse (in minutes) 5Total Time (in minutes) 10,000,000

Time available per resource over a one year period of time:Work weeks in a year 48Work days in a week 5Work hours in a day 7.5Work minutes in a day 450Total Work minutes/year 108,000

Person years required to cleanse each NSN once prior to migration:Minutes needed 10,000,000Minutes available person/year 108,000Total Person-Years 92.6

Resource Cost to cleanse NSN's prior to migration:Avg Salary for SME year (not including overhead) $60,000.00Projected Years Required to Cleanse/Total DLA Person Year Saved 93Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $5.5 million

Quantitative Benefits

Copyright 2017 by Data Blueprint Slide #17

Time needed to review all NSNs once over the life of the project:NSNs 150,000Average time to review & cleanse (in minutes) 5Total Time (in minutes) 750,000

Time available per resource over a one year period of time:Work weeks in a year 48Work days in a week 5Work hours in a day 7.5Work minutes in a day 450Total Work minutes/year 108,000

Person years required to cleanse each NSN once prior to migration:Minutes needed 750,000Minutes available person/year 108,000Total Person-Years 7

Resource Cost to cleanse NSN's prior to migration:Avg Salary for SME year (not including overhead) $60,000.00Projected Years Required to Cleanse/Total DLA Person Year Saved 7Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $420,000

Time needed to review all NSNs once over the life of the project:NSNs 2,000,000Average time to review & cleanse (in minutes) 5Total Time (in minutes) 10,000,000

Time available per resource over a one year period of time:Work weeks in a year 48Work days in a week 5Work hours in a day 7.5Work minutes in a day 450Total Work minutes/year 108,000

Person years required to cleanse each NSN once prior to migration:Minutes needed 10,000,000Minutes available person/year 108,000Total Person-Years 92.6

Resource Cost to cleanse NSN's prior to migration:Avg Salary for SME year (not including overhead) $60,000.00Projected Years Required to Cleanse/Total DLA Person Year Saved 93Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $5.5 million

Quantitative Benefits

Copyright 2017 by Data Blueprint Slide #18

Data Quality Misconceptions

• You can fix the data

• Data quality is an IT problem

• The problem is in the data sources or data entry

• The data warehouse will provide a single version of the truth

• The new system will provide a single version of the truth

• Standardization will eliminate the problem of different "truths" represented in the reports or analysis

Source: Business Intelligence solutions, Athena Systems

19Copyright 2017 by Data Blueprint Slide #

• It was six men of Indostan, To learning much inclined,Who went to see the Elephant(Though all of them were blind),That each by observationMight satisfy his mind.

• The First approached the Elephant,And happening to fallAgainst his broad and sturdy side,At once began to bawl:"God bless me! but the ElephantIs very like a wall!"

• The Second, feeling of the tusk Cried, "Ho! what have we here,So very round and smooth and sharp? To me `tis mighty clearThis wonder of an ElephantIs very like a spear!"

• The Third approached the animal,And happening to takeThe squirming trunk within his hands, Thus boldly up he spake:"I see," quoth he, "the ElephantIs very like a snake!"

• The Fourth reached out an eager hand, And felt about the knee:"What most this wondrous beast is like Is mighty plain," quoth he;"'Tis clear enough the Elephant Is very like a tree!"

• The Fifth, who chanced to touch the ear, Said: "E'en the blindest manCan tell what this resembles most;Deny the fact who can,This marvel of an ElephantIs very like a fan!"

• The Sixth no sooner had begunAbout the beast to grope,Than, seizing on the swinging tailThat fell within his scope."I see," quoth he, "the ElephantIs very like a rope!"

• And so these men of IndostanDisputed loud and long,Each in his own opinionExceeding stiff and strong,Though each was partly in the right,And all were in the wrong!

The Blind Men and the Elephant

(Source: John Godfrey Saxe's ( 1816-1887) version of the famous Indian legend )

20Copyright 2017 by Data Blueprint Slide #

No universal conception of data quality exists, instead many differing perspective compete

• Problem:

– Most organizations approach data quality problems in the same way that the blind men approached the elephant - people tend to see only the data that is in front of them

– Little cooperation across boundaries, just as the blind men were unable to convey their impressions about the elephant to recognize the entire entity.

– Leads to confusion, disputes and narrow views

• Solution:

– Data quality engineering can help achieve a more complete picture and facilitate cross boundary communications

21Copyright 2017 by Data Blueprint Slide #

Quality Data is ...

22Copyright 2017 by Data Blueprint Slide #

Fit For

Purpose

Famous Words?• Question:

– Why haven't organizations taken a more proactive approach to data quality?

• Answer: – Fixing data quality problems is not easy – It is dangerous -- they'll come after you – Your efforts are likely to be misunderstood – You could make things worse – Now you get to fix it

• A single data quality issue can grow into a significant, unexpected investment

23Copyright 2017 by Data Blueprint Slide #

24Copyright 2017 by Data Blueprint Slide #

1. Data Quality in Context of Data Management

2. DQE Definition

3. DQE Cycle & Contextual Complications

4. DQ Causes and Dimensions

5. Quality and the Data Life Cycle

6. DDE Tool Sets

7. Takeaways and Q&A

Data Quality Strategies

Four ways to make your data sparkle!1.Prioritize the task

– Cleaning data is costly and time consuming

– Identify mission critical/non-mission critical data

2.Involve the data owners – Seek input of business units on what constitutes "dirty"

data 3.Keep future data clean

– Incorporate processes and technologies that check every zip code and area code

4.Align your staff with business – Align IT staff with business units

(Source: CIO JULY 1 2004)

25Copyright 2017 by Data Blueprint Slide #

Structured Data Quality Engineering1. Allow the form of the

Problem to guide the form of the solution

2. Provide a means of decomposing the problem

3. Feature a variety of tools simplifying system understanding

4. Offer a set of strategies for evolving a design solution

5. Provide criteria for evaluating the quality of the various solutions

6. Facilitate development of a framework for developing organizational knowledge.

26Copyright 2017 by Data Blueprint Slide #

The DQE Cycle• Deming cycle

• "Plan-do-study-act" or "plan-do-check-act"

1. Identifying data issues that are critical to the achievement of business objectives

2. Defining business requirements for data quality

3. Identifying key data quality dimensions

4. Defining business rules critical to ensuring high quality data

27Copyright 2017 by Data Blueprint Slide #

The DQE Cycle: (1) Plan• Plan for the assessment of the

current state and identification of key metrics for measuring quality

• The data quality engineering team assesses the scope of known issues

– Determining cost and impact

– Evaluating alternatives for addressing them

28Copyright 2017 by Data Blueprint Slide #

The DQE Cycle: (2) Deploy• Deploy processes for measuring

and improving the quality of data:

• Data profiling

– Institute inspections and monitors to identify data issues when they occur

– Fix flawed processes that are the root cause of data errors or correct errors downstream

– When it is not possible to correct errors at their source, correct them at their earliest point in the data flow

29Copyright 2017 by Data Blueprint Slide #

The DQE Cycle: (3) Monitor• Monitor the quality of data as

measured against the defined business rules

• If data quality meets defined thresholds for acceptability, the processes are in control and the level of data quality meets the business requirements

• If data quality falls below acceptability thresholds, notify data stewards so they can take action during the next stage

30Copyright 2017 by Data Blueprint Slide #

The DQE Cycle: (4) Act

• Act to resolve any identified issues to improve data quality and better meet business expectations

• New cycles begin as new data sets come under investigation or as new data quality requirements are identified for existing data sets

31Copyright 2017 by Data Blueprint Slide #

DQE Context & Engineering Concepts • Can rules be implemented stating that no data can be corrected

unless the source of the error has been discovered and addressed?

• All data must be 100% perfect?

• Pareto – 80/20 rule

– Not all data is of equal Importance

• Scientific, economic, social, and practical knowledge

32Copyright 2017 by Data Blueprint Slide #

33Copyright 2017 by Data Blueprint Slide #

1. Data Quality in Context of Data Management

2. DQE Definition

3. DQE Cycle & Contextual Complications

4. DQ Causes and Dimensions

5. Quality and the Data Life Cycle

6. DDE Tool Sets

7. Takeaways and Q&A

Data Quality Strategies

Two Distinct Activities Support Quality Data• Data quality best practices depend on both

– Practice-oriented activities – Structure-oriented activities

34Copyright 2017 by Data Blueprint Slide #

Practice-oriented activities focus on the capture and manipulation of data

Structure-oriented activities focus on the data implementation

Quality Data

Practice-Oriented Activities• Stem from a failure to rigor when capturing/manipulating data such

as: – Edit masking – Range checking of input data – CRC-checking of transmitted data

• Affect the Data Value Quality and Data Representation Quality • Examples of improper practice-oriented activities:

– Allowing imprecise or incorrect data to be collected when requirements specify otherwise

– Presenting data out of sequence

• Typically diagnosed in bottom-up manner: find and fix the resulting problem

• Addressed by imposing more rigorous data-handling/governance

35Copyright 2017 by Data Blueprint Slide #

Practice-oriented activities

Quality of Data

Values

Quality of Data Representation

Knee Surgery

36Copyright 2017 by Data Blueprint Slide #

Structure-Oriented Activities• Occur because of data and metadata that has been arranged

imperfectly. For example: – When the data is in the system but we just can't access it; – When a correct data value is provided as the wrong response to a query; or – When data is not provided because it is unavailable or inaccessible

• Developer focus within system boundaries instead of within organization boundaries

• Affect the Data Model Quality and Data Architecture Quality • Examples of improper structure-oriented activities:

– Providing a correct response but incomplete data to a query because the user did not comprehend the system data structure

– Costly maintenance of inconsistent data used by redundant systems • Typically diagnosed in

top-down manner: root cause fixes

• Addressed through fundamental data structure governance

37Copyright 2017 by Data Blueprint Slide #

Quality of

Data Models

Quality of

Data Architecture

Structure-oriented activities

New York Turns to Data to Solve Big Tree Problem• NYC

– 2,500,000 trees

• 11-months from 2009 to 2010

– 4 people were killed or seriously injured by falling tree limbs in Central Park alone

• Belief

– Arborists believe that pruning and otherwise maintaining trees can keep them healthier and make them more likely to withstand a storm, decreasing the likelihood of property damage, injuries and deaths

• Until recently

– No research or data to back it up

38Copyright 2017 by Data Blueprint Slide #

http://www.computerworld.com/s/article/9239793/New_York_Turns_to_Big_Data_to_Solve_Big_Tree_Problem?source=CTWNLE_nlt_datamgmt_2013-06-05

NYC's Big Tree Problem• Question

– Does pruning trees in one year reduce the number of hazardous tree conditions in the following year?

• Lots of data but granularity challenges – Pruning data recorded block by block – Cleanup data recorded at the address level – Trees have no unique identifiers

• After downloading, cleaning, merging, analyzing and intensive modeling – Pruning trees for certain types of hazards caused a 22 percent reduction in the

number of times the department had to send a crew for emergency cleanups

• The best data analysis – Generates further questions

• NYC cannot prune each block every year – Building block risk profiles: number of trees, types of trees, whether the block is in

a flood zone or storm zone

39Copyright 2017 by Data Blueprint Slide #

http://www.computerworld.com/s/article/9239793/New_York_Turns_to_Big_Data_to_Solve_Big_Tree_Problem?source=CTWNLE_nlt_datamgmt_2013-06-05

Quality Dimensions

40Copyright 2017 by Data Blueprint Slide #

4 Dimensions of Data QualityAn organization’s overall data quality is a function of four distinct components, each with its own attributes: • Data Value: the quality of data as stored & maintained in

the system • Data Representation – the quality of representation for

stored values; perfect data values stored in a system that are inappropriately represented can be harmful

• Data Model – the quality of data logically representing user requirements related to data entities, associated attributes, and their relationships; essential for effective communication among data suppliers and consumers

• Data Architecture – the coordination of data management activities in cross-functional system development and operations

41Copyright 2017 by Data Blueprint Slide #

Pra

ctic

e-or

ient

edS

truct

ure-

orie

nted

Effective Data Quality Engineering• Data quality engineering has been focused on operational problem

correction

– Directing attention to practice-oriented data imperfections

• Data quality engineering is more effective when also focused on structure-oriented causes

– Ensuring the quality of shared data across system boundaries

42Copyright 2017 by Data Blueprint Slide #

Data Representation

Quality

As presented to the user

Data Value Quality

As maintained in the system

Data Model Quality

As understood by developers

Data Architecture Quality

As an organizational

asset

(closer to the architect)(closer to the user)

Full Set of Data Quality Attributes

43Copyright 2017 by Data Blueprint Slide #

Difficult to obtain leverage at the bottom of the falls

44Copyright 2017 by Data Blueprint Slide #

Frozen Falls

45Copyright 2017 by Data Blueprint Slide #

46Copyright 2017 by Data Blueprint Slide #

1. Data Quality in Context of Data Management

2. DQE Definition

3. DQE Cycle & Contextual Complications

4. DQ Causes and Dimensions

5. Quality and the Data Life Cycle

6. DDE Tool Sets

7. Takeaways and Q&A

Data Quality Strategies

Data acquisition activities Data usage activitiesData storage

Traditional Quality Life Cycle

47Copyright 2017 by Data Blueprint Slide #

restored data

Metadata Creation

Metadata Refinement

Metadata Structuring

Data Utilization

Data Manipulation

Data Creation

Data Storage

Data Assessment

Data Refinement

Data Life Cycle Model Products

48Copyright 2017 by Data Blueprint Slide #

data architecture

& models

populated data models and

storage locations

data values

datavalues

datavalues

value defects

structure defects

architecture refinements

model refinements

data

architecture & model quality

Data Refinement

Data Utilization

Data Manipulation

representation quality

restored data

Metadata Refinement

Metadata Structuring

Data Creation

Data Storage

Data Assessment

Data Life Cycle Model: Quality Focus

49Copyright 2017 by Data Blueprint Slide #

populated data models and

storage locationsdata

values

data

model quality

value quality

value quality

value quality

Metadata Creation

architecture quality

Startingpointfor newsystemdevelopment

data performance metadata

data architecture

dataarchitecture and

data models

shared data updated data

correcteddata

architecturerefinements

facts &meanings

Metadata &Data Storage

Starting pointfor existingsystems

Metadata Refinement• Correct Structural Defects• Update Implementation

Metadata Creation• Define Data Architecture• Define Data Model Structures

Metadata Structuring• Implement Data Model Views• Populate Data Model Views

Data Refinement• Correct Data Value Defects• Re-store Data Values

Data Manipulation• Manipulate Data• Updata Data

Data Utilization• Inspect Data• Present Data

Data Creation• Create Data• Verify Data Values

Data Assessment• Assess Data Values• Assess Metadata

Extended data life cycle model with metadata sources and uses

50Copyright 2017 by Data Blueprint Slide #

51Copyright 2017 by Data Blueprint Slide #

1. Data Quality in Context of Data Management

2. DQE Definition

3. DQE Cycle & Contextual Complications

4. DQ Causes and Dimensions

5. Quality and the Data Life Cycle

6. DDE Tool Sets

7. Takeaways and Q&A

Data Quality Strategies

Profile, Analyze and Assess DQ• Data assessment using 2 different approaches:

– Bottom-up

– Top-down

• Bottom-up assessment: – Inspection and evaluation of the data sets

– Highlight potential issues based on the results of automated processes

• Top-down assessment: – Engage business users to document

their business processes and the corresponding critical data dependencies

– Understand how their processes consume data and which data elements are critical to the success of the business applications

52Copyright 2017 by Data Blueprint Slide #

Define DQ Measures• Measures development occurs as part of the strategy/design/plan

step

• Process for defining data quality measures:

1. Select one of the identified critical business impacts

2. Evaluate the dependent data elements, create and update processes associate with that business impact

3. List any associated data requirements

4. Specify the associated dimension of data quality and one or more business rules to use to determine conformance of the data to expectations

5. Describe the process for measuring conformance

6. Specify an acceptability threshold

53Copyright 2017 by Data Blueprint Slide #

Set and Evaluate DQ Service Levels• Data quality inspection and

monitoring are used to measure and monitor compliance with defined data quality rules

• Data quality SLAs specify the organization’s expectations for response and remediation

• Operational data quality control defined in data quality SLAs includes: – Data elements covered by the agreement – Business impacts associated with data flaws – Data quality dimensions associated with each data element – Quality expectations for each data element of the identified dimensions in

each application for system in the value chain – Methods for measuring against those expectations – (…)

54Copyright 2017 by Data Blueprint Slide #

Measure, Monitor & Manage DQ• DQM procedures depend on

available data quality measuring and monitoring services

• 2 contexts for control/measurement of conformance to data quality business rules exist:

– In-stream: collect in-stream measurements while creating data

– In batch: perform batch activities on collections of data instances assembled in a data set

• Apply measurements at 3 levels of granularity:

– Data element value

– Data instance or record

– Data set

55Copyright 2017 by Data Blueprint Slide #

Overview: Data Quality Tools• 4 categories of activities:

– Analysis – Cleansing – Enhancement – Monitoring

• Principal tools: – Data Profiling – Parsing and Standardization – Data Transformation – Identity Resolution and Matching – Enhancement – Reporting

56Copyright 2017 by Data Blueprint Slide #

DQ Tool Set #1: Data Profiling• Data profiling is the assessment of

value distribution and clustering of values into domains

• Need to be able to distinguish between good and bad data before making any improvements

• Data profiling is a set of algorithms for 2 purposes: – Statistical analysis and assessment of the data quality values within a data set

– Exploring relationships that exist between value collections within and across data sets

• At its most advanced, data profiling takes a series of prescribed rules from data quality engines. It then assesses the data, annotates and tracks violations to determine if they comprise new or inferred data quality rules

57Copyright 2017 by Data Blueprint Slide #

DQ Tool Set #1: Data Profiling, cont’d• Data profiling vs. data quality-business context and semantic/

logical layers – Data quality is concerned with proscriptive rules

– Data profiling looks for patterns when rules are adhered to and when rules are violated; able to provide input into the business context layer

• Incumbent that data profiling services notify all concerned parties of whatever is discovered

• Profiling can be used to… – …notify the help desk that valid

changes in the data are about to case an avalanche of “skeptical user” calls

– …notify business analysts of precisely where they should be working today in terms of shifts in the data

58Copyright 2017 by Data Blueprint Slide #

Courtesy GlobalID.com

59Copyright 2017 by Data Blueprint Slide #

DQ Tool Set #2: Parsing & Standardization • Data parsing tools enable the definition

of patterns that feed into a rules engine used to distinguish between valid and invalid data values

• Actions are triggered upon matching a specific pattern

• When an invalid pattern is recognized, the application may attempt to transform the invalid value into one that meets expectations

• Data standardization is the process of conforming to a set of business rules and formats that are set up by data stewards and administrators

• Data standardization example: – Brining all the different formats of “street” into a single format, e.g. “STR”, “ST.”,

“STRT”, “STREET”, etc.

60Copyright 2017 by Data Blueprint Slide #

DQ Tool Set #3: Data Transformation• Upon identification of data

errors, trigger data rules to transform the flawed data

• Perform standardization and guide rule-based transformations by mapping data values in their original formats and patterns into a target representation

• Parsed components of a pattern are subjected to rearrangement, corrections, or any changes as directed by the rules in the knowledge base

61Copyright 2017 by Data Blueprint Slide #

DQ Tool Set #4: Identify Resolution & Matching• Data matching enables analysts to identify relationships between records for

de-duplication or group-based processing • Matching is central to maintaining data consistency and integrity throughout

the enterprise • The matching process should be used in

the initial data migration of data into a single repository

• 2 basic approaches to matching: • Deterministic

– Relies on defined patterns/rules for assigning weights and scores to determine similarity

– Predictable – Dependent on rules developers anticipations

• Probabilistic – Relies on statistical techniques for assessing the probability that any pair of record represents

the same entity – Not reliant on rules – Probabilities can be refined based on experience -> matchers can improve precision as more

data is analyzed

62Copyright 2017 by Data Blueprint Slide #

DQ Tool Set #5: Enhancement• Definition:

– A method for adding value to information by accumulating additional information about a base set of entities and then merging all the sets of information to provide a focused view. Improves master data.

• Benefits: – Enables use of third party data sources – Allows you to take advantage of the information and

research carried out by external data vendors to make data more meaningful and useful

• Examples of data enhancements: – Time/date stamps – Auditing information – Contextual information – Geographic information

– Demographic information – Psychographic information

63Copyright 2017 by Data Blueprint Slide #

DQ Tool Set #6: Reporting• Good reporting supports:

– Inspection and monitoring of conformance to data quality expectations

– Monitoring performance of data stewards conforming to data quality SLAs

– Workflow processing for data quality incidents

– Manual oversight of data cleansing and correction

• Data quality tools provide dynamic reporting and monitoring capabilities

• Enables analyst and data stewards to support and drive the methodology for ongoing DQM and improvement with a single, easy-to-use solution

• Associate report results with: – Data quality measurement

– Metrics

– Activity

64Copyright 2017 by Data Blueprint Slide #

65Copyright 2017 by Data Blueprint Slide #

1. Data Quality in Context of Data Management

2. DQE Definition

3. DQE Cycle & Contextual Complications

4. DQ Causes and Dimensions

5. Quality and the Data Life Cycle

6. DDE Tool Sets

7. Takeaways and Q&A

Data Quality Strategies

Guiding Principles• Manage data as a core organizational asset. • Identify a gold record for all data elements • All data elements will have a standardized data

definition, data type, and acceptable value domain • Leverage data governance for the control and performance of DQM • Use industry and international data standards whenever possible • Downstream data consumers specify data quality expectations • Define business rules to assert conformance to data quality expectations • Validate data instances and data sets against defined business rules • Business process owners will agree to and abide by data quality SLAs • Apply data corrections at the original source if possible • If it is not possible to correct data at the source, forward data corrections

to the owner of the original source. Influence on data brokers to conform to local requirements may be limited

• Report measured levels of data quality to appropriate data stewards, business process owners, and SLA managers

66Copyright 2017 by Data Blueprint Slide #

Goals and Principles• To measurably improve the quality of

data in relation to defined business expectations

• To define requirements and specifications for integrating data quality control into the system development life cycle

• To provide defined processes for measuring, monitoring, and reporting conformance to acceptable levels of data quality

67Copyright 2017 by Data Blueprint Slide #

Summary: Data Quality Engineering

68Copyright 2017 by Data Blueprint Slide #

Upcoming Events

Data-Ed Online: The Seven Deadly Data Sins - Emerging from Management Purgatory November 14, 2017 @ 2:00 PM ET/11:00 AM PT

Data-Ed Online: Metadata Strategies - Data Squared December 13, 2012 @ 2:00 PM ET/11:00 AM PT

Sign up here: www.datablueprint.com/webinar-schedule or www.dataversity.net

69Copyright 2017 by Data Blueprint Slide #

References & Recommended Reading

70Copyright 2017 by Data Blueprint Slide #

Data Quality Dimensions

71Copyright 2017 by Data Blueprint Slide #

Data Value Quality

72Copyright 2017 by Data Blueprint Slide #

Data Representation Quality

73Copyright 2017 by Data Blueprint Slide #

Data Model Quality

74Copyright 2017 by Data Blueprint Slide #

Data Architecture Quality

75Copyright 2017 by Data Blueprint Slide #

Questions?

76Copyright 2017 by Data Blueprint Slide #

+ =

It’s your turn! Use the chat feature or Twitter (#dataed) to submit

your questions to Peter now.

10124 W. Broad Street, Suite C Glen Allen, Virginia 23060 804.521.4056