1er Simposio Latinoamericano Data Quality Fundamentals Miguel Angel Granados Troncoso

Preview:

Citation preview

1er Simposio Latinoamericano

Data Quality Fundamentals

Miguel Angel Granados Troncoso

Agenda

• Scenarios• Definitions, Processes and Standards• Data Quality Services (DQS)• DQS Solutions

Organizational Compliance

Optimized Productivity

11Extend Any Data, Anywhere

Fast Timeto Solution

Scalable Analytics & DW

8Credible, Consistent Data

Peace of Mind

Managed Self-Service BI

4

Rapid Data Exploration

3Blazing-Fast Performance

2Required 9s& Protection

1

Scale on Demand

5 76

12109

MISSION CRITICAL CONFIDENCE

BREAKTHROUGH INSIGHT

CLOUD ON YOUR TERMS

Credible, Consistent Data% of master data complete & accurate

Hrs spent per employee each week searching for info

Top 20% Performers1.2hrs

Middle 50% Performers 2.8hrs

91%

68%

Under 50%Bottom 30% Performers 6hrs

Companies with accurate data perform better¹

Single BI Semantic

Model

Data Quality

Services¹Source: “Turning Pain into Productivity with Master Data Management,” Aberdeen Group, Feb 2011

Delivered with MasterData

Services

#7

Why is Data Quality Important?Data quality problems cost U.S. businesses more than $600 billion a year.

Data Warehousing Institute (TDWI)

Costs associated with bad data include: • Excess inventory• Higher supply chain costs• higher direct marketing costs• Billing• And more…

Common Data Quality IssuesData Quality Issue Sample Data Problem

Format Do values follow consistent formatting standards ? Telephone number formats:xxxxxxxxxx, (xxx) xxx-xxxx 1.xxx.xxx.xxxx, etc.

Standard Are data elements consistently defined and understood ? ‘Gender code’ = M, F, U ‘Gender code’ = 0, 1, 2

Consistent Do values represent the same meaning ? How is revenue presented ?Dollars, Euro, Both?

Complete Is all necessary data present ? 20% of customers’ last name is blank, 50% of zip-codes are 99999

Accurate Does the data accurately represent reality or a verifiable source? A Supplier is listed as ‘Active’ but went out of business six years ago

Valid Do data values fall within acceptable ranges? Salary values should be between 60,000-120,000

Duplicates Data appears several times Both John Ryan and Jack Ryan appear in the system – are they the same person?

Agenda

Scenarios• Definitions, Processes and Standards• Data Quality Services (DQS)• DQS Solutions

Data Governance

IT Governance

Data Governance

Data Management

Data Quality

Data Correctness

Strategic

Tactical

Data Management

Content

• Subject details• Attribute identification• Subject names• Definitions• Values representation• Standard formats

Relationship

• Identity part (similar attributes)• Group (Rules/Logic)• Hierarchy (Parent/Child)• Relationship Rules/Scenarios

Access

• Access and Sharing Politics (internal/external)

• Data provider• Metadata (use, lineage, etc)• Regulations/Security• External data sources

Changes Management

• Data Quality and Acceptability• Measurement and monitoring• Detection and Error correction• Centralized change tracking• Jurisdiction over data

Data Standarization

Data Management

Master Data Management

Data Quality

• Data quality consists of verifying whether the data is suitable for their intended use in operations, decision making and planning.

Domain Management

Knowledge Discovery

Discovery Value

Management

Quality Control Efforts• Knowing the context of the data• Profile the data required• Create and maintain quality standards• Tracking Data Quality

Requirements for Data Quality Solution

Cleansing

MatchingProfiling

Monitoring

Tracking and monitoring the state of data quality activities and quality of data.

Analysis of the data source; providing insight into the quality of the data, to identify data quality issues.

Amend, remove or enrich data that is incorrect or incomplete. This includes correction, standardization and enrichment.

Identifying, linking and removing duplications within or across sets of data.

How to Manage Data Quality?Data quality management entails the establishment and deployment of:– Roles– Responsibilities– Policies– Procedures– Technology

Data Quality Standards

ISO 8000

• Data Quality Principles• Characteristics that

defines data quality• Processes that ensure

data quality

ISO 22745

• Defines open technical dictionaries

• Applying dictionaries to master data

International Association for Information and Data Qualityhttp://www.iaidq.org/

Agenda

ScenariosDefinitions, Processes and Standards• Data Quality Services (DQS)• DQS Solutions

What is Data Quality Services?

Data Quality Services (DQS) is a Knowledge-Driven data quality solution, enabling IT Pros and data stewards to easily improve the quality of their data

DQS Solution Concepts

Knowledge-DrivenBased on a Data Quality Knowledge Base (DQKB) that is reusable for a variety of data quality improvements

Knowledge Discovery

Acquire additional knowledge through data samples and user feedback

SemanticsData is mapped into Data Domains, which capture its Semantics

Open and Extendible

Support use of user-generated knowledge and IP by 3rd party reference data providers

Easy to Use

Compelling user experience designed for increased productivity

Data Quality Knowledge Base (DQKB)

Matching Policy

Domains

Composite Domains

Matching Rules

Reference Data Services

Composite Domain Rules

Value Relations

Reference Data Services

Domain Rules

Term-based Relations

Values

• Repository of knowledge about data:– Domains define values and rules for each field– Matching policies define rules for identifying duplicate records

DQS Knowledge Sources

Windows Azure Marketplace™ Data MarketCleanse and enrich data with Reference Data Services from DataMarket

DQS Data StoreWebsite that contains DQS knowledge available for downloading

3rd Party Reference Data ProvidersOpen integration with external 3rd party reference data providers

Organization DataCreate domains from your own data sources

Out of the Box Knowledge A set of data domains that come out of the box with DQS

What is a Domain?

Domain

Values

Reference Data Rules and Relationships

• Domains are specific to a data field

• Domains contain the rules for the data

• Domains can be individual or composite

KB

Name

Family NameFirst Name

What is a Reference Data Service?

Address

• The Azure Marketplace hosts specialist data cleansing providers Set up an account

Subscribe to a reference service

Map your domain to the reference service

DQS Architecture Overview

DQS Clients

Knowledge Discovery and Management

DQS Cloud Services

DataMarket - Categorized Reference DataDQS Client

DQS Server

Reference Data API(Browse, Set, Validate…)

Reference Data API(Browse, Get, Update…)

Common Knowledge Store

DQS Engine

Knowledge Discovery Data Profiling Exploration Matching

Cleansing

Reference Data

Reference Data Services

DQS Store - KB, Domains

© 2010 Microsoft Corporation. Microsoft Materials - Confidential. All rights reserved.

Interactive DQ Projects

Administration

Future Clients: Excel, SharePoint,MDS…

DQ Active Projects Published KBs

SSIS DQS Cleansing Component

DQ Projects Store

Other DQS Clients

3rd Party Reference Data

Agenda

ScenariosDefinitions, Processes and StandardsData Quality Services (DQS)• DQS Solutions

IntegratedProfiling

Progress NotificationsStatus

DQS process

Build

Use

DQ Projects

Knowledge Management

Cloud Services

KnowledgeBase

EnterpriseData

ReferenceData

Interactive Cleansing – DQS Project• Analyzes the quality of source data• Automatically corrects and enriches the data• Manual approval/rejection of suggestions provided by the cleansing algorithm/ reference data services

Knowledge Base

Batch Cleansing - Using SSIS

Matching Policy

Reference Data Definition

Invalid

Corrected

Suggested

Correct

Reference Data Services

New

DQS server

Values/Rules

Matching – DQS Project

Why Match?• Identify duplicates within the data source• Create consolidated view of data

DQS Matching• Build a matching police• Matching training• Create a matching project • Choose survivors

Agenda

ScenariosDefinitions, Processes and StandardsData Quality Services (DQS)DQS Solutions

Q&A

Miguel Ángel Granados Troncoso@SQLMiguelGmiguelangel@granadostroncoso.com.mx

Personal Bloghttp://www.granadostroncoso.com.mx

PASS Mexico City Chapterhttp://mexico.sqlpass.org @PASSMXDF

SolidQ Journalhttp://www.solidq.com/sqj/Pages/Home.aspx

Microsofthttp://www.microsoft.com/sqlserver/en/us/solutions-technologies/SQL-Server-2012-business-intelligence.aspx