33
November 10 th , 2011 DQS MATCHING GADI PELEG, SENIOR PROGRAM MANAG SQL SERVER DATA QUALITY SERVICES Microsoft SQL Server 2012

November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Embed Size (px)

Citation preview

Page 1: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

November 10th, 2011

DQS MATCHINGGADI PELEG, SENIOR PROGRAM MANAGER

SQL SERVER DATA QUALITY SERVICES

Microsoft

SQL Server 2012

Page 2: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Agenda

Matching Project

What is record matching?

Data Issues

DQS Matching Process

DQS Data Matching Principles

Matching Policy

Page 3: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

What is Record Matching?

Record matching is the task of identifying records that match the same real world entity.

Page 4: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

The Cost of Duplicate Data

…a few examples…

Direct marketing communications are doubled up unnecessarily.Product shipments and customer-site based services could be sent to the wrong address due to an incorrect duplicate record being used.

Your sales reporting may be inaccurate due to an over-inflated number of customers.

Inaccurate sales analysis due to sales being split between multiple records that represent the same customer, resulting in an undervaluing of some key customers.

Page 5: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Where do Duplicate Records come from?

Poorly designed software

No verification of existing records upon entry

Formatting & abbreviations

"Doctor Robert Smith" Vs. "Dr. Bob Smith".

Data validation Human errors can creep into the system when fields’ input is not validated

Company merging and acquisitions

Merging systems may result in duplicates in the merged data.

Change of attributes The same person may appear to not exist in the database if some of the attributes were changed (e.g., address, name etc.)

Page 6: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

…Data Issues…

There are different ways to represent the same person or address in a database:

Data is ‘fuzzy’ in nature (spelling mistakes, abbreviations etc.).

Page 7: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

How Data Issues Affects Matching?

Matching Results

Low field Similarity scores due to variations in field values representation

Low Similarity scores can be reduced by expanding the knowledge in the KB.

Matching Results Reasoning

The Data

Page 8: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

DQS Matching Principles

Page 9: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

IntegratedProfiling

Progress

Notifications

Status

DQS Matching Process

ConnectBuild

Use

DQ Projects

Knowledge Management

Knowledge

Base

SampleData

Page 10: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

DQS Matching Key Points

Identifies exact and approximate matches, enabling removal of duplicate data.

Enables creating a matching policy interactively using a computer-assisted process.

Ensures that values that are equivalent, but were entered in a different format or style, are in fact rendered uniform.

Page 11: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Matching Policy

Page 12: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

What is a Matching Policy?

A matching policy is prepared in the knowledge base.

A matching policy consists of matching rules that assess how well one record matches to another.

Specify in the rule whether records’ values have to be an exact match, similar, or prerequisite.

Train your policy by running and tuning each rule separately.

Page 13: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Matching Policy Process

Identify the attributes in your data that are most significant for matching.

Create domains/composite domains based on your data structure.

Define matching rules.

Birth Date Gender

Composite Domain Full Name

F. Name M. NameL. Name Email Phone

Composite Domain Full Address

Street City State Country

DQS Domains for Matching

Your Data

Page 14: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Matching Rules Properties

Similarity, select Similar if field values can be similar. Select Exact if field values must be identical.

Weight, determines the contribution of each domain in the rule to the overall matching score for two records.

Prerequisite validates whether field values return a 100% match; else the records are not considered a match.

Minimum matching score is the threshold at or above which two records are considered to be a match.

Page 15: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Similarity Functions for Non String Domains

Domains of type ‘Date’, ‘Integer’ or ‘Decimal’ can be matched using the ‘Similar’ property by assigning a tolerance either in percentage or integer.

Field values that fall within the defined tolerance are considered a match.

‘Date’ DomainDefine Tolerance as Integer

‘Integer’ DomainDefine Tolerance as Integer or Percentage

Page 16: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Policy Tuning - Profiler

Uniqueness

Usage Description Domains

Low • Define as Prerequisite• Define with lower weights

Provides discriminatory information

Gender, City, State

High • Define as Similar or Exact• Define with higher weights

Provides highly identifiable information and is highly discriminatory

Names (First, Last, Company), Address Line 1

Completeness

Usage Description

Low Do not use or define with low weight

High level of missing values

High Include for matching if the column provides highly identifiable information

Low level of missing values

The Profiler provides insights about the Completeness and Uniqueness of the data which can be used to tune your matching policy.

Page 17: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

• The Matching Results tab displays statistics for the current and previous run of a matching rule.

• Restore the previous rule.

Policy Tuning - Matching Results Tab

Page 18: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Matching Policy Demo The Data

KnowledgeHome TeamSongArtist

Page 19: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Matching + Knowledge

Page 20: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Matching with Knowledge

The DQS matching system uses the knowledge accumulated in the knowledge base to propose matching candidates. This knowledge includes:

Synonyms, Syntax Errors and their Leading Value (by domain)

Domain Values and their synonyms and syntax errors are used by the matching system to find identical or similar records.

Term-Based Relations (TBR)TBR improves consistency of data attributes values by transforming data values to a single form using user-defined term relations. In matching, TBRs are only applied in-memory for boosting matching accuracy.

Nulls and Equivalents (“Unknown”, “99999”…)

Manage values that represent missing data by linking to the ‘DQS_Null’ value to assure that they are considered as a match.

Page 21: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

String Normalization

The DQS matching system removes punctuation characters from Field values (aka String Normalization) before applying the matching algorithm.

String normalization is a Domain property.

String 1 String 2 Similarity Score Character

Before After

175 CLEARBROOK ROAD P.O. BOX

535

175 CLEARBROOK ROAD P.O.BOX

535

0.92 1.00 .

1834 E. 42ND STREET 1834 E. 42ND. ST. 0.695 0.857 .

1721 DE KALB AVE, NE 1721 DE KALB AVE NE 0.88 1.00 ,

14538 S. GARFIELD AVE., BLDG.

1-B

14538 S GARFIELD AVE BLDG 1B 0.676 0.944 , . -

#704, SJ Technoville BD, 60-19 704 SJ Technoville BD 60 19 0.65 1.00 # , -

Example:

Page 22: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Matching Project

Page 23: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Matching Project Principles

Export - export both matching results (clusters) and survivors (unique records).

A Matching project is performed in three steps:

Mapping - map source columns to domains.

Matching - run matching and view the results; it includes additional functionality such as:

• Reject records• Filter results by ‘Matched’ & ‘Unmatched’ and by

matching score.• Display clusters in two different methods

(overlapping and non- overlapping )

Page 24: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Overlapping Vs. Non-Overlapping Clusters

In Overlapping clusters a record may appear more than once in various clustered results. This structure may be harder to read since the same record exists in multiple clusters.

In Non-Overlapping clusters, the system unifies clusters containing the same record. This structure is easier to read as you won't repeat the same observation twice.

A B C

Overlapping Clusters

(A~B) , (B~C)

A B C

Non-Overlapping Cluster

(A~B~C)

Page 25: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Example: Overlapping Vs. Non-Overlapping Overlapping Clusters

Non-Overlapping Clusters

Page 26: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Rejecting Records from Clusters

Check the Rejected box to move the records out of the proposed cluster upon moving to the next page in the activity. Unlike the Cleansing Data Project where records move between tabs instantly, the rejected records are not removed from the clusters on the user interface.

Record 1000040 is rejected on the user interface. DQS Client User Interface

Exported Matching ResultsRecord 1000040 appears as ‘Rejected’ under the ‘Status’ column.

Page 27: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Record Survivorship and ExportMatching and Survivorship results can be exported to a SQL table, Excel or CSV file for further analysis or consumption.

Page 28: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Matching Configuration - AdministrationSpecify a value in the Min record score field. This value signifies the minimum matching record score allowed in a matching rule.

In a Matching Rule, the Minimum matching score parameter is the threshold at or above which two records are considered to be a match.

Page 29: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

DQS Matching

Demo

Page 30: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Matching Lab Description

The StoryContoso airport receives passenger details from different airlines; the data contains duplicate passengers information which need to be identified and removed.

Exercise DescriptionIn this exercise you will :

• Prepare a Matching Policy and tune the matching rules.

• Create a Matching Project and run a matching process to identify duplicate passengers.

• Export the matching and survivors results.

Page 31: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Resources

www.microsoft.com/teched

Sessions On-Demand & Community Microsoft Certification & Training Resources

Resources for IT Professionals Resources for Developers

www.microsoft.com/learning

http://microsoft.com/technet http://microsoft.com/msdn

Learning

http://northamerica.msteched.com

Connect. Share. Discuss.

Page 32: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

Additional DQS Resources

DQS Blog

Tips, tricks and guidance on best

practices for using DQS – courtesy of

the DQS team

DQS Movies

A set of getting started movies for

an easy introduction to DQS

DQS Forum

Come participate in DQS related

discussions in our DQS forum on MSDN

Available Hereblogs.msdn.com/b/dqs

Available Here

Page 33: November 10 th, 2011 DQS MATCHING G ADI P ELEG, S ENIOR P ROGRAM M ANAGER SQL S ERVER D ATA Q UALITY S ERVICES Microsoft SQL Server 2012

© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after

the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.