Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Data Quality
Services 101
Knowledge
Base Driven
Data Quality
Matching
Integration
DQS with
MDS and SSIS
DQ Issues and DQ Dimensions
Name Gender Street House # Zip code City State D.O.B
John Doe Male 60th street 45 New York New York 08/12/64
Jane Doe Male Jonathan ln 36 10023 Poughkeepsy NY 21-dec-1954
Name Gender Street House # Zip
code
City State D.O.B
John Doe Male E 60th St 45W 10022 New York NY 08/12/64
Jane Doe Female Jonathan
Lane
36 10023 Poughkeepsie NY 12/21/54
Name Address Postal Code City State
John Smith 545 S Valley View Drive # 136 34563 Anytown New York
Margaret & John smith 545 Valley View ave unit 136 34563-2341 Anytown New York
Maggie Smith 545 S Valley View Dr Anytown New York
John Smith 545 Valley Drive St. 34253 NY NY
Name Address Zip Code City State Cluster
John Smith 545 S Valley View Drive # 136 34563 Anytown New York 1
Margaret & John smith 545 Valley View ave unit 136 34563-2341 Anytown New York 1
Maggie Smith 545 S Valley View Dr Anytown New York 1
John Smith 545 Valley Drive St. 34253 NY NY 2
Before
Before
After
After
Completeness Accuracy Conformity Consistency Uniqueness
Build
Use
DQ Projects
Knowledge
Management
Connect
Knowledge
Base
Build
Use
DQ Projects
Knowledge
Management
Connect
Integrated
ProfilingKnowledge
Base
9
Amend, remove or
enrich data that is
incorrect or incomplete.
This includes correction,
enrichment and
standardization .
Identifying, linking or
merging related
entries within or
across sets of data.Cleansing Matching
Profiling MonitoringAnalysis of the data
source to provide
insight into the quality
of the data and help to
identify data quality
issues.
Tracking and
monitoring
the state of Quality
activities and Quality
of Data.
Matching
Reference
Data
DQ Clients
DQS UI
DQ Server
DQ Projects Store Common Knowledge Store Knowledge Base Store
DQ Engine
3rd Party
/ Internal
MS DQ
Domains Store
Reference
Data
Services
Reference
Data Sets
SSIS DQ
Component
DQ Active
Projects
MS Data
Domains
Local
Data
Domains
Published
KBs
Knowledge
Discovery
Data
Profiling &
Exploration
Cleansing
Knowledge
Discovery
and
Management
Interactive
DQ Projects
Data
Exploration
Azure Market Place
Categorized
Reference Data
Categorized
Reference Data
Services
Reference Data API
(Browse, Get,
Update…)
RD Services API
(Browse, Set,
Validate…)
MDS Excel
Add in
Future Clients –
Excel,
Dynamics
15010 NE 36th Street
RDS –
Reference
Data
In order
Knowledge
Base
Parsing
• When you don’t have enough knowledge in your
knowledge base
• Sample : Mellissa DataWhen to Use
• Handing over the dirty job Advantage
• Paying subscription fee
• Large volumes of data may cause performance issues on
cloudDisadvantage
15010 NE 36th Street , Redmond, WA, USA
USA, 15010 NE 36th Street , Redmond, WA
15010 NE 36th Street , Redmond, WA
Data Issues
There are different ways to represent the same person or address in a database:
Data is ‘fuzzy’ in nature (spelling mistakes, abbreviations etc.).
A matching policy is prepared in the knowledge base.
A matching policy consists of matching rules that
assess how well one record matches to another.
Specify in the rule whether records’ values have to be
an exact match, similar, or prerequisite.
Train your policy by running and tuning each rule
separately.
Similarity, select
Similar if field
values can be
similar. Select Exact
if field values must
be identical.
Weight, determines
the contribution of
each domain in the
rule to the overall
matching score for
two records.
Prerequisite
validates whether
field values return a
100% match; else
the records are not
considered a match.
Minimum
matching score is
the threshold at or
above which two
records are
considered to be a
match.
Uniqueness Usage Description Domains
Low • Define as Prerequisite
• Define with lower weights
Provides discriminatory
information
Gender, City, State
High • Define as Similar or Exact
• Define with higher weights
Provides highly identifiable
information and is highly
discriminatory
Names (First, Last,
Company),
Address Line 1
Completeness Usage Description
Low Do not use or define with low weight High level of missing values
High Include for matching if the column
provides highly identifiable information
Low level of missing values
In Overlapping clusters a record may appear more than once in various clustered
results. This structure may be harder to read since the same record exists in multiple
clusters.
In Non-Overlapping clusters, the system unifies clusters containing the same
record. This structure is easier to read as you won't repeat the same observation
twice.
Overlapping Clusters
(A~B) , (B~C)
Non-Overlapping Cluster
(A~B~C)
DQS Component Overview
Reference Data
Definition
Values/RulesSource +
MappingDQS Cleansing
Component
SSIS Package
Destination
Design Run
Activity
MonitoringInteractive Cleansing
Project
http://social.technet.microsoft.com/wiki/contents/articles/14065.tsql-script-to-delete-dqs-projects-leftover-from-ssis-dqs-cleansing-component.aspx
Thank you