Upload
jeff-fried
View
140
Download
0
Embed Size (px)
Citation preview
All Data is DirtyBRIGHTHOUSE
BRIGHT HOUSE
BRIGHT HOUSE NETWORK
BRIGHT HOUSE NETWORKS
BRIGHTHOUSE
BRIGHTHOUSE NETWORKS
FLORIDA LIGHT & POWER
FLORIDA POWER
FLORIDA POWER & LIGHT
FLORIDA POWER & LIGHT CO
FLORIDA POWER & LIGHT COMPANY
FLORIDA POWER AND LIGHT
VERIZON COMMUNICATIONS
VERIZON FLORIDA INC
VERIZON FLORIDA INC.
VERIZON FLORIDA, INC.
VERIZON NORTH
VERIZON RESIDENTAL
VERIZON SOUTH
VERIZON WIRELESS
VERIZONWIRELESS
VERIZON
VERIZON (FORMERLY GTE)
VERIZON COMMUNICATIONS
VERIZON FLORIDA INC
VERIZON FLORIDA INC.
VERIZON FLORIDA, INC.
VERIZON NORTH
VERIZON RESIDENTAL
VERIZON SOUTH
VERIZON WIRELESS
VERIZONWIRELESS
AMERICAN EXPRESS
AMERICAN EXPRESS TRS CO INC
AMERICAN EXPRESSS
Focused on Search and
SharePoint since 2004
Longtime
Search Nerd
• CTO, BA Insight
• Senior PM, Microsoft
• VP, FAST
• SVP, LingoMotors
About Jeff Fried
Passionate About
• Search
• SharePoint
• Search-driven
applications
• Information Strategy
Blog:
BAinsight.com/blog
Technet Column
“A View from the
Crawlspace”
@jefffried
Requirements for Data Quality Solutions
Cleansing
MatchingProfiling
Monitoring
Monitoring Tracking and monitoring the state of Quality activities and Quality of Data
Cleansing Amend, remove or enrich data that is incorrect or incomplete. This includes correction, standardization and enrichment.
Profiling Analysis of the data source to provide insight into the quality of the data and help to identify data quality issues.
MatchingIdentifying, linking or merging related entries within or across sets of data.
ABU BUK UKA KAB ABL BLA LAN AN, N, , K KH KHA HAL ALI LID ID D H HU HUS USS SSE SEI EIN IN N O OC OCE CEA EAN AN N B BL BLV LVD
3-Gram
KABLAN, KHALID HUSEIN OCEAN BULEVARD POMPANO BEACH
3-Gram
KAB ABL BLA LAN AN, N, , K KH KHA HAL ALI LID ID D H HU HUS USE SEI EIN IN N O OC OCE CEA EAN AN N B BUL ULE LEV EVA VAR ARD
Improves Data Quality through de-duplication
ABUKABLAN, KHALID HUSSEIN OCEAN BLVD 33062 POMPANO BEACH
Score
131 776
N-Grams for name/address resolution
MatchingReference Data
DQ Clients
DQS UI
DQ Server
DQ Projects Store Common Knowledge Store Knowledge Base Store
DQ Engine
3rd Party
MS DQ
Domains Store
Reference
Data
Services
Reference
Data Sets
DQ Active Projects
MS Data
Domains
Local Data
Domains
Published
KBs
Knowledge
Discovery
Data Profiling &
Exploration
Cleansing
Knowledge
Discovery and
Management
Interactive DQ
Projects
Data Exploration
Future Clients –
Excel, SharePoint…
Azure Market Place
Categorized Reference
DataCategorized Reference
Data Services
Reference Data API
(Browse, Get, Update…)RD Services API
(Browse, Set, Validate…)
24
Example: Leading Insurance Vendor
A consolidated, single view of an individual client
across multiple systems
Automated process that can match clients based on
relevancy, statistics and business defined matching
rules.
Capability of learning from incorrectly matched
entries that a user corrects
Batch processing of data matching, merging and a
results and exception report
Drives changes to legacy systems and master file
o Client information is in several
legacy systems with no standard
definition for clients across the
systems
o Client data is erroneous and
duplicated
Solution:Challenge:
Example: Feedback Loops
Customer Sales Order
Billing
Customer AccountInformation
Provisioning
CustomerCare
Existing Data Flow Missing Data Flow
So….I don’t need to worry about data quality, right?
If it’s Big Data,
it’s machine generated
Chart courtesy of xkcd
34
www.BAinsight.com
@jefffried