33
Data Quality and Data Analytics Jeff Fried CTO, BA Insight Data Summit May 2017

Fried data summit data quality data analytics together

Embed Size (px)

Citation preview

Data Quality and Data Analytics

Jeff Fried

CTO, BA Insight

Data Summit

May 2017

All Data is DirtyBRIGHTHOUSE

BRIGHT HOUSE

BRIGHT HOUSE NETWORK

BRIGHT HOUSE NETWORKS

BRIGHTHOUSE

BRIGHTHOUSE NETWORKS

FLORIDA LIGHT & POWER

FLORIDA POWER

FLORIDA POWER & LIGHT

FLORIDA POWER & LIGHT CO

FLORIDA POWER & LIGHT COMPANY

FLORIDA POWER AND LIGHT

VERIZON COMMUNICATIONS

VERIZON FLORIDA INC

VERIZON FLORIDA INC.

VERIZON FLORIDA, INC.

VERIZON NORTH

VERIZON RESIDENTAL

VERIZON SOUTH

VERIZON WIRELESS

VERIZONWIRELESS

VERIZON

VERIZON (FORMERLY GTE)

VERIZON COMMUNICATIONS

VERIZON FLORIDA INC

VERIZON FLORIDA INC.

VERIZON FLORIDA, INC.

VERIZON NORTH

VERIZON RESIDENTAL

VERIZON SOUTH

VERIZON WIRELESS

VERIZONWIRELESS

AMERICAN EXPRESS

AMERICAN EXPRESS TRS CO INC

AMERICAN EXPRESSS

Focused on Search and

SharePoint since 2004

Longtime

Search Nerd

• CTO, BA Insight

• Senior PM, Microsoft

• VP, FAST

• SVP, LingoMotors

About Jeff Fried

Passionate About

• Search

• SharePoint

• Search-driven

applications

• Information Strategy

Blog:

BAinsight.com/blog

Technet Column

“A View from the

Crawlspace”

@jefffried

[email protected]

About BA Insight

– Connectivity

– Applications -

– Classification -

– Analytics

6

This session

Conventional Definition of Data Quality

Requirements for Data Quality Solutions

Cleansing

MatchingProfiling

Monitoring

Monitoring Tracking and monitoring the state of Quality activities and Quality of Data

Cleansing Amend, remove or enrich data that is incorrect or incomplete. This includes correction, standardization and enrichment.

Profiling Analysis of the data source to provide insight into the quality of the data and help to identify data quality issues.

MatchingIdentifying, linking or merging related entries within or across sets of data.

ABU BUK UKA KAB ABL BLA LAN AN, N, , K KH KHA HAL ALI LID ID D H HU HUS USS SSE SEI EIN IN N O OC OCE CEA EAN AN N B BL BLV LVD

3-Gram

KABLAN, KHALID HUSEIN OCEAN BULEVARD POMPANO BEACH

3-Gram

KAB ABL BLA LAN AN, N, , K KH KHA HAL ALI LID ID D H HU HUS USE SEI EIN IN N O OC OCE CEA EAN AN N B BUL ULE LEV EVA VAR ARD

Improves Data Quality through de-duplication

ABUKABLAN, KHALID HUSSEIN OCEAN BLVD 33062 POMPANO BEACH

Score

131 776

N-Grams for name/address resolution

MatchingReference Data

DQ Clients

DQS UI

DQ Server

DQ Projects Store Common Knowledge Store Knowledge Base Store

DQ Engine

3rd Party

MS DQ

Domains Store

Reference

Data

Services

Reference

Data Sets

DQ Active Projects

MS Data

Domains

Local Data

Domains

Published

KBs

Knowledge

Discovery

Data Profiling &

Exploration

Cleansing

Knowledge

Discovery and

Management

Interactive DQ

Projects

Data Exploration

Future Clients –

Excel, SharePoint…

Azure Market Place

Categorized Reference

DataCategorized Reference

Data Services

Reference Data API

(Browse, Get, Update…)RD Services API

(Browse, Set, Validate…)

Data Quality

Data Analytics

“Hot Research Topic” Examples

21

Example: Major Communications Company

22

Common Issues

23

Resulting Solution

24

Example: Leading Insurance Vendor

A consolidated, single view of an individual client

across multiple systems

Automated process that can match clients based on

relevancy, statistics and business defined matching

rules.

Capability of learning from incorrectly matched

entries that a user corrects

Batch processing of data matching, merging and a

results and exception report

Drives changes to legacy systems and master file

o Client information is in several

legacy systems with no standard

definition for clients across the

systems

o Client data is erroneous and

duplicated

Solution:Challenge:

Courtesy of Reltio

Data match and merge based on rules to create golden records

Data Quality Dashboard

Courtesy of Reltio

Modes

Example: Feedback Loops

Customer Sales Order

Billing

Customer AccountInformation

Provisioning

CustomerCare

Existing Data Flow Missing Data Flow

So….I don’t need to worry about data quality, right?

If it’s Big Data,

it’s machine generated

Chart courtesy of xkcd

This session