32
Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern, 14.06.2019 Martin Müller-Lennert Senior Data Scientist [email protected] Milica Petrović Senior Data Scientist [email protected]

Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

Digitization - Data - Intelligence

Automated Data Quality Assurancewith Machine Learning and Autoencoders

SDS2019

Bern, 14.06.2019

Martin Müller-Lennert

Senior Data Scientist

[email protected]

Milica Petrović

Senior Data Scientist

[email protected]

Page 2: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

2

Talk Outline

2

What’s Wrong with Data Quality?

3

Error Detection using ML

4

Demo

Findings and Outlook

1

Page 3: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

3

Talk Outline

2

What’s Wrong with Data Quality?

3

Error Detection using ML

4

Demo

Findings and Outlook

1

Page 4: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

4

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 5: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

5

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 6: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

6

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 7: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

7

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 8: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

8

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 9: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

9

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 10: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

10

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 11: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

11

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

Page 12: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

12

Data Quality TodayOur Take at a Solution

Data Quality Today

▪ Manually coded

SQL rules

▪ Uni-/bi-variate

checks

▪ Too much data: set of rules

▪ Too few rules: undetected

errors

▪ Too narrow focus: one-

dimensional

▪ Too late: new errors types

detected after occurrence

▪ Automate: simultaneous error detection & faster process

▪ Reusability: tailored ML algorithms reused for fields of similar type

▪ Deep dive: discovery of new types of errors based on multivariate

relationships

▪ Unsupervised

▪ Model of input data:

→ Anomalies easily

detected

▪ Capture multivariate

relationships

Ch

all

en

ges

Solutions with Machine Learning

Au

toan

co

ders

Page 13: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

13

Talk Outline

2

What’s Wrong with Data Quality?

3

Error Detection using ML

4

Demo

Findings and Outlook

1

Page 14: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

14

Autoencoders for Data QualityArchitecture and Training

Target: Reconstruct input

Bottleneck: Enforced by architecture or regularization

Ensures network learns structure of input data

For good data only

INPUTINPUT OUTPUT

Page 15: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

15

Autoencoders for Data QualityArchitecture and Training

Training on imperfect data: Requires large share of good data

Limits potency of network: More layers not always better

From simple one-layer NN up to VAE with LSTM cells

INPUT OUTPUT INPUT

For good data only

Page 16: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

16

Discriminating Good and Bad Data RecordsClustering the Reconstruction Errors

Mean

Sq

uare

d E

rro

r

Individual Data Records

Challenge: Many data points and potentially extreme class imbalance

Kernel Density Estimate

Page 17: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

17

Discriminating Good and Bad Data RecordsClustering the Reconstruction Errors

Mean

Sq

uare

d E

rro

r

Individual Data Records

Challenge: Many data points and potentially extreme class imbalance

Kern

el D

en

sity E

stimate

Page 18: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

18

Discriminating Good and Bad Data RecordsSequence of Autoencoders

1st iteration

Keep Rest of Data

Challenge: Magnitude of reconstruction error

varies across data error types

Remove Detected

Anomalies

Page 19: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

19

Discriminating Good and Bad Data RecordsSequence of Autoencoders

Page 20: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

20

Discriminating Good and Bad Data RecordsSequence of Autoencoders

Page 21: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

21

Discriminating Good and Bad Data RecordsSequence of Autoencoders

Across iterations: Increase model complexity

Stopping: When threshold separates large chunk of data

Page 22: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

22

Talk Outline

2

What’s Wrong with Data Quality?

3

Error Detection using ML

4

Demo

Findings and Outlook

1

Page 23: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

23

DemoBirth date

Page 24: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

24

DemoBirth date

Page 25: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

25

DemoFirst name

Page 26: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

26

DemoFirst name

Page 27: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

27

DemoRevenue

Page 28: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

28

DemoRevenue

Page 29: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

29

Talk Outline

2

What’s Wrong with Data Quality?

3

Error Detection using ML

4

Demo

Findings and Outlook

1

Page 30: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

30

Reusability of Pre-Processing and Model Setup

Type of variable Pre-processing Model

Character One-hot encoding of charactersVariational autoencoders with

LSTM cells

Categorical One-hot encodingComplete autoencoder with

regularization

DateNumerical features from digits Complete autoencoder with

regularizationNormalization

Numerical NormalizationUndercomplete autoencoder

with custom loss

Generic pipeline per field type → can be reused for other fields of same type

Page 31: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

31

Key Findings from Application to Production Data

High reusability: One-time customization effort

Replication: ML can automatically replicate rule-based data quality checks

Extension: Autoencoders can find additional errors

SME feedback necessary: Sanity checks during model building

Multivariate relationships: Detection of interdependencies

Unsupervised learning! Training data quality matters

Tra

inin

gP

erf

orm

an

ce

Page 32: Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence Automated Data Quality Assurance with Machine Learning and Autoencoders SDS2019 Bern,

32

Model lifecycle: improve models over time from feedback

Detected anomalies are errors: correct or leave out

Automated error correction

Data remediation using RPA

Batch processing: extend error detection to whole batches of data

Detect faulty data sourcing process

Detected anomalies are false positives: increase weight during training

OutlookFuture Endeavors