39
Data Quality The True Big Data Challenge Dr. Stefan Kühn Lead Data Scientist data2day 2016 - Karlsruhe

Data quality - The True Big Data Challenge

Embed Size (px)

Citation preview

Page 1: Data quality - The True Big Data Challenge

Data QualityThe True Big Data Challenge

Dr. Stefan KühnLead Data Scientist

data2day 2016 - Karlsruhe

Page 2: Data quality - The True Big Data Challenge

A short motivation

• Some „famous“ quotes• "Data are becoming the new raw material of

business."

• "The data fabric is the next middleware.“

• "Data matures like wine, applications like fish."

• "There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days."

• "Information is the oil of the 21st century, and analytics is the combustion engine."

2

Page 3: Data quality - The True Big Data Challenge

A short motivation

3

Data matures like wine?

Page 4: Data quality - The True Big Data Challenge

A short motivation

4

Data matures like wine?

More like grapes…

Page 5: Data quality - The True Big Data Challenge

A short motivation

• Some „critical“ quotes• "Big Data is not the new oil."

• "Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom."

• "It’s easy to lie with statistics. It’s hard to tell the truth without statistics."

• "Anything that can be measured can be improved.“

5

Page 6: Data quality - The True Big Data Challenge

Data Quality Fundamentals

6

Page 7: Data quality - The True Big Data Challenge

Twofold Approach to Data Quality

• Does Data represent the real-world objects / events / concepts it is supposed to?• Does Data meet the expectations of the Data

consumers and the requirements of intended usage?• Warning: Data is not facts!

Data is not existing independent from its creation.

7

Page 8: Data quality - The True Big Data Challenge

Data as representation

8

Idea

Word World

Semiotic Triangle

Where is the Data?

Page 9: Data quality - The True Big Data Challenge

Data as representation

9

Metadata

Data World

Semiotic Triangle

Here is the Data

Page 10: Data quality - The True Big Data Challenge

Data and Metadata

• Data implies a context -> Metadata• Metadata provides explicit knowledge about Data

• Metadata enables a common understanding of Data inside an organization• Metadata serves as documentation and

dictionary, as context for Data Understanding

Metadata is absolutely necessary for the effective use of Data.

10

Page 11: Data quality - The True Big Data Challenge

Responsibility for Data

Common Misunderstanding• Data and Data-related systems typically are managed

and hosted by IT, therefore most people (from business and IT) tend to think that Data is part of IT and not of Business• BUT: Data is not the by-product of Business processes

Data is THE product of Business Processes• Data Quality Improvement as Business Strategy

Shared Responsibility

11

Page 12: Data quality - The True Big Data Challenge

Data Creation as Observation

• Data is created under specific Conditions and for specific Purposes• Creation process involves• Observed Object• Observer• Instrument

• Example - Customer Self-Registration Form• Customer Information as Observed Object• Customer as Observer• Registration Form as Instrument

Instrument is not built / known by Observer.

12

Page 13: Data quality - The True Big Data Challenge

Data as Product

• Analogy between manufacturing of products and creation / production of data• Data as core product of a business process• Transfer quality concepts from Software Development

to „Data Development“• Testing• Staging• Versioning• Continuous Delivery / Improvement

• Product Management• Standardization

Data Quality as Manufactoring Quality

13

Page 14: Data quality - The True Big Data Challenge

Expectations and Requirements

• Implicit assumptions for usage of Data• Creation of Data is a business process• Expectations and requirements have to be

explicitely known when defining the process• Data Quality is Business Process Quality• Constantly changing expectations and

requirements makes Data age like grapes…

Make all assumptions explicit.

14

Page 15: Data quality - The True Big Data Challenge

Data Producers

• People or systems that create Data• Producers have control over what they create

(given the functionality of the instrument)

• Producers don’t have control over possible uses of data• Most Data is produced for a dedicated purpose but used

for several purposes• Data Quality is fixed at the moment of creation

Data Quality starts with enabling producers to produce high-quality Data -> useable Data

15

Page 16: Data quality - The True Big Data Challenge

Data Consumers

• People or systems that use Data within its lifecycle• Multiple systems and people can consume data• Often, Consumers are Producers at the same time• Consumers do not control the production of Data but

have implicit assumptions and expectations about it

Data Quality Processes are Consumers of Data of an unknown Quality and Producers of Data of a defined Quality

16

Page 17: Data quality - The True Big Data Challenge

17

Data Quality Problems

Page 18: Data quality - The True Big Data Challenge

Problematic Aspects of Data Management

• Data crosses Organizational Boundaries• Technical (IT) and non-technical (Business)

roles have to communicate• Shared Responsibility instead of „Ownership“• No common definitions• Twelve Barriers to Effective Management of

Data and Information Assets (Th. Redman)

Holistic Approach to Data Quality required

18

Page 19: Data quality - The True Big Data Challenge

Problematic Aspects of Data Management

19

Page 20: Data quality - The True Big Data Challenge

20

Big Data Quality Big Problems

Page 21: Data quality - The True Big Data Challenge

Summary

Big Data

• "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (George Dyson)

21

Page 22: Data quality - The True Big Data Challenge

What is Big Data?

• Different Data sources• External Data• No control over data production• No sufficient documentation (Metadata)• No quality definitions available

• Incompatible schema• Example: Car callbacks

• Even more implicit assumptions• Big Data implies less information per unit of data• Lots of data points are redundant• Example: Measure a constant quantity once per day or once

per second

22

Page 23: Data quality - The True Big Data Challenge

What is Big Data in the Media?

• „new oil“• „gold“• „revolution“• „raw material“• „the future“• „bigger, better, faster, more“• „more data beats better algorithms“• …

23

Page 24: Data quality - The True Big Data Challenge

Three major problems

• Redundancy• Big Data by Copy/Paste

• Resolution• Every problem has an inherent time scale of change• Every problem has an inherent level of uncertainty• Increasing the resolution beyond these levels only

resolves noise• Noise• Adding noisy features decreases the signal-noise ratio• Adding good but irrelevant features increases

complexity and can look like noise

24

Page 25: Data quality - The True Big Data Challenge

Redundancy

25

Page 26: Data quality - The True Big Data Challenge

Resolution

26

Page 27: Data quality - The True Big Data Challenge

Resolution

27

Page 28: Data quality - The True Big Data Challenge

Noise

28

Page 29: Data quality - The True Big Data Challenge

Noise

29

Page 30: Data quality - The True Big Data Challenge

Example from Kaggle

30

Page 31: Data quality - The True Big Data Challenge

349 variables - basically rank 1

31

Page 32: Data quality - The True Big Data Challenge

Moore’s Law

32

Moore’s Law: By Wgsimon - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15193542

Page 33: Data quality - The True Big Data Challenge

What’s the point?

• Moore’s Law• Amount of transistors per area doubles every two

year

• Real-world Problem sizes• Grow at approximately the same speed

• Algorithmic requirements• For answering the same questions in the same

time, we need algorithms with linear complexity

33

Page 34: Data quality - The True Big Data Challenge

Solutions?

34

Page 35: Data quality - The True Big Data Challenge

Overall Goals

• Implement Data Quality Standards• Detect Data Quality Problems• Manage Data Quality Problems• Root Cause Analysis of Data Quality Problems• Measure Costs of „poor“ Data Quality• Measure Value of Data / „high“ Data Quality• Measure Effects of Data Quality

Improvements

35

Page 36: Data quality - The True Big Data Challenge

Typical Approaches

• Force Data Quality (via order)• Fillrate: Make certain fields a must• Range: Prescribe list of valid options

• Buy tool• Hire expert• Fire expert• Collect more bad Data• Relabel „bad“ Data Pool as Data Lake• …

36

Page 37: Data quality - The True Big Data Challenge

Summary of the problem

Big Data

• "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (George Dyson)

37

Page 38: Data quality - The True Big Data Challenge

Useful Approaches

• Hire expert ;-)• Shared Responsibility• Common Understanding of and access to Metadata• This does not imply that the terminology has to change• Typically, the same term has a different meaning in

different departments • Bounded contexts! (DDD)

• Invest in creating better Data instead of fixing old and broken Data

Treat Data as Product, not as Fact

38