21
Hadoop Powered Corporate Data How to Produce and Manage Meaningful Data and Analytics Dr. Geoffrey Malafsky Phasic Systems Inc.

Phasic Systems - Dr. Geoffrey Malafsky

Embed Size (px)

Citation preview

Page 1: Phasic Systems - Dr. Geoffrey Malafsky

Hadoop Powered Corporate Data

How to Produce and Manage Meaningful Data and Analytics

Dr. Geoffrey Malafsky

Phasic Systems Inc.

Page 2: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 2

Governance

Warehouse

Analytics

NoSQL Streaming

BIIntegration

Architecture

Modeling

Big Data Hadoop Velocity,Volume,Variety

Veracity

Page 3: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 3

Governance

Warehouse

Analytics

NoSQL Streaming

BIIntegration

Architecture

Modeling

Big Data Hadoop Velocity,Volume,Variety

Veracity

What does this really mean for my corporate

data?

Disruption

Page 4: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 4

Organizational Issues

Technology IssuesBusiness Issues

Page 5: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 5

Are we discovering new knowledge?

Are we analyzing business and operations for decisions, audit, compliance, consolidation?

Are we fulfilling required reports?

Page 6: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 6

Veracity, Meaningful

Does it matter?Topic Should Does

BI Yes Sometimes

Required Reports Yes Sometimes

Audit Yes Yes

Compliance Yes Yes

Consolidation Yes Sometimes

Marketing Yes Sometimes

Financial Yes Yes but….

Decision Making Yes Yes but….

Page 7: Phasic Systems - Dr. Geoffrey Malafsky

TechLab by InsideAnalysis

Phasic Systems Inc. 7

Normalizing Corporate Small Data With Hadoop and Data ScienceBy Dr. Geoffrey P Malafsky

In part one of this discussion series (Hadoop for Small Data), I introduced the idea that Small Data is the mission-critical data management challenge. To

reiterate, Small Data is “corporate structured data that is the fuel of its main activities, and whose problems with accuracy and trustworthiness are past

the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision

making, applications, reports, and Business Intelligence.”

I am excluding what I call stochastic data use cases which can succeed even if there is error in the source data and uncertainty in the results since the

business objective is getting trends or making general associations. Most Big Data examples are this type. In stark contrast are deterministic use cases,

which I am focusing on here and in the next TechLab in September, where the ramifications for wrong results are severely negative. This is the realm of

executive decision making, Accounting, Risk Management, regulatory compliance, security, to name a few.

Corporate Small Data is

structured data that is the

fuel of its main activities

Data Normalization combines

subject matter knowledge,

governance, business rules,

and raw data to make it

meaningful.

Page 8: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 8

Hadoop was created to handle extraordinarily large and constantly changing data sets. It is a very well-engineered software framework and set of tools for distributed storage and cluster computing. But, can it help solve the intractable challenges with key corporate data ?

Page 9: Phasic Systems - Dr. Geoffrey Malafsky

The Challenge of Corporate Small Data

Phasic Systems Inc. 9

multiple sources multiple definitions multiple copies

variable structures

different data values

hidden conflicts in data definitions

which to use

different model types & standards

more storage more data flows

Many DW & marts different ETL

complex dependencies

conflicting business rules

analyses restricted by inconsistencies

Page 10: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 10

An example of embedded errors that defy traditional tools and methods. Two authoritative data systems have many occurrences of conflicts, errors, and quantitative discrepancies. Finding these has been too difficult with common tools. But, using small Hadoop cluster (this is Corporate Data not Big Data) allows us to iteratively detect, learn, adjust. Once detected, investigated, and understood we can find just the one answer from business needed to correct.

Page 11: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 11

136666505 adese genc petrol

136666505 amy lily chung

136666505 anderson erin ruth

136666505 andrew william knef

136666505 anduaga-arias laura

136666505 angelica m. de la cruz

136666505 anthony o'brien, 330531-5100194

136666505 batac belle

136666505 bottesini beth ms.

136666505 bouck shannon

136666505 bunn amy b.

136666505 carlene clark

136666505 cho, boong haeng

136666505 choe, sun young

136666505 christina michajlyszyn

136666505 christopher cannon

136666505 christopher l. booth

136666505 chun, kil mo

136666505 conflict + transition consultancies

136666505 cozzone elaine

136666505 deborah p. carney

136666505 denihan patricia joann

136666505 dong sook mcgeorge, 690525-2716816

136666505 dorene d.lukewalton,pharm d.

136666505 dr. terry a. klein

0

10

20

30

40

50

60

70

80

90

100

WhiteSpace Transpose Acronym NoiseWord LowSim Punctuation

Perc

ent

of

DU

NS

Wit

h >

=50

% N

ames

Mat

ched

Proportion of DUNS Matched by Transform Type

FPDS FPDS-WAWF FPDS-WAWF-GDUNS

Page 12: Phasic Systems - Dr. Geoffrey Malafsky

Requirements for Data Analytics1. Data must be understood

2. The right definitions must apply at the right time for the right user

3. Data’s lineage and provenance must be clear

4. Data integrity must be preserved

5. Data must be accurate, consistent, complete, timely, unique and valid

6. Data and system access must be secure

7. Data must be provided in multiple arrangements to meet different user needs and analytical processing requirements

8. Data must be prepared and tracked to support meaningful analysis for different user needs

9. Data processing must be flexible to adapt to new knowledge and discoveries on data already being used

10. Data must be normalized using authoritative or best known sets of codes, lookup values, and source adjudication knowledge and rules

11. High speed, low maintenance techniques and tools are needed to be cost and time effective

12. Lifecycle audits and data maintenance must be performed including maintaining and documenting data from raw source to intermediate transformed to full normalized

13. Use Common data models that align, correct, and semantically unify data from multiple sources to enforce meaningful and consistent analysis

Phasic Systems Inc. 12

Page 13: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 13

Page 14: Phasic Systems - Dr. Geoffrey Malafsky

An Example of Hidden Business Rules and Logic

• If (DELIVERY_ORDER=NULL) v_piid = CONTRACT else v_piid = DELIVERY_ORDER

• If ( x1='0') v_modification_number = '0‘ else v_modification_number = x2

• where x1: if (ACO_MOD=NULL) x1 = x3 else x1 = ACO_MOD

• where x3: if (PCO_MOD=NULL) x3='0‘ else x3=PCO_MOD

• where x2: if (x4=NULL) x2='0‘ else x2=x4

• where x4: x4= LTRIM(x5)

• where x5: x5=x1• essentially this first tries to use ACO_MOD, and if this is NULL then it tries

to use PCO_MOD and sets = '0' if these are NULL

• If (DELIVERY_ORDER=NULL) v_idv_piid = y1 else v_idv_piid = CONTRACT

• where y1: y1 = REF_PROC_INSTRUMENT with all '-' characters removed

Phasic Systems Inc. 14

key business logic as buried in a database stored procedure (condensed)

Page 15: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 15

Flexible, Fast, Adaptive, Multi-Tool Data Analytics Environment

Page 16: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 16

Page 17: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 17

0

50

100

150

200

250

300

350

400

Hive Impala SQLServer

FPDS Hadoop Query Times Text Field (secs)

Text Parquet Parquet Partitioned

Page 18: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 18

Parallel Jobs in Hadoop

Page 19: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 19

Page 20: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 20

Page 21: Phasic Systems - Dr. Geoffrey Malafsky

Phasic Systems Inc. 21