Big data from the trenches

Preview:

Citation preview

Big Data from the

trenchesAdvice from the FSI industry

By: Azrul MADISA

About me…

• VP – Enterprise Data Architect @ Maybank

• Take care of Maybank’s data world wide

• Nuts about data, analytics and software dev.

• Very hands on, love to read

• Teach aikido to kids

Big Data landscape today

https://www.linkedin.com/pulse/big-data-still-thing-2016-landscape-matt-turck

Too many big data tech?

Wait … what?

I have to know ALL

that?

Let’s change the game a bit…

Use c

ase

The data journey

The data journey

Acquisition Dumping

Tidy data

Real Time

Analytics

Analytical

model

Sandbox

Example: credit scoring and loan origination

Acquisition Dumping

Tidy data

Real Time

Analytics

Analytical

model

ScreensData staging

area

Data

warehouse

Score card

builder

Decisioning

Sandbox

Data

scientist

Acquisition with quality

Acquisition with quality

• Manage data quality up front

• Human-factor data quality

Data EntryData

StagingApplication

Over-night

Acquisition with quality

• Manage data quality up front

• Human-factor data quality

Data EntryData Staging

Application

Over-night

Audit trail

Weekly

Acquisition with quality

• Non-human error

• Use PEWMA algorithm

https://aws.amazon.com/blogs/iot/anomaly-detection-using-aws-iot-and-aws-lambda/

Data sandbox

Creating a sandbox on the cloud

• Why cloud:

– Scale data discovery as needed

– Merging private with public data

– Less bureaucratic

• But…

– Customer data on the cloud is a no no

Creating a sandbox on the cloud

• Masking

– Non-numerical data => No sweat!

– E.g.

• En. Abdul Jalil => 837x2unxy237e832!@

• 720324-03-8891 => 472376-84-8732

• Masking numerical data?

Creating a sandbox on the cloud

• Masking

– Non-numerical data => No sweat!

– E.g.

• En. Abdul Jalil => 837x2unxy237e832!@

• 720324-03-8891 => 472376-84-8732

• Masking numerical data?

What if there is a way to mask numerical data

while keeping the statistical properties intact

Easier for the

regulators to

digest

Creating a sandbox on the cloud

• Random projection

• Usually used for dimension reduction

Original

data

(M x N)

Random

matrix

(N x N)X =

Masked

data

(M x N)

Fast real-time vs. batch

analytics

Fast real-time analytics

• ‘Batch’ analytics:

UserApplication

Over-night

batch

Data

warehouse

Predictive

analyticsDescriptive

analytics

Analytical

model

Monthly

Fast real-time analytics

• ‘Batch’ analytics:

UserApplication

Over-night

batch

Data

warehouse

Predictive

analyticsDescriptive

analytics

Real time decisioning

Monthly

Fast real-time analytics

• So what is real time analytics:

UserApplication

Real time decisioning analytics

Analytical

model

updated in

real time

Fast real-time analytics

• So what is real time analytics:

UserApplication

Real time analytics and decisioning

Analytical

model

updated in

real time

Predictive

analytics

Batch

analytical

model

Real-time

analytical model

Fast real-time analytics

• Q- learning

• E.g. SMS advertisement campaign

Real-time

Analytical

Marketting

System

Location, user info

SMS campaign

Fast real-time analytics

• Q- learning

• E.g. SMS advertisement campaign

Real-time

Analytical

Marketting

System

Change behaviour

(E.g. buy

something else)

Learn new

behaviour

Fast real-time analytics : Real-time analytics in

action

Over time

Interest

in

concerts

Interest

in moviesInterest

in sports

Fast real-time analytics: Real time analytics in

action

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.51

174

347

520

693

866

103

91

21

21

38

51

55

81

73

11

90

42

07

72

25

02

42

32

59

62

76

92

94

23

11

53

28

83

46

13

63

43

80

73

98

04

15

34

32

64

49

94

67

24

84

55

01

85

19

15

36

45

53

75

71

05

88

36

05

66

22

96

40

26

57

56

74

86

92

17

09

47

26

77

44

07

61

37

78

67

95

98

13

28

30

58

47

88

65

18

82

48

99

79

17

09

34

39

51

69

68

99

86

21

0…

10…

10…

10…

10…

10…

INT

ER

ES

T

MESSAGES

SPORTS CONCERTS MOVIES

Interest

in

concerts

Interest

in movies

Interest

in sports

Fast real-time analytics: Real time analytics in

action

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.51

174

347

520

693

866

103

91

21

21

38

51

55

81

73

11

90

42

07

72

25

02

42

32

59

62

76

92

94

23

11

53

28

83

46

13

63

43

80

73

98

04

15

34

32

64

49

94

67

24

84

55

01

85

19

15

36

45

53

75

71

05

88

36

05

66

22

96

40

26

57

56

74

86

92

17

09

47

26

77

44

07

61

37

78

67

95

98

13

28

30

58

47

88

65

18

82

48

99

79

17

09

34

39

51

69

68

99

86

21

0…

10…

10…

10…

10…

10…

INT

ER

ES

T

MESSAGES

SPORTS CONCERTS MOVIES

Interest

in

concerts

Interest

in movies

Interest

in sports

Real time

analytical

tracking and

learning of

people’s

interest

Putting it all together

under one architecture

Data architecture

• Some difficult questions around big data and analytics

– How can I invest in big data while managing cost?

– How can I “experiment” with big data while mitigating risks?

– How can I create a 360 view of data without boiling the ocean?

– How can I use oversea data without violation regulations?

Tiered data architecture

Data warehouse

- Staging

- SQL access

Big Data Infra (E.g. Hadoop)

Data sources Batch

Real-timeReal-time store

Master / Reference Data

Social / Cloud Public Data

Oversea Data

Oversea data

sources

Social

network

Batch

Tiered data architecture

Data

consumer

Data virtualization

SQL /

Rest /

SOAP /

MQ

Data warehouse

- Staging

- SQL access

Big Data Infra (E.g. Hadoop)

Data sources Batch

Real-time Real-time store

Master / Reference Data

Social / Cloud Public Data

Oversea Data

Oversea data

sources

Social

network

Batch

Official data model

Tiered data architecture

• Investment / level of support

Master data

Fast data

Hot data

Cold data

Investment

in CPU /

memory

Investment

in storage

Level 1

Level 1

Level 2

Level 3

Data virtualization Level 1

Level of

support

Tiered data architecture• Invest where it matters

– Defer investment if needed

– Refocus investment without disrupting business

• Data virtualization

– Create a façade for data access

– Provide standard interface for data

– Single data model, single access, single quality checkpoint

• Allow ‘experimentation’

– E.g. cut-off point for hot / cold

• Oversea data access

– Data stays where they are, only aggregated data is transferred back

– More palatable to regulators

• 360 view

– Data can be ‘joined’ through the data virtualization layer – no laborious ETL needed

• Single place to check for data quality

That’s all folks…

• Linkedin:

– https://www.linkedin.com/in/azrul-madisa-6052419

Recommended