68
Big Data: The Magic to Attain New Heights Ken Johnston Principal Data Science Manager Twitter – @rkjohnston Blog – http://linkedin.com/in/rkjohnston Email – [email protected] LinkedIn - http://linkedin.com/in/rkjohnston @rkjohnston #DataMagic

Big Data: The Magic to Attain New Heights

Embed Size (px)

Citation preview

Page 1: Big Data:  The Magic to Attain New Heights

Big Data: The Magic to

Attain New HeightsKen Johnston Principal Data Science Manager

Twitter – @rkjohnston

Blog – http://linkedin.com/in/rkjohnston

Email – [email protected]

LinkedIn - http://linkedin.com/in/rkjohnston

@rkjohnston #DataMagic

Page 2: Big Data:  The Magic to Attain New Heights

Data Scientist

in Core Data

Science Team

Office Live,

WebApps,

Office Online

Cosmos,

AutoPilot,

Local,

Shopping

About Ken

Kanban and

Data Science

series on

LinkedIn

EaaSy&MVQ

– Everything as

a Service &

Minimum Viable

Quality

Write Books and Blog

and some fiction

Page 3: Big Data:  The Magic to Attain New Heights

I have a lot of love in my life

Page 4: Big Data:  The Magic to Attain New Heights
Page 5: Big Data:  The Magic to Attain New Heights

My Kids

Page 6: Big Data:  The Magic to Attain New Heights

@rkjohnston #DataMagic

Page 7: Big Data:  The Magic to Attain New Heights

Team of Amazing Magicians

Page 8: Big Data:  The Magic to Attain New Heights

Getting hands

dirty in the data

Page 9: Big Data:  The Magic to Attain New Heights

Connect the

Dots

Page 10: Big Data:  The Magic to Attain New Heights

Create

Deep

Insights

Page 11: Big Data:  The Magic to Attain New Heights
Page 12: Big Data:  The Magic to Attain New Heights

Taking on Sudden

Infant Death

Syndrome

Page 13: Big Data:  The Magic to Attain New Heights

Big Data and Magic

Page 14: Big Data:  The Magic to Attain New Heights

So, My son

gets this kids

“Magic Kit in

a Box” for his

8th birthday

Page 15: Big Data:  The Magic to Attain New Heights

Open our own Magic Show

Page 16: Big Data:  The Magic to Attain New Heights

Six Keys to a “Big” Magic Show

Try, Try, Try

Again

The Tyrany of

Counting

Magic

Tricks

(A/B Testing,

Runtime Flags)

The Venue

(Big Data

Infrastructure)

Foundation

(Tools for Big

Data)

Security

(Protection,

Privacy, Fraud)

The

Assistant

Recruit, Train,

& Retain “Big Data” Search Trends

@rkjohnston #DataMagic

Page 17: Big Data:  The Magic to Attain New Heights

The Venue

Your Big Data Infrastructure

Page 18: Big Data:  The Magic to Attain New Heights

Common Design Patterns

Good Paper to ReadIDC: Six Patterns of Big Data and Analytics Adoption:

The Importance of the Information Architecture

IngestFrom Services, IOT, AppsVia StreamsInto Storage

ProcessBuild PipelinesReduce, Transform, JoinPipe out

AnalyzeFrom Services, IOT, AppsVia StreamsInto Storage

Page 19: Big Data:  The Magic to Attain New Heights

Azure Model

Cindy Gross – Technical Fellow: Big Data and CloudTwitter: @SQLCindy [email protected]

Ingest

Process

Analyze

Page 20: Big Data:  The Magic to Attain New Heights

Hybrid: Azure and Hadoop Model

Ingest Process Analyze

Page 21: Big Data:  The Magic to Attain New Heights

Amazon Model

Ingest Process Analyze

Page 22: Big Data:  The Magic to Attain New Heights

How we do

it in

Windows

Page 23: Big Data:  The Magic to Attain New Heights

Prototypical Big Data PlatformC

lient

1C

lient

2C

lient

3

Tele

metr

y Fr

ont

End

Serv

ice Fast pipeline for high priority Data

Alerting

DB Ale

rtin

g

Dash

bo

ard

Big Data Map

Reduce Cloud

PII

Scr

ub

bin

g S

erv

ice

Data

Ext

ract

ion S

erv

ice

Insights

DB 1

Insights

DB N

Ad

ditio

nal

Rep

ort

ing

Dash

bo

ard

s

Personally Identifiable Information (PII)

Management very critical.

Data Driven Quality (DDQ) and big data pipelines will

need a cloud platform

Superfast pipeline typically (not always) bypasses cloud.

Also void of PII.

Big Data & ML Model Orchestration

@rkjohnston #DataMagic

Page 24: Big Data:  The Magic to Attain New Heights

Prototypical Big Data PlatformC

lient

1C

lient

2C

lient

3

Tele

metr

y Fr

ont

End

Serv

ice Fast pipeline for high priority Data

Alerting

DB Ale

rtin

g

Dash

bo

ard

Big Data Map

Reduce Cloud

PII

Scr

ub

bin

g S

erv

ice

Data

Ext

ract

ion S

erv

ice

Insights

DB 1

Insights

DB N

Ad

ditio

nal

Rep

ort

ing

Dash

bo

ard

s

Big Data & ML Model Orchestration

Ingest Process Analyze

@rkjohnston #DataMagic

Page 25: Big Data:  The Magic to Attain New Heights

User Segmentation Approaches

• Risk Tolerance Model• Users Segment themselves

• Opt in for greater risk with a reward in mind

• Profile Based• Usage behaviors

• new vs. power users

• Browser type

• Connection Type

• Device and Device OS

@rkjohnston #DataMagic

Page 26: Big Data:  The Magic to Attain New Heights

Ring 2 External Beta

UsersRing 2: Company

& NDA

Balancing Speed and Risk with Rings

Ring 1: My Team

Ring 4: Everyone

Ring 0: Buddy Build

Red Line demarks disclosure risk

and possible loss of patent rights

Risk Tolerance

is highest

No desire

for risk

@rkjohnston #DataMagic

Page 27: Big Data:  The Magic to Attain New Heights

Date

SecurityProtection, Privacy, Fraud

Page 28: Big Data:  The Magic to Attain New Heights
Page 29: Big Data:  The Magic to Attain New Heights

Office 365 Advanced Threat Protection

Big Data Only Solution

Safe Link is powered by Cloud Exchange & Bing data

AI Model powered by data from thousands of

companies and attachments @rkjohnston #DataMagic

Page 30: Big Data:  The Magic to Attain New Heights

Short lived identifiers

Increase transparency and control for users

Build privacy into the OS and all apps

Page 31: Big Data:  The Magic to Attain New Heights
Page 32: Big Data:  The Magic to Attain New Heights

How the

Windows Store

Security Team

made the

Insights Leap

@rkjohnston #DataMagic

Page 33: Big Data:  The Magic to Attain New Heights

App Store Data Architecture

App Certification and Analysis

Pipeline

Store Services Log and Telemetry

Bing Spam and Malware

Windows Services Safety Platform

(MSA, SmartScreen, Etc..)

MMPC/Spynet

Network IPsFile HashesPhotoDNA

StringsAPI Called

User Install DataRatings and

ReviewsPurchases

Geographic Data

Account ReputationBad URLs

Botnet infected Clients

Cosmos Storage and Compute

BTW this

was not Big

Data

Page 34: Big Data:  The Magic to Attain New Heights

NoName was Learning basic DS

Look at how I did this k-means

clustering and found these weird

outliers in buying circles from Dev

accounts created the same week and

same IP address

Check it out, I found this guys FB

page. We have his picture!

NoName and I were Spitballing Ideas

Page 35: Big Data:  The Magic to Attain New Heights

Bad Dev ‘N’

Bad Dev ‘N’

Fraud Network Identification

Bad Dev 1

Payment InstrumentsApp Similarity

Social Networks

3rd party app stores

Bad Dev 2

XXXDeveloper

Created 40 Different Store Developer Accounts and 100s of Apps

App Metadata(URL, Websites)

Developer Watering Holes

Shared Fraudulent Payment Instruments

Bad Dev ‘N’

New Identity Metadata

Shared Fraudulent Payment InstrumentsApp Similarity

App Similarity

Page 36: Big Data:  The Magic to Attain New Heights

lights out

Page 37: Big Data:  The Magic to Attain New Heights

Date

Foundation

Tools and Skills for Big Data

Page 38: Big Data:  The Magic to Attain New Heights

The Big Red Switch

This used to require humans

Page 39: Big Data:  The Magic to Attain New Heights

Sidebar: I

had an

Epiphany

Page 40: Big Data:  The Magic to Attain New Heights

Speed is your friend because…

Page 41: Big Data:  The Magic to Attain New Heights

Six week coding milestone

Code churn is

cumulative

Imagine this as part of a larger multi-layered project

Layer 1

Layer 2

Layer 3

• Tightly coupled layers

• Long stabilization phase

• Complicated end-to-end integration

Sim-ship increases

risk

Maximum point of

instability is at end of

milestone

Code Churn Example 1

@rkjohnston #DataMagic

Page 42: Big Data:  The Magic to Attain New Heights

Code Churn Example 2 (Continuous

Deployment)

Layer 1

Layer 2

Layer 3

• Risk per release decreases because of more incremental change

• You still must be careful of Risk within Production but…

• Total risk over time can be less with incremental change

Rapid release cadence

(weekly or daily)

Max Risk is Production

Layer N

@rkjohnston #DataMagic

Page 43: Big Data:  The Magic to Attain New Heights

As Speed Accelerates

Up Front & Post Deploy Testing Decreases

Page 44: Big Data:  The Magic to Attain New Heights
Page 45: Big Data:  The Magic to Attain New Heights

Measures = Test Cases• We do Measures

• What is a post release test

case?

• Automation validates the

golden path

• We measure the golden

path

• Measures are the same

as test cases• Monitor the golden path

@rkjohnston #DataMagic

Page 46: Big Data:  The Magic to Attain New Heights

>1.5*IQR = Outlier = Bug (probably)

• What is a Test Case?• What I expect to happen vs.

What does happen

• A Test Case is Binary

• Measures can observe

success and fail• Measures have history of

pass fail

• When pass or fail drift from

standard expected rates we

find outliers

• Outliers are often bugs

Page 47: Big Data:  The Magic to Attain New Heights

Rings + Speed + Data = Success

• When speed increases the need for telemetry increases

• The rings model provides a buffer@rkjohnston #DataMagic

Page 48: Big Data:  The Magic to Attain New Heights

Tricks

Page 49: Big Data:  The Magic to Attain New Heights

Flighting and A/B testing are mostly the same thing

@rkjohnston #DataMagic

Page 50: Big Data:  The Magic to Attain New Heights

Runtime Flags =

Continuous Deployment

Page 51: Big Data:  The Magic to Attain New Heights

Generic Service Stack

Service UX Front Door

Service Auth/Identity

Layer A vCurrent

Layer B vCurrent

Service Layer C (Persistent Data Store)

Defa

ult P

ath

Production

Traffic

Front door servers for logging and access management

UX rendering layers

Identity or authentication layers

Persistent data layers

@rkjohnston #DataMagic

Page 52: Big Data:  The Magic to Attain New Heights

Runtime Flags Example 1

Side-by-Side Deployments

Service UX Front Door

Service Auth/Identity

Runtime Flags• Flags direct traffic through the stack

• Used to test vNext before full

release

Layer A vCurrent

Layer B vCurrent

Service Layer C (Persistent Data Store)

Defa

ult P

ath

Runtim

e

Production

Traffic

Test or Forked

Traffic

Runtime Runtim

eRuntime

Layer B vNext

Page 53: Big Data:  The Magic to Attain New Heights

Runtime Flags Example 2

N Test Environments

Service UX Front Door

Service Auth/Identity

Layer A vCurrent

Layer B vCurrent

Service Layer C (Persistent Data Store)

Production

Traffic

Test

Case

Checkin

Tests

Defa

ult P

ath

Runtim

eRuntimeR

untim

e

Runtime

Layer A Test Path

Layer B Test Path

Page 54: Big Data:  The Magic to Attain New Heights

Apps as a Service: FacebookHow Facebook secretly redesigned its iPhone app with

your help

…a system for creating alternate versions… within the

native app.

The team could then turn on certain new features for a

subset of its users, directly,

…a system of "different types of Legos... and see the

results on the server in real time."

From article on The Verge by Dieter Bohn September 18, 2013

@rkjohnston #DataMagic

Page 55: Big Data:  The Magic to Attain New Heights

All

Magicians

need an

Assistant

Page 56: Big Data:  The Magic to Attain New Heights

Visualization

Machine Learning

Data Scientist Data Engineer

Extract Load Transform

Data Architecture

Operations and Monitoring

Big Data Infrastructure & Storage

DB AdministrationStatistics

Math

Programming

Modeling

Story Telling

Data Exploration

http://www.datasciencecentral.com/profiles/blogs/difference-between-data-engineers-and-data-scientists

Typical Industry Staffing

Page 57: Big Data:  The Magic to Attain New Heights

Blended Role for Agile

Visualization

Machine Learning

Data Scientist/Data Engineer

Extract Load Transform

Data Architecture

Operations and Monitoring

Big Data Infrastructure & Storage

DB Administration

Statistics

Math

Programming

Modeling

Story Telling

Data Exploration

@rkjohnston #DataMagic

Page 58: Big Data:  The Magic to Attain New Heights

LDA vs PCA vs A13 before stratified

sampling

Backlog Doing Validation Done

MLADS ARPD Rehearsal

Submit Abstract to Strata + Hadoop World

Edge Experiment 1 Data Processing

Edge Experiment 2

Customer Sat and Post Sales Monetization

Factors Analysis

Install Base Decay Rate estimation using Baysian Model

Friday Review Slides for Edge Experiment 1

Edge Experiment 1 Insights Analysis

Top Enterprise DSAT list from textual

analysis

Business Entity Graph with DUNS, Domain

Name, & TaxIDs

Open Source Entity Graph visualization

technology research

Submit Paper to Informs 2016

ARPD V3 Model with FFF

MLADS ARPD Slides Draft 1

Device Lifetime Value (LTV) model 2

Process and Culture impact Retention• Kanban for Project Management• Balance long and short term impact• Participate in Industry papers and reviews

@rkjohnston #DataMagic

Page 59: Big Data:  The Magic to Attain New Heights

Trying Again & Again

Advantages and

Disadvantages of

the counting

culture

Page 60: Big Data:  The Magic to Attain New Heights

KPIs drive companies and behavior

Page 61: Big Data:  The Magic to Attain New Heights

The 5 Vs of Big DataNine months ago there were only three Vs

Variety VelocityVolume Verify

Verification – managing data quality and access control at all points

Value

Page 62: Big Data:  The Magic to Attain New Heights

Must Count More

Counting More Granular

Make it go up and to the

right

Is vs Likely

Business Impact is a

Given

Drives behavior

(especially if tied to

compensation)

Page 63: Big Data:  The Magic to Attain New Heights

Viable

Possible Features

Minimum + ViableGood features to test the

users responses

Bad user experience. Too

minimal a set or wrong set of

features. Will not engage users

enough to gain valuable

insights

The product you want to

build but to deliver all

features will take too long

Wasted work adding features

that do not add critical value for

winning and retaining customers

Minimum

MVP in a Nutshell

Page 64: Big Data:  The Magic to Attain New Heights

Possible Data

Viable

Model should provide

enough coverage that it can

be used for core insights.

Many models try to include all data

and large numbers of attributes but

that slows down innovation

If precision is too low then the

model can’t be trusted for even

first level insights.

Minimum

More features can increase

complexity without

significant improvement in

precision and recall

Minimum Viable Model (MVM)

Possible Features

Minimum + Viable

An Ideal MVM uses a modest

amount of data, implements

a relatively simple initial

algorithm, has good

precision (we aim for 98% or

more) and enough recall to

be used for core insights.

Page 65: Big Data:  The Magic to Attain New Heights

Keep your eye on the target

The goal is not

to get a bulls eye

every time

The goal is to

get the data and

Learn

Page 66: Big Data:  The Magic to Attain New Heights

Test & Ops = Data Science

Page 67: Big Data:  The Magic to Attain New Heights

Six Keys to a “Big” Magic Show

Try, Try, Try

Again

The Tyrany of

Counting

Magic

Tricks

(A/B Testing,

Runtime Flags)

The Venue

(Big Data

Infrastructure)

Foundation

(Tools for Big

Data)

Security

(Protection,

Privacy, Fraud)

The

Assistant

Recruit, Train,

& Retain “Big Data” Search Trends

@rkjohnston #DataMagic

Page 68: Big Data:  The Magic to Attain New Heights

Big Data: The Magic to

Attain New HeightsKen Johnston Principal Data Science Manager

Twitter – @rkjohnston

Blog – http://linkedin.com/in/rkjohnston

Email – [email protected]

LinkedIn - http://linkedin.com/in/rkjohnston

@rkjohnston #DataMagic