Upload
test-huddle
View
217
Download
2
Embed Size (px)
Citation preview
Big Data: The Magic to
Attain New HeightsKen Johnston Principal Data Science Manager
Twitter – @rkjohnston
Blog – http://linkedin.com/in/rkjohnston
Email – [email protected]
LinkedIn - http://linkedin.com/in/rkjohnston
@rkjohnston #DataMagic
Data Scientist
in Core Data
Science Team
Office Live,
WebApps,
Office Online
Cosmos,
AutoPilot,
Local,
Shopping
About Ken
Kanban and
Data Science
series on
EaaSy&MVQ
– Everything as
a Service &
Minimum Viable
Quality
Write Books and Blog
and some fiction
I have a lot of love in my life
My Kids
@rkjohnston #DataMagic
Team of Amazing Magicians
Getting hands
dirty in the data
Connect the
Dots
Create
Deep
Insights
Taking on Sudden
Infant Death
Syndrome
Big Data and Magic
So, My son
gets this kids
“Magic Kit in
a Box” for his
8th birthday
Open our own Magic Show
Six Keys to a “Big” Magic Show
Try, Try, Try
Again
The Tyrany of
Counting
Magic
Tricks
(A/B Testing,
Runtime Flags)
The Venue
(Big Data
Infrastructure)
Foundation
(Tools for Big
Data)
Security
(Protection,
Privacy, Fraud)
The
Assistant
Recruit, Train,
& Retain “Big Data” Search Trends
@rkjohnston #DataMagic
The Venue
Your Big Data Infrastructure
Common Design Patterns
Good Paper to ReadIDC: Six Patterns of Big Data and Analytics Adoption:
The Importance of the Information Architecture
IngestFrom Services, IOT, AppsVia StreamsInto Storage
ProcessBuild PipelinesReduce, Transform, JoinPipe out
AnalyzeFrom Services, IOT, AppsVia StreamsInto Storage
Azure Model
Cindy Gross – Technical Fellow: Big Data and CloudTwitter: @SQLCindy [email protected]
Ingest
Process
Analyze
Hybrid: Azure and Hadoop Model
Ingest Process Analyze
Amazon Model
Ingest Process Analyze
How we do
it in
Windows
Prototypical Big Data PlatformC
lient
1C
lient
2C
lient
3
Tele
metr
y Fr
ont
End
Serv
ice Fast pipeline for high priority Data
Alerting
DB Ale
rtin
g
Dash
bo
ard
Big Data Map
Reduce Cloud
PII
Scr
ub
bin
g S
erv
ice
Data
Ext
ract
ion S
erv
ice
Insights
DB 1
Insights
DB N
Ad
ditio
nal
Rep
ort
ing
Dash
bo
ard
s
Personally Identifiable Information (PII)
Management very critical.
Data Driven Quality (DDQ) and big data pipelines will
need a cloud platform
Superfast pipeline typically (not always) bypasses cloud.
Also void of PII.
Big Data & ML Model Orchestration
@rkjohnston #DataMagic
Prototypical Big Data PlatformC
lient
1C
lient
2C
lient
3
Tele
metr
y Fr
ont
End
Serv
ice Fast pipeline for high priority Data
Alerting
DB Ale
rtin
g
Dash
bo
ard
Big Data Map
Reduce Cloud
PII
Scr
ub
bin
g S
erv
ice
Data
Ext
ract
ion S
erv
ice
Insights
DB 1
Insights
DB N
Ad
ditio
nal
Rep
ort
ing
Dash
bo
ard
s
Big Data & ML Model Orchestration
Ingest Process Analyze
@rkjohnston #DataMagic
User Segmentation Approaches
• Risk Tolerance Model• Users Segment themselves
• Opt in for greater risk with a reward in mind
• Profile Based• Usage behaviors
• new vs. power users
• Browser type
• Connection Type
• Device and Device OS
@rkjohnston #DataMagic
Ring 2 External Beta
UsersRing 2: Company
& NDA
Balancing Speed and Risk with Rings
Ring 1: My Team
Ring 4: Everyone
Ring 0: Buddy Build
Red Line demarks disclosure risk
and possible loss of patent rights
Risk Tolerance
is highest
No desire
for risk
@rkjohnston #DataMagic
Date
SecurityProtection, Privacy, Fraud
Office 365 Advanced Threat Protection
Big Data Only Solution
Safe Link is powered by Cloud Exchange & Bing data
AI Model powered by data from thousands of
companies and attachments @rkjohnston #DataMagic
Short lived identifiers
Increase transparency and control for users
Build privacy into the OS and all apps
How the
Windows Store
Security Team
made the
Insights Leap
@rkjohnston #DataMagic
App Store Data Architecture
App Certification and Analysis
Pipeline
Store Services Log and Telemetry
Bing Spam and Malware
Windows Services Safety Platform
(MSA, SmartScreen, Etc..)
MMPC/Spynet
Network IPsFile HashesPhotoDNA
StringsAPI Called
User Install DataRatings and
ReviewsPurchases
Geographic Data
Account ReputationBad URLs
Botnet infected Clients
Cosmos Storage and Compute
BTW this
was not Big
Data
NoName was Learning basic DS
Look at how I did this k-means
clustering and found these weird
outliers in buying circles from Dev
accounts created the same week and
same IP address
Check it out, I found this guys FB
page. We have his picture!
NoName and I were Spitballing Ideas
Bad Dev ‘N’
Bad Dev ‘N’
Fraud Network Identification
Bad Dev 1
Payment InstrumentsApp Similarity
Social Networks
3rd party app stores
Bad Dev 2
XXXDeveloper
Created 40 Different Store Developer Accounts and 100s of Apps
App Metadata(URL, Websites)
Developer Watering Holes
Shared Fraudulent Payment Instruments
Bad Dev ‘N’
New Identity Metadata
Shared Fraudulent Payment InstrumentsApp Similarity
App Similarity
lights out
Date
Foundation
Tools and Skills for Big Data
The Big Red Switch
This used to require humans
Sidebar: I
had an
Epiphany
Speed is your friend because…
Six week coding milestone
Code churn is
cumulative
Imagine this as part of a larger multi-layered project
Layer 1
Layer 2
Layer 3
• Tightly coupled layers
• Long stabilization phase
• Complicated end-to-end integration
Sim-ship increases
risk
Maximum point of
instability is at end of
milestone
Code Churn Example 1
@rkjohnston #DataMagic
Code Churn Example 2 (Continuous
Deployment)
Layer 1
Layer 2
Layer 3
• Risk per release decreases because of more incremental change
• You still must be careful of Risk within Production but…
• Total risk over time can be less with incremental change
Rapid release cadence
(weekly or daily)
Max Risk is Production
Layer N
@rkjohnston #DataMagic
As Speed Accelerates
Up Front & Post Deploy Testing Decreases
Measures = Test Cases• We do Measures
• What is a post release test
case?
• Automation validates the
golden path
• We measure the golden
path
• Measures are the same
as test cases• Monitor the golden path
@rkjohnston #DataMagic
>1.5*IQR = Outlier = Bug (probably)
• What is a Test Case?• What I expect to happen vs.
What does happen
• A Test Case is Binary
• Measures can observe
success and fail• Measures have history of
pass fail
• When pass or fail drift from
standard expected rates we
find outliers
• Outliers are often bugs
Rings + Speed + Data = Success
• When speed increases the need for telemetry increases
• The rings model provides a buffer@rkjohnston #DataMagic
Tricks
Flighting and A/B testing are mostly the same thing
@rkjohnston #DataMagic
Runtime Flags =
Continuous Deployment
Generic Service Stack
Service UX Front Door
Service Auth/Identity
Layer A vCurrent
Layer B vCurrent
Service Layer C (Persistent Data Store)
Defa
ult P
ath
Production
Traffic
Front door servers for logging and access management
UX rendering layers
Identity or authentication layers
Persistent data layers
@rkjohnston #DataMagic
Runtime Flags Example 1
Side-by-Side Deployments
Service UX Front Door
Service Auth/Identity
Runtime Flags• Flags direct traffic through the stack
• Used to test vNext before full
release
Layer A vCurrent
Layer B vCurrent
Service Layer C (Persistent Data Store)
Defa
ult P
ath
Runtim
e
Production
Traffic
Test or Forked
Traffic
Runtime Runtim
eRuntime
Layer B vNext
Runtime Flags Example 2
N Test Environments
Service UX Front Door
Service Auth/Identity
Layer A vCurrent
Layer B vCurrent
Service Layer C (Persistent Data Store)
Production
Traffic
Test
Case
Checkin
Tests
Defa
ult P
ath
Runtim
eRuntimeR
untim
e
Runtime
Layer A Test Path
Layer B Test Path
Apps as a Service: FacebookHow Facebook secretly redesigned its iPhone app with
your help
…a system for creating alternate versions… within the
native app.
The team could then turn on certain new features for a
subset of its users, directly,
…a system of "different types of Legos... and see the
results on the server in real time."
From article on The Verge by Dieter Bohn September 18, 2013
@rkjohnston #DataMagic
All
Magicians
need an
Assistant
Visualization
Machine Learning
Data Scientist Data Engineer
Extract Load Transform
Data Architecture
Operations and Monitoring
Big Data Infrastructure & Storage
DB AdministrationStatistics
Math
Programming
Modeling
Story Telling
Data Exploration
http://www.datasciencecentral.com/profiles/blogs/difference-between-data-engineers-and-data-scientists
Typical Industry Staffing
Blended Role for Agile
Visualization
Machine Learning
Data Scientist/Data Engineer
Extract Load Transform
Data Architecture
Operations and Monitoring
Big Data Infrastructure & Storage
DB Administration
Statistics
Math
Programming
Modeling
Story Telling
Data Exploration
@rkjohnston #DataMagic
LDA vs PCA vs A13 before stratified
sampling
Backlog Doing Validation Done
MLADS ARPD Rehearsal
Submit Abstract to Strata + Hadoop World
Edge Experiment 1 Data Processing
Edge Experiment 2
Customer Sat and Post Sales Monetization
Factors Analysis
Install Base Decay Rate estimation using Baysian Model
Friday Review Slides for Edge Experiment 1
Edge Experiment 1 Insights Analysis
Top Enterprise DSAT list from textual
analysis
Business Entity Graph with DUNS, Domain
Name, & TaxIDs
Open Source Entity Graph visualization
technology research
Submit Paper to Informs 2016
ARPD V3 Model with FFF
MLADS ARPD Slides Draft 1
Device Lifetime Value (LTV) model 2
Process and Culture impact Retention• Kanban for Project Management• Balance long and short term impact• Participate in Industry papers and reviews
@rkjohnston #DataMagic
Trying Again & Again
Advantages and
Disadvantages of
the counting
culture
KPIs drive companies and behavior
The 5 Vs of Big DataNine months ago there were only three Vs
Variety VelocityVolume Verify
Verification – managing data quality and access control at all points
Value
Must Count More
Counting More Granular
Make it go up and to the
right
Is vs Likely
Business Impact is a
Given
Drives behavior
(especially if tied to
compensation)
Viable
Possible Features
Minimum + ViableGood features to test the
users responses
Bad user experience. Too
minimal a set or wrong set of
features. Will not engage users
enough to gain valuable
insights
The product you want to
build but to deliver all
features will take too long
Wasted work adding features
that do not add critical value for
winning and retaining customers
Minimum
MVP in a Nutshell
Possible Data
Viable
Model should provide
enough coverage that it can
be used for core insights.
Many models try to include all data
and large numbers of attributes but
that slows down innovation
If precision is too low then the
model can’t be trusted for even
first level insights.
Minimum
More features can increase
complexity without
significant improvement in
precision and recall
Minimum Viable Model (MVM)
Possible Features
Minimum + Viable
An Ideal MVM uses a modest
amount of data, implements
a relatively simple initial
algorithm, has good
precision (we aim for 98% or
more) and enough recall to
be used for core insights.
Keep your eye on the target
The goal is not
to get a bulls eye
every time
The goal is to
get the data and
Learn
Test & Ops = Data Science
Six Keys to a “Big” Magic Show
Try, Try, Try
Again
The Tyrany of
Counting
Magic
Tricks
(A/B Testing,
Runtime Flags)
The Venue
(Big Data
Infrastructure)
Foundation
(Tools for Big
Data)
Security
(Protection,
Privacy, Fraud)
The
Assistant
Recruit, Train,
& Retain “Big Data” Search Trends
@rkjohnston #DataMagic
Big Data: The Magic to
Attain New HeightsKen Johnston Principal Data Science Manager
Twitter – @rkjohnston
Blog – http://linkedin.com/in/rkjohnston
Email – [email protected]
LinkedIn - http://linkedin.com/in/rkjohnston
@rkjohnston #DataMagic