Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial

@2016 LinkedIn Corporation. All Rights Reserved.

KDD 16’ Tutorial

Business Applications of Predictive Modeling at Scale

@2016 LinkedIn Corporation. All Rights Reserved. 2

Instructors

Songtao GuoSr. Staff of Business Analytics Data Mining at LinkedIn

Yan LiuSr. Manager of Business Analytics Data Mining at LinkedIn

Paul OgilvieSr. Manager of Machine Learning Algorithms Team at LinkedIn

Qiang ZhuStaff of Business Analytics Data Mining at LinkedIn


Outline

Introduction ❏ Why predictive modeling is important? ❏ Understanding of audience background and learning objectives

Predictive Modeling Overview

❏ End-to-end walkthrough of a production modeling solution❏ Common pitfalls and challenges❏ Case Study - LinkedIn Feed Ranking

❏ Considerations when choosing a framework❏ Overview of tools and platforms across industry❏ Modoop - example of a scaled framework

Choosing a Framework


Outline







Example 1: Search Ranking


Example 2: Recommendation

Job recommendations on LinkedIn

Movie recommendations on Netflix


Example 3: User Intention Predictions

Predict which users are more likely to make certain actions, such as click, purchase, churn, etc.


Example 4: B2B Predictive Analytics


Example 5: Web Content Analytics

Social data

Customer feedback• Customer service• Group updates• Network updates

Survey results

Products Categorization

Sentiments

Group Subscription

Home Page Mobile Message

Text Classification

Relevance


Other Applications...

Image RecognitionFraud/Spam Detection NLP Speech Recognition

and more...


Outline







Outline







Problem Definition

13

Example: Job Seeker Subscription Model

Assume we periodically send marketing promotions/campaigns to

LinkedIn members for job-seeker subscriptions. How do we decide who we should send these emails to?

Binary classification problem: let i represents the product

subscription status of member

Job 1

Senior Data ScientistLinkedinMountain View, CA, US

12 connections work here------------------------------------------Posted 2 days ago

Job 2 Job 3

Job Seeker Subscription

Reaching out• Who's viewed your profile• InMail™ messages

Finding the right people• Premium search

Job seeking• Featured applicant• Applicant insights• Salary data

Standing out• Premium Profile• Larger search listings


crowd and internal judgments

label preparation

raw features

features with label

log data

online A/B test random bucket

evaluate for model selection

data partitioning

model training+

-- -

-

+ +

best model

compute offline evaluation metrics

offline scoring and indexing

log data

a/b test reports

feature engineering

label

feature integration

trainingdata

testingdata

validationdata

model performance

raw features

feature integration

scoring features

-

raw data

-

online/offline systems



label preparation

raw features

features with label

log data



model training+

-- -

-

+ +

best model



a/b test reports

feature engineering

label

feature integration

trainingdata

testingdata

validationdata

model performance

raw features

feature integration

scoring features

-

raw data

-

data partitioning


log data


Label Preparation

▪ A set of labels(“right answers”) is defined in advance

▪ Methods– Derive from data

▪ Historical transaction▪ User preference▪ User activity

– Domain expert– Scale up label collection

▪ Crowdsourcing

Example1: Job seeker model• user_id• Target: whether subscribed

JobSeeker product • {Yes,No} -> {1,0}

• timestamp

Example2: Churn prediction• Account + timestamp• Target: Close/Renew

Example3: sentiment analysis• Review• Target: sentiment types

• Strong negative• Negative• Neutral• Positive• Strong positive

16



label preparation

raw features

features with label

log data



model training+

-- -

-

+ +

best model



a/b test reports

feature engineering

label

feature integration

trainingdata

testingdata

validationdata

model performance

raw features

feature integration

scoring features

-

raw data

-

data partitioning


log data


Feature Engineering

Entity Features Social FeaturesBehavioral Features

• Multiple data sources, hundreds to billions of features• Multiple levels of granularity• Monitoring dynamic changes

18

• Demographics• Personal and

professional interest• …

• pageviews • searches• activities on external

sites• …

• Social network identity and behaviors

• …

The process of transforming raw data into features that better represent the underlying problem to the predictive models


Feature Metadata

▪ Location▪ Description▪ Type

– numerical, categorical, ordinal, binary

▪ Granularity– snapshot, hourly, daily,

weekly, monthly...▪ Creation frequency

– hourly, daily, weekly, monthly…

19

▪ Aggregatable▪ Aggregation length

– snapshot, hourly, daily, weekly, monthly...

▪ Aggregation option– min, max, sum, avg,

median...▪ Transformation option

– log, binary, self…▪ Owner▪ Retention policy


Feature Storage

▪ File format: one of the most essential drivers of functionality and performance for big data processing

▪ What must be considered?– How big are your files?– How important is file format “splitability”?– Does block compression matter?– Are you more concerned about read or write?

20


Feature Storage Type

21

Storage Text Sequence File Avro Parquet ORC

type row based row based row based column-oriented column + row

block compression

X V V V V

splittable compression

X V V V V

schema support X X V V V

schema evolution

X X V V V


Feature Representation

▪ Key-Value pair

▪ Full feature vector

▪ Sparse feature vector

22

entity_id feature1 feature2 ...

xx1 0 0.3

xx2 1 0.1

xx3 0 NA

… … 0.9


xx1 0 0.3

xx2 1 0.1

xx3 0 NA

… … 0.9


xx1 1 0.4

xx2 NA 0.2

xx3 0 NA

… … 0.7

2016-02-29

2016-03-31

Entity

Feature

Timestamp2016-01-31


Raw Feature Schema Example

23

{ "type" : "record", "name" : "TUPLE", "fields" : [ { "name" : "id", "type" : [ "null", "long" ] }, { "name" : "timestamp", "type" : [ "null", "int" ] }, { "name" : "feature_name", "type" : [ "null", "string" ] }, { "name" : "feature_value", "type" : [ "null", "double" ] } ]}

▪ Key-value pair

entity_id timestamp name value1 t1 F1 0.41 t1 F2 01 t2 F4 52 t3 F1 1… … F5 ...



24

{ "type": "record", "name": "TUPLE", "fields": [ { "name": "id", "type": [ "null", "long" ] }, { "name": "timestamp", "type": [ "null", "int" ] }, { "name": "feature1", "type": [ "null", "double" ] }, { "name": "feature1", "type": [ "null", "double" ] }, { "name": "feature3", "type": [ "null", "double" ] }, ... ]}

▪ Full feature vector

entity_id timestamp feature1 feature2 ...xx1 t1 1 0.4xx2 t1 NA 0.2xx3 t2 0 NA… ... … 0.7



25

{ "type": "record", "name": "FeatureVector", "fields": [

{ "name": "entity_id", "type": [ "string" ]

},{

"name": "timestamp", "type": [ "null", "string" ]

},{

"name": "features", "type": { "type": "array", "items": { "type": "record", "name": "FeatureValue", "fields": [ { "name": "id", "type": "string", "doc": "identifier of a feature"}, { "name": "value", "type": "string", "doc": "text representation of a feature value" } ] }}, "default": []

} ]}

▪ Sparse feature vector

LIBSVMlabel index1:value1 index2:value2 ...


Feature Monitoring

26

Labor Day

Typical metrics to monitor:● summation: e.g.: total page views, total

searches● coverage: e.g.: number of users who sent

inmails● percentiles: e.g.: top 95% of profile views

count

Methods to detect anomaly● Percentage change● T-test, Chi-square test● Generalized ESD (Extreme Studentized

Deviate) test, Seasonal Hybrid ESD.



label preparation

raw features

features with label

log data



model training+

-- -

-

+ +

best model



a/b test reports

feature engineering

label

feature integration

trainingdata

testingdata

validationdata

model performance

raw features

feature integration

scoring features

-

raw data

-

data partitioning


log data


Feature Integration

28

user_id Label

xx1 0

xx2 1

xx3 0

… …

user_id pageviews searches …

xx1 97 5 …

xx2 27 2 …

xx4 58 4 …

user_id Tenure Is employed …

xx1 3876 1 …

xx2 60 0 …

xx3 2700 1 …

user_id connections Connect in …

xx1 60 4 …

xx2 5 5 …

xx3 120 30 …

user_id Label pageviews searches … Tenure Is employed … connections Connect in …

xx1 0 97 5 … 3876 1 … 60 4 …

xx2 1 27 2 … 60 0 … 5 5 …

xx3 0 null null null 2700 1 … 120 30 …

… …

Label setFeature set 1

Feature set 2

Feature set 3


Feature Integration

Training data with timestamp

Feature Mart

presentweek 1

week 2week 3

week 4

mid Y X

xx1 0 Snapshot features Aggregated features

xx2 1 … …

xx3 0 … …

trans descself xsign sign(x)log sign(x)*log(abs(x)+1))isna 1:x==null; 0:x!=null

trans desc

sum sum(xi)

Weighted sum sum(wi*xi), wi=α^i

avg avg(xi)

Weighted avg sum(wi*xi)/sum(wi), wi=α^i

past


Feature Transformation and Interactions

▪ Feature transformation important for linear models▪ Numeric values:

– binary buckets based on histograms of data– log transforms– scale to unit variance, mean of 0

▪ Categorical values – binary indicator for each categorical value

▪ Interactions– cross-products of feature types, e.g.

▪ {Skillsmember} X {Skillsjob}▪ {Skillsmember} X {Industryjob}

30



label preparation

raw features

-

features with label

log data



model training+

-- -

-

+ +

best model



a/b test reports

feature engineering label

feature integration

trainingdata

testingdata

validationdata

model performance

raw features

feature integration

scoring features

-

raw data

-

data partitioning


log data


Model Learning

32

Training Set

Validation Set

Testing Set

Model Validation

Model Training

Model Testing

Data Partitioning

Best Model

Model Selection

SolversLogistic regression L1/L2

Random Forest

Gradient Boosting Trees

SVM… ...


Hyperparameter search▪ Modeling techniques must be tuned for best performance

– Regression▪ regularization methods (Ridge, Lasso, Elastic Net)▪ regularization weight

– Gradient boosted trees ▪ number trees▪ depth of tree▪ learning rate▪ sampling parameters

▪ Provide wrappers to ease hyperparameter search in early problem exploration

– Grid Search– Randomized Search, Bayesian Optimization

▪ Typically don’t want to rerun hyperparameter search for each retrain, do only periodically when there is a major change in data volume, features, etc.

33


Measure Model Performance

▪ Precision/Recall– Precision = TP/(TP+FP)– Recall = TP/(TP+FN)

▪ true positive rate / sensitivity

▪ F1 score– 2 * (Precision * Recall) / (Precision + Recall)

▪ False positive rate:– FPR = FP/N = FP/(FP+TN)

▪ Specificity: true negative rate– SPC = TN/N = TN/(FP+TN)

▪ Accuracy– ACC = (TP+TN)/(P+N)

TruePositive

FalsePositive

FalseNegative

TrueNegative

Actual Value

p n

P N

Prediction Outcome

p’

n’

total


Measure Model Performance



label preparation

raw features

features with label

log data



model training+

-- -

-

+ +

best model


a/b test reports

feature engineering

label

feature integration

trainingdata

testingdata

validationdata

model performance

raw features

feature integration

scoring features

-

raw data

-

data partitioning



log data


Model Testing

▪ Similar to model validation– Test on data NOT used in model selection– Scoring using the best model– Performance evaluation/comparison

Raw Testing Data Production Model

Best Model



label preparation

raw features

features with label

log data



model training+

-- -

-

+ +

best model



a/b test reports

feature engineering

label

feature integration

trainingdata

testingdata

validationdata

model performance

raw features

feature integration

scoring features

-

raw data

-

data partitioning


log data


Performance Measurement via A/B Test

39

Algorithm b20%

Algorithm a 80%

Collect results to determine which one is better


Best Practices for Running A/B Test▪ Start testing on a small portion of users▪ Measure one change at a time▪ Be aware of potential biases (time, targeted population, etc.)▪ Avoid coupling a marketing campaign with an A/B test▪ Use a simple rule of thumb to address multiple testing problems

– 0.05 p-value cutoff for metrics that are expected to be impacted– a smaller cutoff, say 0.001, for metrics that are not

40


label preparation

raw features

features with label

log data



model training+

-- -

-

+ +

best model



a/b test reports

feature engineering

label

feature integration

trainingdata

testingdata

validationdata

model performance

raw features

feature integration

scoring features

-

raw data

-

data partitioning


log data


Model Deployment

▪ Schedule and run the scoring pipeline regularly– Feature integration and transformation– Scoring using the selected model

▪ Score reasoning▪ Downstream apps: e.g., marketing campaigns

Production Model

Active members

past k weeks

id score rankxxx1 0.99 1

xxx2 0.87 2

xxx3 0.72 3

… … …


Model Management

43

Day 1 Day 30

Photo credit: asmfoto Marcell Mizik, photo license withDepositphotos File Purchase Agreement #41549281


Model ManagementFeed in new training data to generate new model periodically

Monitor performance changes over time Keep model up to date

Ensemble historical models as one of the baseline models


Outline







Leakage: the introduction of information about the data mining target, which should NOT be legitimately available to mine from. -- Shachar Kaufman, et.al. KDD11

Common leakage scenarios● Leaking data from the testing set into the training set

● Leaking of information from the future into the past

● Leaking information from data samples outside of scope of the algorithm’s intended use

Data Leakage

ID Entity Time Label Usage 1 Usage 2 Region Size Spending Activity 1 Activity 2 …

1 Company 1 01/01 1 10 2 SF 118 3,810 60 4 …

2 Company 1 02/01 1 5 0 SF 117 3,810 70 5 …

3 Company 1 03/01 1 0 3 SF 119 3,810 120 30 …

4 Company 2 01/01 0 100 30 NYC 50 2,000 23 0 ...

5 Company 2 04/01 0 90 43 NYC 50 2,000 23 0 ...

… … … … … … … … … … … …

1 3

2

4

5

Training

Testing

randomSplit()

price1 price2 price3

Yesterday Today Tomorrow

featurei= (price1+ price2+ price3) / 3Target = future price ?

Feature = “Rep id” Target: customer churnA specific sales representative was assigned to take over an churned account

Kaggle 2011 IJCNN Social Network Challenge: the winners de-anonymized 80% nodes and 60% edges to disclose real identities in the social network


● Identify data leakage○ Exploratory data analysis (EDA)

■ An approach to analyzing data sets to summarize their main characteristics, often with visual methods

○ Model performance is too good to be true?

○ Early in-the-field testing

47

Data Leakage


Class Imbalance

▪ A dataset is said to be imbalanced when the binomial or multinomial response variable has one or more classes that are underrepresented in the training data, with respect to the other classes

– “I have a binary classification problem and the label is distributed in 1:100 ratio in my training set. My results are overfit to majority class.”

▪ The class imbalance problem is pervasive and ubiquitous– e.g job recommendation, ads CTR, fraud detection

▪ Misclassify the minority class usually with high cost– rejecting a valid credit card transaction VS. approving a large fraudulent transaction

48

?


Solutions▪ Can You Collect More Data?▪ Consider different evaluation metrics

– “Accuracy” might be misleading for imbalanced training data– Confusion Matrix, Precision, Recall, F1, Kappa, ROC

▪ Re-sampling data set– Up-sampling (Over-sampling)– Down-sampling– Synthetic Minority Oversampling Technique (SMOTE)

▪ Use different learning algorithms– Decision tree, random forest– AUC-Maximizing Algorithms

▪ Cost-Sensitive Training

Class Imbalance


Categorical Data

▪ Categorical feature– A variable that can take on one of a limited, and usually fixed, number of possible values,

thus assigning each individual to a particular group or "category” - The Practice of Statistics , 2003

▪ |SIGKDD Conference| = ?▪ |KDD workshop| = ?▪ |KDD host country| = ?

▪ High cardinality categorical features are common in the data– E.g. Industry, country, city

▪ Too many levels– Not all the levels (distinct values of the categorical feature) got enough support. Some are

less useful– Many Machine Learning tools can only handle certain amount of levels

▪ E.g. Random Forest implementation in R has a hard limit of 32-levels for a categorical feature

50

http://bcs.whfreeman.com/yates2e/

http://bcs.whfreeman.com/yates2e/


▪ Solutions– Reduce cardinality by grouping categories into higher-level ones

– Transform a categorical feature into multiple binary ones

▪ Introduce an additional ‘others’ feature to represent all the new categories in the testing set

Cloud and Distributed Computing Dental

Hadoop Spark Hive Yarn ... others

1 1 0 0 ... 0

Categorical Data


▪ Missing data scenarios– Missing Completely at Random (MCAR)

▪ Is not related to other variables AND is not related to value of missing variable▪ E.g. Computer crash

– Missing at Random (MAR)▪ Is related to other, known variables. BUT, is not related to value of missing

variable(s) once we take the above relation(s) into account▪ E.g. male participants are more likely to refuse to fill out the depression survey, but it does not

depend on the level of their depress

– Missing Not at Random (MNAR)▪ Is related to what the value of the missing data would have been even if we take

into consideration other variables▪ E.g. People with low high school GPA decline to report it

52

Missing Data


Missing Data

▪ Solutions– Remove observations with missing values

▪ When values are missing at random and you have enough data

– Missing data imputation▪ Common imputation strategies

– Categorical: Choose the category with the most support– Numerical: Median, mean, or simply set to 0

▪ Predict missing data using a model

– Introduce a corresponding dummy feature to indicate its availability

– Use a model that are robust to missing data, e.g. tree-based model.

53


Outliers

▪ Outliers can be introduced in response or predictors– Rare event (valid)– Erroneous metrics (invalid)

▪ Impact of the outliers– Outlier values can have a disproportionate weight on the model. – MSE will focus on handling outlier observations more to reduce square error– Boosting will spend considerable modeling effort fitting these observations

▪ Solutions– Whether the outlier value is valid or invalid? – Remove observation with outlier feature values– Apply transformation to reduce impact:

▪ log ▪ binning (e.g. based on distribution)▪ spline transforms▪ Impose a constraint on data range (cap values)

– Choose a more robust loss function (e.g. trees)54


Model Interpretation

▪ Why? – Debug, diagnose, generate new hypotheses– Inevitable questions about why a prediction was made from your

business counterparts– for presenting reasons to users - may be output of a ML model itself

55

... With premium account, you get more search results and access to …

... Do you know 5 of your connections have started to use premium account …



Univariate Feature InterpretationPros:▪ Get a sense of importance for each

feature▪ Many available algorithms: Random

Forest, Regularized linear models, various feature selection algorithms

Cons:▪ Bias, e.g. impurity evaluation of RF

is biased towards preferring variables with more categories

▪ Difficulty of interpreting the ranking of correlated variables

▪ Single feature may contain lots of noise.

56

Group-wise Feature InterpretationPros:▪ Easy to capture the overall look by

grouping massive features into a few buckets

▪ Strong semantic meanings▪ Inter-group correlation are less▪ Impact from noise is reduced by

analysing multiple features at the same time

Cons:▪ Domain knowledge required



57

SOCIAL E-LEARNING

MARKETING

AFFILIATION

ENGAGEMENT SPENDING

GROWTH

PRODUCT USAGE

Provide high-level overview and insights

Deep dive to finer-level predictors


Outline







Sponsored Ads

Updates from followed companies

Updates from connections

Case Study - LinkedIn Feed Ranking



Feed Events

➢ Sponsored Ads➢ Updates from connections➢ Updates from followed companies➢ Updates from joined groups➢ Articles shared / commented / liked by

connections ➢ Articles mentioned connections➢ Articles posted by influencers or

connections➢ News from followed channels➢ Job recommendations➢ People You May Know➢ ...


Actor-Verb-Object Formalism and Activity Types

61



Actor

Object

Verb



63

▪ Context

▪ Questions

▪ Group Discussions

▪ Discuss LinkedIn Solutions

- 400+ million members- may have thousands of items to rank- focus on the modeling problem - assume that items are provided- skewed activity - some members very active, some not very active- ranking problem, online scoring

- How to formulate the feed ranking problem? - What are the suitable algorithms? - What kind of features are preferred here?- How to perform offline validation?


▪ Binary classification problem: let it represent the interaction between viewer i and update t:

it =

▪ Assume using logistic regression model, let Xit be a vector of features characterizing viewers, feed update and the context, and be a vector of parameters:

log = ′Xit

which is equivalent to:

Problem Formulation

64

1, if viewer interacts with feed update

-1, otherwise

P( it = 1 | viewer, update)

1 - P( it = 1 | viewer, update)

P( it = 1 | viewer, update) = 1/ (1 + exp(- ′Xit))


Model Training

▪ The parameter vector is estimated by maximizing the likelihood of the training data as a function of

L( ) = (1 + exp(- ′Xit))-1

▪ Add a regularization term (Euclidean norm on the parameter vector) to the log-likelihood function to mitigate overfitting

( ) = - ∑ log (1 + exp(- it ′Xit)) - || ||2

▪ Compute the gradient vector of the regularized log-likelihood function by using one of the standard computational tools for gradient-based maximization, for example, stochastic gradient descent.

65

it

it


Potential Features ▪ Features in the feature vector Xit are classified into following categories:

– Viewer-only features Xi

▪ member industry, geo location, skills– Activity-only features Xt

▪ time of activity, verb type (like, comment, share, recommendation, ...)– Object-only features Xo

▪ position– Viewer-actor features Xij:

▪ number messages sent to actor from viewer, number shared connections, ...– Viewer-activity type features Xik

▪ viewer CTR for activity type– Viewer-actor-activity type features Xijk

▪ viewer CTR for actor - activity type combination– Viewer-object features Xio

▪ object same language as viewer profile, number previous interactions

▪ The inner product ′Xit can be decomposed to′Xit = i′Xi + t′Xt + ij′Xij + ik′Xik + ijk′Xijk + io′Xio

66


Feature Design and Engineering

67

▪ Apply feature transformation to bypass the limitation from the linear models.

– log function on raw counts,– indicator function on raw counts,– bucketize a numerical feature as a set of categorical features concatenated as a binary

sub-vector


Three-step Offline Evaluation

68

▪ AUC for model selection▪ Comparing the ratio of observed to expected (predictions) probabilities

for different activity types (denoted as the O/E ratio) and whether or not the data likelihood converged to the global maximum.

– The CTR model overestimates different pCTR probabilities if O/E<1 – The CTR model underestimates different pCTR probabilities if O/E>1

▪ The “Replay” tool – Takes a logistic regression model and runs it on historic (random bucket) data and

records the total interaction on “matched impressions”– Reorders them using the new model that we evaluate;– Count only clicks on matched impressions that appear in the first position.


Outline







Scale of the problem

70

▪ Can it fit on one machine?– source: https://github.com/szilard/benchm-ml– Vowpal Wabbit can train linear model on a

▪ 1B example x 9 dense feature dataset in ~ 23 minutes on AWS r3.8xlarge instance

▪ 10M example dataset in 15 seconds – xgboost can train gradient boosted trees on a

▪ 10M example x 9 dense feature dataset in as little as 15 minutes using 6GB RAM

▪ Easier to explore a wide range of algorithms▪ Single machine algorithms often the most

robust, come with good tools

64"Green Globe In Child’s Hands" (CC BY-SA 2.0) by kenteegardin

https://github.com/szilard/benchm-ml

https://www.flickr.com/photos/teegardin/5860234733/

https://creativecommons.org/licenses/by-sa/2.0/

https://www.flickr.com/people/teegardin/


Integration with your company’s tech stack

▪ Data formats or sources– Avro, Parquet, Hive, SQL, Kafka ….– Build your own data converters

▪ Can be surprisingly difficult!▪ Character encoding▪ Sparse, text named features →

indexed features▪ Model deployment

– Model formats: PMML – Embeddable scorers: what

programming language?

?


Modeling Techniques

▪ Linear regression models (linear, logistic, poisson) – Scale to the largest datasets – Can have useful model diagnostics– Are fast at scoring– Can be easily retrained or updated online– Require much more feature engineering

▪ Gradient boosted regression trees and random forests– Provide great model accuracy with minimal feature engineering– Are slower to train– Are slower to score

▪ Others?– Factorization machines– Deep learning

72

Σ


▪ Sometimes efficient scoring is critical– ranking problems, recommender systems, edge prediction

▪ Pre-compute in batch– build your own candidate selection– feed into scoring utility– rank and store– scaling for very large numbers of candidate

pairs can be challenging– relatively expensive if not queried

▪ Compute on-demand– candidates provided by an underlying index

(e.g. search engine), database, or store– score online– rank and return– latency is critical– requires robust online infrastructure

Scoring Support

73

Σ

Σ


▪ build– in-house building often only makes sense for large teams – allows specialization, innovation

▪ assemble– leverage open-source software – make it easier to use within your company’s tech stack– fill in gaps that open-source software doesn’t address: feature integration, model

updates, a/b testing, …▪ buy

– can get support contracts– makes sense if little to no engineering support within company or business unit– good for small teams– can be limiting in terms of techniques

74

"Brick laying" (CC BY 2.0) by hnnbz

"dollar-sign" (CC BY-SA 2.0) by Oldmaison

"LEGO house" (CC BY 2.0) by Atsushi Tadokoro

https://www.flickr.com/photos/99781513@N04/16086839412/

https://creativecommons.org/licenses/by/2.0/

https://www.flickr.com/people/99781513@N04/

https://www.flickr.com/photos/httpoldmaisonblogspotcom/4315218578/


https://www.flickr.com/people/httpoldmaisonblogspotcom/

https://www.flickr.com/photos/tadokoro/64402761/

https://creativecommons.org/licenses/by/2.0/

https://www.flickr.com/people/tadokoro/


Maturity of platform

▪ Leverage your network– Who is using it? in production?– Is it working for them?– What are its limits? scalability, stability, ease of use, ….

▪ Open source– How active is the development of the project?

▪ sometimes means lots of changes or instability ▪ usually a good sign

– From what companies? ▪ Diverse companies suggests broad adoption▪ Big companies suggests scale, maturity of engineering processes▪ Academic only may indicate weak points in scale or robustness (especially

with respect to scoring solutions)▪ Purchasing solutions or support

– Consider size of the company, amount of funding raised, your tolerance to risk– Pedigree of the ML experts and engineers at the company

75"Jersey City waterfront at night" (CC BY-SA 2.0) by mattk1979

https://www.flickr.com/photos/mattkieffer/6212418206/


https://www.flickr.com/people/mattkieffer/


Anticipate evolving needs

▪ May start with buying a solution▪ But later need to integrate more tightly and move

toward assembling▪ Or later build major components yourself▪ Consider whether the solutions you buy are built on

open source and can help you transition to assembling solutions

76

"Butterfly" (CC BY-SA 2.0) by Salvatore Gerace

https://www.flickr.com/photos/salvatore_gerace/16968520459/


https://www.flickr.com/people/salvatore_gerace/


Outline







Spark ML, MLlib

▪ ML– Usability through DataFrames and Pipelines– More polished implementations, often more scalable– Regression, Trees, k-Means, LDA

▪ MLlib– Old RDD API– More extensive set of methods– Matrix Factorization, Power Iteration Clustering, SVD, PCA, Frequent Pattern Mining,

… ▪ Extensive support for reading data sources via Spark: Amazon S3, Amazon Redshift,

mongoDB, mySQL, Shark, Hive, HDFS, byo, ...▪ Batch and streaming scoring support▪ PMML model export for some techniques

78


Vowpal Wabbit

79

▪ Very fast SGD – hashing trick – scales to very large number of features and datasets

on a single machine– models less interpretable (can’t know the weight on any single feature)

▪ Allreduce server for distributed learning▪ Quadratic, cubic feature interaction▪ Most users run from command line▪ Also C++ and C# APIs▪ Linear, logistic, svm, quantile regression, online (contextual bandit),

LDA, matrix factorization▪ Batch, API, service scoring


DMLC

▪ (mostly) research collaboration▪ Mixed language support across methods▪ XGBoost: Scalable GBDT▪ MXNet: flexible deep learning▪ DiFacto: factorization machines, logistic regression using

parameter server

80


scikit-learn

▪ Popular python package for machine learning▪ SVM, regression, trees (RF, GBDT),

k-means, PCA, matrix factorization, …▪ Load easily from file, HDFS, Amazon S3▪ Batch, API scoring

81


R

▪ Application for data mining, statistics, ML▪ Generally not considered “production” grade▪ Extensive packages▪ Among most popular: SVM, regression trees (GBRT, RF, ...),

PCA, Quadratic Programming, GLM, GBM, association rule mining, bagging, ...

▪ Packages for loading data from file, SQL, JDBC, HDFS, hive, Redshift, Amazon S3

▪ PMML model export for some packages

82


TensorFlow

▪ Primarily used for Deep Learning▪ Python (first class), C++ (minimal) APIs▪ Designed from ground up for distributed ML▪ GPU support▪ Flexible modeling framework beyond deep learning▪ Auto-differentiation makes it easier to add new techniques▪ Designed for production learning and scoring▪ Batch scoring, model export

83


AWS + Amazon Machine Learning

84

console, API

S3, Redshift, RDS (SQL)

limited transformations

logistic regression, multinomial regression,linear regression


AWS + Spark

85

linear, logistic, multiclass, and survivalregression, trees (RF, GBDT), MF, survival regression, k-means, LDA

Scala, Java, R, python

MySQL, Shark, Hive, HDFS, ODBC, …

pipelines

console, API


AWS + Spark + databricks

86

linear, logistic, multiclass, and survivalregression, trees (RF, GBDT), MF, survival regression, k-means, LDA

Scala, Java, R, python

MySQL, Shark, Hive, HDFS, ODBC, …

pipelines

notebooks,scheduled jobs,model deployment toscoring server

S3, Redshift, RDS (SQL)


AWS + H2O

87

extensive regression & classification, GBDT, ensembles, deep learning

notebooks, model export (jar, JSON)

HDFS, file, URL download, Amazon S3


AWS + H2O + Spark

88


notebooks, model export (jar, JSON)


AWS + Turi

89

extensive regression & classification, recommenders (factorization machines, MF, …), k-means, LDA, deep learning

notebooks, C++, pythonHDFS, Amazon S3, file, ODBC, Avro, ...

API, batch scoring


AWS + Turi + Spark

90


notebooks, C++, pythonHDFS, Amazon S3, file, ODBC, Avro, ...

Spark RDD <-> Dato SFrame data conversion


Azure Cloud:dbs, batch computation,Power BI,stream analytics,...

Microsoft Azure Machine Learning

91

extensive regression & classification, k-means, Vowpal Wabbit

API, Notebooks, GUI Workflows, R, python

API, batch scoring

Hive, Azure, bulk upload, URL download,. ..


Azure Cloud:dbs, batch computation,Power BI,stream analytics,...

Azure Machine Learning + Hadoop/Spark

92


API, Notebooks, GUI Workflows, R, python

HDInsight:run Hadoop and Spark on Azure

API, batch scoring

Hive, Azure, bulk upload, URL download,. ..


Google Machine Learning (limited preview)

93

compute, app, networking,storage, big data, security, ...

TensorFlow: deep learning,flexible optimization framework

notebooks

Cloud Storage, BigQuery

model export, API, batch


Outline







Intelligence Engine

HDFS Feature Mart

Application 1 Application 2 Application N… …

Workflows Workflows Workflows

Feature Engineering Libraries

Machine Learning Libraries

Workflow Scheduler & Manager

Ground Truth

DriversMetadata

Store

Web UI

95

Modoop - example of building/assembling a platform

Scores Models

MapReduce / Yarn

Pig, Hive, Python, Scala, Shell script, ...


Intelligence Engine

HDFS Feature Mart






Ground Truth

DriversMetadata

Store

Web UI

96Scores Models

MapReduce / Yarn


Leverage hadoop/Spark, Azkaban open source


Intelligence Engine

HDFS Feature Mart






Ground Truth

DriversMetadata

Store

Web UI

97Scores Models

MapReduce / Yarn


Feature mart and metadata store enable feature integration and feature engineering


Intelligence Engine

HDFS Feature Mart






Ground Truth

DriversMetadata

Store

Web UI

98Scores Models

MapReduce / Yarn


Leverage open source, augment with evaluation and model selection

Photon MLMetronome: internalML workflow tools


Intelligence Engine

HDFS Feature Mart






Ground Truth

DriversMetadata

Store

Web UI

99Scores Models

MapReduce / Yarn


Create end-to-end workflow based on user-provided label and configurations


Intelligence Engine

HDFS Feature Mart






Ground Truth

DriversMetadata

Store

Web UI

100Scores Models

MapReduce / Yarn


Custom user interface provides model understanding tools


Summary

▪ Mining data to extract useful and enduring patterns remains a skill arguably more art than science

▪ Choose appropriate evaluation metrics for different audiences▪ A black box model is not enough▪ Consider scale, features, integration costs, maturity, and

flexibility of platforms

101


Appendix

102


Spark ML Vowpal Wabbit dmlc scikit-learn R TensorFlow

Interfaces python, Scala, Java, R command line, C#, C++ command line, xgboost: C++, Python, R, Java, Scala, Juliamxnet: Python, R, Julia, Scala, Go, Javascriptdifacto: C++

notebook, command line, python

gui, command line python

Data Sources Amazon S3, Amazon Redshift, mongoDB, mySQL, Shark, Hive, HDFS, byo, ...

file, API file, HDFS, Amazon S3 file, HDFS, Amazon S3, file, SQL, JDBC, HDFS, hive, Redshift, Amazon S3

file, API

Modeling Techniques

MLlib: linear, logistic, tree (RF, GBDT), MF, survival regression, multiclass, k-means, LDA

linear, logistic, svm, quantile regression, online (contextual bandit), LDA, MF

xgboost (gbrt), mxnet (deep learning), difacto (factorization machines, logistic regression)

SVM, regression, trees (RF, GBDT), k-means, PCA, matrix factorization, ...

SVM, regression trees (GBRT, RF, ..), PCA, QP, GLM, GBM, association rule mining, bagging, ...

deep learning, linear/logistic regression, flexible learning framework

Platforms Spark Single Machine, Custom AllReduce

parameter server, AllReduce (Rabit), Spark, YARN

- - Google Cloud Compute Engine

Scoring Batch, Streaming Batch, API Batch Batch, API PMML export for some packages

Batch, API

Open Source Software


Microsoft Azure Machine Learning

Google Cloud Machine Learning Platform (alpha)

Amazon Machine Learning

Databricks H20 Turi

Interfaces API, Notebooks, GUI Workflows, R, python

API, Notebooks, Programmatic Workflows, Java, python (alpha)

API, console API, Notebooks, Programmatic Workflows, Java, Scala, python, R

API, Notebook, R, Java, Scala, python

Notebooks, C++, python

Cloud Microsoft Azure Google Cloud Compute Engine

Amazon AWS Amazon AWS Amazon AWS, Microsoft Azure*, Google Cloud*, your data center

Amazon AWS, Microsoft Azure*, Google Cloud*, your data center

Data Sources Hive, Azure blob storage, Azure table, Azure SQL OData feed, bulk upload, URL download

text (including JSON), Google Cloud Bigtable, Google Cloud Datastore, byo

Amazon S3, Amazon Redshift, Amazon RDS (SQL)

Amazon S3, Amazon Redshift, mongoDB, mySQL, Shark, Hive, HDFS, byo, ...

HDFS, file, URL download, Amazon S3

HDFS, Amazon S3, file, ODBC, Avro, SparkRDD, ...

Modeling Techniques


TensorFlow: flexible deep learning, regression, ...

binary, multiclass classification, regression

MLlib: linear, logistic, tree (RF, GBDT), MF, survival regression, multiclass, k-means, LDA



Limitations 10GB training datasetsfull info

ML in limited previewPrediciton API: 2.5GB training datasets

100GB training datasets1k input featuresfull info

Extensibility byo R, python, package imports TensorFlow SDK, Google Cloud Platform

byo AWS services pipelines, byo Spark byo on your platform python, C++, byo platform

Pretrained Models vision, speech, sentiment, ... speech, vision, translation, ...

Scoring API, batch API, batch, planned export API, batch API, batch, streaming, PMML export

API, JSON, POJO export API, Batch

Machine Learning Platforms

https://azure.microsoft.com/en-us/documentation/articles/machine-learning-faq/

https://azure.microsoft.com/en-us/documentation/articles/machine-learning-faq/

http://docs.aws.amazon.com/machine-learning/latest/dg/system-limits.html

http://docs.aws.amazon.com/machine-learning/latest/dg/system-limits.html

Data & Analytics

Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial