48
Big Data Infrastructure Week 8: Data Mining (2/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo March 3, 2016 These slides are available at http://lintool.github.io/bigdata-2016w/

Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Big Data Infrastructure

Week 8: Data Mining (2/4)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States���See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 489/698 Big Data Infrastructure (Winter 2016)

Jimmy LinDavid R. Cheriton School of Computer Science

University of Waterloo

March 3, 2016

These slides are available at http://lintool.github.io/bigdata-2016w/

Page 2: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

¢  Inducel  Such that loss is minimized

f : X ! Y

1

n

nX

i=0

`(f(xi), yi)

¢  Given D = {(xi, yi)}ni

¢  Typically, consider functions of a parametric form:

argmin✓

1

n

nX

i=0

`(f(xi; ✓), yi)

xi = [x1, x2, x3, . . . , xd]

y 2 {0, 1}

The Task

(sparse) feature vector

label

loss function

model parameters

Page 3: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Gradient Descent

Source: Wikipedia (Hills)

✓(t+1) ✓(t) � �(t) 1

n

nX

i=0

r`(f(xi; ✓(t)), yi)

Page 4: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

mapper mapper mapper mapper

reducer

compute partial gradient

single reducer

mappers

update model iterate until convergence

✓(t+1) ✓(t) � �(t) 1

n

nX

i=0

r`(f(xi; ✓(t)), yi)

MapReduce Implementation

Page 5: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Spark Implementationval points = spark.textFile(...).map(parsePoint).persist() var w = // random initial vector for (i <- 1 to ITERATIONS) { val gradient = points.map{ p => p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y }.reduce((a,b) => a+b) w -= gradient }

mapper mapper mapper mapper

reducer

compute partial gradient

update model

What’s the difference?

Page 6: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Gradient Descent

Source: Wikipedia (Hills)

Page 7: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Stochastic Gradient Descent

Source: Wikipedia (Water Slide)

Page 8: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Gradient Descent

Stochastic Gradient Descent (SGD)

“batch” learning: update model after considering all training instances

“online” learning: update model after considering each (randomly-selected) training instance

In practice… just as good!

✓(t+1) ✓(t) � �(t) 1

n

nX

i=0

r`(f(xi; ✓(t)), yi)

✓(t+1) ✓(t) � �(t)r`(f(x; ✓(t)), y)

Batch vs. Online

Opportunity to interleaving prediction and learning!

Page 9: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Practical Notes¢  Order of the instances important!

¢  Most common implementation:

l  Randomly shuffle training instancesl  Stream instances through learner

¢  Single vs. multi-pass approaches

¢  “Mini-batching” as a middle ground between batch and stochastic gradient descent

We’ve solved the iteration problem!

What about the single reducer problem?

Page 10: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Source: Wikipedia (Orchestra)

Ensembles

Page 11: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Ensemble Learning¢  Learn multiple models, combine results from different models

to make prediction

¢  Why does it work?l  If errors uncorrelated, multiple classifiers being wrong is less likely

l  Reduces the variance component of error

¢  A variety of different techniques:l  Majority voting

l  Simple weighted voting:

l  Model averaging

l  …

y = argmax

y2Y

nX

k=1

↵kpk(y|x)

Page 12: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Practical Notes¢  Common implementation:

l  Train classifiers on different input partitions of the datal  Embarrassingly parallel!

¢  Contrast with other ensemble techniques, e.g., boosting

Page 13: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

MapReduce Implementation

✓(t+1) ✓(t) � �(t)r`(f(x; ✓(t)), y)

training data training data training data training data

mapper mapper mapper mapperlearner learner learner learner

Page 14: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

MapReduce Implementation

✓(t+1) ✓(t) � �(t)r`(f(x; ✓(t)), y)

training data training data training data training data

mapper mapper mapper mapper

reducer reducerlearner learner

Page 15: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

What about Spark?

✓(t+1) ✓(t) � �(t)r`(f(x; ✓(t)), y)

mapPartitions���f: (Iterator[T]) ⇒ Iterator[U]

RDD[T]

RDD[U]

learner

Page 16: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

MapReduce Implementation: Details¢  Two possible implementations:

l  Write model out as “side data”l  Emit model as intermediate output

Page 17: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Candidate Generation

Candidates Classification

Follow graph Retweet graph Other log data …

Final Results

Trained Model

Case Study: Link Recommendation

Lin and Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD 2012.

Page 18: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Classifier ���Training

Making���Predictions

Just like any other parallel Pig dataflow

label, feature vector

model UDF

feature vector

prediction

model UDF

feature vector

prediction

model

previous Pig dataflow

map

reduce

previous Pig dataflow

model model

Pig storage function

Page 19: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Classifier Training

training = load ‘training.txt’ using SVMLightStorage()as (target: int, features: map[]);

store training into ‘model/’using FeaturesLRClassifierBuilder();

Want an ensemble?

training = foreach training generatelabel, features, RANDOM() as random;

training = order training by random parallel 5;

Logistic regression + SGD (L2 regularization)Pegasos variant (fully SGD or sub-gradient)

Page 20: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

define Classify ClassifyWithLR(‘model/’);data = load ‘test.txt’ using SVMLightStorage()

as (target: double, features: map[]);data = foreach data generate target,

Classify(features) as prediction;

Making Predictions

Want an ensemble?

define Classify ClassifyWithEnsemble(‘model/’,‘classifier.LR’, ‘vote’);

Page 21: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Sentiment Analysis Case Study

¢  Binary polarity classification: {positive, negative} sentiment

l  Independently interesting taskl  Illustrates end-to-end flow

l  Use the “emoticon trick” to gather data

¢  Datal  Test: 500k positive/500k negative tweets from 9/1/2011

l  Training: {1m, 10m, 100m} instances from before (50/50 split)

¢  Features: Sliding window byte-4grams

¢  Models:l  Logistic regression with SGD (L2 regularization)

l  Ensembles of various sizes (simple weighted voting)

Lin and Kolcz, SIGMOD 2012

Page 22: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

0.75

0.76

0.77

0.78

0.79

0.8

0.81

0.82

1 1 1 3 5 7 9 11 13 15 17 19 3 5 11 21 31 41

Acc

ura

cy

Number of Classifiers in Ensemble

1m instances10m instances

100m instances

“for free”

Ensembles with 10m examples���better than 100m single classifier!

Diminishing returns…

single classifier 10m ensembles 100m ensembles

Page 23: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

training

Model

training data

Machine Learning Algorithm

testing/deployment

?

Supervised Machine Learning

Page 24: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Applied ML in Academia¢  Download interesting dataset (comes with the problem)

¢  Run baseline model

l  Train/test

¢  Build better modell  Train/test

¢  Does new model beat baseline?

l  Yes: publish a paper!l  No: try again!

Page 25: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Three Commandants of Machine Learning

Thou shalt not mix training and testing dataThou shalt not mix training and testing dataThou shalt not mix training and testing data

Page 26: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Training/Testing Splits

Training

Test

What happens if you need more? Cross-Validation

Page 27: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Training/Testing Splits

Cross-Validation

Page 28: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Training/Testing Splits

Cross-Validation

Page 29: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Training/Testing Splits

Cross-Validation

Page 30: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Training/Testing Splits

Cross-Validation

Page 31: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Training/Testing Splits

Cross-Validation

Page 32: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Applied ML in Academia¢  Download interesting dataset (comes with the problem)

¢  Run baseline model

l  Train/test

¢  Build better modell  Train/test

¢  Does new model beat baseline?

l  Yes: publish a paper!l  No: try again!

Page 33: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment
Page 34: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment
Page 35: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

FantasyExtract features

Develop cool ML technique

#Profit

RealityWhat’s the task?

Where’s the data?

What’s in this dataset?

What’s all the f#$!* crap?

Clean the data

Extract features

“Do” machine learning

Fail, iterate…

Page 36: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

It’s impossible to overstress this: 80% of the work in any data project is in cleaning

the data. – DJ Patil “Data Jujitsu”

Source: Wikipedia (Jujitsu)

Page 37: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment
Page 38: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

On finding things…

Page 39: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

CamelCase

smallCamelCase

snake_case

camel_Snake

dunder__snake

uid UserId

userId userid

user_id user_Id

On finding things…

Page 40: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

^(\\w+\\s+\\d+\\s+\\d+:\\d+:\\d+)\\s+ ([^@]+?)@(\\S+)\\s+(\\S+):\\s+(\\S+)\\s+(\\S+) \\s+((?:\\S+?,\\s+)*(?:\\S+?))\\s+(\\S+)\\s+(\\S+) \\s+\\[([^\\]]+)\\]\\s+\"(\\w+)\\s+([^\"\\\\]* (?:\\\\.[^\"\\\\]*)*)\\s+(\\S+)\"\\s+(\\S+)\\s+ (\\S+)\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*) \"\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"\\s* (\\d*-[\\d-]*)?\\s*(\\d+)?\\s*(\\d*\\.[\\d\\.]*)? (\\s+[-\\w]+)?.*$

An actual Java regular expression used to parse log message at Twitter circa 2010

On feature extraction…

Friction is cumulative!

Page 41: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Frontend EngineerDevelops new feature, adds logging code to capture clicks

Data ScientistAnalyze user behavior, extract insights to improve feature

Okay, let’s get going… where’s the click data?

Well, that’s kinda non-intuitive, but okay…

Oh, BTW, where’s the timestamp of the click?

It’s over here…

Well, it wouldn’t fit, so we had to shoehorn…

Hang on, I don’t remember…

Uh, bad news. Looks like we forgot to log it…

[grumble, grumble, grumble]

Data Plumbing… Gone Wrong![scene: consumer internet company in the Bay Area…]

Page 42: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Extract features

Develop cool ML technique

#Profit

What’s the task?

Where’s the data?

What’s in this dataset?

What’s all the f#$!* crap?

Clean the data

Extract features

“Do” machine learning

Fail, iterate…

Finally works!

Fantasy Reality

Page 43: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Source: Wikipedia (Hills)

Congratulations, you’re halfway there…

Page 44: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Does it actually work?

Congratulations, you’re halfway there…

Is it fast enough?

Good, you’re two thirds there…

A/B testing

Page 45: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Source: Wikipedia (Oil refinery)

Productionize

Page 46: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

What are your jobs’ dependencies?

Productionize

How/when are your jobs scheduled?

Infrastructure is critical here!

Are there enough resources?How do you know if it’s working?

Who do you call if it stops working?

(plumbing)

Page 47: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Source: Wikipedia (Plumbing)

Plumbing matters!Takeaway lesson:

Page 48: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week08b.pdf · Sentiment Analysis Case Study! Binary polarity classification: {positive, negative} sentiment

Source: Wikipedia (Japanese rock garden)

Questions?