39
Scalable Machine Learning Lecture 1: Introduction

Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Scalable Machine Learning

Lecture 1: Introduction

Page 2: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Class Logistics

I WhoMaria ChikinaDavid KoesAkash Parvatikar

I When1.30 PM to 3.00 PM W/F

I Office hoursMaria Chikina: F 3.00PMDavid Koes: TBA

Page 3: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Class Logistics

I WhoMaria ChikinaDavid KoesAkash Parvatikar

I When1.30 PM to 3.00 PM W/F

I Office hoursMaria Chikina: F 3.00PMDavid Koes: TBA

Page 4: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Class Logistics

I WhoMaria ChikinaDavid KoesAkash Parvatikar

I When1.30 PM to 3.00 PM W/F

I Office hoursMaria Chikina: F 3.00PMDavid Koes: TBA

Page 5: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Class materials and resources

I Piazzapiazza.com/pitt/fall2019/mscbiocmpbio2065

I Websitehttps://mscbio2065.wordpress.com

I Reading:I There is no formal textbook. Generally useful resources are linked to on the

website.I Class schedule and resources pertaining to specific lectures will be posted on

the website.I Lectures will be linked to on the website so you can follow along.

Page 6: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Class materials and resources

I Piazzapiazza.com/pitt/fall2019/mscbiocmpbio2065

I Websitehttps://mscbio2065.wordpress.com

I Reading:I There is no formal textbook. Generally useful resources are linked to on the

website.I Class schedule and resources pertaining to specific lectures will be posted on

the website.I Lectures will be linked to on the website so you can follow along.

Page 7: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Grading and requirements

I Assignments: 5 in total, 60%

I 3 late days in total (for the whole semester)

Assignments take time; start early!!

I Final Course Project: 35%

Project structure: Read a methodological machine learning paper andpresent it to the class, (15%) Either implement the methods or applyexisting code to a new dataset. Write a final report. (20%)

I Class Participation: 5%

Page 8: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Grading and requirements

I Assignments: 5 in total, 60%

I 3 late days in total (for the whole semester)

Assignments take time; start early!!

I Final Course Project: 35%

Project structure: Read a methodological machine learning paper andpresent it to the class, (15%) Either implement the methods or applyexisting code to a new dataset. Write a final report. (20%)

I Class Participation: 5%

Page 9: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Grading and requirements

I Assignments: 5 in total, 60%

I 3 late days in total (for the whole semester)

Assignments take time; start early!!

I Final Course Project: 35%

Project structure: Read a methodological machine learning paper andpresent it to the class, (15%) Either implement the methods or applyexisting code to a new dataset. Write a final report. (20%)

I Class Participation: 5%

Page 10: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Final Project Requirements and Timeline

Paper suggestions will be posted but you can select a paper that is not on thelist.

Paper selection is due by October 18th (there will be a reminder).

I Presentation of papers:I 20-30 minutes. November 13, 15, 20, 22.

I 2-3 students per class depending on class size.

I The expectation is that you read the paper in depth and explain it to the rest ofthe class, paying careful attention to the implementation.

I Final report: significant implementation and/or analysis effort.

I Final project presentations: 10 minutes per project. December 11 and 13.

Page 11: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Computing

I You will develop all your code on your personal computer:

I Install Apache Spark, Python, and their dependencies.

I Anaconda Python installation preferred.

I For large scale deployment you will use DCB computational and/or cloudcomputing. Details will come along with specific assignments.

Page 12: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Course objectives

I Design and develop machine learning algorithms to analyze largedatasets.

I Understand distributed computing environments for analyzing largedatasets.

I Collaborate with domain experts in inter-disciplinary areas such asbiomedical and health sciences.

Page 13: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Course objectives

I Design and develop machine learning algorithms to analyze largedatasets.

I Understand distributed computing environments for analyzing largedatasets.

I Collaborate with domain experts in inter-disciplinary areas such asbiomedical and health sciences.

Page 14: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Course objectives

I Design and develop machine learning algorithms to analyze largedatasets.

I Understand distributed computing environments for analyzing largedatasets.

I Collaborate with domain experts in inter-disciplinary areas such asbiomedical and health sciences.

Page 15: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Course objectives

I Design and develop machine learning algorithms to analyze largedatasets.

I Understand distributed computing environments for analyzing largedatasets.

I Collaborate with domain experts in inter-disciplinary areas such asbiomedical and health sciences.

Page 16: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Types of machine learning

I Supervised:I Predict some output/target given featuresI ClassificationI Regression

I UnsupervisedI Create internal representations of the dataI Generative modelsI Dimensionality reduction

I ReinforcementI Learn action to maximize payoffI Very relevant for robotics and decision making in general

We will mostly focus on the first two.

Page 17: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

An example of scalable ML

Ad Click Prediction: a View from the Trenches – A paper about Google’sad click predictions.The problem:

I Google makes money by serving ads according to your search query.

I When a user does a search q, an initial set of candidate ads is matched tothe query q based on advertiser-chosen keywords.

I An auction mechanism then determines whether these ads are shown tothe user, what order they are shown in, and what prices the advertiserspay if their ad is clicked.

I An important input to the auction is, for each ad a, an estimate ofP(click | q, a), the probability that the ad will be clicked if it is shown.

Page 18: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

The features

I Featurized representations of the query and ad text (for examplekeywords) as well as text meta-data.

I The set of features is very large, billions.

I Data is extremely sparse: Only a tiny fraction of features are non-zero forany one example.

Page 19: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Constraints and requirements

I Necessary to make predictions many billions of times per day. The modelitself should be compact even when the data is not.

I Need to quickly update the model as new clicks and non-clicks areobserved.

I The data is enormous and is provided as a stream. You don’t have accessto the i ’th example!

Page 20: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Writing down a formal problem

For probabilities we use logisitc regression. Our features affect the probabilitythrough a logistic function.

σ(a) = 1/(1 + exp(−a)) (1)

we predictp = σ(w · x) = σ(w1x1 + w2x2 + w3x3 + . . . ) (2)

Here w is a weight vector and x is a vector of feature values.

The prediction is simply a weighted sum of the features passed through logisticfunction.

Page 21: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Solving the problem

Given an even (click or no click) y . We observe a predicted probability p ∈ 0, 1and compute a logisitc loss.

A loss is a penalty for incurring an error. It can take different forms but shouldbe minimal when we are exactly right.

The logistic loss for a single example depends on the weights w and is:

` (w)y = −y log p − (1− y) log (1− p) (3)

For the data overall we have:

` (w) =∑

y

−y log p − (1− y) log (1− p) (4)

Note the index of y is omitted for clarity.

Why logistic loss has this specific form will be discussed in a few lectures.

Page 22: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Solving the problem

Our goal is to minimize the error.

` (w) =∑

y

−y log p − (1− y) log (1− p) (5)

The most common way of doing this is by using gradient descent:

Keep moving in the direction of decreasing error!

Page 23: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Solving the problem

Taking derivatives of the above expression we get that:

∇`(w) =∑

y

(σ (w · x)− y) x =∑

y

(p − y) x (6)

So that at each step we update our weight vector as:

wt+1 = wt − ηt∇`(w) = wt − ηtgt (7)

Where gt is the gradient at step t . What is ηt? This is called the learning rate.How to set it correctly is an important topic in ML.

Page 24: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Implementation

I We only have access to a stream of examples. We cannot just look up yi

and xi for any i but the loss is with respect to all y ’s!

I The answer: minimize the loss one example at a time!

I This is called stochastic gradient descent (SGD). It is stochastic becausethe order of examples is random (either because of constraints or bydesign).

Some version of SGD is a main workhorse of many modern ML methods

Page 25: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Sparsity

I We may desire that in our final model many of the weights wi ’s are 0.Why?

I Interpretability of the model.

I For add click prediction this is also practical. w is on the order of billions.We need a smaller model to evaluate quickly.

Page 26: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Sparsity

I We may desire that in our final model many of the weights wi ’s are 0.Why?

I Interpretability of the model.

I For add click prediction this is also practical. w is on the order of billions.We need a smaller model to evaluate quickly.

Page 27: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Sparsity through L1 norm regularization

I The L1 norm of a vector x: |x|1 =∑|xi |. That is the sum of the absolute

values of entries of x

I We can draw the curve where the L1 of a d-dimensional vector is equal.

L1-norm

I Let’s write a new loss function:

` (w) = logistic loss + λ|w|1 (8)

So we want to both have low error and small w in the L1 sense.

Page 28: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Sparsity through L1 norm regularization

I The L1 norm of a vector x: |x|1 =∑|xi |. That is the sum of the absolute

values of entries of x

I We can draw the curve where the L1 of a d-dimensional vector is equal.

L1-norm

I Let’s write a new loss function:

` (w) = logistic loss + λ|w|1 (8)

So we want to both have low error and small w in the L1 sense.

Page 29: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Sparsity through L1 norm regularization

The minimum solution of this new loss will have many zeros.

Why?

Here is an intuitive explanation.

L1-norm

Page 30: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Sparsity through L1 norm regularization

The minimum solution of this new loss will have many zeros.

Why? Here is an intuitive explanation.

L1-norm

Page 31: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Sparsity using SGD

I When using SGD we are updating with respect to a single yt at a time.

wt+1 = wt − ηtgt (9)

I Even if our current wt is sparse the update will likely add some additionalnon-zero values reducing the sparsity.

I The problem is that sparsity is not coupled across examples.I The Google team used a modified version of SGD to avoid this problem.I Follow The (Proximally) Regularized Leader algorithm, or FTRL-Proximal.

Page 32: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Further implementation issues

I Per-Coordinate Learning Rates

I Probabilistic Feature Inclusion

I Encoding Values with Fewer Bits (probably not relevant for anyone here)

I Subsampling Training Data

Page 33: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Per-Coordinate Learning Rates

I Learning rate optimization is an active area of ML research

I The standard learning rate schedule is

ηt =1√t

(10)

Learning starts out fast but slows as the algorithm progresses.

I There is no reason why the learning rate for all the features shouldbe the same

I Some features are seen much more often than others. For rare featureswe are less confident about their contribution and the learning rate shouldstay high so we can react quickly to new data.

Page 34: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Proposed schedule

ηt,i =α

β +√∑t

s=1 g2s,i

(11)

The learning rate is inversely proportional to the size gradient steps taken thusfar.

“The results showed that using a per-coordinate learning rates reducedAucLoss by 11.2% compared to the global-learning-rate baseline. To put thisresult in context, in our setting AucLoss reductions of 1% are consideredlarge.”

Page 35: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Probabilistic Feature Inclusion

I Some feature are very rare.

I It is expensive to track statistics for such rare features which can never beof any real use.

I But the data is a stream so we do not know in advance which features willbe rare.

I The feature is included probabilistically based on how many times it hasbeen seen.

I Counting features is too expensive. The Google team uses Bloom filtersto reduce the space requirements.

I Bloom filters are a probabilistic data structure that can over-count but notunder-count. (Covered in a later lecture.)

Page 36: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Subsampling Training Data

This is one of the simplest ways to increase scalability

For add click prediction

I Clicks are rare and thus more valuable for prediction

I The data is subsampled to include:

I Any query for which at least one of the ads was clicked.

I A fraction r ∈ (0, 1] of the queries where none of the ads were clicked.

I In this new dataset the probability of clicking is much higher which willbias the results

I This is fixed by simply weighing the examples as:

ωt =

{1 event t is in a clicked query1r event t is in a query with no clicks. (12)

Page 37: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

The importance of understanding your model

I The model needs to perform well across queries and geographicalregions.

I We need to understand how adding features changes the model

I Google has built its own evaluation system.

Page 38: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Many more insights in the paper

I How to compactly train multiple models at the same time

I Confidence estimates

I Things that didn’t work!

Page 39: Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

How to make the best machine learning model

I You have to be clever.

I You have to keep up with the literature.

I You have to work hard. Details matter and most can only be figured out bytrial and error.

I Understand your model. Make lots of plots. What is the model gettingwrong? This is critical for real world results.

I Use the simplest algorithm that can do the job. Don’t use deep neuralnets if you don’t have to.