Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper

Scalable Machine Learning

Lecture 1: Introduction

Class Logistics

I WhoMaria ChikinaDavid KoesAkash Parvatikar

I When1.30 PM to 3.00 PM W/F

I Office hoursMaria Chikina: F 3.00PMDavid Koes: TBA

Class Logistics




Class Logistics




Class materials and resources

I Piazzapiazza.com/pitt/fall2019/mscbiocmpbio2065

I Websitehttps://mscbio2065.wordpress.com

I Reading:I There is no formal textbook. Generally useful resources are linked to on the

website.I Class schedule and resources pertaining to specific lectures will be posted on

the website.I Lectures will be linked to on the website so you can follow along.

piazza.com/pitt/fall2019/mscbiocmpbio2065

https://mscbio2065.wordpress.com

Class materials and resources

I Piazzapiazza.com/pitt/fall2019/mscbiocmpbio2065

I Websitehttps://mscbio2065.wordpress.com

I Reading:I There is no formal textbook. Generally useful resources are linked to on the

website.I Class schedule and resources pertaining to specific lectures will be posted on

the website.I Lectures will be linked to on the website so you can follow along.

piazza.com/pitt/fall2019/mscbiocmpbio2065

https://mscbio2065.wordpress.com

Grading and requirements

I Assignments: 5 in total, 60%

I 3 late days in total (for the whole semester)

Assignments take time; start early!!

I Final Course Project: 35%

Project structure: Read a methodological machine learning paper andpresent it to the class, (15%) Either implement the methods or applyexisting code to a new dataset. Write a final report. (20%)

I Class Participation: 5%















Final Project Requirements and Timeline

Paper suggestions will be posted but you can select a paper that is not on thelist.

Paper selection is due by October 18th (there will be a reminder).

I Presentation of papers:I 20-30 minutes. November 13, 15, 20, 22.

I 2-3 students per class depending on class size.

I The expectation is that you read the paper in depth and explain it to the rest ofthe class, paying careful attention to the implementation.

I Final report: significant implementation and/or analysis effort.

I Final project presentations: 10 minutes per project. December 11 and 13.

Computing

I You will develop all your code on your personal computer:

I Install Apache Spark, Python, and their dependencies.

I Anaconda Python installation preferred.

I For large scale deployment you will use DCB computational and/or cloudcomputing. Details will come along with specific assignments.

Course objectives

I Design and develop machine learning algorithms to analyze largedatasets.

I Understand distributed computing environments for analyzing largedatasets.

I Collaborate with domain experts in inter-disciplinary areas such asbiomedical and health sciences.

Course objectives




Course objectives




Course objectives




Types of machine learning

I Supervised:I Predict some output/target given featuresI ClassificationI Regression

I UnsupervisedI Create internal representations of the dataI Generative modelsI Dimensionality reduction

I ReinforcementI Learn action to maximize payoffI Very relevant for robotics and decision making in general

We will mostly focus on the first two.

An example of scalable ML

Ad Click Prediction: a View from the Trenches – A paper about Google’sad click predictions.The problem:

I Google makes money by serving ads according to your search query.

I When a user does a search q, an initial set of candidate ads is matched tothe query q based on advertiser-chosen keywords.

I An auction mechanism then determines whether these ads are shown tothe user, what order they are shown in, and what prices the advertiserspay if their ad is clicked.

I An important input to the auction is, for each ad a, an estimate ofP(click | q, a), the probability that the ad will be clicked if it is shown.

The features

I Featurized representations of the query and ad text (for examplekeywords) as well as text meta-data.

I The set of features is very large, billions.

I Data is extremely sparse: Only a tiny fraction of features are non-zero forany one example.

Constraints and requirements

I Necessary to make predictions many billions of times per day. The modelitself should be compact even when the data is not.

I Need to quickly update the model as new clicks and non-clicks areobserved.

I The data is enormous and is provided as a stream. You don’t have accessto the i ’th example!

Writing down a formal problem

For probabilities we use logisitc regression. Our features affect the probabilitythrough a logistic function.

σ(a) = 1/(1 + exp(−a)) (1)

we predictp = σ(w · x) = σ(w1x1 + w2x2 + w3x3 + . . . ) (2)

Here w is a weight vector and x is a vector of feature values.

The prediction is simply a weighted sum of the features passed through logisticfunction.

Solving the problem

Given an even (click or no click) y . We observe a predicted probability p ∈ 0, 1and compute a logisitc loss.

A loss is a penalty for incurring an error. It can take different forms but shouldbe minimal when we are exactly right.

The logistic loss for a single example depends on the weights w and is:

` (w)y = −y log p − (1− y) log (1− p) (3)

For the data overall we have:

` (w) =∑

y

−y log p − (1− y) log (1− p) (4)

Note the index of y is omitted for clarity.

Why logistic loss has this specific form will be discussed in a few lectures.

Solving the problem

Our goal is to minimize the error.

` (w) =∑

y

−y log p − (1− y) log (1− p) (5)

The most common way of doing this is by using gradient descent:

Keep moving in the direction of decreasing error!

Solving the problem

Taking derivatives of the above expression we get that:

∇`(w) =∑

y

(σ (w · x)− y) x =∑

y

(p − y) x (6)

So that at each step we update our weight vector as:

wt+1 = wt − ηt∇`(w) = wt − ηtgt (7)

Where gt is the gradient at step t . What is ηt? This is called the learning rate.How to set it correctly is an important topic in ML.

Implementation

I We only have access to a stream of examples. We cannot just look up yi

and xi for any i but the loss is with respect to all y ’s!

I The answer: minimize the loss one example at a time!

I This is called stochastic gradient descent (SGD). It is stochastic becausethe order of examples is random (either because of constraints or bydesign).

Some version of SGD is a main workhorse of many modern ML methods

Sparsity

I We may desire that in our final model many of the weights wi ’s are 0.Why?

I Interpretability of the model.

I For add click prediction this is also practical. w is on the order of billions.We need a smaller model to evaluate quickly.

Sparsity

I We may desire that in our final model many of the weights wi ’s are 0.Why?

I Interpretability of the model.

I For add click prediction this is also practical. w is on the order of billions.We need a smaller model to evaluate quickly.

Sparsity through L1 norm regularization

I The L1 norm of a vector x: |x|1 =∑|xi |. That is the sum of the absolute

values of entries of x

I We can draw the curve where the L1 of a d-dimensional vector is equal.

L1-norm

I Let’s write a new loss function:

` (w) = logistic loss + λ|w|1 (8)

So we want to both have low error and small w in the L1 sense.


I The L1 norm of a vector x: |x|1 =∑|xi |. That is the sum of the absolute

values of entries of x

I We can draw the curve where the L1 of a d-dimensional vector is equal.

L1-norm

I Let’s write a new loss function:

` (w) = logistic loss + λ|w|1 (8)

So we want to both have low error and small w in the L1 sense.


The minimum solution of this new loss will have many zeros.

Why?

Here is an intuitive explanation.

L1-norm


The minimum solution of this new loss will have many zeros.

Why? Here is an intuitive explanation.

L1-norm

Sparsity using SGD

I When using SGD we are updating with respect to a single yt at a time.

wt+1 = wt − ηtgt (9)

I Even if our current wt is sparse the update will likely add some additionalnon-zero values reducing the sparsity.

I The problem is that sparsity is not coupled across examples.I The Google team used a modified version of SGD to avoid this problem.I Follow The (Proximally) Regularized Leader algorithm, or FTRL-Proximal.

Further implementation issues

I Per-Coordinate Learning Rates

I Probabilistic Feature Inclusion

I Encoding Values with Fewer Bits (probably not relevant for anyone here)

I Subsampling Training Data

Per-Coordinate Learning Rates

I Learning rate optimization is an active area of ML research

I The standard learning rate schedule is

ηt =1√t

(10)

Learning starts out fast but slows as the algorithm progresses.

I There is no reason why the learning rate for all the features shouldbe the same

I Some features are seen much more often than others. For rare featureswe are less confident about their contribution and the learning rate shouldstay high so we can react quickly to new data.

Proposed schedule

ηt,i =α

β +√∑t

s=1 g2s,i

(11)

The learning rate is inversely proportional to the size gradient steps taken thusfar.

“The results showed that using a per-coordinate learning rates reducedAucLoss by 11.2% compared to the global-learning-rate baseline. To put thisresult in context, in our setting AucLoss reductions of 1% are consideredlarge.”

Probabilistic Feature Inclusion

I Some feature are very rare.

I It is expensive to track statistics for such rare features which can never beof any real use.

I But the data is a stream so we do not know in advance which features willbe rare.

I The feature is included probabilistically based on how many times it hasbeen seen.

I Counting features is too expensive. The Google team uses Bloom filtersto reduce the space requirements.

I Bloom filters are a probabilistic data structure that can over-count but notunder-count. (Covered in a later lecture.)

Subsampling Training Data

This is one of the simplest ways to increase scalability

For add click prediction

I Clicks are rare and thus more valuable for prediction

I The data is subsampled to include:

I Any query for which at least one of the ads was clicked.

I A fraction r ∈ (0, 1] of the queries where none of the ads were clicked.

I In this new dataset the probability of clicking is much higher which willbias the results

I This is fixed by simply weighing the examples as:

ωt =

{1 event t is in a clicked query1r event t is in a query with no clicks. (12)

The importance of understanding your model

I The model needs to perform well across queries and geographicalregions.

I We need to understand how adding features changes the model

I Google has built its own evaluation system.

Many more insights in the paper

I How to compactly train multiple models at the same time

I Confidence estimates

I Things that didn’t work!

How to make the best machine learning model

I You have to be clever.

I You have to keep up with the literature.

I You have to work hard. Details matter and most can only be figured out bytrial and error.

I Understand your model. Make lots of plots. What is the model gettingwrong? This is critical for real world results.

I Use the simplest algorithm that can do the job. Don’t use deep neuralnets if you don’t have to.

Documents

Scalable Machine Learning - University of Pittsburghgobie.csb.pitt.edu/MSCBIO2065/Intro.pdf · An example of scalable ML Ad Click Prediction: a View from the Trenches – A paper