50
Behavioral Data Mining Lecture 6 Optimizers

CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec06.pdf · scale data mining problems ... Stochastic Gradient Descent: ... Aside, how do you get the square root of a matrix

Embed Size (px)

Citation preview

Behavioral Data Mining

Lecture 6

Optimizers

Outline

• Design patterns for Behavioral Modeling

• Stochastic Gradient Descent

• Second-Order SGD

• MCMC + Gibbs Sampling

• What it all means

Huasha Zhao Graduate student, UC Berkeley BS with honor, Tsinghua University Research in algorithm/system design and numerical methods for emerging large-scale data mining problems Past work experience with Goldman Sachs, Cisco System and Microsoft Research Co-founder of Sponsor for Educational Opportunity (SEO) China Chapter

Behavioral Data

Datasets are large, resident on disks somewhere

(possibly in cluster storage): 109 – 1015 bytes

Feature spaces large (words, URLs, movies,

followers, etc.) 103 - 109

Data comprises many samples from an ideal population:

users, documents, web pages etc. 106 – 109

Data are sparse, only few sample x feature combinations

observed.

Inference

• Estimation: Given a joint distribution Pr(X, Y) on observed

data Y and unobserved data X, we want to estimate X given Y.

We may want:

– MAP estimates: the mode of the posterior Pr(X | Y)

– Condition means: E(X | Y)

In practice it may only be possible to get a local max or mean.

• Model Inference: Since we cant know the true Pr(X, Y), we

choose a family of models M with tractable Pr(X, Y | M) and

then find a “best” model (e.g. minimum loss).

– Most model inference formulations are not closed form, so an iteration

is needed to find the best model.

Mapping to Hardware

• Models: Fit in memory. Replicated in memory if in a cluster.

• Data: Stored on disk, possibly cached in memory. Samples

distributed if in a cluster.

• Inference: Access many samples in some order, or blocks of

samples that can fit in memory.

• Updates: Are often derivatives of loss wrt model parameters,

but may be other functions as well.

Mapping Iterations to Hardware

• Classical:

(in memory)

• Large Datasets: Minibatch model updates

DATA

samples

features

Batch model update

Many passes over the data

DATA1 DATA2 DATA3 …

Few (or one) passes over the entire dataset

M+

M

M+

M

M+

M

M+

M

SGD and MCMC

Stochastic Gradient Descent and Markov-Chain Monte-Carlo

Both involve model updates after processing each sample.

In most cases, there is little penalty to processing small blocks of

samples.

DATA1 DATA2 DATA3 … M+

M

M+

M

M+

M

Outline

• Design patterns for Behavioral Modeling

• Stochastic Gradient Descent

• Second-Order SGD

• MCMC + Gibbs Sampling

• What it all means

Gradient Descent

Gradient Descent: Let 𝑄(𝑧, 𝑤) be the loss for a sample 𝑧 and

model parameters (weights) 𝑤. Then gradient descent updates

these weights as:

𝑤𝑡+1 = 𝑤𝑡 − 𝛾1

𝑛 𝛻𝑤𝑄 𝑧𝑖 , 𝑤𝑡

𝑛

𝑖=1

Where 𝑛 is the number of samples and 𝛾 is a suitably chosen

constant.

For 𝑤0 sufficiently close to the optimum and small enough 𝛾, we

get linear convergence, i.e. log 𝜌 ~ 𝑡 where 𝜌 is the residual

error.

Second-Order Gradient Descent

Second-Order Gradient Descent: Instead we update with:

𝑤𝑡+1 = 𝑤𝑡 − 𝛤𝑡

1

𝑛 𝛻𝑤𝑄 𝑧𝑖 , 𝑤𝑡

𝑛

𝑖=1

Where 𝛤𝑡 is a positive definite matrix that approximates the

inverse Hessian of the loss function at the optimum.

For 𝑤0 sufficiently close to the optimum, we get quadratic

convergence, i.e. log log 𝜌 ~ 𝑡 where 𝜌 is the residual error.

This is Newton’s method.

Stochastic Gradient Descent

Stochastic Gradient Descent: Simply pick a random sample

𝑧𝑡 and do:

𝑤𝑡+1 = 𝑤𝑡 − 𝛾𝑡𝛻𝑤𝑄 𝑧𝑡 , 𝑤𝑡

Note that we made 𝛾𝑡 a function of the iteration step 𝑡.

The idea is that many of these steps should average to the

standard gradient step.

Convergence bound for a 𝛾𝑡 ~ 1 𝑡 strategy is 𝔼𝜌 ~ 1 𝑡

2nd-Order Stochastic Gradient Descent

2nd –Order SGD: Pick a random sample 𝑧𝑡 and do:

𝑤𝑡+1 = 𝑤𝑡 − 𝛤𝑡𝛻𝑤𝑄 𝑧𝑡 , 𝑤𝑡

Where 𝛤𝑡 is a positive definite matrix that approximates the

inverse Hessian.

Convergence bound for a 𝛾𝑡 ~ 1 𝑡 strategy is 𝔼𝜌 ~ 1 𝑡 , same as

simple SGD, although constants are better (by square of

condition number).

Error Analysis

Let 𝑓 be the (parametric) function that predicts unknown values

from the known data

• 𝐸𝑛 𝑓 the empirical risk is the average loss on the samples

• 𝐸 𝑓 the expected risk is the expected loss on future samples

Let 𝑓∗ be the true “best” function, i.e. the function that

minimizes expected risk.

Let 𝑓 𝑛 be the empirical best function after 𝑛 steps.

The error ℰ = 𝔼 𝐸 𝑓 𝑛 − 𝐸 𝑓∗ can be broken down as

ℰ = ℰapp + ℰest + ℰopt

Error Analysis

ℰ = ℰapp + ℰest + ℰopt

ℰapp is the approximation error – how well our actual function

could approximate the best function. This is like a bias term.

ℰest is the estimation error – how well we can estimate the best

parametric function given the data. This is a variance term.

ℰopt is optimization error and measures the effects of the

particular optimization strategy used. Not an issue for small-

scale learners, but often a limitation at large scale.

Error Analysis

Leon Bottou Large-Scale Machine Learning with Stochastic Gradient Descent COMPSTAT'2010

𝜌 measures accuracy to the final optimum given the data.

But ℰ is the total error relative to the ideal function 𝑓, and with

computational bounds, SGD is faster.

Expected Risk Convergence

Linear SVM on RCV1 dataset (labeled news articles) Leon Bottou Large-Scale Machine Learning with Stochastic Gradient Descent COMPSTAT'2010

Inverse speed vs. dataset size

Shai Shalev-Shwartz and Nathan Srebro SVM Optimization: Inverse Dependence on

Training Set Size, ICML 2008.

Pragmatics

Disk/Network access time for a block of data

SGD + model update (10x – 100x faster)

Time

Using Idle Resources

SGD + model update (10x – 100x faster)

Time

Disk/Network access time for a block of data

Opportunity to do kx more work to

improve SGD convergence, k=10..100

Open Problems (i.e. course projects, papers)

• SGD convergence depends strongly on the 𝛾𝑡 constant. Run

k optimizations in parallel (different 𝛾𝑡), periodically compute

cross-validation error, keep and distribute the best 𝑤.

Outline

• Design patterns for Behavioral Modeling

• Stochastic Gradient Descent

• Second-Order SGD

• MCMC + Gibbs Sampling

• What it all means

Second-Order Gradient

First-order gradient descent:

Second-order gradient descent:

from http://leon.bottou.org/slides/largescale/lstut.pdf

Stochastic Gradient

First-order gradient descent:

from http://leon.bottou.org/slides/largescale/lstut.pdf

Second-Order SGD

The update rule is:

𝑤𝑡+1 = 𝑤𝑡 − 𝛤𝑡𝛻𝑤𝑄 𝑧𝑡 , 𝑤𝑡

Where 𝛤𝑡 approximates the inverse Hessian.

Convergence is much better (typically) than 1-SGD.

But for d features, 𝛤𝑡 has d2 coefficients – too large and too

slow to deal with.

Simplest approach: use a diagonal approximation to the

inverse Hessian.

In effect we are scaling the feature dimensions – critical for

power law data which differ by orders of magnitude.

ADAGRAD

Assume first the parameter space is a closed, convex set 𝜒.

Given the standard Euclidean distance 𝑑 𝑥, 𝑦 = 𝑥 − 𝑦 ,

where 𝑤 = 𝑤. 𝑤, projection onto the convex set 𝜒 is:

𝜋𝜒 𝑦 = argmin𝑥∈𝜒𝑑 𝑥, 𝑦

The standard stochastic gradient update (with projection) is:

𝑥𝑡+1 = 𝜋𝜒 𝑥𝑡 − 𝜂𝑔𝑡

where 𝑔𝑡 is the loss gradient at step t, and 𝜂 is the rate

constant.

ADAGRAD

Define a weighted distance 𝑑𝐴 𝑥, 𝑦 = 𝑥 − 𝑦 𝐴 , where

𝑤 𝐴 = 𝑤. 𝐴𝑤, the weighted projection becomes:

𝜋𝜒𝐴 𝑦 = argmin𝑥∈𝜒𝑑𝐴 𝑥, 𝑦

The ADAGRAD update is:

𝑥𝑡+1 = 𝜋𝜒

𝐺𝑡

12

𝑥𝑡 − 𝜂𝐺𝑡

−12 𝑔𝑡

Where 𝐺𝑡 = 𝑔𝜏𝑡𝜏=1 𝑔𝜏

T the cumulative sum of outer

products of previous gradient.

Matrix Powers

Aside, how do you get the square root of a matrix 𝐺𝑡 ?

i.e. you want a matrix 𝐴 such that 𝐴𝐴 = 𝐺𝑡.

Or 𝐺𝑡−1

2 , where you want 𝐴 s.t. 𝐴𝐴 = 𝐺𝑡

−1 ?

Matrix Powers

Aside, how do you get the square root of a matrix 𝐺𝑡 ?

i.e. you want a matrix 𝐴 such that 𝐴𝐴 = 𝐺𝑡.

𝐺𝑡 is real, symmetric and positive semi-definite (all its

eigenvalues are 0).

So we can write 𝐺𝑡 = 𝑄Λ𝑄𝑇 , where columns of 𝑄 are the

(orthonormal) eigenvectors of 𝐺𝑡 and Λ is a diagonal matrix

of eigenvalues of 𝐺𝑡 . Orthonormality implies 𝑄𝑇𝑄 = 𝐼

Then the square root of 𝐺𝑡 is A = 𝑄Λ1

2 𝑄𝑇 since:

𝐴𝐴 = 𝑄Λ1

2 𝑄𝑇𝑄Λ1

2 𝑄𝑇 = 𝑄Λ1

2 Λ1

2 𝑄𝑇 = 𝑄Λ𝑄𝑇

ADAGRAD

The full 𝐺𝑡 is usually too expensive to compute and store, but

its diagonal works almost as well. The diagonal update is:

𝑥𝑡+1 = 𝜋𝜒diag 𝐺𝑡

12

𝑥𝑡 − 𝜂diag(𝐺𝑡)−1

2 𝑔𝑡

where diag(𝐺𝑡) = 𝑔𝜏 ∘𝑡𝜏=1 𝑔𝜏 where ∘ is the element-wise

product. diag(𝐺𝑡) can be computed very efficiently – so can

its inverse which is the element-wise reciprocal.

In practice you often don’t have an a-priori convex bound set

𝜒 and you can skip the projection step.

ADAGRAD - Intuition

Each gradient coefficient 𝑔𝑡,𝑖 is scaled by 1/sqrt( 𝑔𝜏,𝑖2𝑡

𝜏=1 )

Intuition:

• Coefficients are divided by a factor proportional to the

average 𝐿2-norm of that coefficient

• The scale factor is 1/sqrt(𝑡)

So there is a 𝑡−12 schedule for scaling gradient updates, and

gradient coordinates are normalized by 𝐿2-norm.

Performance stated in terms of “regret” – expected loss

compared with an ideal offline algorithm.

Very good in practice – ideal for power law data.

Natural+BFGS updates

The following update is based on “natural gradient”:

𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝐺𝑡−1𝑔𝑡

Like full ADAGRAD, it requires O(d2) storage and work

(using a Sherman-Morrison update).

If we instead use the true Hessian at time t, 𝐻𝑡 we are doing

local Newton updates:

𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝐻𝑡−1𝑔𝑡

BFGS (Broyden, Fletcher, Goldfarb and Shanno) is an iterative

method to approximate 𝐻𝑡−1 efficiently.

L-BFGS

L-BFGS is Limited-memory BFGS.

Idea: use the last k models updates + gradients to

approximate the inverse Hessian.

• This can be done in O(kd) time per iteration –

• The inverse Hessian is represented with a low-rank

approximation which requires O(kd) space

• Uses diagonal preconditioner (can be ADAGRAD)

• matrix-vector multiplies are similarly fast.

Recall: this is exactly the right complexity to exploit the extra

cycles available during disk streaming: we should be able to

support k ~ 10-100.

oL-BFGS

Online L-BFGS, Schraudolph et al. paper.

Use stochastic gradient and steps instead of full gradient.

http://jmlr.csail.mit.edu/proceedings/papers/v2/schraudolph07a/schraudolph07a.pdf

oL-BFGS

http://jmlr.csail.mit.edu/proceedings/papers/v2/schraudolph07a/schraudolph07a.pdf

Open Problem

Derive a limited-memory version of the full ADAGRAD

update:

𝑥𝑡+1 = 𝜋𝜒

𝐺𝑡

12

𝑥𝑡 − 𝜂𝐺𝑡

−12 𝑔𝑡

Its not difficult to derive a fast O(nm) update for 𝐺𝑡

12 , and to

invert it in O(m3).

Extend to compute 𝐺𝑡−1

2 directly in O(mn) time, similar to

L-BFGS.

Should be a block version (for faster implementation), not a

recurrence.

Open Problem

Compare:

• ADAGRAD vs. Natural gradient diagonal scaling

• 1st vs. second order ADAGRAD, Newton, and Natural

gradient scaling.

Outline

• Design patterns for Behavioral Modeling

• Stochastic Gradient Descent

• Second-Order SGD

• MCMC + Gibbs Sampling

• What it all means

Back to Inference

• Estimation: Given a joint distribution Pr(X, Y) on observed

data Y and unobserved data X, we want to estimate X given Y.

We may want:

– MAP estimates: the mode of the posterior Pr(X | Y)

– Condition means: E(X | Y)

In practice it may only be possible to get a local max or mean.

• Model Inference: Since we cant know the true Pr(X, Y), we

choose a family of models M with tractable Pr(X, Y | M) and

then find a “best” model (e.g. minimum loss).

– Most model inference formulations are not closed form, so an iteration

is needed to find the best model.

MCMC

Basics:

• The expected value of a sum is the sum of expected values (no

independence needed), so a random mean can be

approximated as a mean of random values with the same mean.

• A Markov chain is a sequence of random values 𝑋0, 𝑋1, 𝑋2, …

such that the distribution of 𝑋𝑖−1 depends only on the value of

𝑋𝑖

• So if you can generate a Markov chain whose stationary

distribution is the posterior probability, any posterior statistics

can be estimated.

MCMC

Metropolis Hastings: Xt is the current state.

1. Sample a point Y from a proposal distribution q(.|Xt)

2. With probability 𝛼(𝑋, 𝑌) accept the new point and set

𝑋𝑡+1 = 𝑌.

where

𝛼 𝑋, 𝑌 = min 1,𝜋 𝑌 𝑞 𝑋|𝑌

𝜋 𝑌 𝑞 𝑌|𝑋

And 𝜋 . is the distribution of interest, usually the posterior

probability.

The stationary distribution of this sampler is 𝜋 . and we can

estimate the statistics of any variable derived from X.

MCMC

1. Sample a point Y from a proposal distribution q(.|Xt)

2. With probability 𝛼(𝑋, 𝑌) accept the new point and set

𝑋𝑡+1 = 𝑌.

𝛼 𝑋, 𝑌 = min 1,𝜋 𝑌 𝑞 𝑋|𝑌

𝜋 𝑋 𝑞 𝑌|𝑋

Very simple, but its not magic.

Note that the sampler will spend most time in high-probability

states. If the proposed Y is too “far” from 𝑋𝑡 it will have low

probability and the chain will never move.

A very successful strategy for this is the Gibbs Sampler, which

changes one variable at a time.

Gibbs Sampler

Proposal distribution is:

𝑞𝑖 𝑌𝑖 | 𝑋 = 𝜋𝑖 𝑌𝑖 | 𝑋−𝑖

where 𝑋−𝑖 = 𝑋1, … , 𝑋𝑖−1, 𝑋𝑖+1, … , 𝑋𝑛

For this 𝑞𝑖 , the acceptance probability is 1.

Its exceptionally simple if the 𝑋𝑖 are binary or categorical

variables. Then 𝑞𝑖 𝑌𝑖 | 𝑋 is simply a vector of probabilities from

which we can directly sample.

Gibbs Sampler

There are lots of principled and semi-principled improvements:

1. Update independent blocks of variables in parallel.

2. Delay (mini-batch) updates to model parameters, rather than

updating with every block – Smola et al. paper

3. Use collapsed inference for continuous variables where

possible.

4. Draw multiple samples for each variable (skip-ahead) in one

time step:

– Bernoulli samples Binomial samples

– Categorical samples Multinomial samples

– Or approximate both with Poisson distributions

Open Problem

Gibbs sampler approaches so far have been much slower than

algebraic (e.g. variational LDA) ones.

• With collapsed sampling, the number of “operations” is

essentially the same.

• But implementation choices lead to huge constant factors

– Java random numbers are quite slow

– Updating random memory locations is very slow

• Instead, dense parametric GPU-based random number

generators can do the same calculations orders of magnitude

faster.

Outline

• Design patterns for Behavioral Modeling

• Stochastic Gradient Descent

• Second-Order SGD

• MCMC + Gibbs Sampling

• What it all means

Convex vs non-Convex Optimization

• Many simple optimization problems are convex (e.g.

regression), and the choice of optimization strategy affects

only the rate of convergence.

• But most non-trivial models are non-convex (e.g. factor

and cluster models, latent variable models), and have

multiple local risk minima. The optimization strategy can

have a significant affect on final accuracy.

Convex vs non-Convex Optimization

• Classical gradient descent is a “deterministic” algorithm. It

will usually find a nearby local minimum of risk.

• SGD adds “randomness” to the gradient estimates. It

moves both with and against the gradient, and can move

away from local minima.

• Gibbs samplers in principle explore the entire posterior

space, but in practice often wander only near a local

minimum.

Convex vs non-Convex Optimization

• For this reason, its common to start Gibbs samplers (or

other MCMC estimators) from many random initial points.

• Since some trajectories can wander far from the minima,

they can be periodically pruned.

• The acceptance probability can be adjusted (down) – to

reduce the “temperature” of the sampler.

• This process (called annealing) eventually causes the

sampler to settle in a true local minimum.

• If the loss is a negative log probability, then the local risk

minimum is a local mode of probability.

Summary

• Design patterns for Behavioral Modeling

• Stochastic Gradient Descent

• Second-Order SGD

• MCMC + Gibbs Sampling

• What it all means