Upload
trankhuong
View
224
Download
5
Embed Size (px)
Citation preview
Outline
• Design patterns for Behavioral Modeling
• Stochastic Gradient Descent
• Second-Order SGD
• MCMC + Gibbs Sampling
• What it all means
Huasha Zhao Graduate student, UC Berkeley BS with honor, Tsinghua University Research in algorithm/system design and numerical methods for emerging large-scale data mining problems Past work experience with Goldman Sachs, Cisco System and Microsoft Research Co-founder of Sponsor for Educational Opportunity (SEO) China Chapter
Behavioral Data
Datasets are large, resident on disks somewhere
(possibly in cluster storage): 109 – 1015 bytes
Feature spaces large (words, URLs, movies,
followers, etc.) 103 - 109
Data comprises many samples from an ideal population:
users, documents, web pages etc. 106 – 109
Data are sparse, only few sample x feature combinations
observed.
Inference
• Estimation: Given a joint distribution Pr(X, Y) on observed
data Y and unobserved data X, we want to estimate X given Y.
We may want:
– MAP estimates: the mode of the posterior Pr(X | Y)
– Condition means: E(X | Y)
In practice it may only be possible to get a local max or mean.
• Model Inference: Since we cant know the true Pr(X, Y), we
choose a family of models M with tractable Pr(X, Y | M) and
then find a “best” model (e.g. minimum loss).
– Most model inference formulations are not closed form, so an iteration
is needed to find the best model.
Mapping to Hardware
• Models: Fit in memory. Replicated in memory if in a cluster.
• Data: Stored on disk, possibly cached in memory. Samples
distributed if in a cluster.
• Inference: Access many samples in some order, or blocks of
samples that can fit in memory.
• Updates: Are often derivatives of loss wrt model parameters,
but may be other functions as well.
Mapping Iterations to Hardware
• Classical:
(in memory)
• Large Datasets: Minibatch model updates
DATA
samples
features
Batch model update
Many passes over the data
DATA1 DATA2 DATA3 …
Few (or one) passes over the entire dataset
M+
M
M+
M
M+
M
M+
M
SGD and MCMC
Stochastic Gradient Descent and Markov-Chain Monte-Carlo
Both involve model updates after processing each sample.
In most cases, there is little penalty to processing small blocks of
samples.
DATA1 DATA2 DATA3 … M+
M
M+
M
M+
M
Outline
• Design patterns for Behavioral Modeling
• Stochastic Gradient Descent
• Second-Order SGD
• MCMC + Gibbs Sampling
• What it all means
Gradient Descent
Gradient Descent: Let 𝑄(𝑧, 𝑤) be the loss for a sample 𝑧 and
model parameters (weights) 𝑤. Then gradient descent updates
these weights as:
𝑤𝑡+1 = 𝑤𝑡 − 𝛾1
𝑛 𝛻𝑤𝑄 𝑧𝑖 , 𝑤𝑡
𝑛
𝑖=1
Where 𝑛 is the number of samples and 𝛾 is a suitably chosen
constant.
For 𝑤0 sufficiently close to the optimum and small enough 𝛾, we
get linear convergence, i.e. log 𝜌 ~ 𝑡 where 𝜌 is the residual
error.
Second-Order Gradient Descent
Second-Order Gradient Descent: Instead we update with:
𝑤𝑡+1 = 𝑤𝑡 − 𝛤𝑡
1
𝑛 𝛻𝑤𝑄 𝑧𝑖 , 𝑤𝑡
𝑛
𝑖=1
Where 𝛤𝑡 is a positive definite matrix that approximates the
inverse Hessian of the loss function at the optimum.
For 𝑤0 sufficiently close to the optimum, we get quadratic
convergence, i.e. log log 𝜌 ~ 𝑡 where 𝜌 is the residual error.
This is Newton’s method.
Stochastic Gradient Descent
Stochastic Gradient Descent: Simply pick a random sample
𝑧𝑡 and do:
𝑤𝑡+1 = 𝑤𝑡 − 𝛾𝑡𝛻𝑤𝑄 𝑧𝑡 , 𝑤𝑡
Note that we made 𝛾𝑡 a function of the iteration step 𝑡.
The idea is that many of these steps should average to the
standard gradient step.
Convergence bound for a 𝛾𝑡 ~ 1 𝑡 strategy is 𝔼𝜌 ~ 1 𝑡
2nd-Order Stochastic Gradient Descent
2nd –Order SGD: Pick a random sample 𝑧𝑡 and do:
𝑤𝑡+1 = 𝑤𝑡 − 𝛤𝑡𝛻𝑤𝑄 𝑧𝑡 , 𝑤𝑡
Where 𝛤𝑡 is a positive definite matrix that approximates the
inverse Hessian.
Convergence bound for a 𝛾𝑡 ~ 1 𝑡 strategy is 𝔼𝜌 ~ 1 𝑡 , same as
simple SGD, although constants are better (by square of
condition number).
Error Analysis
Let 𝑓 be the (parametric) function that predicts unknown values
from the known data
• 𝐸𝑛 𝑓 the empirical risk is the average loss on the samples
• 𝐸 𝑓 the expected risk is the expected loss on future samples
Let 𝑓∗ be the true “best” function, i.e. the function that
minimizes expected risk.
Let 𝑓 𝑛 be the empirical best function after 𝑛 steps.
The error ℰ = 𝔼 𝐸 𝑓 𝑛 − 𝐸 𝑓∗ can be broken down as
ℰ = ℰapp + ℰest + ℰopt
Error Analysis
ℰ = ℰapp + ℰest + ℰopt
ℰapp is the approximation error – how well our actual function
could approximate the best function. This is like a bias term.
ℰest is the estimation error – how well we can estimate the best
parametric function given the data. This is a variance term.
ℰopt is optimization error and measures the effects of the
particular optimization strategy used. Not an issue for small-
scale learners, but often a limitation at large scale.
Error Analysis
Leon Bottou Large-Scale Machine Learning with Stochastic Gradient Descent COMPSTAT'2010
𝜌 measures accuracy to the final optimum given the data.
But ℰ is the total error relative to the ideal function 𝑓, and with
computational bounds, SGD is faster.
Expected Risk Convergence
Linear SVM on RCV1 dataset (labeled news articles) Leon Bottou Large-Scale Machine Learning with Stochastic Gradient Descent COMPSTAT'2010
Inverse speed vs. dataset size
Shai Shalev-Shwartz and Nathan Srebro SVM Optimization: Inverse Dependence on
Training Set Size, ICML 2008.
Using Idle Resources
SGD + model update (10x – 100x faster)
Time
Disk/Network access time for a block of data
Opportunity to do kx more work to
improve SGD convergence, k=10..100
Open Problems (i.e. course projects, papers)
• SGD convergence depends strongly on the 𝛾𝑡 constant. Run
k optimizations in parallel (different 𝛾𝑡), periodically compute
cross-validation error, keep and distribute the best 𝑤.
Outline
• Design patterns for Behavioral Modeling
• Stochastic Gradient Descent
• Second-Order SGD
• MCMC + Gibbs Sampling
• What it all means
Second-Order Gradient
First-order gradient descent:
Second-order gradient descent:
from http://leon.bottou.org/slides/largescale/lstut.pdf
Stochastic Gradient
First-order gradient descent:
from http://leon.bottou.org/slides/largescale/lstut.pdf
Second-Order SGD
The update rule is:
𝑤𝑡+1 = 𝑤𝑡 − 𝛤𝑡𝛻𝑤𝑄 𝑧𝑡 , 𝑤𝑡
Where 𝛤𝑡 approximates the inverse Hessian.
Convergence is much better (typically) than 1-SGD.
But for d features, 𝛤𝑡 has d2 coefficients – too large and too
slow to deal with.
Simplest approach: use a diagonal approximation to the
inverse Hessian.
In effect we are scaling the feature dimensions – critical for
power law data which differ by orders of magnitude.
ADAGRAD
Assume first the parameter space is a closed, convex set 𝜒.
Given the standard Euclidean distance 𝑑 𝑥, 𝑦 = 𝑥 − 𝑦 ,
where 𝑤 = 𝑤. 𝑤, projection onto the convex set 𝜒 is:
𝜋𝜒 𝑦 = argmin𝑥∈𝜒𝑑 𝑥, 𝑦
The standard stochastic gradient update (with projection) is:
𝑥𝑡+1 = 𝜋𝜒 𝑥𝑡 − 𝜂𝑔𝑡
where 𝑔𝑡 is the loss gradient at step t, and 𝜂 is the rate
constant.
ADAGRAD
Define a weighted distance 𝑑𝐴 𝑥, 𝑦 = 𝑥 − 𝑦 𝐴 , where
𝑤 𝐴 = 𝑤. 𝐴𝑤, the weighted projection becomes:
𝜋𝜒𝐴 𝑦 = argmin𝑥∈𝜒𝑑𝐴 𝑥, 𝑦
The ADAGRAD update is:
𝑥𝑡+1 = 𝜋𝜒
𝐺𝑡
12
𝑥𝑡 − 𝜂𝐺𝑡
−12 𝑔𝑡
Where 𝐺𝑡 = 𝑔𝜏𝑡𝜏=1 𝑔𝜏
T the cumulative sum of outer
products of previous gradient.
Matrix Powers
Aside, how do you get the square root of a matrix 𝐺𝑡 ?
i.e. you want a matrix 𝐴 such that 𝐴𝐴 = 𝐺𝑡.
Or 𝐺𝑡−1
2 , where you want 𝐴 s.t. 𝐴𝐴 = 𝐺𝑡
−1 ?
Matrix Powers
Aside, how do you get the square root of a matrix 𝐺𝑡 ?
i.e. you want a matrix 𝐴 such that 𝐴𝐴 = 𝐺𝑡.
𝐺𝑡 is real, symmetric and positive semi-definite (all its
eigenvalues are 0).
So we can write 𝐺𝑡 = 𝑄Λ𝑄𝑇 , where columns of 𝑄 are the
(orthonormal) eigenvectors of 𝐺𝑡 and Λ is a diagonal matrix
of eigenvalues of 𝐺𝑡 . Orthonormality implies 𝑄𝑇𝑄 = 𝐼
Then the square root of 𝐺𝑡 is A = 𝑄Λ1
2 𝑄𝑇 since:
𝐴𝐴 = 𝑄Λ1
2 𝑄𝑇𝑄Λ1
2 𝑄𝑇 = 𝑄Λ1
2 Λ1
2 𝑄𝑇 = 𝑄Λ𝑄𝑇
ADAGRAD
The full 𝐺𝑡 is usually too expensive to compute and store, but
its diagonal works almost as well. The diagonal update is:
𝑥𝑡+1 = 𝜋𝜒diag 𝐺𝑡
12
𝑥𝑡 − 𝜂diag(𝐺𝑡)−1
2 𝑔𝑡
where diag(𝐺𝑡) = 𝑔𝜏 ∘𝑡𝜏=1 𝑔𝜏 where ∘ is the element-wise
product. diag(𝐺𝑡) can be computed very efficiently – so can
its inverse which is the element-wise reciprocal.
In practice you often don’t have an a-priori convex bound set
𝜒 and you can skip the projection step.
ADAGRAD - Intuition
Each gradient coefficient 𝑔𝑡,𝑖 is scaled by 1/sqrt( 𝑔𝜏,𝑖2𝑡
𝜏=1 )
Intuition:
• Coefficients are divided by a factor proportional to the
average 𝐿2-norm of that coefficient
• The scale factor is 1/sqrt(𝑡)
So there is a 𝑡−12 schedule for scaling gradient updates, and
gradient coordinates are normalized by 𝐿2-norm.
Performance stated in terms of “regret” – expected loss
compared with an ideal offline algorithm.
Very good in practice – ideal for power law data.
Natural+BFGS updates
The following update is based on “natural gradient”:
𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝐺𝑡−1𝑔𝑡
Like full ADAGRAD, it requires O(d2) storage and work
(using a Sherman-Morrison update).
If we instead use the true Hessian at time t, 𝐻𝑡 we are doing
local Newton updates:
𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝐻𝑡−1𝑔𝑡
BFGS (Broyden, Fletcher, Goldfarb and Shanno) is an iterative
method to approximate 𝐻𝑡−1 efficiently.
L-BFGS
L-BFGS is Limited-memory BFGS.
Idea: use the last k models updates + gradients to
approximate the inverse Hessian.
• This can be done in O(kd) time per iteration –
• The inverse Hessian is represented with a low-rank
approximation which requires O(kd) space
• Uses diagonal preconditioner (can be ADAGRAD)
• matrix-vector multiplies are similarly fast.
Recall: this is exactly the right complexity to exploit the extra
cycles available during disk streaming: we should be able to
support k ~ 10-100.
oL-BFGS
Online L-BFGS, Schraudolph et al. paper.
Use stochastic gradient and steps instead of full gradient.
http://jmlr.csail.mit.edu/proceedings/papers/v2/schraudolph07a/schraudolph07a.pdf
oL-BFGS
http://jmlr.csail.mit.edu/proceedings/papers/v2/schraudolph07a/schraudolph07a.pdf
Open Problem
Derive a limited-memory version of the full ADAGRAD
update:
𝑥𝑡+1 = 𝜋𝜒
𝐺𝑡
12
𝑥𝑡 − 𝜂𝐺𝑡
−12 𝑔𝑡
Its not difficult to derive a fast O(nm) update for 𝐺𝑡
12 , and to
invert it in O(m3).
Extend to compute 𝐺𝑡−1
2 directly in O(mn) time, similar to
L-BFGS.
Should be a block version (for faster implementation), not a
recurrence.
Open Problem
Compare:
• ADAGRAD vs. Natural gradient diagonal scaling
• 1st vs. second order ADAGRAD, Newton, and Natural
gradient scaling.
Outline
• Design patterns for Behavioral Modeling
• Stochastic Gradient Descent
• Second-Order SGD
• MCMC + Gibbs Sampling
• What it all means
Back to Inference
• Estimation: Given a joint distribution Pr(X, Y) on observed
data Y and unobserved data X, we want to estimate X given Y.
We may want:
– MAP estimates: the mode of the posterior Pr(X | Y)
– Condition means: E(X | Y)
In practice it may only be possible to get a local max or mean.
• Model Inference: Since we cant know the true Pr(X, Y), we
choose a family of models M with tractable Pr(X, Y | M) and
then find a “best” model (e.g. minimum loss).
– Most model inference formulations are not closed form, so an iteration
is needed to find the best model.
MCMC
Basics:
• The expected value of a sum is the sum of expected values (no
independence needed), so a random mean can be
approximated as a mean of random values with the same mean.
• A Markov chain is a sequence of random values 𝑋0, 𝑋1, 𝑋2, …
such that the distribution of 𝑋𝑖−1 depends only on the value of
𝑋𝑖
• So if you can generate a Markov chain whose stationary
distribution is the posterior probability, any posterior statistics
can be estimated.
MCMC
Metropolis Hastings: Xt is the current state.
1. Sample a point Y from a proposal distribution q(.|Xt)
2. With probability 𝛼(𝑋, 𝑌) accept the new point and set
𝑋𝑡+1 = 𝑌.
where
𝛼 𝑋, 𝑌 = min 1,𝜋 𝑌 𝑞 𝑋|𝑌
𝜋 𝑌 𝑞 𝑌|𝑋
And 𝜋 . is the distribution of interest, usually the posterior
probability.
The stationary distribution of this sampler is 𝜋 . and we can
estimate the statistics of any variable derived from X.
MCMC
1. Sample a point Y from a proposal distribution q(.|Xt)
2. With probability 𝛼(𝑋, 𝑌) accept the new point and set
𝑋𝑡+1 = 𝑌.
𝛼 𝑋, 𝑌 = min 1,𝜋 𝑌 𝑞 𝑋|𝑌
𝜋 𝑋 𝑞 𝑌|𝑋
Very simple, but its not magic.
Note that the sampler will spend most time in high-probability
states. If the proposed Y is too “far” from 𝑋𝑡 it will have low
probability and the chain will never move.
A very successful strategy for this is the Gibbs Sampler, which
changes one variable at a time.
Gibbs Sampler
Proposal distribution is:
𝑞𝑖 𝑌𝑖 | 𝑋 = 𝜋𝑖 𝑌𝑖 | 𝑋−𝑖
where 𝑋−𝑖 = 𝑋1, … , 𝑋𝑖−1, 𝑋𝑖+1, … , 𝑋𝑛
For this 𝑞𝑖 , the acceptance probability is 1.
Its exceptionally simple if the 𝑋𝑖 are binary or categorical
variables. Then 𝑞𝑖 𝑌𝑖 | 𝑋 is simply a vector of probabilities from
which we can directly sample.
Gibbs Sampler
There are lots of principled and semi-principled improvements:
1. Update independent blocks of variables in parallel.
2. Delay (mini-batch) updates to model parameters, rather than
updating with every block – Smola et al. paper
3. Use collapsed inference for continuous variables where
possible.
4. Draw multiple samples for each variable (skip-ahead) in one
time step:
– Bernoulli samples Binomial samples
– Categorical samples Multinomial samples
– Or approximate both with Poisson distributions
Open Problem
Gibbs sampler approaches so far have been much slower than
algebraic (e.g. variational LDA) ones.
• With collapsed sampling, the number of “operations” is
essentially the same.
• But implementation choices lead to huge constant factors
– Java random numbers are quite slow
– Updating random memory locations is very slow
• Instead, dense parametric GPU-based random number
generators can do the same calculations orders of magnitude
faster.
Outline
• Design patterns for Behavioral Modeling
• Stochastic Gradient Descent
• Second-Order SGD
• MCMC + Gibbs Sampling
• What it all means
Convex vs non-Convex Optimization
• Many simple optimization problems are convex (e.g.
regression), and the choice of optimization strategy affects
only the rate of convergence.
• But most non-trivial models are non-convex (e.g. factor
and cluster models, latent variable models), and have
multiple local risk minima. The optimization strategy can
have a significant affect on final accuracy.
Convex vs non-Convex Optimization
• Classical gradient descent is a “deterministic” algorithm. It
will usually find a nearby local minimum of risk.
• SGD adds “randomness” to the gradient estimates. It
moves both with and against the gradient, and can move
away from local minima.
• Gibbs samplers in principle explore the entire posterior
space, but in practice often wander only near a local
minimum.
Convex vs non-Convex Optimization
• For this reason, its common to start Gibbs samplers (or
other MCMC estimators) from many random initial points.
• Since some trajectories can wander far from the minima,
they can be periodically pruned.
• The acceptance probability can be adjusted (down) – to
reduce the “temperature” of the sampler.
• This process (called annealing) eventually causes the
sampler to settle in a true local minimum.
• If the loss is a negative log probability, then the local risk
minimum is a local mode of probability.