48
Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis Lehigh University Jorge Nocedal Northwestern University

Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Optimization Methods for Machine LearningPart II – The theory of SG

Leon BottouFacebook AI Research

Frank E. CurtisLehigh University

Jorge NocedalNorthwestern University

Page 2: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Summary

1. Setup2. Fundamental Lemmas3. SG for Strongly Convex Objectives4. SG for General Objectives5. Work complexity for Large-Scale Learning6. Comments

2

Page 3: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

1- Setup

3

Page 4: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

The generic SG algorithm

The SG algorithm produces successive iterates 𝑤" ∈  ℝ&with the goal to minimize a certain function 𝐹 ∶ ℝ& → ℝ.

We assume that we have access to three mechanisms1. Given an iteration number 𝑘  ,

a mechanism to generate a realization of a random variable 𝜉".The 𝜉" form a sequence of jointly independent random variables

2. Given an iterate 𝑤" and a realization 𝜉", a mechanism to compute a stochastic vector 𝑔 𝑤", 𝜉" ∈ ℝ&

3. Given an iteration number,a mechanism to compute a scalar stepsize 𝛼" > 0

4

Page 5: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

The generic SG algorithm

Algorithm 4.1 (Stochastic Gradient (SG) Method)

5

Page 6: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

The generic SG algorithm

The function 𝐹 ∶ ℝ& → ℝ could be

The stochastic vector could be

the gradient for one example,

the empirical risk.

the gradient for a minibatch,

possibly rescaled

the expected risk,

6

Page 7: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

The generic SG algorithm

Stochastic processes

• We assume that the 𝜉" are jointly independent to avoid the full machinery of stochastic processes. But everything still holds if the 𝜉"form an adapted stochastic process, where each 𝜉" can depend on the previous ones.

Active learning

• We can handle more complex setups by view 𝜉" as a “random seed”.For instance, in active learning, 𝑔(𝑤", 𝜉") firsts construct a multinomial distribution on the training examples in a manner that depends on 𝑤", then uses the random seed 𝜉" to pick one according to that distribution.

The same mathematics cover all these cases.

7

Page 8: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

2- Fundamental lemmas

8

Page 9: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Smoothness

Smoothness• Our analysis relies on a smoothness assumption.

We chose this path because it also gives results for the nonconvex case.We’ll discuss other paths in the commentary section.

Well known consequence

9

Page 10: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Smoothness

• 𝔼45 is the expectation with respect to the distribution of 𝜉" only.• 𝔼45 𝐹(𝑤"67) is meaningful because 𝑤"67 depends on 𝜉" (step 6 of SG)

Expected  decrease Noise

10

Page 11: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Smoothness

11

Page 12: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Moments

12

Page 13: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Moments

• In expectation 𝑔(𝑤", 𝜉") is a sufficient descent direction.• True if 𝔼45 𝑔(𝑤", 𝜉") = 𝛻𝐹 𝑤" with 𝜇 = 𝜇; = 1.• True if 𝔼45 𝑔(𝑤", 𝜉") = 𝐻"𝛻𝐹(𝑤") with bounded spectrum.

13

Page 14: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Moments

• 𝕍45 denotes the variance w.r.t. 𝜉"• Variance of the noise must be bounded in a mild manner.

14

Page 15: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Moments

• Combining (4.7b) and (4.8) gives

with

15

Page 16: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Moments

• The convergence of SG depends on the balance between these two terms.

Expected  decrease Noise

16

Page 17: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Moments

17

Page 18: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

3- SG for Strongly Convex Objectives

18

Page 19: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Strong convexity

Known consequence

Why does strong convexity matter?• It gives the strongest results.• It often happens in practice (one regularizes to facilitate optimization!)• It describes any smooth function near a strong local minimum.

19

Page 20: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Total expectation

Different expectations• 𝔼45 is the expectation with respect to the distribution of 𝜉" only.• 𝔼 is the total expectation w.r.t. the joint distribution of all 𝜉".

For instance, since 𝑤" depends only on 𝜉7, 𝜉?,… , 𝜉"A7,

𝔼 𝐹 𝑤" =   𝔼4B𝔼4C …𝔼45DB[𝐹 𝑤" ]

Results in expectation

• We focus on results that characterize the properties of SG in expectation.

• The stochastic approximation literature usually relies on rather complex martingale techniques to establish almost sure convergence results. We avoid them because they do not give much additional insight.

20

Page 21: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

SG with fixed stepsize

• Only converges to a neighborhood of the optimal value.• Both (4.13) and (4.14) describe well the actual behavior.

21

Page 22: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

SG with fixed stepsize (proof)

22

Page 23: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

SG with fixed stepsize

Note the interplay between the stepsize𝛼G and the variance bound 𝑀.• If 𝑀 = 0, one recovers the linear convergence of batch gradient descent.• If 𝑀 > 0, one reaches a point where the noise prevents further progress.

23

Page 24: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Diminishing the stepsizes

• If we wait long enough, halving the stepsizeα eventually halves 𝐹 𝑤" − 𝐹∗.• We can even estimate 𝐹∗ ≈ 2𝐹N/? − 𝐹N

k

𝐹𝑤 "

Stepsize α

Stepsize α/2

𝐹∗

=    

24

𝐹N/?

Page 25: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Diminishing the stepsizes faster

• Divide 𝛼 by 2 whenever 𝔼 𝐹 𝑤" reaches 𝛼𝐿𝑀/𝑐𝜇.• Time 𝜏S between changes : 1 − 𝛼𝑐𝜇 TU = 1/3 means 𝜏S ∝ 1/𝛼.• Whenever we halve 𝛼 we must wait twice as long to halve 𝐹(𝑤) − 𝐹∗.• Overall convergence rate in 𝒪 1 𝑘⁄ .

k

𝔼𝐹𝑤 "

𝐹∗

α

α/2

α/4

25

Divide 𝛼 by 2 whenever 𝔼 𝐹 𝑤" − 𝐹∗ reaches 2S[\?]^ .

Page 26: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

SG with diminishing stepsizes

26

Page 27: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

SG with diminishing stepsizes

27

Same maximal stepsize

Stepsize decreases in 1/k

Not too slow…

gap ∝ stepsize…otherwise

Page 28: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

SG with diminishing stepsizes (proof)

28

Page 29: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Mini batching

Using minibatches with stepsize𝛼G :

Using single example with stepsize𝛼G  /  𝑛`a :

29

Computation Noise

1 𝑀

𝑛`a 𝑀/𝑛`a

𝑛`a times more iterations that are 𝑛`a times cheaper. same

Page 30: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Minibatching

Ignoring implementation issues• We can match minibatch SG with stepsize𝛼G

using single example SG with stepsize𝛼G  /  𝑛`a .• We can match single example SG with stepsize 𝛼G

using minibatch SG with stepsize 𝛼G  ×  𝑛`aprovided 𝛼G  ×  𝑛`a is smaller than the max stepsize.

With implementation issues• Minibatch implementations use the hardware better.• Especially on GPU.

30

Page 31: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

4- SG for General Objectives

31

Page 32: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Nonconvex objectives

Nonconvex training objectives are pervasive in deep learning.

Nonconvex landscape in high dimension can be very complex.• Critical points can be local minima or saddle points.• Critical points can be first order of high order.• Critical points can be part of critical manifolds.• A critical manifold can contain both local minima and saddle points.

We describe meaningful (but weak) guarantees• Essentially, SG goes to critical points.

The SG noise plays an important role in practice• It seems to help navigating local minima and saddle points.• More noise has been found to sometimes help optimization.• But the theoretical understanding of these facts is weak.

32

Page 33: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Nonconvex SG with fixed stepsize

33

Page 34: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Nonconvex SG with fixed stepsize

34

Same max stepsize

This goes to zero like 1/K

This does not

If the average norm of the gradient is small, then the

norm of the gradient cannot be often large…

Page 35: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Nonconvex SG with fixed stepsize (proof)

35

Page 36: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Nonconvex SG with diminishing step sizes

36

Page 37: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

5- Work complexity for Large-Scale Learning

37

Page 38: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Large-Scale Learning

Assume that we are in the large data regime• Training data is essentially unlimited.• Computation time is limited.

The good• More training data ⇒ less overfitting• Less overfitting ⇒ richer models.

The bad• Using more training data or rich models quickly exhausts the time budget.

The hope• How thoroughly do we need to optimize 𝑅d(𝑤)

when we actually want another function 𝑅(𝑤) to be small ?

38

Page 39: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Expected risk versus training time

• When we vary the number of examples

39

Expected  Risk

Page 40: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Expected risk versus training time

• When we vary the number of examples, the model, and the optimizer…

40

Expected  Risk

Page 41: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Expected risk versus training time

• The optimal combination depends on the computing time budget

41

Expected  Risk

Good  combinations

Page 42: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Formalization

The components of the expected risk

Question• Given a fixed model ℋ and a time budget 𝒯 gh, choose 𝑛, 𝜖…

Approach• Statistics tell us ℰklm(𝑛) decreases with a rate in range 1/ 𝑛� … 1/𝑛.• For now, let’s work with the fastest rate compatible with statistics

42

Page 43: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Batch versus Stochastic

Typical convergence rates• Batch algorithm:• Stochastic algorithm:

Rate analysis

43

Processing more training examples beats

optimizing more thoroughly.

This effect only grows if ℰklm(𝑛) decreases

slower than 1/𝑛.

Page 44: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

6- Comments

44

Page 45: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Asymptotic performance of SG is fragile

Diminishing stepsizes are tricky• Theorem 4.7 (strongly convex function) suggests

Constant stepsizes are often used in practice• Sometimes with a simple halving protocol.

45

SG converges very slowly if 𝛽 < 7

]^

SG usually diverges when 𝛼 is above ?^[\q

Page 46: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Condition numbers

The ratios 𝑳𝒄

and 𝑴𝒄

appear in critical places

• Theorem 4.6. With  𝜇 = 1, 𝑀x = 0, the optimal stepsize is 𝛼G = 7[

46

Page 47: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Distributed computing

SG is notoriously hard to parallelize• Because it updates the parameters 𝑤  with high frequency• Because it slows down with delayed updates.

SG still works with relaxed synchronization• Because this is just a little bit more noise.

Communication overhead give room for new opportunities• There is ample time to compute things while communication takes place.• Opportunity for optimization algorithms with higher per-iteration costs ➔ SG may not be the best answer for distributed training.

47

Page 48: Optimization Methods for Machine Learning · 2018-09-10 · Optimization Methods for Machine Learning Part II – The theory of SG Leon Bottou Facebook AI Research Frank E. Curtis

Smoothness versus Convexity

Analyses of SG that only rely on convexity• Bounding 𝑤" −𝑤∗ ? instead of 𝐹 𝑤" − 𝐹∗

and assuming 𝔼45 𝑔(𝑤", 𝜉") = 𝑔y 𝑤" ∈ 𝜕𝐹(𝑤")gives a result similar to Lemma 4.4.

• Ways to bound the expected decrease

• Proof does not easily support second order methods.48

Expected decrease Noise