117
Survey of Overparametrization and Optimization Jason D. Lee University of Southern California September 25, 2019 Jason Lee

Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Survey of Overparametrization and Optimization

Jason D. Lee

University of Southern California

September 25, 2019

Jason Lee

Page 2: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

1 Overparametrization and Architecture Design

2 Geometric Results on OverparametrizationReview Non-convex OptimizationNon-Algorithmic Results

3 Algorithmic ResultsGradient Dynamics: NTK

4 LimitationsJason Lee

Page 3: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Today’s Tutorial

Survey of Optimization and Overparametrization in Deep Learning.

Can think of this tutorial more as a survey of the literaturewith my own perspectives and opinions.

Jason Lee

Page 4: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

1 Overparametrization and Architecture Design

2 Geometric Results on OverparametrizationReview Non-convex OptimizationNon-Algorithmic Results

3 Algorithmic ResultsGradient Dynamics: NTK

4 Limitations

Jason Lee

Page 5: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Theoretical Challenges: Two Major Hurdles

1 Optimization

Non-convex and non-smooth with exponentially many criticalpoints.

2 Statistical

Successful Deep Networks are huge with more parameters thansamples (overparametrization).

Two Challenges are Intertwined

Learning = Optimization Error + Statistical Error.But Optimization and Statistics Cannot Be Decoupled.

The choice of optimization algorithm affects the statisticalperformance (generalization error).

Improving statistical performance (e.g. using regularizers,dropout . . . ) changes the algorithm dynamics and landscape.

Jason Lee

Page 6: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Theoretical Challenges: Two Major Hurdles

1 Optimization

Non-convex and non-smooth with exponentially many criticalpoints.

2 Statistical

Successful Deep Networks are huge with more parameters thansamples (overparametrization).

Two Challenges are Intertwined

Learning = Optimization Error + Statistical Error.But Optimization and Statistics Cannot Be Decoupled.

The choice of optimization algorithm affects the statisticalperformance (generalization error).

Improving statistical performance (e.g. using regularizers,dropout . . . ) changes the algorithm dynamics and landscape.

Jason Lee

Page 7: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Non-convexity

Practical observation: Gradient methods find high qualitysolutions.

Theoretical Side: Exponentially many local minima in squareloss in simple architecture (one neuron with sigmoidactivation) [Auer-Herbster-Warmuth].

Many other hardness based on intersection of halfspaces forrealizable models with positive margin [Klivans-Sherstov,Livni-Shalev-Shwartz-Shamir, Neyshabur-Tomioka-Srebro]

Question

Why is (stochastic) gradient descent (GD) successful? Or is it just“alchemy”?

Jason Lee

Page 8: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Non-convexity

Practical observation: Gradient methods find high qualitysolutions.

Theoretical Side: Exponentially many local minima in squareloss in simple architecture (one neuron with sigmoidactivation) [Auer-Herbster-Warmuth].

Many other hardness based on intersection of halfspaces forrealizable models with positive margin [Klivans-Sherstov,Livni-Shalev-Shwartz-Shamir, Neyshabur-Tomioka-Srebro]

Question

Why is (stochastic) gradient descent (GD) successful? Or is it just“alchemy”?

Jason Lee

Page 9: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Non-convexity

Practical observation: Gradient methods find high qualitysolutions.

Theoretical Side: Exponentially many local minima in squareloss in simple architecture (one neuron with sigmoidactivation) [Auer-Herbster-Warmuth].

Many other hardness based on intersection of halfspaces forrealizable models with positive margin [Klivans-Sherstov,Livni-Shalev-Shwartz-Shamir, Neyshabur-Tomioka-Srebro]

Question

Why is (stochastic) gradient descent (GD) successful? Or is it just“alchemy”?

Jason Lee

Page 10: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Setting

Loss

Ln(θ) =∑i

`(fθ(xi), yi) +R(θ),

1 fθ(x) is the prediction function (neural network)

2 `(y, y) = 12(y − y)2 or `(y, y) = log(1 + exp(−yy)).

Algorithm

Gradient Descent algorithm:

θk+1 = θk − ηk∇Ln(θ).

Stochastic Gradient:

θk+1 = θk − ηk∇θ`(fθ(xi), y).

Jason Lee

Page 11: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Notation

All parameters is θ. Weights of individual layers are Wl andoutput layer weights is a.

Input dimension x ∈ Rd, width of network is denoted m, anddepth L.

Total number of parameters θ ∈ Rp and sample size n.

Jason Lee

Page 12: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

1 Overparametrization and Architecture Design

2 Geometric Results on OverparametrizationReview Non-convex OptimizationNon-Algorithmic Results

3 Algorithmic ResultsGradient Dynamics: NTK

4 LimitationsJason Lee

Page 13: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Architecture Design

Designing the Architecture

Goal: Design the architecture so that gradient decent finds goodsolutions (e.g. no spurious local minimizers)a.

aLivni et al.

Figure: SGD succeeds on the right loss function, but fails on the left infinding global minima.

Jason Lee

Page 14: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Architecture Design: Overparametrization

Empirical Observation: Easier for SGD to optimize largerarchitectures

Jason Lee

Page 15: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Practical Landscape Design - Overparametrization

Iterations ×104

1 2 3 4 5

Obj

ectiv

e V

alue

0

0.1

0.2

0.3

0.4

0.5

0.5 1 1.5 2 2.5 3

Iterations 104

0

0.1

0.2

0.3

0.4

0.5

Obje

ctive V

alu

e

Figure: Experiment first done by Livni-Shalev-Shwartz-Shamir 2014

Jason Lee

Page 16: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Overparametrization

Conventional Wisdom on Overparametrization

If SGD is not finding a low training error solution, then fit a moreexpressive model until the training error is near zero.

Problem

How much over-parametrization do we need to efficiently optimize+ generalize?

Adding parameters increases computational and memory cost.

Too many parameters may lead to overfitting (???).

Jason Lee

Page 17: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

How much Overparametrization to Optimize?

Motivating Question

How much overparametrization ensures success of SGD?

For arbitrary labels, p� n is necessary, where p is the numberof parameters.

Can the amount of overparametrization adapt to latentstructure in the labels?

Jason Lee

Page 18: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Taxonomy of Overparametrization

Geometry-based: All local minima are global, All strict localminima are global.

Local Dynamics (Lazy/ Kernel Regime): Utilize local linearbehavior of the network predictions.

Global Dynamics (Active training, Mean Field): characterizedby large changes in the parameter.

Jason Lee

Page 19: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Architecture Design: Skip Connections

Skip connections/Resnet avoids gradient vanishing due todepth.

Jason Lee

Page 20: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

1 Overparametrization and Architecture Design

2 Geometric Results on OverparametrizationReview Non-convex OptimizationNon-Algorithmic Results

3 Algorithmic ResultsGradient Dynamics: NTK

4 LimitationsJason Lee

Page 21: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Overview of Geometric Results

Decompose into two steps:

Gradient-based algorithms find first-order stationary points orsecond-order stationary points (Pemantle 92, Ge et al. 15, Leeet al. 16, Jin et al. 17)

Establish that all first-order/ second-order stationary points(or local minima) are global minimizers.

Jason Lee

Page 22: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

One-point convexity

Gradient methods can converge even when the function isnon-convex.

Quasi-convex

∇L(θ)>(θ − θ?) ≥ d(θ, θ?),

where d is some distance measure to optimality.

d(θ, θ?) = L(θ)− L(θ?), Quasi convex.

d(θ, θ?) = η‖θ − θ?‖2 + β‖∇L(θ)‖2 , regularity condition orcorrelation condition.

Jason Lee

Page 23: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Single Neuron/Filter and Local Results

One-point convex

Single neuron/filter models have an even stronger property (withdistribution assumptions on x):

∇θL(θ)>(θ − θ?) > 0.

y = σ(w>x) (Candes et al., Kalai et al., Kakade et al., Mei etal., Soltanolkotabi, Goel et al. )

Single filter: y =∑

j σ((Sjw)>x) (Brutzkus and Globerson,Du et al., Goel et al.)

If W ? ≈ I , then two-layer ReLU network is one-point convexin a small region around W ? (Li and Yuan, Zhong et al.).Resnets have a large basin of attraction around identity.

Linear Dynamical System (Hardt et al.)

Jason Lee

Page 24: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Polyak Condition

In over-parametrized models, we frequently do not know what θ? isbecause it is non-identifiable.

Polyak Gradient Domination

‖∇L(θ)‖ ≥ L(θ)− L(θ?).

Local convergence for over-parametrized models (SJL18)

Global convergence for over-parametrized models (DZPS18,DLLWZ18)

Linear residual networks (Hardt and Ma) satisfy Polyakcondition in a large region around initialization.

Jason Lee

Page 25: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Gradient Vanishing

The key is to avoid spurious gradient vanishing.

What to do if the gradient is zero?

L(θ) = L(θ0) +∇L(θ0)>(θ − θ0)︸ ︷︷ ︸=0

+1

2(θ − θ0)∇2L(θ0)(θ − θ0)

Try to find a direction θ− θ0 so that (θ− θ0)∇2L(θ0)(θ− θ0) < 0.

Jason Lee

Page 26: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Gradient Vanishing

The key is to avoid spurious gradient vanishing.

What to do if the gradient is zero?

L(θ) = L(θ0) +∇L(θ0)>(θ − θ0)︸ ︷︷ ︸=0

+1

2(θ − θ0)∇2L(θ0)(θ − θ0)

Try to find a direction θ− θ0 so that (θ− θ0)∇2L(θ0)(θ− θ0) < 0.

Jason Lee

Page 27: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Algorithms that Avoid Strict Saddle

Using second-order information, it is easy to find SOSP:

Algorithm 1 Second-Order Method (Royer and Wright)

for k = 0, 1, 2, . . . doStep 1. (First-Order)if ‖∇L(θk‖ ≤ εg then

Go to Step 2;else

Set dk = −∇L(θk). θk+1 = θk + ηdkend ifStep 2. (Second-Order) Compute eigenpair (vk, λk) where λk =λmin(∇2L(θk)) and v>k ∇L(θk) ≤ 0.if ‖∇L(θk)‖ ≤ εg and λk ≥ −εH then

Terminate;else if λk < −εH then

(Negative Curvature) Set dk = vk; Set θk+1 = θk + ηdk.end if

end for

Jason Lee

Page 28: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Where will the second-order algorithm terminate?

Second-order algorithm makes progress until both of the followinghold:

1 ∇L(θ) = 0

2 ∇2L(θ) � 0.

If any of these two conditions are violated, then the algorithm canstill make progress. Thus if θ satisfies above two conditions, thenwe call it second-order stationary.

Jason Lee

Page 29: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Strict Saddle aka Second-order Stationary Point

A critical point θ∗ is second order stationary point (SOSP) if

1 ∇L(θ∗) = 0,

2 ∇2L(θ∗) � 0.

SOSP ≈ local minimum

Jason Lee

Page 30: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Detour: Higher-order saddles

There is an obvious generalization to escaping higher-order saddlesthat requires computing negative eigenvalues of higher-ordertensors.

Third-order saddles can be escaped (Anandkumar and Ge2016)

NP-hard to escape 4th order saddles.

Neural nets of depth L will generally have saddles of order L.

Escaping second-order stationary points in manifoldconstrained optimization is the same difficulty asunconstrained. Escaping second-order stationary points inconstrained optimization is NP-hard (copositivity testing).

Jason Lee

Page 31: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

How about Gradient Methods?

Can gradient methods with no access to Hessian avoidsaddle-points?

Typically, algorithms only use gradient access.

Naively, you may think if gradient vanishes then the algorithmcannot escape since it cannot “access” second-orderinformation.

Randomness

The above intuition may hold without randomness, but imaginethat θ0 = 0 and ∇L(θ) = 0. We run GD from a small perturbationof 0:

θt+1 = (I − ηH)tZ.

GD can see second-order information when near saddle-points.

Jason Lee

Page 32: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

How about Gradient Methods?

Can gradient methods with no access to Hessian avoidsaddle-points?

Typically, algorithms only use gradient access.

Naively, you may think if gradient vanishes then the algorithmcannot escape since it cannot “access” second-orderinformation.

Randomness

The above intuition may hold without randomness, but imaginethat θ0 = 0 and ∇L(θ) = 0. We run GD from a small perturbationof 0:

θt+1 = (I − ηH)tZ.

GD can see second-order information when near saddle-points.

Jason Lee

Page 33: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

How about Gradient Methods?

Gradient flow diverges from (0, 0) unlessinitialized on y = −x.

This picture completely generalizes to general non-convexfunctions.

Jason Lee

Page 34: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

More Intuition

Gradient Descent near a saddle-point is power iteration:

f(x) =1

2xTHx

xk = (I − ηH)kx0

Converges to the saddle point 0 iff x0 is in the span of thepositive eigenvectors.

As long as there is one negative eigenvector, this set ismeasure 0.

Thus for indefinite quadratics, the set of initial conditions thatconverge to a saddle is measure 0.

Jason Lee

Page 35: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

More Intuition

Gradient Descent near a saddle-point is power iteration:

f(x) =1

2xTHx

xk = (I − ηH)kx0

Converges to the saddle point 0 iff x0 is in the span of thepositive eigenvectors.

As long as there is one negative eigenvector, this set ismeasure 0.

Thus for indefinite quadratics, the set of initial conditions thatconverge to a saddle is measure 0.

Jason Lee

Page 36: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Avoiding Saddle-points

Theorem ( Pemantle 92, Ge et al. 2015, Lee et al. 2016)

Assume the function f is smooth and coercive(lim‖x‖→∞ ‖∇f(x)‖ =∞) , then Gradient Descent with noisefinds a point with

‖∇f(x)‖ < εg

λmin(∇2f(x)) � −εHI,

in poly(1/εg, 1/εH , d) steps.Gradient descent with random initialization asymptotically finds aSOSP.

Gradient-based algorithms find SOSP.

Jason Lee

Page 37: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

SOSP

We only need a) gradient non-vanishing or b) Hessiannon-negative, so strictly larger set of problems than before.

Jason Lee

Page 38: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Why are SOSP interesting?

All SOSP are global minimizers and SGD/GD find the global min:

1 Matrix Completion (GLM16, GJZ17,. . . )

2 Rank k Approximation (classical)

3 Matrix Sensing (BNS16)

4 Phase Retrieval (SQW16)

5 Orthogonal Tensor Decomposition (AGHKT12,GHJY15)

6 Dictionary Learning (SQW15)

7 Max-cut via Burer Monteiro (BBV16, Montanari 16)

8 Overparametrized Networks with Quadratic Activation (DL18)

9 ReLU network with two neurons (LWL17)

10 ReLU networks via landscape design (GLM18)

Jason Lee

Page 39: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

What neural net are strict saddle?

Quadratic Activation (Du-Lee, Journee et al., Soltanolkotabi et al.)

f(W ;x) =

m∑j=1

aj(w>j x)2

with over-parametrization (m & min(√n, d)) and any standard

loss.

Jason Lee

Page 40: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

What neural net are strict saddle?

ReLU activation

f∗(x) =

k∑j=1

σ(w∗>j x)

Tons of assumptions:

1 Gaussian x

2 no negative output weights

3 k ≤ dLoss function with strict saddle is complicated. Essentially the lossencodes tensor decomposition.

Jason Lee

Page 41: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

More strict saddle

Two-neuron with orthogonal weights (Luo et al.) proved usingextraordinarily painful trigonometry.

One convolutional filter with non-overlapping patches(Brutzkus and Globerson).

Jason Lee

Page 42: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

1 Overparametrization and Architecture Design

2 Geometric Results on OverparametrizationReview Non-convex OptimizationNon-Algorithmic Results

3 Algorithmic ResultsGradient Dynamics: NTK

4 LimitationsJason Lee

Page 43: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Non-linear Least Squares (NNLS) Perspective

Folklore

Optimization is “easy” when parameters > sample size.

View the loss as a NNLS:

n∑i=1

(fi(θ)− yi)2 and fi(θ) = fθ(xi) = prediction with param θ

Jason Lee

Page 44: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Stationary Points of NNLS

Jacobian J ∈ Rp×n has columns ∇θfi(θ).

Let the error ri = fi(θ)− yi.The stationarity condition is

J(θ)r(θ) = 0.

J is a tall matrix when over-parametrized, so at “most”points σmin(J) > 0.

Jason Lee

Page 45: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

NNLS continued

Imagine that magically you found a critical point withσmin(J) > 0.Then

‖J(θ)r(θ)‖ ≤ ε =⇒ ‖r(θ)‖ ≤ ε

σmin(J),

and thus globally optimal!Takeaway: If you can find a critical point (which GD/SGD

do) and ensure J is full rank, then it is a global optimum.

Jason Lee

Page 46: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Other losses

Other losses

Consider ∑i

`(fi(x), yi).

Critcial points have the form

J(θ)r(θ) = 0 and ri = `′(fi(θ), yi).

and so‖r(θ)‖ ≤ ε

σmin(J).

For almost all commonly used losses, `(z) . `′(z) includingcross-entropy.

Jason Lee

Page 47: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

NNLS (continued)

Question

How to find non-degenerate critical points????

Short answer*: No one knows.

Nuanced answer: For almost all θ, J(θ) is full rank whenover-parametrized. Thus “almost all” critical points are globalminima.

Jason Lee

Page 48: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

NNLS (continued)

Question

How to find non-degenerate critical points????

Short answer*: No one knows.

Nuanced answer: For almost all θ, J(θ) is full rank whenover-parametrized. Thus “almost all” critical points are globalminima.

Jason Lee

Page 49: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

NNLS (continued)

Question

How to find non-degenerate critical points????

Short answer*: No one knows.

Nuanced answer: For almost all θ, J(θ) is full rank whenover-parametrized. Thus “almost all” critical points are globalminima.

Jason Lee

Page 50: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Several Attempts

Strategy 1: Auxiliary randomness ω , so that J(θ, ω) is fullrank even when θ depends on the data (Soudry-Carmon).The guarantees suggest that SGD with auxilary randomnesscan find a global minimum.

Strategy 2: Pretend it is independent (Kawaguchi)

Strategy 3: Punt on the dependence. Theorems say “Almostall critical points are global” (Nguyen and Hein, Nouiehed andRazaviyayn)

Jason Lee

Page 51: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Geometric Viewpoint

Question

What do these results have in common?

Our goal is to minimize L(f) = ‖f − y‖2.

Imagine that you are at f0 which is non optimal. Due toconvexity, −(f − y) is a first-order descent direction.

Parameter space is fθ(x), so let’s say fθ0 = f0. For θ to“mimick” the descent direction, we need

Jf (θ0)(θ − θ0) = y − f.

Jason Lee

Page 52: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Inverse Function Theorem (Informal)

What if Jf is zero? Then we can try to solve∇2f(θ0)[(θ− θ0)⊗2] = −(f − y). This will give a second-orderdescent direction, and allow us to escape all SOSP.

And so forth: If we can solve ∇kf(θ0)[(θ − θ0)⊗k] = y − f ,this will allow us to escape a kth order saddle.

Since we do not know y − f , we just compute the minimaleigenvector to find such a direction.

Jason Lee

Page 53: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Non-Algorithmic No Spurious Local Minima

No Spurious Local Minima (Nouihed and Razaviyayn)

Fundamentally, if the map f(Bθ0) is locally onto, then thereexists an escape direction.

This does not mean you can efficiently find the direction (e.g. 4thorder and above). Contrast this to the strict saddle definition.

Jason Lee

Page 54: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Relation to overparametrization (Informal):

y ∈ Rn, so we need at least dim(θ) = p ≥ n.

Imagine if you had a two-layer net f(x) = a>σ(Wx), and thehidden layer is super wide m ≥ n. Then as long as W is fullrank, can only treat a as the variable and solve∇af(a0,W0)[a− a0, 0] = y − f .Thus if W is fixed, all critical points in a are global minima.

Now imagine that W is also a variable. The only potentialissue is if σ(WX) is a rank degenerate matrix.

Thus imagine that if (a,W ) is a local minimum, where theerror is not zero. We can make an infinitesmal perturbation toW to make it full rank. Then a perturbation to a to move inthe direction of y − f to escape. Thus there are no spuriouslocal minima.

Papers that are of this “flavor”: Poston et al. , Yu, Nguyen andHein, Nouiehed and Razaviyayn, Haeffele and Vidal, Venturi et al..

Jason Lee

Page 55: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Theorem (First form)

Assume that f(x) = WLσ(WL−1 . . .W1x). There is a layer withwidth ml > n and ml ≥ ml+1 ≥ ml+2 ≥ . . . ≥ mL.Then almost all local minimizers are global.

Theorem (Second form)

Similar assumptions as above.There exists a path with non-increasing loss value from everyparameter θ to a global min θ?. This implies that every strict localminimizer is a global min.

Generally require you to have m ≥ n (at least one layer that isvery wide).

Non-algorithmic and do not have any implications for SGDfinding a global minimum (higher-order saddles etc.)

Jason Lee

Page 56: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Connection to Frank-Wolfe

In Frank-Wolfe or Gradient Boosting, the goal is to find a searchdirection that is correlated with the residual.The direction we want to go in is fi(θ)− yi. If the weak classifieris a single neuron, then a two-layer classifier is the boosted version(same as Barron’s greedy algorithm):

f(x) =

m∑j=1

aj σ(w>j x)︸ ︷︷ ︸weak classifier

.

At every step try to find:

σ(w>xi) = fi(θ)− yi.

Jason Lee

Page 57: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Similar to GD

Frank-Wolfe basically introduces a neuron at zero and does a localsearch step on the parameter.Has the same issues:

If σ(w>x) can make first-order progress (meaning strictlypositively correlated with f − y), then GD will find this.

Otherwise need to find a direction of higher-order correlationwith f − y, and this is likely hard.

Notable exceptions: quadratic activation requires eigenvector(Livni et al.) and monomial activation requires tensoreigenvalue.

Jason Lee

Page 58: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

1 Overparametrization and Architecture Design

2 Geometric Results on OverparametrizationReview Non-convex OptimizationNon-Algorithmic Results

3 Algorithmic ResultsGradient Dynamics: NTK

4 LimitationsJason Lee

Page 59: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

How to get Algorithmic result?

Most NN are not strict saddle, and the “all local are global” styleresults have no algorithmic implications.What are the few cases we do have algorithmic results?

Optimizing a single layer (Random Features).

Local results (Polyak condition).

Let’s try to use these two building blocks to get algorithmic results.

Jason Lee

Page 60: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

How to get Algorithmic result?

Most NN are not strict saddle, and the “all local are global” styleresults have no algorithmic implications.What are the few cases we do have algorithmic results?

Optimizing a single layer (Random Features).

Local results (Polyak condition).

Let’s try to use these two building blocks to get algorithmic results.

Jason Lee

Page 61: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Random Features Review

Consider functions of the form

f(x) =

∫φ(x; θ)c(θ)dω(θ), sup

θc(θ) <∞

Rahimi and Recht showed that this induces an RKHS with kernel

Kφ(x, x′) = Eω[φ(x; θ)>φ(x′; θ)].

Jason Lee

Page 62: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Relation to Neural Nets (Warm-up)

Two-layer Net

fθ(x) =∑m

j=1 ajσ(w>j x), and imagine if m→∞.

Define the measure c(

wj‖wj‖

)∝ |aj |‖wj‖2, then

fc(x) =

∫φ(x; θ)c(θ)dω(θ).

If m is large enough, any function of the formf(x) =

∫φ(x; θ)c(θ)dω(θ) can be approximated by a two-layer

network.

Jason Lee

Page 63: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Random Features

Theorem

Let f =∫φ(x; θ)c(θ)dω(θ), then there is a function of the form

f(x) =∑m

j=1 ajφ(x; θj),

‖f − f‖ . ‖c‖∞√m.

span({φ(·, θj)}) is dense in H(Kφ)

Jason Lee

Page 64: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Function classes Learnable via SGD

Proof Strategy (Andoni et al., Daniely):

1 Assume the target f? ∈ H(Kφ) or approximable by H(Kφ)up to tolerance.

2 Show that SGD learns something as competitive as the best inH(Kφ).

Jason Lee

Page 65: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Function classes Learnable via SGD

Proof Strategy (Andoni et al., Daniely):

1 Assume the target f? ∈ H(Kφ) or approximable by H(Kφ)up to tolerance.

2 Show that SGD learns something as competitive as the best inH(Kφ).

Jason Lee

Page 66: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Function classes Learnable via SGD

Proof Strategy (Andoni et al., Daniely):

1 Assume the target f? ∈ H(Kφ) or approximable by H(Kφ)up to tolerance.

2 Show that SGD learns something as competitive as the best inH(Kφ).

Jason Lee

Page 67: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Step 1

1 Write K(x, y) = g(ρ) =∑ciρ

i.

2 Thus φ(x)i =√cix

i is a feature map.

3 Using this, we can write p(x) =∑pjx

j = 〈w, φ(x)〉 forwj = pj/

√cj .

4 Thus if cj decay quickly, then ‖w‖2 won’t be too huge. RKHSnorm and sample complexity is governed by ‖w‖2.

Conclusion: Polynomials and some other simple functions are inthe RKHS.

Jason Lee

Page 68: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Step 1

1 Write K(x, y) = g(ρ) =∑ciρ

i.

2 Thus φ(x)i =√cix

i is a feature map.

3 Using this, we can write p(x) =∑pjx

j = 〈w, φ(x)〉 forwj = pj/

√cj .

4 Thus if cj decay quickly, then ‖w‖2 won’t be too huge. RKHSnorm and sample complexity is governed by ‖w‖2.

Conclusion: Polynomials and some other simple functions are inthe RKHS.

Jason Lee

Page 69: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Step 1

1 Write K(x, y) = g(ρ) =∑ciρ

i.

2 Thus φ(x)i =√cix

i is a feature map.

3 Using this, we can write p(x) =∑pjx

j = 〈w, φ(x)〉 forwj = pj/

√cj .

4 Thus if cj decay quickly, then ‖w‖2 won’t be too huge. RKHSnorm and sample complexity is governed by ‖w‖2.

Conclusion: Polynomials and some other simple functions are inthe RKHS.

Jason Lee

Page 70: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Step 1

1 Write K(x, y) = g(ρ) =∑ciρ

i.

2 Thus φ(x)i =√cix

i is a feature map.

3 Using this, we can write p(x) =∑pjx

j = 〈w, φ(x)〉 forwj = pj/

√cj .

4 Thus if cj decay quickly, then ‖w‖2 won’t be too huge. RKHSnorm and sample complexity is governed by ‖w‖2.

Conclusion: Polynomials and some other simple functions are inthe RKHS.

Jason Lee

Page 71: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Step 1

1 Write K(x, y) = g(ρ) =∑ciρ

i.

2 Thus φ(x)i =√cix

i is a feature map.

3 Using this, we can write p(x) =∑pjx

j = 〈w, φ(x)〉 forwj = pj/

√cj .

4 Thus if cj decay quickly, then ‖w‖2 won’t be too huge. RKHSnorm and sample complexity is governed by ‖w‖2.

Conclusion: Polynomials and some other simple functions are inthe RKHS.

Jason Lee

Page 72: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Step 2

Restrict to two-layer.

Optimizing only output layer

Consider fθ(x) = a>σ(Wx), and we only optimize over a. This isa convex problem.

Algorithm: Initialize wj uniform over the sphere, then compute

f(x) = arg mina

∑i

L(fa,w(xi), yi).

Guarantee (via Rahimi-Recht):

‖f − f‖ . ‖f‖√m.

Jason Lee

Page 73: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Step 2

Restrict to two-layer.

Optimizing only output layer

Consider fθ(x) = a>σ(Wx), and we only optimize over a. This isa convex problem.Algorithm: Initialize wj uniform over the sphere, then compute

f(x) = arg mina

∑i

L(fa,w(xi), yi).

Guarantee (via Rahimi-Recht):

‖f − f‖ . ‖f‖√m.

Jason Lee

Page 74: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Step 2

Restrict to two-layer.

Optimizing only output layer

Consider fθ(x) = a>σ(Wx), and we only optimize over a. This isa convex problem.Algorithm: Initialize wj uniform over the sphere, then compute

f(x) = arg mina

∑i

L(fa,w(xi), yi).

Guarantee (via Rahimi-Recht):

‖f − f‖ . ‖f‖√m.

Jason Lee

Page 75: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Both layers

If we optimize both layers, the optimization is non-convex.

Morally, this non-convexity is harmless. We only need toshow that optimizing wj does not hurt!

Strategy:

Initialize aj ≈ 0 and ‖wj‖ = O(1),

∇ajL(θ) = σ(wjx) and ∇wjL(θ) = ajσ′(wjx)x

∇wjL(θ) ≈ 0, so the wj do not move under SGD.

The aj converge quickly to their global optimum w.r.t.wj = w0

j , since wj ≈ w0j for all time.

Jason Lee

Page 76: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Both layers

If we optimize both layers, the optimization is non-convex.

Morally, this non-convexity is harmless. We only need toshow that optimizing wj does not hurt!

Strategy:

Initialize aj ≈ 0 and ‖wj‖ = O(1),

∇ajL(θ) = σ(wjx) and ∇wjL(θ) = ajσ′(wjx)x

∇wjL(θ) ≈ 0, so the wj do not move under SGD.

The aj converge quickly to their global optimum w.r.t.wj = w0

j , since wj ≈ w0j for all time.

Jason Lee

Page 77: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Both layers

If we optimize both layers, the optimization is non-convex.

Morally, this non-convexity is harmless. We only need toshow that optimizing wj does not hurt!

Strategy:

Initialize aj ≈ 0 and ‖wj‖ = O(1),

∇ajL(θ) = σ(wjx) and ∇wjL(θ) = ajσ′(wjx)x

∇wjL(θ) ≈ 0, so the wj do not move under SGD.

The aj converge quickly to their global optimum w.r.t.wj = w0

j , since wj ≈ w0j for all time.

Jason Lee

Page 78: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Both layers

If we optimize both layers, the optimization is non-convex.

Morally, this non-convexity is harmless. We only need toshow that optimizing wj does not hurt!

Strategy:

Initialize aj ≈ 0 and ‖wj‖ = O(1),

∇ajL(θ) = σ(wjx) and ∇wjL(θ) = ajσ′(wjx)x

∇wjL(θ) ≈ 0, so the wj do not move under SGD.

The aj converge quickly to their global optimum w.r.t.wj = w0

j , since wj ≈ w0j for all time.

Jason Lee

Page 79: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Theorem

Fix a target function f? and let m & ‖f?‖2H. Initialize the networkso that |aj | � ‖wj‖2. Then the learned network

‖f − f?‖ . ‖f‖H√m.

Roughly is what Daniely and Andoni et al. are doing.

Jason Lee

Page 80: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Deeper Networks

The idea is similar:

fθ(x) =∑

ajσ(wL>j xL−1)

Define φ(x; θj) = σ(w(0)L>j xL−1), which induces some Kφ.

SGD on just a is simply training random feature scheme forthis deep kernel Kφ.

Initialization is special in that the a moves much more than wduring training, so kernel is almost stationary.

Jason Lee

Page 81: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

1 Overparametrization and Architecture Design

2 Geometric Results on OverparametrizationReview Non-convex OptimizationNon-Algorithmic Results

3 Algorithmic ResultsGradient Dynamics: NTK

4 LimitationsJason Lee

Page 82: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Other Induced Kernels

Recap

fθ(x) =∑j

ajσ(w>j x)

If only aj changes, then get the kernel

K(x, x′) = E[σ(w>j x)σ(w>j x′)].

Somewhat unsatisfying. The non-convexity is all in wj and itis being fixed throughout the dynamics.

All weights moving

More general viewpoint. Consider if both a and w move:

fθ(x) ≈ f0(x) +∇θfθ(x)>(θ − θ0) +O(‖θ − θ0‖2).

Jason Lee

Page 83: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Other Induced Kernels

Recap

fθ(x) =∑j

ajσ(w>j x)

If only aj changes, then get the kernel

K(x, x′) = E[σ(w>j x)σ(w>j x′)].

Somewhat unsatisfying. The non-convexity is all in wj and itis being fixed throughout the dynamics.

All weights moving

More general viewpoint. Consider if both a and w move:

fθ(x) ≈ f0(x) +∇θfθ(x)>(θ − θ0) +O(‖θ − θ0‖2).

Jason Lee

Page 84: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Neural Tangent Kernel

Backup and consider fθ(·) is any nonlinear function.

fθ(x) ≈ f0(x)︸ ︷︷ ︸≈0

+∇θfθ(x)>(θ − θ0) +O(‖θ − θ0‖2),

Assumptions:

Second order term is “negligible”.

f0 is negligible, which can be argued usinginitialization+overparametrization.

References:

Kernel Viewpoint: Jacot et al., (Du et al.)2, (Arora et al.) 2,Chizat and Bach, Lee et al., E et al.

Pseudo-network: Li and Liang, (Allen-Zhu et al.)&5, Zou et al.

Jason Lee

Page 85: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Neural Tangent Kernel

Backup and consider fθ(·) is any nonlinear function.

fθ(x) ≈ f0(x)︸ ︷︷ ︸≈0

+∇θfθ(x)>(θ − θ0) +O(‖θ − θ0‖2),

Assumptions:

Second order term is “negligible”.

f0 is negligible, which can be argued usinginitialization+overparametrization.

References:

Kernel Viewpoint: Jacot et al., (Du et al.)2, (Arora et al.) 2,Chizat and Bach, Lee et al., E et al.

Pseudo-network: Li and Liang, (Allen-Zhu et al.)&5, Zou et al.

Jason Lee

Page 86: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Tangent Kernel

Under these assumptions,

fθ(x) ≈ fθ(x) = (θ − θ0)>∇θf(θ0)

This is a linear classifier in θ.

Feature representation is φ(x; θ0) = ∇θf(θ0).

Corresponds to using the kernel

K = ∇f(θ0)>∇f(θ0).

Jason Lee

Page 87: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Tangent Kernel

Under these assumptions,

fθ(x) ≈ fθ(x) = (θ − θ0)>∇θf(θ0)

This is a linear classifier in θ.

Feature representation is φ(x; θ0) = ∇θf(θ0).

Corresponds to using the kernel

K = ∇f(θ0)>∇f(θ0).

Jason Lee

Page 88: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

What is this kernel?

Neural Tangent Kernel (NTK)

K =

L+1∑l=1

αlKl and Kl = ∇Wlf(θ0)>∇Wl

f(θ0)

Two-layer

K1 =∑j

a2jσ′(w>j x)σ′(w>j x

′)x>x′ and K2 =∑j

σ(w>j x)σ(w>j x′)

Jason Lee

Page 89: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Kernel is initialization dependent

K1 =∑j

a2jσ′(w>j x)σ′(w>j x

′)x>x′ and K2 =∑j

σ(w>j x)σ(w>j x′)

so how a,w is initialized matters a lot.

Imagine ‖wj‖2 = 1/d and |aj |2 = 1/m, then only K = K2

matters (Daniely, Rahimi-Recht).

“NTK parametrization”: fθ(x) = 1√m

∑j ajσ(wjx), and

|aj | = O(1), ‖w‖ = O(1), then

K = K1 +K2.

This is what is done in Jacot et al., Du et al, Chizat & Bach

Li and Liang consider when |aj | = O(1) is fixed, and onlytrain w,

K = K1.

Jason Lee

Page 90: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Initialization and LR

Through different initialization/ parametrization/layerwise learningrate, you can get

K =

L+1∑l=1

αlKl and Kl = ∇Wlf(θ0)>∇Wl

f(θ0)

NTK should be thought of as this family of kernels.

Rahimi-Recht, Daniely studied the special case where only K2

matters and the other terms disappear.

Jason Lee

Page 91: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Infinite-width

For theoretical analysis, it is convenient to look at infinite width toremove the randomness from initialization.

Infinite-width

Initialize aj ∼ N(0, s2a/m) and wj ∼ N(0, s2

wI/m).Then

K1 = s2aEw[σ′(w>j x)σ′(w>j x

′)x>x′]

K2 = s2wEw[σ(w>j x)σ(w>j x

′)].

These have ugly closed forms in terms of x>x′, ‖x‖, ‖x′‖.

Jason Lee

Page 92: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Deep net Infinite-Width

Let a(l) = Wlσ(a(l−1)) be the pre-activations with σ(a(0)) := x.When the widths ml →∞, the pre-activations follow a Gaussianprocess. These have covariance function given by:

Σ(0) = x>x′

A(l) =

[Σ(l−1)(x, x) Σ(l−1)(x, x′)

Σ(l−1)(x′, x) Σ(l−1)(x′, x′)

]Σ(l)(x, x′) = E(u,v)∼A(l) [σ(u)σ(v)].

limml→∞KL+1 = Σ(L) gives us the kernel of the last layer (Lee etal., Matthews et al.).Define the gradient kernels as Σ(l)(x, x′) = E(u,v)∼A(l) [σ′(u)σ′(v)].Using backprop equations and Gaussian Process arguments (Jacotet al. , Lee et al., Du et al., Yang, Arora et al. ) can get

Kl(x, x′) = Σ(l−1)(x, x′) ·ΠL

l′=lΣ(l′)(x, x′)

Jason Lee

Page 93: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

NTK Overview

Recall

fθ(x) = f0(x) +∇fθ(x)(θ − θ0) +O(‖θ − θ0‖2).

Linearized network (Li and Liang, Du et al., Chizat and Bach):

fθ(x) = f0(x) +∇fθ(x)>(θ − θ0)

The network and linearized network are close if GD ensures‖θ − θ0‖2 is small.

If f0 � 1, then GD will not stay close to the initialization1.Thus need to initialize so f0 doesn’t blow up.

1Probably need f0 = o(√m), and is the only place neural net structure is

used.Jason Lee

Page 94: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Initialization size

Common initialization schemes ensure that norms are roughly

preserved at each layer. Initialization ensures x(L)j = O(1).

x(l) = σ(Wx(l−1))

f0(x) =

m∑j=1

ajx(L)j

Important Observation

If a2j ∼ 1

nin= 1

m , then f0(x) = O(1).

For two-layer case, first noticed by Li and Liang. For deepcase, used by Jacot et al., Du et al., Allen-Zhu et al., Zou etal.

Initialization is a√m factor smaller than the worst-case.

Jason Lee

Page 95: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Initialization size

Common initialization schemes ensure that norms are roughly

preserved at each layer. Initialization ensures x(L)j = O(1).

x(l) = σ(Wx(l−1))

f0(x) =

m∑j=1

ajx(L)j

Important Observation

If a2j ∼ 1

nin= 1

m , then f0(x) = O(1).

For two-layer case, first noticed by Li and Liang. For deepcase, used by Jacot et al., Du et al., Allen-Zhu et al., Zou etal.

Initialization is a√m factor smaller than the worst-case.

Jason Lee

Page 96: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Loss with unique root (Square loss , hinge loss)

Heuristic reasoning:

Define J = p× n Jacobian matrix of f . Need to solveJ(θ − θ0) = y − f0, which has a solution if p� n (and somenon-degeneracy).

‖θ − θ0‖2 = (y − f0)>(J>J)−1(y − f0) and does not dependon m (assuming J>J concentrates).

Jason Lee

Page 97: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Curvature

As m→∞ and f0 = O(1), thus the amount we need to move isconstant

‖θ − θ0‖2 = (y − f0)>(J>J)−1(y − f0).

Let’s look at how “fast” the prediction function deviates fromlinear, which is given by the Hessian of fθ. Roughly,

‖∇2fθ(x)‖ = om(1) ≈ 1√m.

Two-layer net (NTK parametrization)

∇2wjf(x) = 1√

majσ′′(w>j x)xx> and ∇2

aj ,wjf(x) = 1√mσ′(w>j x)x

The curvature vanishes as the width increases (due to how weparametrize/initialize).

Jason Lee

Page 98: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Curvature

As m→∞ and f0 = O(1), thus the amount we need to move isconstant

‖θ − θ0‖2 = (y − f0)>(J>J)−1(y − f0).

Let’s look at how “fast” the prediction function deviates fromlinear, which is given by the Hessian of fθ.

Roughly,

‖∇2fθ(x)‖ = om(1) ≈ 1√m.

Two-layer net (NTK parametrization)

∇2wjf(x) = 1√

majσ′′(w>j x)xx> and ∇2

aj ,wjf(x) = 1√mσ′(w>j x)x

The curvature vanishes as the width increases (due to how weparametrize/initialize).

Jason Lee

Page 99: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Curvature

As m→∞ and f0 = O(1), thus the amount we need to move isconstant

‖θ − θ0‖2 = (y − f0)>(J>J)−1(y − f0).

Let’s look at how “fast” the prediction function deviates fromlinear, which is given by the Hessian of fθ. Roughly,

‖∇2fθ(x)‖ = om(1) ≈ 1√m.

Two-layer net (NTK parametrization)

∇2wjf(x) = 1√

majσ′′(w>j x)xx> and ∇2

aj ,wjf(x) = 1√mσ′(w>j x)x

The curvature vanishes as the width increases (due to how weparametrize/initialize).

Jason Lee

Page 100: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Curvature

As m→∞ and f0 = O(1), thus the amount we need to move isconstant

‖θ − θ0‖2 = (y − f0)>(J>J)−1(y − f0).

Let’s look at how “fast” the prediction function deviates fromlinear, which is given by the Hessian of fθ. Roughly,

‖∇2fθ(x)‖ = om(1) ≈ 1√m.

Two-layer net (NTK parametrization)

∇2wjf(x) = 1√

majσ′′(w>j x)xx> and ∇2

aj ,wjf(x) = 1√mσ′(w>j x)x

The curvature vanishes as the width increases (due to how weparametrize/initialize).

Jason Lee

Page 101: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Curvature

As m→∞ and f0 = O(1), thus the amount we need to move isconstant

‖θ − θ0‖2 = (y − f0)>(J>J)−1(y − f0).

Let’s look at how “fast” the prediction function deviates fromlinear, which is given by the Hessian of fθ. Roughly,

‖∇2fθ(x)‖ = om(1) ≈ 1√m.

Two-layer net (NTK parametrization)

∇2wjf(x) = 1√

majσ′′(w>j x)xx> and ∇2

aj ,wjf(x) = 1√mσ′(w>j x)x

The curvature vanishes as the width increases (due to how weparametrize/initialize).

Jason Lee

Page 102: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Implication on Training Dynamics

Since the curvature can be made small by overparametrization, thegradient flow dynamics of fθt and fθt can be bounded:

‖fθt − fθt|‖∞ ≤ O(‖∇2fθ(x)‖) =1√m.

In some of the papers, the linearized function f is referred to apseudo-network (Li and Liang, (Allen-Zhu et al.) 5, Zou et al. )

Jason Lee

Page 103: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Training Error

What has been proved:

Two-layer convergence to global minimizer (Li and Liang, Duet al., Oymak and Soltanolkotabi)

Deep nets, Convolutional Deep nets, Resnets (Du et al.,Allen-Zhu et al., Zou et al.)

The requirements on width are n2 or worse2. However with ResNetthe width is not depth dependent (Zhang et al.).

2It can be significantly improved with data assumptions to m & ‖f‖2K .Jason Lee

Page 104: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Learning

What functions can be efficiently learned?

Let K be an induced kernel. The sample complexity of learning akernel class is

n & ‖f‖2K/ε2.

Write our target function as a linear function in the RKHS(Sridharan et al., Zhang et al.):

K(x, x′) = g(ρ) =∑

ciρi = φ(x)>φ(x′) and φ(x)i =

√cix

i.

f(x) = xk and σ is a monomial of degree k. Then

f(x) = 〈 1√ckek, φ(x)〉 and ‖f‖2K =

1

ck.

Jason Lee

Page 105: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Learning

What functions can be efficiently learned?

Let K be an induced kernel. The sample complexity of learning akernel class is

n & ‖f‖2K/ε2.

Write our target function as a linear function in the RKHS(Sridharan et al., Zhang et al.):

K(x, x′) = g(ρ) =∑

ciρi = φ(x)>φ(x′) and φ(x)i =

√cix

i.

f(x) = xk and σ is a monomial of degree k. Then

f(x) = 〈 1√ckek, φ(x)〉 and ‖f‖2K =

1

ck.

Jason Lee

Page 106: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Polynomials

f(x) =∑

ajxj =

⟨∑j

aj√cjej , φ(x)

⟩, ‖f‖2K =

∑j

a2j/cj .

Multivariate case: Let f(x) =∑

j aj(w>j x)j with ‖wj‖2 = 1, then

‖f‖2K =∑

j |aj |2/cj 3.What is learnable?

Constant degree polynomials with sample complexity 1/ck.

Functions whose coefficients aj decay quickly. For two-layerNTK, cj � 1/j2 , so need

∑j |aj |2j2 <∞.

3If Kernel has nullspace, then should be ‖f‖2K ≤∑j |aj |

2/cj .Jason Lee

Page 107: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Teacher network:

f(x) =∑

ajσ(w>j x) =∑d

∑j

αj,d(w>j x)d

All such networks are learnable as long as the σ hascoefficients decaying fast enough. Deep networks andsmooth activations can “recurse” argument (similar toZhang et al.)

Arora et al. used this to show when the teacher network hassmooth activation that NTK can learn.

Allen-Zhu et al. used a direct construction to find the RKHSfunction (pseudo-network) instead of the series expansion ofthe kernel/target.

Cao and Gu, Daniely showed that functions in the RKHS arelearnable via deep networks.

Previously known that such teacher networks are learnablewith kernel K(x, y) = 1/(2− x>y) and its recursive variant(Sridharan et al., Zhang et al.)

Jason Lee

Page 108: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

When does the kernel regime hold?

Square loss: For m > m0 , ‖ft − ft‖ = 1√m

for all time t.

Logistic loss: For all time t until L(ft) ≈ 1m .

Logistic Loss: For very large times t, not a kernel predictor(Gunasekar et al., Nacson et al.).

SGD with fresh data and either loss: For small time t, SGD onkernel and SGD of network are same. They will start differingat some point (Mei et al.)

Jason Lee

Page 109: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Learning Rate

Learning rate schedule is an important (probably the main reason)that networks trained in practice are not in the Kernel regime.

NTK Parametrization vs Standard Parametrization

Let’s consider

fA(x) =∑j

ajσ(w>j x) and fB(x) =1√m

∑j

ajσ(w>j x)

Assume that ‖x‖2 = d.

A: standard initialiation is wj ∼ N(0, I/d) andaj ∼ N(0, 1/m)

B: NTK initialization is wj ∼ N(0, I/d) and aj ∼ N(0, 1).

Both initializations ensure that f0(x) = O(1) and parametrizethe same functions.

Jason Lee

Page 110: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

However to get the same dynamics on ft, we need to scale learningrate (Lee et al.).

θ(A)t+1 = θ

(A)t − η∇L1(θ(A)) ↔ θ

(B)t+1 = θ

(B)t −mη∇L2(θ(B))

Thus for NTK to learn the same function as SGD on standardparametrization with constant learning rate, we need aninfinite learning rate on NTK parametrization.

Infinite learning rate means we would leave the kernel regime.

So in practice, if you keep the same learning rate when using widerand wider networks, NTK won’t be a good approximation.

Jason Lee

Page 111: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

1 If f?(x) = σ(w>x) (single ReLU), then need exponential in dmany random features to approximate this model (forK = E[σ(wx)σ(wy)])) , or need predictors with exponentialin d norm (Yehudai and Shamir). If we choose the model tobe fθ(x) =

∑j σ(w>j x), then it is learnable with O(d)

samples (Soltanolkotabi).2 With m = dk can only learn as well as fitting a degree k

polynomial (Ghorbani et al.).3 For a simple distribution realizable by four ReLU with n . d2

samples, no better than random guessing. The idea is thesame as the first bullet. The RKHS inductive bias is verypoorly aligned with targets that are “sparse” in neuron space.Intuition: f? is 1-sparse in the space of neurons meaningf?(x) =

∫ρ(w)σ(w>x)dw and ρ is a dirac delta. The RKHS

inductive bias is ‖ρ‖2 which is really terrible when ρ is a diracdelta. If you could enforce a ‖ρ‖1 inductive bias, then thesample complexity is O(d)If you could learn with respect to ‖ρ‖1 , the samplecomplexity is n & d/ε2.

Jason Lee

Page 112: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Going beyond NTK

WARNING: What follows are opinions that are only lightlygrounded in mathematics.

Question

Is the kernel regime reflecting the success of deep learning?

NTK accuracy and SGD accuracy on the same architecturehave a gap (Lee et al., Arora et al.)

Can we close this gap and how?

Jason Lee

Page 113: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Going beyond NTK

WARNING: What follows are opinions that are only lightlygrounded in mathematics.

Question

Is the kernel regime reflecting the success of deep learning?

NTK accuracy and SGD accuracy on the same architecturehave a gap (Lee et al., Arora et al.)

Can we close this gap and how?

Jason Lee

Page 114: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Going beyond NTK

WARNING: What follows are opinions that are only lightlygrounded in mathematics.

Question

Is the kernel regime reflecting the success of deep learning?

NTK accuracy and SGD accuracy on the same architecturehave a gap (Lee et al., Arora et al.)

Can we close this gap and how?

Jason Lee

Page 115: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Random thoughts:

Learning rate is key. If you use a small LR, then NTK andSGD find similar predictors (but the test accuracy is not superhigh). With the same architecture and the LR is tuned, thenthe test accuracy is higher. NTK implicitly enforces thelearning rate to be infinitesmally small, which may hurtlearning.

Logistic Loss: If you try to solve matrix completion withfΘ((i, j)) = (UV >)ij , then NTK simply imputes the observedentries. However if you run GD for a long time, then you getminimum nuclear norm (Srebro, Gunasekar et al.)

Jason Lee

Page 116: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Logistic Loss (and maybe one-pass SGD): We know thatasymptotically (with numerous assumptions) converges to astationary point of arg minyifθ(xi)≥1 ‖θ‖2 (Nacson et al., Li andLv)4 For even simple models, `2 regularization on parameters leadsto interesting inductive bias.

For deep linear nets, schatten 2/L norm, so promotes lowrank.

For linear model β = θ1 � θ2 , gets ‖β‖1.

For two-layer ReLU net, f(x) =∫ρ(w)σ(wx)dw , gets ‖ρ‖1

(Neyshabur et al. , Bengio et al., Wei et al.).

Deep ReLU net, size-free complexity bound (Golowich et al.)

However we should NOT expect to get global max-margin exceptin special example such as matrix sensing.Question: If I initialize at the NTK solution, which stationary pointof arg minyifθ(xi)≥1 ‖θ‖2 do you converge to? This is whathappens in super-wide networks with infinitely small LR.

4NTK is not stationary point of this.Jason Lee

Page 117: Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Questions?

Thank You.Questions?

Jason Lee