Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss

Survey of Overparametrization and Optimization

Jason D. Lee

University of Southern California

September 25, 2019

Jason Lee

1 Overparametrization and Architecture Design

2 Geometric Results on OverparametrizationReview Non-convex OptimizationNon-Algorithmic Results

3 Algorithmic ResultsGradient Dynamics: NTK

4 LimitationsJason Lee

Today’s Tutorial

Survey of Optimization and Overparametrization in Deep Learning.

Can think of this tutorial more as a survey of the literaturewith my own perspectives and opinions.

Jason Lee




4 Limitations

Jason Lee

Theoretical Challenges: Two Major Hurdles

1 Optimization

Non-convex and non-smooth with exponentially many criticalpoints.

2 Statistical

Successful Deep Networks are huge with more parameters thansamples (overparametrization).

Two Challenges are Intertwined

Learning = Optimization Error + Statistical Error.But Optimization and Statistics Cannot Be Decoupled.

The choice of optimization algorithm affects the statisticalperformance (generalization error).

Improving statistical performance (e.g. using regularizers,dropout . . . ) changes the algorithm dynamics and landscape.

Jason Lee

Theoretical Challenges: Two Major Hurdles

1 Optimization

Non-convex and non-smooth with exponentially many criticalpoints.

2 Statistical

Successful Deep Networks are huge with more parameters thansamples (overparametrization).

Two Challenges are Intertwined

Learning = Optimization Error + Statistical Error.But Optimization and Statistics Cannot Be Decoupled.

The choice of optimization algorithm affects the statisticalperformance (generalization error).

Improving statistical performance (e.g. using regularizers,dropout . . . ) changes the algorithm dynamics and landscape.

Jason Lee

Non-convexity

Practical observation: Gradient methods find high qualitysolutions.

Theoretical Side: Exponentially many local minima in squareloss in simple architecture (one neuron with sigmoidactivation) [Auer-Herbster-Warmuth].

Many other hardness based on intersection of halfspaces forrealizable models with positive margin [Klivans-Sherstov,Livni-Shalev-Shwartz-Shamir, Neyshabur-Tomioka-Srebro]

Question

Why is (stochastic) gradient descent (GD) successful? Or is it just“alchemy”?

Jason Lee

Non-convexity




Question


Jason Lee

Non-convexity




Question


Jason Lee

Setting

Loss

Ln(θ) =∑i

`(fθ(xi), yi) +R(θ),

1 fθ(x) is the prediction function (neural network)

2 `(y, y) = 12(y − y)2 or `(y, y) = log(1 + exp(−yy)).

Algorithm

Gradient Descent algorithm:

θk+1 = θk − ηk∇Ln(θ).

Stochastic Gradient:

θk+1 = θk − ηk∇θ`(fθ(xi), y).

Jason Lee

Notation

All parameters is θ. Weights of individual layers are Wl andoutput layer weights is a.

Input dimension x ∈ Rd, width of network is denoted m, anddepth L.

Total number of parameters θ ∈ Rp and sample size n.

Jason Lee





Architecture Design

Designing the Architecture

Goal: Design the architecture so that gradient decent finds goodsolutions (e.g. no spurious local minimizers)a.

aLivni et al.

Figure: SGD succeeds on the right loss function, but fails on the left infinding global minima.

Jason Lee

Architecture Design: Overparametrization

Empirical Observation: Easier for SGD to optimize largerarchitectures

Jason Lee

Practical Landscape Design - Overparametrization

Iterations ×104

1 2 3 4 5

Obj

ectiv

e V

alue

0

0.1

0.2

0.3

0.4

0.5

0.5 1 1.5 2 2.5 3

Iterations 104

0

0.1

0.2

0.3

0.4

0.5

Obje

ctive V

alu

e

Figure: Experiment first done by Livni-Shalev-Shwartz-Shamir 2014

Jason Lee

Overparametrization

Conventional Wisdom on Overparametrization

If SGD is not finding a low training error solution, then fit a moreexpressive model until the training error is near zero.

Problem

How much over-parametrization do we need to efficiently optimize+ generalize?

Adding parameters increases computational and memory cost.

Too many parameters may lead to overfitting (???).

Jason Lee

How much Overparametrization to Optimize?

Motivating Question

How much overparametrization ensures success of SGD?

For arbitrary labels, p� n is necessary, where p is the numberof parameters.

Can the amount of overparametrization adapt to latentstructure in the labels?

Jason Lee

Taxonomy of Overparametrization

Geometry-based: All local minima are global, All strict localminima are global.

Local Dynamics (Lazy/ Kernel Regime): Utilize local linearbehavior of the network predictions.

Global Dynamics (Active training, Mean Field): characterizedby large changes in the parameter.

Jason Lee

Architecture Design: Skip Connections

Skip connections/Resnet avoids gradient vanishing due todepth.

Jason Lee





Overview of Geometric Results

Decompose into two steps:

Gradient-based algorithms find first-order stationary points orsecond-order stationary points (Pemantle 92, Ge et al. 15, Leeet al. 16, Jin et al. 17)

Establish that all first-order/ second-order stationary points(or local minima) are global minimizers.

Jason Lee

One-point convexity

Gradient methods can converge even when the function isnon-convex.

Quasi-convex

∇L(θ)>(θ − θ?) ≥ d(θ, θ?),

where d is some distance measure to optimality.

d(θ, θ?) = L(θ)− L(θ?), Quasi convex.

d(θ, θ?) = η‖θ − θ?‖2 + β‖∇L(θ)‖2 , regularity condition orcorrelation condition.

Jason Lee

Single Neuron/Filter and Local Results

One-point convex

Single neuron/filter models have an even stronger property (withdistribution assumptions on x):

∇θL(θ)>(θ − θ?) > 0.

y = σ(w>x) (Candes et al., Kalai et al., Kakade et al., Mei etal., Soltanolkotabi, Goel et al. )

Single filter: y =∑

j σ((Sjw)>x) (Brutzkus and Globerson,Du et al., Goel et al.)

If W ? ≈ I , then two-layer ReLU network is one-point convexin a small region around W ? (Li and Yuan, Zhong et al.).Resnets have a large basin of attraction around identity.

Linear Dynamical System (Hardt et al.)

Jason Lee

Polyak Condition

In over-parametrized models, we frequently do not know what θ? isbecause it is non-identifiable.

Polyak Gradient Domination

‖∇L(θ)‖ ≥ L(θ)− L(θ?).

Local convergence for over-parametrized models (SJL18)

Global convergence for over-parametrized models (DZPS18,DLLWZ18)

Linear residual networks (Hardt and Ma) satisfy Polyakcondition in a large region around initialization.

Jason Lee

Gradient Vanishing

The key is to avoid spurious gradient vanishing.

What to do if the gradient is zero?

L(θ) = L(θ0) +∇L(θ0)>(θ − θ0)︸︷︷︸=0

+1

2(θ − θ0)∇2L(θ0)(θ − θ0)

Try to find a direction θ− θ0 so that (θ− θ0)∇2L(θ0)(θ− θ0) < 0.

Jason Lee

Gradient Vanishing

The key is to avoid spurious gradient vanishing.

What to do if the gradient is zero?

L(θ) = L(θ0) +∇L(θ0)>(θ − θ0)︸︷︷︸=0

+1

2(θ − θ0)∇2L(θ0)(θ − θ0)

Try to find a direction θ− θ0 so that (θ− θ0)∇2L(θ0)(θ− θ0) < 0.

Jason Lee

Algorithms that Avoid Strict Saddle

Using second-order information, it is easy to find SOSP:

Algorithm 1 Second-Order Method (Royer and Wright)

for k = 0, 1, 2, . . . doStep 1. (First-Order)if ‖∇L(θk‖ ≤ εg then

Go to Step 2;else

Set dk = −∇L(θk). θk+1 = θk + ηdkend ifStep 2. (Second-Order) Compute eigenpair (vk, λk) where λk =λmin(∇2L(θk)) and v>k ∇L(θk) ≤ 0.if ‖∇L(θk)‖ ≤ εg and λk ≥ −εH then

Terminate;else if λk < −εH then

(Negative Curvature) Set dk = vk; Set θk+1 = θk + ηdk.end if

end for

Jason Lee

Where will the second-order algorithm terminate?

Second-order algorithm makes progress until both of the followinghold:

1 ∇L(θ) = 0

2 ∇2L(θ) � 0.

If any of these two conditions are violated, then the algorithm canstill make progress. Thus if θ satisfies above two conditions, thenwe call it second-order stationary.

Jason Lee

Strict Saddle aka Second-order Stationary Point

A critical point θ∗ is second order stationary point (SOSP) if

1 ∇L(θ∗) = 0,

2 ∇2L(θ∗) � 0.

SOSP ≈ local minimum

Jason Lee

Detour: Higher-order saddles

There is an obvious generalization to escaping higher-order saddlesthat requires computing negative eigenvalues of higher-ordertensors.

Third-order saddles can be escaped (Anandkumar and Ge2016)

NP-hard to escape 4th order saddles.

Neural nets of depth L will generally have saddles of order L.

Escaping second-order stationary points in manifoldconstrained optimization is the same difficulty asunconstrained. Escaping second-order stationary points inconstrained optimization is NP-hard (copositivity testing).

Jason Lee

How about Gradient Methods?

Can gradient methods with no access to Hessian avoidsaddle-points?

Typically, algorithms only use gradient access.

Naively, you may think if gradient vanishes then the algorithmcannot escape since it cannot “access” second-orderinformation.

Randomness

The above intuition may hold without randomness, but imaginethat θ0 = 0 and ∇L(θ) = 0. We run GD from a small perturbationof 0:

θt+1 = (I − ηH)tZ.

GD can see second-order information when near saddle-points.

Jason Lee


Can gradient methods with no access to Hessian avoidsaddle-points?

Typically, algorithms only use gradient access.

Naively, you may think if gradient vanishes then the algorithmcannot escape since it cannot “access” second-orderinformation.

Randomness

The above intuition may hold without randomness, but imaginethat θ0 = 0 and ∇L(θ) = 0. We run GD from a small perturbationof 0:

θt+1 = (I − ηH)tZ.

GD can see second-order information when near saddle-points.

Jason Lee


Gradient flow diverges from (0, 0) unlessinitialized on y = −x.

This picture completely generalizes to general non-convexfunctions.

Jason Lee

More Intuition

Gradient Descent near a saddle-point is power iteration:

f(x) =1

2xTHx

xk = (I − ηH)kx0

Converges to the saddle point 0 iff x0 is in the span of thepositive eigenvectors.

As long as there is one negative eigenvector, this set ismeasure 0.

Thus for indefinite quadratics, the set of initial conditions thatconverge to a saddle is measure 0.

Jason Lee

More Intuition

Gradient Descent near a saddle-point is power iteration:

f(x) =1

2xTHx

xk = (I − ηH)kx0

Converges to the saddle point 0 iff x0 is in the span of thepositive eigenvectors.

As long as there is one negative eigenvector, this set ismeasure 0.

Thus for indefinite quadratics, the set of initial conditions thatconverge to a saddle is measure 0.

Jason Lee

Avoiding Saddle-points

Theorem ( Pemantle 92, Ge et al. 2015, Lee et al. 2016)

Assume the function f is smooth and coercive(lim‖x‖→∞ ‖∇f(x)‖ =∞) , then Gradient Descent with noisefinds a point with

‖∇f(x)‖ < εg

λmin(∇2f(x)) � −εHI,

in poly(1/εg, 1/εH , d) steps.Gradient descent with random initialization asymptotically finds aSOSP.

Gradient-based algorithms find SOSP.

Jason Lee

SOSP

We only need a) gradient non-vanishing or b) Hessiannon-negative, so strictly larger set of problems than before.

Jason Lee

Why are SOSP interesting?

All SOSP are global minimizers and SGD/GD find the global min:

1 Matrix Completion (GLM16, GJZ17,. . . )

2 Rank k Approximation (classical)

3 Matrix Sensing (BNS16)

4 Phase Retrieval (SQW16)

5 Orthogonal Tensor Decomposition (AGHKT12,GHJY15)

6 Dictionary Learning (SQW15)

7 Max-cut via Burer Monteiro (BBV16, Montanari 16)

8 Overparametrized Networks with Quadratic Activation (DL18)

9 ReLU network with two neurons (LWL17)

10 ReLU networks via landscape design (GLM18)

Jason Lee

What neural net are strict saddle?

Quadratic Activation (Du-Lee, Journee et al., Soltanolkotabi et al.)

f(W ;x) =

m∑j=1

aj(w>j x)2

with over-parametrization (m & min(√n, d)) and any standard

loss.

Jason Lee

What neural net are strict saddle?

ReLU activation

f∗(x) =

k∑j=1

σ(w∗>j x)

Tons of assumptions:

1 Gaussian x

2 no negative output weights

3 k ≤ dLoss function with strict saddle is complicated. Essentially the lossencodes tensor decomposition.

Jason Lee

More strict saddle

Two-neuron with orthogonal weights (Luo et al.) proved usingextraordinarily painful trigonometry.

One convolutional filter with non-overlapping patches(Brutzkus and Globerson).

Jason Lee





Non-linear Least Squares (NNLS) Perspective

Folklore

Optimization is “easy” when parameters > sample size.

View the loss as a NNLS:

n∑i=1

(fi(θ)− yi)2 and fi(θ) = fθ(xi) = prediction with param θ

Jason Lee

Stationary Points of NNLS

Jacobian J ∈ Rp×n has columns ∇θfi(θ).

Let the error ri = fi(θ)− yi.The stationarity condition is

J(θ)r(θ) = 0.

J is a tall matrix when over-parametrized, so at “most”points σmin(J) > 0.

Jason Lee

NNLS continued

Imagine that magically you found a critical point withσmin(J) > 0.Then

‖J(θ)r(θ)‖ ≤ ε =⇒ ‖r(θ)‖ ≤ ε

σmin(J),

and thus globally optimal!Takeaway: If you can find a critical point (which GD/SGD

do) and ensure J is full rank, then it is a global optimum.

Jason Lee

Other losses

Other losses

Consider ∑i

`(fi(x), yi).

Critcial points have the form

J(θ)r(θ) = 0 and ri = `′(fi(θ), yi).

and so‖r(θ)‖ ≤ ε

σmin(J).

For almost all commonly used losses, `(z) . `′(z) includingcross-entropy.

Jason Lee

NNLS (continued)

Question

How to find non-degenerate critical points????

Short answer*: No one knows.

Nuanced answer: For almost all θ, J(θ) is full rank whenover-parametrized. Thus “almost all” critical points are globalminima.

Jason Lee

NNLS (continued)

Question




Jason Lee

NNLS (continued)

Question




Jason Lee

Several Attempts

Strategy 1: Auxiliary randomness ω , so that J(θ, ω) is fullrank even when θ depends on the data (Soudry-Carmon).The guarantees suggest that SGD with auxilary randomnesscan find a global minimum.

Strategy 2: Pretend it is independent (Kawaguchi)

Strategy 3: Punt on the dependence. Theorems say “Almostall critical points are global” (Nguyen and Hein, Nouiehed andRazaviyayn)

Jason Lee

Geometric Viewpoint

Question

What do these results have in common?

Our goal is to minimize L(f) = ‖f − y‖2.

Imagine that you are at f0 which is non optimal. Due toconvexity, −(f − y) is a first-order descent direction.

Parameter space is fθ(x), so let’s say fθ0 = f0. For θ to“mimick” the descent direction, we need

Jf (θ0)(θ − θ0) = y − f.

Jason Lee

Inverse Function Theorem (Informal)

What if Jf is zero? Then we can try to solve∇2f(θ0)[(θ− θ0)⊗2] = −(f − y). This will give a second-orderdescent direction, and allow us to escape all SOSP.

And so forth: If we can solve ∇kf(θ0)[(θ − θ0)⊗k] = y − f ,this will allow us to escape a kth order saddle.

Since we do not know y − f , we just compute the minimaleigenvector to find such a direction.

Jason Lee

Non-Algorithmic No Spurious Local Minima

No Spurious Local Minima (Nouihed and Razaviyayn)

Fundamentally, if the map f(Bθ0) is locally onto, then thereexists an escape direction.

This does not mean you can efficiently find the direction (e.g. 4thorder and above). Contrast this to the strict saddle definition.

Jason Lee

Relation to overparametrization (Informal):

y ∈ Rn, so we need at least dim(θ) = p ≥ n.

Imagine if you had a two-layer net f(x) = a>σ(Wx), and thehidden layer is super wide m ≥ n. Then as long as W is fullrank, can only treat a as the variable and solve∇af(a0,W0)[a− a0, 0] = y − f .Thus if W is fixed, all critical points in a are global minima.

Now imagine that W is also a variable. The only potentialissue is if σ(WX) is a rank degenerate matrix.

Thus imagine that if (a,W ) is a local minimum, where theerror is not zero. We can make an infinitesmal perturbation toW to make it full rank. Then a perturbation to a to move inthe direction of y − f to escape. Thus there are no spuriouslocal minima.

Papers that are of this “flavor”: Poston et al. , Yu, Nguyen andHein, Nouiehed and Razaviyayn, Haeffele and Vidal, Venturi et al..

Jason Lee

Theorem (First form)

Assume that f(x) = WLσ(WL−1 . . .W1x). There is a layer withwidth ml > n and ml ≥ ml+1 ≥ ml+2 ≥ . . . ≥ mL.Then almost all local minimizers are global.

Theorem (Second form)

Similar assumptions as above.There exists a path with non-increasing loss value from everyparameter θ to a global min θ?. This implies that every strict localminimizer is a global min.

Generally require you to have m ≥ n (at least one layer that isvery wide).

Non-algorithmic and do not have any implications for SGDfinding a global minimum (higher-order saddles etc.)

Jason Lee

Connection to Frank-Wolfe

In Frank-Wolfe or Gradient Boosting, the goal is to find a searchdirection that is correlated with the residual.The direction we want to go in is fi(θ)− yi. If the weak classifieris a single neuron, then a two-layer classifier is the boosted version(same as Barron’s greedy algorithm):

f(x) =

m∑j=1

aj σ(w>j x)︸︷︷︸weak classifier

.

At every step try to find:

σ(w>xi) = fi(θ)− yi.

Jason Lee

Similar to GD

Frank-Wolfe basically introduces a neuron at zero and does a localsearch step on the parameter.Has the same issues:

If σ(w>x) can make first-order progress (meaning strictlypositively correlated with f − y), then GD will find this.

Otherwise need to find a direction of higher-order correlationwith f − y, and this is likely hard.

Notable exceptions: quadratic activation requires eigenvector(Livni et al.) and monomial activation requires tensoreigenvalue.

Jason Lee





How to get Algorithmic result?

Most NN are not strict saddle, and the “all local are global” styleresults have no algorithmic implications.What are the few cases we do have algorithmic results?

Optimizing a single layer (Random Features).

Local results (Polyak condition).

Let’s try to use these two building blocks to get algorithmic results.

Jason Lee

How to get Algorithmic result?

Most NN are not strict saddle, and the “all local are global” styleresults have no algorithmic implications.What are the few cases we do have algorithmic results?

Optimizing a single layer (Random Features).

Local results (Polyak condition).

Let’s try to use these two building blocks to get algorithmic results.

Jason Lee

Random Features Review

Consider functions of the form

f(x) =

∫φ(x; θ)c(θ)dω(θ), sup

θc(θ) <∞

Rahimi and Recht showed that this induces an RKHS with kernel

Kφ(x, x′) = Eω[φ(x; θ)>φ(x′; θ)].

Jason Lee

Relation to Neural Nets (Warm-up)

Two-layer Net

fθ(x) =∑m

j=1 ajσ(w>j x), and imagine if m→∞.

Define the measure c(

wj‖wj‖

)∝ |aj |‖wj‖2, then

fc(x) =

∫φ(x; θ)c(θ)dω(θ).

If m is large enough, any function of the formf(x) =

∫φ(x; θ)c(θ)dω(θ) can be approximated by a two-layer

network.

Jason Lee

Random Features

Theorem

Let f =∫φ(x; θ)c(θ)dω(θ), then there is a function of the form

f(x) =∑m

j=1 ajφ(x; θj),

‖f − f‖ . ‖c‖∞√m.

span({φ(·, θj)}) is dense in H(Kφ)

Jason Lee

Function classes Learnable via SGD

Proof Strategy (Andoni et al., Daniely):

1 Assume the target f? ∈ H(Kφ) or approximable by H(Kφ)up to tolerance.

2 Show that SGD learns something as competitive as the best inH(Kφ).

Jason Lee





Jason Lee





Jason Lee

Step 1

1 Write K(x, y) = g(ρ) =∑ciρ

i.

2 Thus φ(x)i =√cix

i is a feature map.

3 Using this, we can write p(x) =∑pjx

j = 〈w, φ(x)〉 forwj = pj/

√cj .

4 Thus if cj decay quickly, then ‖w‖2 won’t be too huge. RKHSnorm and sample complexity is governed by ‖w‖2.

Conclusion: Polynomials and some other simple functions are inthe RKHS.

Jason Lee

Step 1


i.


i is a feature map.



√cj .



Jason Lee

Step 1


i.


i is a feature map.



√cj .



Jason Lee

Step 1


i.


i is a feature map.



√cj .



Jason Lee

Step 1


i.


i is a feature map.



√cj .



Jason Lee

Step 2

Restrict to two-layer.

Optimizing only output layer

Consider fθ(x) = a>σ(Wx), and we only optimize over a. This isa convex problem.

Algorithm: Initialize wj uniform over the sphere, then compute

f(x) = arg mina

∑i

L(fa,w(xi), yi).

Guarantee (via Rahimi-Recht):

‖f − f‖ . ‖f‖√m.

Jason Lee

Step 2



Consider fθ(x) = a>σ(Wx), and we only optimize over a. This isa convex problem.Algorithm: Initialize wj uniform over the sphere, then compute

f(x) = arg mina

∑i

L(fa,w(xi), yi).


‖f − f‖ . ‖f‖√m.

Jason Lee

Step 2



Consider fθ(x) = a>σ(Wx), and we only optimize over a. This isa convex problem.Algorithm: Initialize wj uniform over the sphere, then compute

f(x) = arg mina

∑i

L(fa,w(xi), yi).


‖f − f‖ . ‖f‖√m.

Jason Lee

Both layers

If we optimize both layers, the optimization is non-convex.

Morally, this non-convexity is harmless. We only need toshow that optimizing wj does not hurt!

Strategy:

Initialize aj ≈ 0 and ‖wj‖ = O(1),

∇ajL(θ) = σ(wjx) and ∇wjL(θ) = ajσ′(wjx)x

∇wjL(θ) ≈ 0, so the wj do not move under SGD.

The aj converge quickly to their global optimum w.r.t.wj = w0

j , since wj ≈ w0j for all time.

Jason Lee

Both layers



Strategy:






Jason Lee

Both layers



Strategy:






Jason Lee

Both layers



Strategy:






Jason Lee

Theorem

Fix a target function f? and let m & ‖f?‖2H. Initialize the networkso that |aj | � ‖wj‖2. Then the learned network

‖f − f?‖ . ‖f‖H√m.

Roughly is what Daniely and Andoni et al. are doing.

Jason Lee

Deeper Networks

The idea is similar:

fθ(x) =∑

ajσ(wL>j xL−1)

Define φ(x; θj) = σ(w(0)L>j xL−1), which induces some Kφ.

SGD on just a is simply training random feature scheme forthis deep kernel Kφ.

Initialization is special in that the a moves much more than wduring training, so kernel is almost stationary.

Jason Lee





Other Induced Kernels

Recap

fθ(x) =∑j

ajσ(w>j x)

If only aj changes, then get the kernel

K(x, x′) = E[σ(w>j x)σ(w>j x′)].

Somewhat unsatisfying. The non-convexity is all in wj and itis being fixed throughout the dynamics.

All weights moving

More general viewpoint. Consider if both a and w move:

fθ(x) ≈ f0(x) +∇θfθ(x)>(θ − θ0) +O(‖θ − θ0‖2).

Jason Lee

Other Induced Kernels

Recap

fθ(x) =∑j

ajσ(w>j x)

If only aj changes, then get the kernel

K(x, x′) = E[σ(w>j x)σ(w>j x′)].

Somewhat unsatisfying. The non-convexity is all in wj and itis being fixed throughout the dynamics.

All weights moving

More general viewpoint. Consider if both a and w move:

fθ(x) ≈ f0(x) +∇θfθ(x)>(θ − θ0) +O(‖θ − θ0‖2).

Jason Lee

Neural Tangent Kernel

Backup and consider fθ(·) is any nonlinear function.

fθ(x) ≈ f0(x)︸︷︷︸≈0

+∇θfθ(x)>(θ − θ0) +O(‖θ − θ0‖2),

Assumptions:

Second order term is “negligible”.

f0 is negligible, which can be argued usinginitialization+overparametrization.

References:

Kernel Viewpoint: Jacot et al., (Du et al.)2, (Arora et al.) 2,Chizat and Bach, Lee et al., E et al.

Pseudo-network: Li and Liang, (Allen-Zhu et al.)&5, Zou et al.

Jason Lee

Neural Tangent Kernel

Backup and consider fθ(·) is any nonlinear function.

fθ(x) ≈ f0(x)︸︷︷︸≈0

+∇θfθ(x)>(θ − θ0) +O(‖θ − θ0‖2),

Assumptions:

Second order term is “negligible”.

f0 is negligible, which can be argued usinginitialization+overparametrization.

References:

Kernel Viewpoint: Jacot et al., (Du et al.)2, (Arora et al.) 2,Chizat and Bach, Lee et al., E et al.

Pseudo-network: Li and Liang, (Allen-Zhu et al.)&5, Zou et al.

Jason Lee

Tangent Kernel

Under these assumptions,

fθ(x) ≈ fθ(x) = (θ − θ0)>∇θf(θ0)

This is a linear classifier in θ.

Feature representation is φ(x; θ0) = ∇θf(θ0).

Corresponds to using the kernel

K = ∇f(θ0)>∇f(θ0).

Jason Lee

Tangent Kernel

Under these assumptions,

fθ(x) ≈ fθ(x) = (θ − θ0)>∇θf(θ0)

This is a linear classifier in θ.

Feature representation is φ(x; θ0) = ∇θf(θ0).

Corresponds to using the kernel

K = ∇f(θ0)>∇f(θ0).

Jason Lee

What is this kernel?

Neural Tangent Kernel (NTK)

K =

L+1∑l=1

αlKl and Kl = ∇Wlf(θ0)>∇Wl

f(θ0)

Two-layer

K1 =∑j

a2jσ′(w>j x)σ′(w>j x

′)x>x′ and K2 =∑j

σ(w>j x)σ(w>j x′)

Jason Lee

Kernel is initialization dependent

K1 =∑j

a2jσ′(w>j x)σ′(w>j x

′)x>x′ and K2 =∑j

σ(w>j x)σ(w>j x′)

so how a,w is initialized matters a lot.

Imagine ‖wj‖2 = 1/d and |aj |2 = 1/m, then only K = K2

matters (Daniely, Rahimi-Recht).

“NTK parametrization”: fθ(x) = 1√m

∑j ajσ(wjx), and

|aj | = O(1), ‖w‖ = O(1), then

K = K1 +K2.

This is what is done in Jacot et al., Du et al, Chizat & Bach

Li and Liang consider when |aj | = O(1) is fixed, and onlytrain w,

K = K1.

Jason Lee

Initialization and LR

Through different initialization/ parametrization/layerwise learningrate, you can get

K =

L+1∑l=1

αlKl and Kl = ∇Wlf(θ0)>∇Wl

f(θ0)

NTK should be thought of as this family of kernels.

Rahimi-Recht, Daniely studied the special case where only K2

matters and the other terms disappear.

Jason Lee

Infinite-width

For theoretical analysis, it is convenient to look at infinite width toremove the randomness from initialization.

Infinite-width

Initialize aj ∼ N(0, s2a/m) and wj ∼ N(0, s2

wI/m).Then

K1 = s2aEw[σ′(w>j x)σ′(w>j x

′)x>x′]

K2 = s2wEw[σ(w>j x)σ(w>j x

′)].

These have ugly closed forms in terms of x>x′, ‖x‖, ‖x′‖.

Jason Lee

Deep net Infinite-Width

Let a(l) = Wlσ(a(l−1)) be the pre-activations with σ(a(0)) := x.When the widths ml →∞, the pre-activations follow a Gaussianprocess. These have covariance function given by:

Σ(0) = x>x′

A(l) =

[Σ(l−1)(x, x) Σ(l−1)(x, x′)

Σ(l−1)(x′, x) Σ(l−1)(x′, x′)

]Σ(l)(x, x′) = E(u,v)∼A(l) [σ(u)σ(v)].

limml→∞KL+1 = Σ(L) gives us the kernel of the last layer (Lee etal., Matthews et al.).Define the gradient kernels as Σ(l)(x, x′) = E(u,v)∼A(l) [σ′(u)σ′(v)].Using backprop equations and Gaussian Process arguments (Jacotet al. , Lee et al., Du et al., Yang, Arora et al. ) can get

Kl(x, x′) = Σ(l−1)(x, x′) ·ΠL

l′=lΣ(l′)(x, x′)

Jason Lee

NTK Overview

Recall

fθ(x) = f0(x) +∇fθ(x)(θ − θ0) +O(‖θ − θ0‖2).

Linearized network (Li and Liang, Du et al., Chizat and Bach):

fθ(x) = f0(x) +∇fθ(x)>(θ − θ0)

The network and linearized network are close if GD ensures‖θ − θ0‖2 is small.

If f0 � 1, then GD will not stay close to the initialization1.Thus need to initialize so f0 doesn’t blow up.

1Probably need f0 = o(√m), and is the only place neural net structure is

used.Jason Lee

Initialization size

Common initialization schemes ensure that norms are roughly

preserved at each layer. Initialization ensures x(L)j = O(1).

x(l) = σ(Wx(l−1))

f0(x) =

m∑j=1

ajx(L)j

Important Observation

If a2j ∼ 1

nin= 1

m , then f0(x) = O(1).

For two-layer case, first noticed by Li and Liang. For deepcase, used by Jacot et al., Du et al., Allen-Zhu et al., Zou etal.

Initialization is a√m factor smaller than the worst-case.

Jason Lee

Initialization size

Common initialization schemes ensure that norms are roughly

preserved at each layer. Initialization ensures x(L)j = O(1).

x(l) = σ(Wx(l−1))

f0(x) =

m∑j=1

ajx(L)j

Important Observation

If a2j ∼ 1

nin= 1

m , then f0(x) = O(1).

For two-layer case, first noticed by Li and Liang. For deepcase, used by Jacot et al., Du et al., Allen-Zhu et al., Zou etal.

Initialization is a√m factor smaller than the worst-case.

Jason Lee

Loss with unique root (Square loss , hinge loss)

Heuristic reasoning:

Define J = p× n Jacobian matrix of f . Need to solveJ(θ − θ0) = y − f0, which has a solution if p� n (and somenon-degeneracy).

‖θ − θ0‖2 = (y − f0)>(J>J)−1(y − f0) and does not dependon m (assuming J>J concentrates).

Jason Lee

Curvature

As m→∞ and f0 = O(1), thus the amount we need to move isconstant

‖θ − θ0‖2 = (y − f0)>(J>J)−1(y − f0).

Let’s look at how “fast” the prediction function deviates fromlinear, which is given by the Hessian of fθ. Roughly,

‖∇2fθ(x)‖ = om(1) ≈ 1√m.

Two-layer net (NTK parametrization)

∇2wjf(x) = 1√

majσ′′(w>j x)xx> and ∇2

aj ,wjf(x) = 1√mσ′(w>j x)x

The curvature vanishes as the width increases (due to how weparametrize/initialize).

Jason Lee

Curvature


‖θ − θ0‖2 = (y − f0)>(J>J)−1(y − f0).

Let’s look at how “fast” the prediction function deviates fromlinear, which is given by the Hessian of fθ.

Roughly,

‖∇2fθ(x)‖ = om(1) ≈ 1√m.


∇2wjf(x) = 1√




Jason Lee

Curvature


‖θ − θ0‖2 = (y − f0)>(J>J)−1(y − f0).


‖∇2fθ(x)‖ = om(1) ≈ 1√m.


∇2wjf(x) = 1√




Jason Lee

Curvature


‖θ − θ0‖2 = (y − f0)>(J>J)−1(y − f0).


‖∇2fθ(x)‖ = om(1) ≈ 1√m.


∇2wjf(x) = 1√




Jason Lee

Curvature


‖θ − θ0‖2 = (y − f0)>(J>J)−1(y − f0).


‖∇2fθ(x)‖ = om(1) ≈ 1√m.


∇2wjf(x) = 1√




Jason Lee

Implication on Training Dynamics

Since the curvature can be made small by overparametrization, thegradient flow dynamics of fθt and fθt can be bounded:

‖fθt − fθt|‖∞ ≤ O(‖∇2fθ(x)‖) =1√m.

In some of the papers, the linearized function f is referred to apseudo-network (Li and Liang, (Allen-Zhu et al.) 5, Zou et al. )

Jason Lee

Training Error

What has been proved:

Two-layer convergence to global minimizer (Li and Liang, Duet al., Oymak and Soltanolkotabi)

Deep nets, Convolutional Deep nets, Resnets (Du et al.,Allen-Zhu et al., Zou et al.)

The requirements on width are n2 or worse2. However with ResNetthe width is not depth dependent (Zhang et al.).

2It can be significantly improved with data assumptions to m & ‖f‖2K .Jason Lee

Learning

What functions can be efficiently learned?

Let K be an induced kernel. The sample complexity of learning akernel class is

n & ‖f‖2K/ε2.

Write our target function as a linear function in the RKHS(Sridharan et al., Zhang et al.):

K(x, x′) = g(ρ) =∑

ciρi = φ(x)>φ(x′) and φ(x)i =

√cix

i.

f(x) = xk and σ is a monomial of degree k. Then

f(x) = 〈 1√ckek, φ(x)〉 and ‖f‖2K =

1

ck.

Jason Lee

Learning

What functions can be efficiently learned?

Let K be an induced kernel. The sample complexity of learning akernel class is

n & ‖f‖2K/ε2.

Write our target function as a linear function in the RKHS(Sridharan et al., Zhang et al.):

K(x, x′) = g(ρ) =∑

ciρi = φ(x)>φ(x′) and φ(x)i =

√cix

i.

f(x) = xk and σ is a monomial of degree k. Then

f(x) = 〈 1√ckek, φ(x)〉 and ‖f‖2K =

1

ck.

Jason Lee

Polynomials

f(x) =∑

ajxj =

⟨∑j

aj√cjej , φ(x)

⟩, ‖f‖2K =

∑j

a2j/cj .

Multivariate case: Let f(x) =∑

j aj(w>j x)j with ‖wj‖2 = 1, then

‖f‖2K =∑

j |aj |2/cj 3.What is learnable?

Constant degree polynomials with sample complexity 1/ck.

Functions whose coefficients aj decay quickly. For two-layerNTK, cj � 1/j2 , so need

∑j |aj |2j2 <∞.

3If Kernel has nullspace, then should be ‖f‖2K ≤∑j |aj |

2/cj .Jason Lee

Teacher network:

f(x) =∑

ajσ(w>j x) =∑d

∑j

αj,d(w>j x)d

All such networks are learnable as long as the σ hascoefficients decaying fast enough. Deep networks andsmooth activations can “recurse” argument (similar toZhang et al.)

Arora et al. used this to show when the teacher network hassmooth activation that NTK can learn.

Allen-Zhu et al. used a direct construction to find the RKHSfunction (pseudo-network) instead of the series expansion ofthe kernel/target.

Cao and Gu, Daniely showed that functions in the RKHS arelearnable via deep networks.

Previously known that such teacher networks are learnablewith kernel K(x, y) = 1/(2− x>y) and its recursive variant(Sridharan et al., Zhang et al.)

Jason Lee

When does the kernel regime hold?

Square loss: For m > m0 , ‖ft − ft‖ = 1√m

for all time t.

Logistic loss: For all time t until L(ft) ≈ 1m .

Logistic Loss: For very large times t, not a kernel predictor(Gunasekar et al., Nacson et al.).

SGD with fresh data and either loss: For small time t, SGD onkernel and SGD of network are same. They will start differingat some point (Mei et al.)

Jason Lee

Learning Rate

Learning rate schedule is an important (probably the main reason)that networks trained in practice are not in the Kernel regime.

NTK Parametrization vs Standard Parametrization

Let’s consider

fA(x) =∑j

ajσ(w>j x) and fB(x) =1√m

∑j

ajσ(w>j x)

Assume that ‖x‖2 = d.

A: standard initialiation is wj ∼ N(0, I/d) andaj ∼ N(0, 1/m)

B: NTK initialization is wj ∼ N(0, I/d) and aj ∼ N(0, 1).

Both initializations ensure that f0(x) = O(1) and parametrizethe same functions.

Jason Lee

However to get the same dynamics on ft, we need to scale learningrate (Lee et al.).

θ(A)t+1 = θ

(A)t − η∇L1(θ(A)) ↔ θ

(B)t+1 = θ

(B)t −mη∇L2(θ(B))

Thus for NTK to learn the same function as SGD on standardparametrization with constant learning rate, we need aninfinite learning rate on NTK parametrization.

Infinite learning rate means we would leave the kernel regime.

So in practice, if you keep the same learning rate when using widerand wider networks, NTK won’t be a good approximation.

Jason Lee

1 If f?(x) = σ(w>x) (single ReLU), then need exponential in dmany random features to approximate this model (forK = E[σ(wx)σ(wy)])) , or need predictors with exponentialin d norm (Yehudai and Shamir). If we choose the model tobe fθ(x) =

∑j σ(w>j x), then it is learnable with O(d)

samples (Soltanolkotabi).2 With m = dk can only learn as well as fitting a degree k

polynomial (Ghorbani et al.).3 For a simple distribution realizable by four ReLU with n . d2

samples, no better than random guessing. The idea is thesame as the first bullet. The RKHS inductive bias is verypoorly aligned with targets that are “sparse” in neuron space.Intuition: f? is 1-sparse in the space of neurons meaningf?(x) =

∫ρ(w)σ(w>x)dw and ρ is a dirac delta. The RKHS

inductive bias is ‖ρ‖2 which is really terrible when ρ is a diracdelta. If you could enforce a ‖ρ‖1 inductive bias, then thesample complexity is O(d)If you could learn with respect to ‖ρ‖1 , the samplecomplexity is n & d/ε2.

Jason Lee

Going beyond NTK

WARNING: What follows are opinions that are only lightlygrounded in mathematics.

Question

Is the kernel regime reflecting the success of deep learning?

NTK accuracy and SGD accuracy on the same architecturehave a gap (Lee et al., Arora et al.)

Can we close this gap and how?

Jason Lee

Going beyond NTK


Question




Jason Lee

Going beyond NTK


Question




Jason Lee

Random thoughts:

Learning rate is key. If you use a small LR, then NTK andSGD find similar predictors (but the test accuracy is not superhigh). With the same architecture and the LR is tuned, thenthe test accuracy is higher. NTK implicitly enforces thelearning rate to be infinitesmally small, which may hurtlearning.

Logistic Loss: If you try to solve matrix completion withfΘ((i, j)) = (UV >)ij , then NTK simply imputes the observedentries. However if you run GD for a long time, then you getminimum nuclear norm (Srebro, Gunasekar et al.)

Jason Lee

Logistic Loss (and maybe one-pass SGD): We know thatasymptotically (with numerous assumptions) converges to astationary point of arg minyifθ(xi)≥1 ‖θ‖2 (Nacson et al., Li andLv)4 For even simple models, `2 regularization on parameters leadsto interesting inductive bias.

For deep linear nets, schatten 2/L norm, so promotes lowrank.

For linear model β = θ1 � θ2 , gets ‖β‖1.

For two-layer ReLU net, f(x) =∫ρ(w)σ(wx)dw , gets ‖ρ‖1

(Neyshabur et al. , Bengio et al., Wei et al.).

Deep ReLU net, size-free complexity bound (Golowich et al.)

However we should NOT expect to get global max-margin exceptin special example such as matrix sensing.Question: If I initialize at the NTK solution, which stationary pointof arg minyifθ(xi)≥1 ‖θ‖2 do you converge to? This is whathappens in super-wide networks with infinitely small LR.

4NTK is not stationary point of this.Jason Lee

Questions?

Thank You.Questions?

Jason Lee

Documents

Survey of Overparametrization and Optimization · Practical observation: Gradient methods nd high quality solutions. Theoretical Side: Exponentially many local minima in square loss