54
Towards Understanding Overparameterized Deep Neural Networks: From Optimization To Generalization Quanquan Gu Computer Science Department UCLA Joint work with Difan Zou, Yuan Cao and Dongruo Zhou Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 1 / 43

Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Towards Understanding Overparameterized Deep Neural Networks:From Optimization To Generalization

Quanquan Gu

Computer Science DepartmentUCLA

Joint work with Difan Zou, Yuan Cao and Dongruo Zhou

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 1 / 43

Page 2: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

The Rise of Deep Learning

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 2 / 43

Page 3: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

The Rise of Deep Learning–Over-parameterization

The evolution of the winning entries on the ImageNet

Alex Krizhevsky et al. 2012. “Imagenet classification with deep convolutional neural networks”. In Advances in neural information processing systems, 1097–1105

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 3 / 43

Page 4: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

The Rise of Deep Learning–Over-parameterization

Top-1 accuracy versus amount of operations required for a single forwardpass in popular neural networks.

Alfredo Canziani et al. 2016. “An analysis of deep neural network models for practical applications”. arXiv preprint arXiv:1605.07678

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 4 / 43

Page 5: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Empirical Observations for DNNs

Fitting random labels and random pixels on CIFAR10

Chiyuan Zhang et al. 2016. “Understanding deep learning requires rethinking generalization”. arXiv preprint arXiv:1611.03530

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 5 / 43

Page 6: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Empirical Observations for DNNs

Training ResNet of different sizes on CIFAR10.

Behnam Neyshabur et al. 2018. “Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks”. arXiv preprintarXiv:1805.12076

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 6 / 43

Page 7: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Empirical Observations for DNNs

An empirical observation

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 7 / 43

Page 8: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Optimization for DNNs

Question

Why and how does Over-parameterized DNNs trained by gradient descentcan fit any labeling over distinct training inputs?

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 8 / 43

Page 9: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Challenge of Optimization for DNNs

ChallengeI The objective training loss is highly nonconvex or even nonsmooth.

I Conventional optimization theory can only guarantee finding second-order stationary point.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 9 / 43

Page 10: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Existing Work

A line of research on the optimization theory for training two-layer NNs is limited to theteacher network setting (Goel et al. 2016; Tian 2017; Du et al. 2017; Li and Yuan 2017;Zhong et al. 2017; Zhang et al. 2018).

Limitations of these work:I Assuming all data are generated from a “teacher network”, which cannot be satisfied in practice.

I Strong assumptions on the training input (e.g., i.i.d. generated from Gaussian distribution).

I Requiring special initialization methods that are very different from the commonly-used one.

Can we prove convergence under more practical assumptions?

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 10 / 43

Page 11: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Existing Work

A line of research on the optimization theory for training two-layer NNs is limited to theteacher network setting (Goel et al. 2016; Tian 2017; Du et al. 2017; Li and Yuan 2017;Zhong et al. 2017; Zhang et al. 2018).

Limitations of these work:I Assuming all data are generated from a “teacher network”, which cannot be satisfied in practice.

I Strong assumptions on the training input (e.g., i.i.d. generated from Gaussian distribution).

I Requiring special initialization methods that are very different from the commonly-used one.

Can we prove convergence under more practical assumptions?

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 10 / 43

Page 12: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Existing Work

Moreover, under much milder conditions on the training data and initialization, Li andLiang (2018) and Du et al. (2018b) established the global convergence of (stochastic) gra-dient descent for training two-layer ReLU networks.

I The neural network should be sufficiently over-parameterized (contains sufficiently large number of hiddennodes).

I The output of (stochastic) gradient descent can achieve abitrary small training loss.

Can similar results be generalized to DNNs?

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 11 / 43

Page 13: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Existing Work

Moreover, under much milder conditions on the training data and initialization, Li andLiang (2018) and Du et al. (2018b) established the global convergence of (stochastic) gra-dient descent for training two-layer ReLU networks.

I The neural network should be sufficiently over-parameterized (contains sufficiently large number of hiddennodes).

I The output of (stochastic) gradient descent can achieve abitrary small training loss.

Can similar results be generalized to DNNs?

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 11 / 43

Page 14: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Training Deep ReLU Networks

Setup

I Training data: S = (xi, yi)i=1,...,n with input vector xi ∈ Rd and label yi ∈ −1,+1.‖xi‖ = 1, (xi)d = µ and ‖xi − xj‖2 ≥ φ if yi 6= yj .

I Fully connected ReLU network:

fW(x) = v>σ(W>Lσ(W>

L−1 · · ·σ(W>1 x) · · · ))

with weight matrices Wl ∈ Rm×m and v ∈ ±1m.

I Classifier: sign(yi · fW(xi))

I Empirical risk minimization:

minW

LS(W) =1

n

n∑i=1

`[yi · fW(xi)],

where `(z) = log[1 + exp(−z)].

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 12 / 43

Page 15: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Training Algorithm

Algorithm 1 (S)GD for training DNNs

1: input: Training data xi,yii∈[n], step size η, total number of iterations K, minibatch size B.

2: initialization: (W(0)l )i,j ∼ N(0, 2/m) for all i, j, l

Gradient Descent3: for k = 0, . . . ,K do

4: W(k+1)l = W

(k)l − η∇Wl

LS(W(k)) for all l ∈ [L]5: end for6: output: W(K)

l l∈[L]Stochastic Gradient Descent

7: for k = 0, . . . ,K do8: Uniformly sample a minibatch of training data B(k) ∈ [n]

9: W(k+1)l = W

(k)l −

ηB

∑s∈B(k) ∇Wl

`[yi · fW(k)(xi)

]for all l ∈ [L]

10: end for11: output: W(K)

l l∈[L]

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 13 / 43

Page 16: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Convergence of GD/SGD for Training DNNs

Theorem (Zou et al. 2018, informal)

For any ε > 0, if

m = Ω(poly(n,L, φ−1, ε−1)

)then with high probability, (stochastic) gradient descent converges to a point that achievesε-training loss within the following iteration number,

K = O(poly(n,L, φ−1, ε−1)

).

I The over-parameterization condition and iteration complexity are polynomial in all problem pa-rameters.

I In order to achieve zero classification error, it suffices to set ε = log(2)/n .

Similar results have also been proved in Allen-Zhu et al. (2018b) and Du et al. (2018b) for regression problem with quadratic loss function.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 14 / 43

Page 17: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Overview of Proof Technique

W(W(0), τ) :=W = WlLl=1 : ‖Wl −W

(0)l ‖F ≤ τ, l ∈ [L]

For large enough width m:

I For W ∈ W(W(0), τp), τp = O(poly(n,L)),LS(W) enjoys good curvature properties (e.g.,sufficiently large gradient, nearly smooth and con-vex).

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 15 / 43

Page 18: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Overview of Proof Technique

W(W(0), τ) :=W = WlLl=1 : ‖Wl −W

(0)l ‖F ≤ τ, l ∈ [L]

For large enough width m:

I For W ∈ W(W(0), τp), τp = O(poly(n,L)),LS(W) enjoys good curvature properties (e.g.,sufficiently large gradient, nearly smooth and con-vex).

I Gradient descent converges with trajectory lengthτopt ≤ O(poly(n) · ε−1m−1/2).

I Sufficiently large m can guarantee that τopt ≤ τp.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 16 / 43

Page 19: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Stronger Guarantees for Training DNNs

Theorem (Zou and Gu 2019, informal)

For any ε > 0, if

m = Ω(n8L12φ−4

)then with high probability, gradient descent converges to a point that achieves ε-training losswithin the following iteration number,

K = O(n2L2φ−1 log(1/ε)

).

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 17 / 43

Page 20: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Comparison with Existing Work

Table: Over-parameterization conditions and iteration complexities of GD for training deep neuralnetworks.

Over-para. condition Iteration complexity ReLU?

Du et al. (2018a) Ω(

2O(L)·n4

λ4min(K

(L))

)O(

2O(L)·n2 log(1/ε)λ2

min(K(L))

)no

Allen-Zhu et al. (2018b) Ω(n24L12

φ8

)O(n6L2 log(1/ε)

φ2

)yes

Our work Ω(n8L12

φ4

)O(n2L2 log(1/ε)

φ

)yes

K(L) denotes the Gram matrix for L-hidden-layer neural network in Du et al. (2018a).

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 18 / 43

Page 21: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Overview of Proof Technique

W(W(0), τ) :=W = WlLl=1 : ‖Wl −W

(0)l ‖F ≤ τ, l ∈ [L]

Improved techniques:

I Prove a larger τp such that for W ∈W(W(0), τp), LS(W) enjoys good curvatureproperties.

I Within the region W(W(0), τp), we prove largergradient which leads to faster convergence of GD.

I We provide sharper characterization on the tra-jectory length of GD which leads to smaller τopt.

I Combine the above results we can significantlyimprove the condition on m to guarantee τopt ≤τp.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 19 / 43

Page 22: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Empirical Observations for DNNs

An empirical observation

Optimization - Over-parameterized DNNs can fit ANY labeling over distinct training inputs.

Generalization - When the labeling is ‘nice’, over-parameterized DNNs can also be trained toachieve small test error.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 20 / 43

Page 23: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Limitations of Existing Generalization Bounds

Uniform convergence based generalization bounds (Neyshabur et al. 2015; Bartlett etal. 2017; Neyshabur et al. 2017; Golowich et al. 2017; Arora et al. 2018; Li et al. 2018; Weiet al. 2018) study

supf∈H

∣∣∣∣∣ 1

n

n∑i=1

`[yi · f(xi)]︸ ︷︷ ︸training loss

−E(x,y)∼D`[y · f(x)]︸ ︷︷ ︸test loss

∣∣∣∣∣

I Worst case analysis

I Trade-off between capacity (VC dimension, Rademacher complexity etc.) and trainingerror

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 21 / 43

Page 24: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Wide Enough DNNs Can Fit Random Labels

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 22 / 43

Page 25: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Wide Enough DNNs Can Fit Random Labels

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 23 / 43

Page 26: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

What If the Distribution is ’Nicer’?

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 24 / 43

Page 27: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

What If the Distribution is ’Nicer’?

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 25 / 43

Page 28: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

What If the Distribution is ’Nicer’?

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 26 / 43

Page 29: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Limitations of Existing Generalization Bounds

Both neural networks separate the training data

Algorithm-dependent generalization analysis is necessary

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 27 / 43

Page 30: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Limitations of Existing Generalization Bounds

Both neural networks separate the training data

Algorithm-dependent generalization analysis is necessary

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 27 / 43

Page 31: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Algorithm-dependent Generalization Bounds

Algorithm-dependent bounds have been studied recently by (Li and Liang 2018; Allen-Zhu et al. 2018a; Arora et al. 2019a; Yehudai and Shamir 2019; E et al. 2019) forshallow networks.

Questions haven’t been fully answered:

I How to obtain generalization bounds for over-parameterized deep neural networks?

I How to quantify the ’classifiability’ of the data distribution?

I What is the benefit of training each layer of the network?

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 28 / 43

Page 32: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Algorithm-dependent Generalization Bounds

Algorithm-dependent bounds have been studied recently by (Li and Liang 2018; Allen-Zhu et al. 2018a; Arora et al. 2019a; Yehudai and Shamir 2019; E et al. 2019) forshallow networks.

Questions haven’t been fully answered:

I How to obtain generalization bounds for over-parameterized deep neural networks?

I How to quantify the ’classifiability’ of the data distribution?

I What is the benefit of training each layer of the network?

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 28 / 43

Page 33: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Learning DNNs with Stochastic Gradient Descent

Setup

I Binary classification: (x, y) ∈ Sd−1 × ±1 is generated from data distribution D.

I Fully connected ReLU network:

fW(x) =√m ·WLσ(WL−1σ(WL−2 · · ·σ(W1x) · · · )),

where W1 ∈ Rm×d, Wl ∈ Rm×m, l = 2, . . . , L−1, WL ∈ R1×m, and σ(·) = max·, 0.I Expected risk minimization:

minW

LD(W) := E(x,y)∼D`[y · fW(x)],

where `(z) = log[1 + exp(−z)] is the cross-entropy loss.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 29 / 43

Page 34: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Learning DNNs with Stochastic Gradient Descent

Algorithm 2 SGD for training DNNs

Input: Number of iterations n, step size η.

Generate each entry of W(0)l independently from N(0, 2/m), l ∈ [L− 1].

Generate each entry of W(0)L independently from N(0, 1/m).

for i = 1, 2, . . . , n doDraw (xi, yi) from D.Update W(i) = W(i−1) − η · ∇W`[yi · fW(i−1)(xi)].

end forOutput: Randomly choose W uniformly from W(0), . . . ,W(n−1).

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 30 / 43

Page 35: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Neural Tangent Random Feature

Definition (Neural Tangent Random Feature)

Let W(0) be generated via the initialization scheme in Algorithm 2. Define

F(W(0), R) =fW(0)(·) + 〈∇WfW(0)(·),W〉 : W ∈ W(0, R ·m−1/2)

,

where W(W, τ) := W′ ∈ W : ‖W′l −Wl‖F ≤ τ, l ∈ [L].

The training of the l-th layer of the network contributes therandom features ∇Wl

fW(0)(·) to the NTRF function class.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 31 / 43

Page 36: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Neural Tangent Random Feature

Definition (Neural Tangent Random Feature)

Let W(0) be generated via the initialization scheme in Algorithm 2. Define

F(W(0), R) =fW(0)(·) + 〈∇WfW(0)(·),W〉 : W ∈ W(0, R ·m−1/2)

,

where W(W, τ) := W′ ∈ W : ‖W′l −Wl‖F ≤ τ, l ∈ [L].

The training of the l-th layer of the network contributes therandom features ∇Wl

fW(0)(·) to the NTRF function class.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 31 / 43

Page 37: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

An Expected 0-1 Error Bound

Theorem (Cao and Gu 2019, informal)

For any R > 0, if m ≥ Ω(poly(R,L, n)

), then with high probability, Algorithm 2 returns W that

satisfies

E[L0−1D (W)

]≤ inff∈F(W(0),R)

4

n

n∑i=1

`[yi · f(xi)]

+O

[LR√n

+

√log(1/δ)

n

].

I Trade-off in the bound:I When R is small, first term is large, second term is small.I When R is large, first term is small, second term is large.

I When R = O(1), the second term is standard large-deviation error.

I Deep neural networks compete with the best function in the NTRF function classF(W(0), O(1)).

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 32 / 43

Page 38: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

An Expected 0-1 Error Bound

Theorem (Cao and Gu 2019, informal)

For any R > 0, if m ≥ Ω(poly(R,L, n)

), then with high probability, Algorithm 2 returns W that

satisfies

E[L0−1D (W)

]≤ inff∈F(W(0),R)

4

n

n∑i=1

`[yi · f(xi)]

+O

[LR√n

+

√log(1/δ)

n

].

I Trade-off in the bound:I When R is small, first term is large, second term is small.I When R is large, first term is small, second term is large.

I When R = O(1), the second term is standard large-deviation error.

I Deep neural networks compete with the best function in the NTRF function classF(W(0), O(1)).

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 32 / 43

Page 39: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

An Expected 0-1 Error Bound

Theorem (Cao and Gu 2019, informal)

For any R > 0, if m ≥ Ω(poly(R,L, n)

), then with high probability, Algorithm 2 returns W that

satisfies

E[L0−1D (W)

]≤ inff∈F(W(0),R)

4

n

n∑i=1

`[yi · f(xi)]

+O

[LR√n

+

√log(1/δ)

n

].

I Trade-off in the bound:I When R is small, first term is large, second term is small.I When R is large, first term is small, second term is large.

I When R = O(1), the second term is standard large-deviation error.

I Deep neural networks compete with the best function in the NTRF function classF(W(0), O(1)).

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 32 / 43

Page 40: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

An Expected 0-1 Error Bound

Theorem (Cao and Gu 2019, informal)

For any R > 0, if m ≥ Ω(poly(R,L, n)

), then with high probability, Algorithm 2 returns W that

satisfies

E[L0−1D (W)

]≤ inff∈F(W(0),R)

4

n

n∑i=1

`[yi · f(xi)]

+O

[LR√n

+

√log(1/δ)

n

].

I Trade-off in the bound:I When R is small, first term is large, second term is small.I When R is large, first term is small, second term is large.

I When R = O(1), the second term is standard large-deviation error.

I Deep neural networks compete with the best function in the NTRF function classF(W(0), O(1)).

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 32 / 43

Page 41: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

An Expected 0-1 Error Bound

Theorem (Cao and Gu 2019, informal)

For any R > 0, if m ≥ O(poly(R,L, n)

), then with high probability, Algorithm 2 returns W that

satisfies

E[L0−1D (W)

]≤ inff∈F(W(0),R)

4

n

n∑i=1

`[yi · f(xi)]

+O

[LR√n

+

√log(1/δ)

n

].

The training of the l-th layer of the network enlarges F(W(0), O(1)),leading to smaller expected 0-1 loss.

The “classifiability” of the underlying data distribution D can be measuredby how well its i.i.d. examples can be classified by F(W(0), O(1)).

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 33 / 43

Page 42: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

An Expected 0-1 Error Bound

Theorem (Cao and Gu 2019, informal)

For any R > 0, if m ≥ O(poly(R,L, n)

), then with high probability, Algorithm 2 returns W that

satisfies

E[L0−1D (W)

]≤ inff∈F(W(0),R)

4

n

n∑i=1

`[yi · f(xi)]

+O

[LR√n

+

√log(1/δ)

n

].

The training of the l-th layer of the network enlarges F(W(0), O(1)),leading to smaller expected 0-1 loss.

The “classifiability” of the underlying data distribution D can be measuredby how well its i.i.d. examples can be classified by F(W(0), O(1)).

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 33 / 43

Page 43: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Connection to Neural Tangent Kernel (NTK)

Definition (Neural Tangent Kernel)

The neural tangent kernel is defined as:

Θ(L) = (Θ(L)i,j )n×n, Θ

(L)i,j := m−1E

[〈∇WfW(0)(xi),∇WfW(0)(xj)〉

].

I Consistent with definitions given in Jacot et al. (2018), Yang (2019), and Arora etal. (2019b).

I Fixes exponential dependence in L in the original definition in Jacot et al. (2018) byusing N(0, 2/m) initialization instead of N(0, 1/m).

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 34 / 43

Page 44: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Connection to Neural Tangent Kernel (NTK)

Corollary (Cao and Gu 2019, informal)

Let y = (y1, . . . , yn)> and λ0 = λmin(Θ(L)). If m ≥ Ω(poly(L, n, λ−10 )

), then with high probability,

Algorithm 2 returns W that satisfies

E[L0−1D (W)

]≤ O

[L ·√

y>(Θ(L))−1y

n

]+O

[√log(1/δ)

n

].

The “classifiability” of the underlying data distribution D can also bemeasured by the quantity y>(Θ(L))−1y.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 35 / 43

Page 45: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Connection to Neural Tangent Kernel (NTK)

Corollary (Cao and Gu 2019, informal)

Let y = (y1, . . . , yn)> and λ0 = λmin(Θ(L)). If m ≥ Ω(poly(L, n, λ−10 )

), then with high probability,

Algorithm 2 returns W that satisfies

E[L0−1D (W)

]≤ O

[L ·√

y>(Θ(L))−1y

n

]+O

[√log(1/δ)

n

].

The “classifiability” of the underlying data distribution D can also bemeasured by the quantity y>(Θ(L))−1y.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 35 / 43

Page 46: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Overview of Proof Technique

Key observations

I Deep ReLU networks are almost linear in terms of their parameters in a small neighbour-hood around random initialization

fW′(xi) ≈ fW(xi) + 〈∇fW(xi),W′ −W〉.

I L(xi,yi)(W) is Lipschitz continuous and almost convex

‖∇WlL(xi,yi)(W)‖F ≤ O(

√m), l ∈ [L],

L(xi,yi)(W′) & L(xi,yi)(W) + 〈∇WL(xi,yi)(W),W′ −W〉.

Optimization for Lipschitz and convex functions+

Online-to-batch conversion

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 36 / 43

Page 47: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Overview of Proof Technique

Key observations

I Deep ReLU networks are almost linear in terms of their parameters in a small neighbour-hood around random initialization

fW′(xi) ≈ fW(xi) + 〈∇fW(xi),W′ −W〉.

I L(xi,yi)(W) is Lipschitz continuous and almost convex

‖∇WlL(xi,yi)(W)‖F ≤ O(

√m), l ∈ [L],

L(xi,yi)(W′) & L(xi,yi)(W) + 〈∇WL(xi,yi)(W),W′ −W〉.

Optimization for Lipschitz and convex functions+

Online-to-batch conversion

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 36 / 43

Page 48: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Conclusion

Under certain data distribution assumptions

I Global convergence guarantees for GD in training over-parameterizeddeep ReLU networks

I SGD trains an over-parameterized deep ReLU network and achievesO(n−1/2

)expected 0-1 loss.

I The data ”classifiability” can be measured by the NTRF functionclass or the NTK kernel matrix.

I An algorithm-dependent generalization error bound.

I Sample complexity is independent of network width.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 37 / 43

Page 49: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Future Work

I Deeper understanding of the NTRF function class and NTK.

I Other learning algorithms.

I Other neural network architectures.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 38 / 43

Page 50: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

Thank you!

Page 51: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

References I

Allen-Zhu, Zeyuan, Yuanzhi Li, and Yingyu Liang. 2018a. “Learning and Generalization in Overparameterized NeuralNetworks, Going Beyond Two Layers”. arXiv preprint arXiv:1811.04918.

Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2018b. “A Convergence Theory for Deep Learning via Over-Parameterization”.arXiv preprint arXiv:1811.03962.

Arora, Sanjeev, et al. 2019a. “Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks”. arXiv preprint arXiv:1901.08584.

Arora, Sanjeev, et al. 2019b. “On exact computation with an infinitely wide neural net”. arXiv preprint arXiv:1904.11955.

Arora, Sanjeev, et al. 2018. “Stronger generalization bounds for deep nets via a compression approach”. arXiv preprintarXiv:1802.05296.

Bartlett, Peter L, Dylan J Foster, and Matus J Telgarsky. 2017. “Spectrally-normalized margin bounds for neuralnetworks”. In Advances in Neural Information Processing Systems, 6240–6249.

Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. 2016. “An analysis of deep neural network models for practicalapplications”. arXiv preprint arXiv:1605.07678.

Cao, Yuan, and Quanquan Gu. 2019. “Generalization Bounds of Stochastic Gradient Descent for Wide and Deep NeuralNetworks”. arXiv preprint arXiv:1905.13210.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 40 / 43

Page 52: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

References II

Du, Simon S, Jason D Lee, and Yuandong Tian. 2017. “When is a Convolutional Filter Easy To Learn?” arXiv preprintarXiv:1709.06129.

Du, Simon S, et al. 2018a. “Gradient Descent Finds Global Minima of Deep Neural Networks”. arXiv preprint arXiv:1811.03804.

Du, Simon S, et al. 2018b. “Gradient Descent Provably Optimizes Over-parameterized Neural Networks”. arXiv preprintarXiv:1810.02054.

E, Weinan, Chao Ma, and Lei Wu. 2019. “A comparative analysis of the optimization and generalization property of two-layer neural network and random feature models under gradient descent dynamics”. arXiv preprint arXiv:1904.04326.

Goel, Surbhi, et al. 2016. “Reliably learning the relu in polynomial time”. arXiv preprint arXiv:1611.10258.

Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2017. “Size-Independent Sample Complexity of Neural Networks”.arXiv preprint arXiv:1712.06541.

Jacot, Arthur, Franck Gabriel, and Clement Hongler. 2018. “Neural Tangent Kernel: Convergence and Generalization inNeural Networks”. arXiv preprint arXiv:1806.07572.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet classification with deep convolutional neuralnetworks”. In Advances in neural information processing systems, 1097–1105.

Li, Xingguo, et al. 2018. “On Tighter Generalization Bound for Deep Neural Networks: CNNs, ResNets, and Beyond”.arXiv preprint arXiv:1806.05159.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 41 / 43

Page 53: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

References III

Li, Yuanzhi, and Yingyu Liang. 2018. “Learning overparameterized neural networks via stochastic gradient descent onstructured data”. arXiv preprint arXiv:1808.01204.

Li, Yuanzhi, and Yang Yuan. 2017. “Convergence Analysis of Two-layer Neural Networks with ReLU Activation”. arXivpreprint arXiv:1705.09886.

Neyshabur, Behnam, Ryota Tomioka, and Nathan Srebro. 2015. “Norm-based capacity control in neural networks”. InConference on Learning Theory, 1376–1401.

Neyshabur, Behnam, et al. 2017. “A pac-bayesian approach to spectrally-normalized margin bounds for neural networks”.arXiv preprint arXiv:1707.09564.

Neyshabur, Behnam, et al. 2018. “Towards Understanding the Role of Over-Parametrization in Generalization of NeuralNetworks”. arXiv preprint arXiv:1805.12076.

Tian, Yuandong. 2017. “An Analytical Formula of Population Gradient for two-layered ReLU network and its Applicationsin Convergence and Critical Point Analysis”. arXiv preprint arXiv:1703.00560.

Wei, Colin, et al. 2018. “On the margin theory of feedforward neural networks”. arXiv preprint arXiv:1810.05369.

Yang, Greg. 2019. “Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradientindependence, and neural tangent kernel derivation”. arXiv preprint arXiv:1902.04760.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 42 / 43

Page 54: Towards Understanding Overparameterized Deep Neural …web.cs.ucla.edu/~qgu/pdf/slides_PKU.pdf · 2019. 8. 12. · Chiyuan Zhang et al. 2016. \Understanding deep learning requires

References IV

Yehudai, Gilad, and Ohad Shamir. 2019. “On the power and limitations of random features for understanding neuralnetworks”. arXiv preprint arXiv:1904.00687.

Zhang, Chiyuan, et al. 2016. “Understanding deep learning requires rethinking generalization”. arXiv preprint arXiv:1611.03530.

Zhang, Xiao, et al. 2018. “Learning One-hidden-layer ReLU Networks via Gradient Descent”. arXiv preprint arXiv:1806.07808.

Zhong, Kai, et al. 2017. “Recovery Guarantees for One-hidden-layer Neural Networks”. arXiv preprint arXiv:1706.03175.

Zou, Difan, and Quanquan Gu. 2019. “An Improved Analysis of Training Over-parameterized Deep Neural Networks”.arXiv preprint arXiv:1906.04688.

Zou, Difan, et al. 2018. “Stochastic gradient descent optimizes over-parameterized deep ReLU networks”. arXiv preprintarXiv:1811.08888.

Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 43 / 43