Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Towards Understanding Overparameterized Deep Neural Networks:From Optimization To Generalization
Quanquan Gu
Computer Science DepartmentUCLA
Joint work with Difan Zou, Yuan Cao and Dongruo Zhou
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 1 / 43
The Rise of Deep Learning
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 2 / 43
The Rise of Deep Learning–Over-parameterization
The evolution of the winning entries on the ImageNet
Alex Krizhevsky et al. 2012. “Imagenet classification with deep convolutional neural networks”. In Advances in neural information processing systems, 1097–1105
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 3 / 43
The Rise of Deep Learning–Over-parameterization
Top-1 accuracy versus amount of operations required for a single forwardpass in popular neural networks.
Alfredo Canziani et al. 2016. “An analysis of deep neural network models for practical applications”. arXiv preprint arXiv:1605.07678
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 4 / 43
Empirical Observations for DNNs
Fitting random labels and random pixels on CIFAR10
Chiyuan Zhang et al. 2016. “Understanding deep learning requires rethinking generalization”. arXiv preprint arXiv:1611.03530
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 5 / 43
Empirical Observations for DNNs
Training ResNet of different sizes on CIFAR10.
Behnam Neyshabur et al. 2018. “Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks”. arXiv preprintarXiv:1805.12076
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 6 / 43
Empirical Observations for DNNs
An empirical observation
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 7 / 43
Optimization for DNNs
Question
Why and how does Over-parameterized DNNs trained by gradient descentcan fit any labeling over distinct training inputs?
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 8 / 43
Challenge of Optimization for DNNs
ChallengeI The objective training loss is highly nonconvex or even nonsmooth.
I Conventional optimization theory can only guarantee finding second-order stationary point.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 9 / 43
Existing Work
A line of research on the optimization theory for training two-layer NNs is limited to theteacher network setting (Goel et al. 2016; Tian 2017; Du et al. 2017; Li and Yuan 2017;Zhong et al. 2017; Zhang et al. 2018).
Limitations of these work:I Assuming all data are generated from a “teacher network”, which cannot be satisfied in practice.
I Strong assumptions on the training input (e.g., i.i.d. generated from Gaussian distribution).
I Requiring special initialization methods that are very different from the commonly-used one.
Can we prove convergence under more practical assumptions?
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 10 / 43
Existing Work
A line of research on the optimization theory for training two-layer NNs is limited to theteacher network setting (Goel et al. 2016; Tian 2017; Du et al. 2017; Li and Yuan 2017;Zhong et al. 2017; Zhang et al. 2018).
Limitations of these work:I Assuming all data are generated from a “teacher network”, which cannot be satisfied in practice.
I Strong assumptions on the training input (e.g., i.i.d. generated from Gaussian distribution).
I Requiring special initialization methods that are very different from the commonly-used one.
Can we prove convergence under more practical assumptions?
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 10 / 43
Existing Work
Moreover, under much milder conditions on the training data and initialization, Li andLiang (2018) and Du et al. (2018b) established the global convergence of (stochastic) gra-dient descent for training two-layer ReLU networks.
I The neural network should be sufficiently over-parameterized (contains sufficiently large number of hiddennodes).
I The output of (stochastic) gradient descent can achieve abitrary small training loss.
Can similar results be generalized to DNNs?
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 11 / 43
Existing Work
Moreover, under much milder conditions on the training data and initialization, Li andLiang (2018) and Du et al. (2018b) established the global convergence of (stochastic) gra-dient descent for training two-layer ReLU networks.
I The neural network should be sufficiently over-parameterized (contains sufficiently large number of hiddennodes).
I The output of (stochastic) gradient descent can achieve abitrary small training loss.
Can similar results be generalized to DNNs?
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 11 / 43
Training Deep ReLU Networks
Setup
I Training data: S = (xi, yi)i=1,...,n with input vector xi ∈ Rd and label yi ∈ −1,+1.‖xi‖ = 1, (xi)d = µ and ‖xi − xj‖2 ≥ φ if yi 6= yj .
I Fully connected ReLU network:
fW(x) = v>σ(W>Lσ(W>
L−1 · · ·σ(W>1 x) · · · ))
with weight matrices Wl ∈ Rm×m and v ∈ ±1m.
I Classifier: sign(yi · fW(xi))
I Empirical risk minimization:
minW
LS(W) =1
n
n∑i=1
`[yi · fW(xi)],
where `(z) = log[1 + exp(−z)].
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 12 / 43
Training Algorithm
Algorithm 1 (S)GD for training DNNs
1: input: Training data xi,yii∈[n], step size η, total number of iterations K, minibatch size B.
2: initialization: (W(0)l )i,j ∼ N(0, 2/m) for all i, j, l
Gradient Descent3: for k = 0, . . . ,K do
4: W(k+1)l = W
(k)l − η∇Wl
LS(W(k)) for all l ∈ [L]5: end for6: output: W(K)
l l∈[L]Stochastic Gradient Descent
7: for k = 0, . . . ,K do8: Uniformly sample a minibatch of training data B(k) ∈ [n]
9: W(k+1)l = W
(k)l −
ηB
∑s∈B(k) ∇Wl
`[yi · fW(k)(xi)
]for all l ∈ [L]
10: end for11: output: W(K)
l l∈[L]
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 13 / 43
Convergence of GD/SGD for Training DNNs
Theorem (Zou et al. 2018, informal)
For any ε > 0, if
m = Ω(poly(n,L, φ−1, ε−1)
)then with high probability, (stochastic) gradient descent converges to a point that achievesε-training loss within the following iteration number,
K = O(poly(n,L, φ−1, ε−1)
).
I The over-parameterization condition and iteration complexity are polynomial in all problem pa-rameters.
I In order to achieve zero classification error, it suffices to set ε = log(2)/n .
Similar results have also been proved in Allen-Zhu et al. (2018b) and Du et al. (2018b) for regression problem with quadratic loss function.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 14 / 43
Overview of Proof Technique
W(W(0), τ) :=W = WlLl=1 : ‖Wl −W
(0)l ‖F ≤ τ, l ∈ [L]
For large enough width m:
I For W ∈ W(W(0), τp), τp = O(poly(n,L)),LS(W) enjoys good curvature properties (e.g.,sufficiently large gradient, nearly smooth and con-vex).
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 15 / 43
Overview of Proof Technique
W(W(0), τ) :=W = WlLl=1 : ‖Wl −W
(0)l ‖F ≤ τ, l ∈ [L]
For large enough width m:
I For W ∈ W(W(0), τp), τp = O(poly(n,L)),LS(W) enjoys good curvature properties (e.g.,sufficiently large gradient, nearly smooth and con-vex).
I Gradient descent converges with trajectory lengthτopt ≤ O(poly(n) · ε−1m−1/2).
I Sufficiently large m can guarantee that τopt ≤ τp.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 16 / 43
Stronger Guarantees for Training DNNs
Theorem (Zou and Gu 2019, informal)
For any ε > 0, if
m = Ω(n8L12φ−4
)then with high probability, gradient descent converges to a point that achieves ε-training losswithin the following iteration number,
K = O(n2L2φ−1 log(1/ε)
).
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 17 / 43
Comparison with Existing Work
Table: Over-parameterization conditions and iteration complexities of GD for training deep neuralnetworks.
Over-para. condition Iteration complexity ReLU?
Du et al. (2018a) Ω(
2O(L)·n4
λ4min(K
(L))
)O(
2O(L)·n2 log(1/ε)λ2
min(K(L))
)no
Allen-Zhu et al. (2018b) Ω(n24L12
φ8
)O(n6L2 log(1/ε)
φ2
)yes
Our work Ω(n8L12
φ4
)O(n2L2 log(1/ε)
φ
)yes
K(L) denotes the Gram matrix for L-hidden-layer neural network in Du et al. (2018a).
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 18 / 43
Overview of Proof Technique
W(W(0), τ) :=W = WlLl=1 : ‖Wl −W
(0)l ‖F ≤ τ, l ∈ [L]
Improved techniques:
I Prove a larger τp such that for W ∈W(W(0), τp), LS(W) enjoys good curvatureproperties.
I Within the region W(W(0), τp), we prove largergradient which leads to faster convergence of GD.
I We provide sharper characterization on the tra-jectory length of GD which leads to smaller τopt.
I Combine the above results we can significantlyimprove the condition on m to guarantee τopt ≤τp.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 19 / 43
Empirical Observations for DNNs
An empirical observation
Optimization - Over-parameterized DNNs can fit ANY labeling over distinct training inputs.
Generalization - When the labeling is ‘nice’, over-parameterized DNNs can also be trained toachieve small test error.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 20 / 43
Limitations of Existing Generalization Bounds
Uniform convergence based generalization bounds (Neyshabur et al. 2015; Bartlett etal. 2017; Neyshabur et al. 2017; Golowich et al. 2017; Arora et al. 2018; Li et al. 2018; Weiet al. 2018) study
supf∈H
∣∣∣∣∣ 1
n
n∑i=1
`[yi · f(xi)]︸ ︷︷ ︸training loss
−E(x,y)∼D`[y · f(x)]︸ ︷︷ ︸test loss
∣∣∣∣∣
I Worst case analysis
I Trade-off between capacity (VC dimension, Rademacher complexity etc.) and trainingerror
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 21 / 43
Wide Enough DNNs Can Fit Random Labels
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 22 / 43
Wide Enough DNNs Can Fit Random Labels
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 23 / 43
What If the Distribution is ’Nicer’?
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 24 / 43
What If the Distribution is ’Nicer’?
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 25 / 43
What If the Distribution is ’Nicer’?
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 26 / 43
Limitations of Existing Generalization Bounds
Both neural networks separate the training data
Algorithm-dependent generalization analysis is necessary
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 27 / 43
Limitations of Existing Generalization Bounds
Both neural networks separate the training data
Algorithm-dependent generalization analysis is necessary
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 27 / 43
Algorithm-dependent Generalization Bounds
Algorithm-dependent bounds have been studied recently by (Li and Liang 2018; Allen-Zhu et al. 2018a; Arora et al. 2019a; Yehudai and Shamir 2019; E et al. 2019) forshallow networks.
Questions haven’t been fully answered:
I How to obtain generalization bounds for over-parameterized deep neural networks?
I How to quantify the ’classifiability’ of the data distribution?
I What is the benefit of training each layer of the network?
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 28 / 43
Algorithm-dependent Generalization Bounds
Algorithm-dependent bounds have been studied recently by (Li and Liang 2018; Allen-Zhu et al. 2018a; Arora et al. 2019a; Yehudai and Shamir 2019; E et al. 2019) forshallow networks.
Questions haven’t been fully answered:
I How to obtain generalization bounds for over-parameterized deep neural networks?
I How to quantify the ’classifiability’ of the data distribution?
I What is the benefit of training each layer of the network?
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 28 / 43
Learning DNNs with Stochastic Gradient Descent
Setup
I Binary classification: (x, y) ∈ Sd−1 × ±1 is generated from data distribution D.
I Fully connected ReLU network:
fW(x) =√m ·WLσ(WL−1σ(WL−2 · · ·σ(W1x) · · · )),
where W1 ∈ Rm×d, Wl ∈ Rm×m, l = 2, . . . , L−1, WL ∈ R1×m, and σ(·) = max·, 0.I Expected risk minimization:
minW
LD(W) := E(x,y)∼D`[y · fW(x)],
where `(z) = log[1 + exp(−z)] is the cross-entropy loss.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 29 / 43
Learning DNNs with Stochastic Gradient Descent
Algorithm 2 SGD for training DNNs
Input: Number of iterations n, step size η.
Generate each entry of W(0)l independently from N(0, 2/m), l ∈ [L− 1].
Generate each entry of W(0)L independently from N(0, 1/m).
for i = 1, 2, . . . , n doDraw (xi, yi) from D.Update W(i) = W(i−1) − η · ∇W`[yi · fW(i−1)(xi)].
end forOutput: Randomly choose W uniformly from W(0), . . . ,W(n−1).
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 30 / 43
Neural Tangent Random Feature
Definition (Neural Tangent Random Feature)
Let W(0) be generated via the initialization scheme in Algorithm 2. Define
F(W(0), R) =fW(0)(·) + 〈∇WfW(0)(·),W〉 : W ∈ W(0, R ·m−1/2)
,
where W(W, τ) := W′ ∈ W : ‖W′l −Wl‖F ≤ τ, l ∈ [L].
The training of the l-th layer of the network contributes therandom features ∇Wl
fW(0)(·) to the NTRF function class.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 31 / 43
Neural Tangent Random Feature
Definition (Neural Tangent Random Feature)
Let W(0) be generated via the initialization scheme in Algorithm 2. Define
F(W(0), R) =fW(0)(·) + 〈∇WfW(0)(·),W〉 : W ∈ W(0, R ·m−1/2)
,
where W(W, τ) := W′ ∈ W : ‖W′l −Wl‖F ≤ τ, l ∈ [L].
The training of the l-th layer of the network contributes therandom features ∇Wl
fW(0)(·) to the NTRF function class.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 31 / 43
An Expected 0-1 Error Bound
Theorem (Cao and Gu 2019, informal)
For any R > 0, if m ≥ Ω(poly(R,L, n)
), then with high probability, Algorithm 2 returns W that
satisfies
E[L0−1D (W)
]≤ inff∈F(W(0),R)
4
n
n∑i=1
`[yi · f(xi)]
+O
[LR√n
+
√log(1/δ)
n
].
I Trade-off in the bound:I When R is small, first term is large, second term is small.I When R is large, first term is small, second term is large.
I When R = O(1), the second term is standard large-deviation error.
I Deep neural networks compete with the best function in the NTRF function classF(W(0), O(1)).
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 32 / 43
An Expected 0-1 Error Bound
Theorem (Cao and Gu 2019, informal)
For any R > 0, if m ≥ Ω(poly(R,L, n)
), then with high probability, Algorithm 2 returns W that
satisfies
E[L0−1D (W)
]≤ inff∈F(W(0),R)
4
n
n∑i=1
`[yi · f(xi)]
+O
[LR√n
+
√log(1/δ)
n
].
I Trade-off in the bound:I When R is small, first term is large, second term is small.I When R is large, first term is small, second term is large.
I When R = O(1), the second term is standard large-deviation error.
I Deep neural networks compete with the best function in the NTRF function classF(W(0), O(1)).
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 32 / 43
An Expected 0-1 Error Bound
Theorem (Cao and Gu 2019, informal)
For any R > 0, if m ≥ Ω(poly(R,L, n)
), then with high probability, Algorithm 2 returns W that
satisfies
E[L0−1D (W)
]≤ inff∈F(W(0),R)
4
n
n∑i=1
`[yi · f(xi)]
+O
[LR√n
+
√log(1/δ)
n
].
I Trade-off in the bound:I When R is small, first term is large, second term is small.I When R is large, first term is small, second term is large.
I When R = O(1), the second term is standard large-deviation error.
I Deep neural networks compete with the best function in the NTRF function classF(W(0), O(1)).
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 32 / 43
An Expected 0-1 Error Bound
Theorem (Cao and Gu 2019, informal)
For any R > 0, if m ≥ Ω(poly(R,L, n)
), then with high probability, Algorithm 2 returns W that
satisfies
E[L0−1D (W)
]≤ inff∈F(W(0),R)
4
n
n∑i=1
`[yi · f(xi)]
+O
[LR√n
+
√log(1/δ)
n
].
I Trade-off in the bound:I When R is small, first term is large, second term is small.I When R is large, first term is small, second term is large.
I When R = O(1), the second term is standard large-deviation error.
I Deep neural networks compete with the best function in the NTRF function classF(W(0), O(1)).
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 32 / 43
An Expected 0-1 Error Bound
Theorem (Cao and Gu 2019, informal)
For any R > 0, if m ≥ O(poly(R,L, n)
), then with high probability, Algorithm 2 returns W that
satisfies
E[L0−1D (W)
]≤ inff∈F(W(0),R)
4
n
n∑i=1
`[yi · f(xi)]
+O
[LR√n
+
√log(1/δ)
n
].
The training of the l-th layer of the network enlarges F(W(0), O(1)),leading to smaller expected 0-1 loss.
The “classifiability” of the underlying data distribution D can be measuredby how well its i.i.d. examples can be classified by F(W(0), O(1)).
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 33 / 43
An Expected 0-1 Error Bound
Theorem (Cao and Gu 2019, informal)
For any R > 0, if m ≥ O(poly(R,L, n)
), then with high probability, Algorithm 2 returns W that
satisfies
E[L0−1D (W)
]≤ inff∈F(W(0),R)
4
n
n∑i=1
`[yi · f(xi)]
+O
[LR√n
+
√log(1/δ)
n
].
The training of the l-th layer of the network enlarges F(W(0), O(1)),leading to smaller expected 0-1 loss.
The “classifiability” of the underlying data distribution D can be measuredby how well its i.i.d. examples can be classified by F(W(0), O(1)).
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 33 / 43
Connection to Neural Tangent Kernel (NTK)
Definition (Neural Tangent Kernel)
The neural tangent kernel is defined as:
Θ(L) = (Θ(L)i,j )n×n, Θ
(L)i,j := m−1E
[〈∇WfW(0)(xi),∇WfW(0)(xj)〉
].
I Consistent with definitions given in Jacot et al. (2018), Yang (2019), and Arora etal. (2019b).
I Fixes exponential dependence in L in the original definition in Jacot et al. (2018) byusing N(0, 2/m) initialization instead of N(0, 1/m).
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 34 / 43
Connection to Neural Tangent Kernel (NTK)
Corollary (Cao and Gu 2019, informal)
Let y = (y1, . . . , yn)> and λ0 = λmin(Θ(L)). If m ≥ Ω(poly(L, n, λ−10 )
), then with high probability,
Algorithm 2 returns W that satisfies
E[L0−1D (W)
]≤ O
[L ·√
y>(Θ(L))−1y
n
]+O
[√log(1/δ)
n
].
The “classifiability” of the underlying data distribution D can also bemeasured by the quantity y>(Θ(L))−1y.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 35 / 43
Connection to Neural Tangent Kernel (NTK)
Corollary (Cao and Gu 2019, informal)
Let y = (y1, . . . , yn)> and λ0 = λmin(Θ(L)). If m ≥ Ω(poly(L, n, λ−10 )
), then with high probability,
Algorithm 2 returns W that satisfies
E[L0−1D (W)
]≤ O
[L ·√
y>(Θ(L))−1y
n
]+O
[√log(1/δ)
n
].
The “classifiability” of the underlying data distribution D can also bemeasured by the quantity y>(Θ(L))−1y.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 35 / 43
Overview of Proof Technique
Key observations
I Deep ReLU networks are almost linear in terms of their parameters in a small neighbour-hood around random initialization
fW′(xi) ≈ fW(xi) + 〈∇fW(xi),W′ −W〉.
I L(xi,yi)(W) is Lipschitz continuous and almost convex
‖∇WlL(xi,yi)(W)‖F ≤ O(
√m), l ∈ [L],
L(xi,yi)(W′) & L(xi,yi)(W) + 〈∇WL(xi,yi)(W),W′ −W〉.
Optimization for Lipschitz and convex functions+
Online-to-batch conversion
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 36 / 43
Overview of Proof Technique
Key observations
I Deep ReLU networks are almost linear in terms of their parameters in a small neighbour-hood around random initialization
fW′(xi) ≈ fW(xi) + 〈∇fW(xi),W′ −W〉.
I L(xi,yi)(W) is Lipschitz continuous and almost convex
‖∇WlL(xi,yi)(W)‖F ≤ O(
√m), l ∈ [L],
L(xi,yi)(W′) & L(xi,yi)(W) + 〈∇WL(xi,yi)(W),W′ −W〉.
Optimization for Lipschitz and convex functions+
Online-to-batch conversion
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 36 / 43
Conclusion
Under certain data distribution assumptions
I Global convergence guarantees for GD in training over-parameterizeddeep ReLU networks
I SGD trains an over-parameterized deep ReLU network and achievesO(n−1/2
)expected 0-1 loss.
I The data ”classifiability” can be measured by the NTRF functionclass or the NTK kernel matrix.
I An algorithm-dependent generalization error bound.
I Sample complexity is independent of network width.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 37 / 43
Future Work
I Deeper understanding of the NTRF function class and NTK.
I Other learning algorithms.
I Other neural network architectures.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 38 / 43
Thank you!
References I
Allen-Zhu, Zeyuan, Yuanzhi Li, and Yingyu Liang. 2018a. “Learning and Generalization in Overparameterized NeuralNetworks, Going Beyond Two Layers”. arXiv preprint arXiv:1811.04918.
Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2018b. “A Convergence Theory for Deep Learning via Over-Parameterization”.arXiv preprint arXiv:1811.03962.
Arora, Sanjeev, et al. 2019a. “Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks”. arXiv preprint arXiv:1901.08584.
Arora, Sanjeev, et al. 2019b. “On exact computation with an infinitely wide neural net”. arXiv preprint arXiv:1904.11955.
Arora, Sanjeev, et al. 2018. “Stronger generalization bounds for deep nets via a compression approach”. arXiv preprintarXiv:1802.05296.
Bartlett, Peter L, Dylan J Foster, and Matus J Telgarsky. 2017. “Spectrally-normalized margin bounds for neuralnetworks”. In Advances in Neural Information Processing Systems, 6240–6249.
Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. 2016. “An analysis of deep neural network models for practicalapplications”. arXiv preprint arXiv:1605.07678.
Cao, Yuan, and Quanquan Gu. 2019. “Generalization Bounds of Stochastic Gradient Descent for Wide and Deep NeuralNetworks”. arXiv preprint arXiv:1905.13210.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 40 / 43
References II
Du, Simon S, Jason D Lee, and Yuandong Tian. 2017. “When is a Convolutional Filter Easy To Learn?” arXiv preprintarXiv:1709.06129.
Du, Simon S, et al. 2018a. “Gradient Descent Finds Global Minima of Deep Neural Networks”. arXiv preprint arXiv:1811.03804.
Du, Simon S, et al. 2018b. “Gradient Descent Provably Optimizes Over-parameterized Neural Networks”. arXiv preprintarXiv:1810.02054.
E, Weinan, Chao Ma, and Lei Wu. 2019. “A comparative analysis of the optimization and generalization property of two-layer neural network and random feature models under gradient descent dynamics”. arXiv preprint arXiv:1904.04326.
Goel, Surbhi, et al. 2016. “Reliably learning the relu in polynomial time”. arXiv preprint arXiv:1611.10258.
Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2017. “Size-Independent Sample Complexity of Neural Networks”.arXiv preprint arXiv:1712.06541.
Jacot, Arthur, Franck Gabriel, and Clement Hongler. 2018. “Neural Tangent Kernel: Convergence and Generalization inNeural Networks”. arXiv preprint arXiv:1806.07572.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet classification with deep convolutional neuralnetworks”. In Advances in neural information processing systems, 1097–1105.
Li, Xingguo, et al. 2018. “On Tighter Generalization Bound for Deep Neural Networks: CNNs, ResNets, and Beyond”.arXiv preprint arXiv:1806.05159.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 41 / 43
References III
Li, Yuanzhi, and Yingyu Liang. 2018. “Learning overparameterized neural networks via stochastic gradient descent onstructured data”. arXiv preprint arXiv:1808.01204.
Li, Yuanzhi, and Yang Yuan. 2017. “Convergence Analysis of Two-layer Neural Networks with ReLU Activation”. arXivpreprint arXiv:1705.09886.
Neyshabur, Behnam, Ryota Tomioka, and Nathan Srebro. 2015. “Norm-based capacity control in neural networks”. InConference on Learning Theory, 1376–1401.
Neyshabur, Behnam, et al. 2017. “A pac-bayesian approach to spectrally-normalized margin bounds for neural networks”.arXiv preprint arXiv:1707.09564.
Neyshabur, Behnam, et al. 2018. “Towards Understanding the Role of Over-Parametrization in Generalization of NeuralNetworks”. arXiv preprint arXiv:1805.12076.
Tian, Yuandong. 2017. “An Analytical Formula of Population Gradient for two-layered ReLU network and its Applicationsin Convergence and Critical Point Analysis”. arXiv preprint arXiv:1703.00560.
Wei, Colin, et al. 2018. “On the margin theory of feedforward neural networks”. arXiv preprint arXiv:1810.05369.
Yang, Greg. 2019. “Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradientindependence, and neural tangent kernel derivation”. arXiv preprint arXiv:1902.04760.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 42 / 43
References IV
Yehudai, Gilad, and Ohad Shamir. 2019. “On the power and limitations of random features for understanding neuralnetworks”. arXiv preprint arXiv:1904.00687.
Zhang, Chiyuan, et al. 2016. “Understanding deep learning requires rethinking generalization”. arXiv preprint arXiv:1611.03530.
Zhang, Xiao, et al. 2018. “Learning One-hidden-layer ReLU Networks via Gradient Descent”. arXiv preprint arXiv:1806.07808.
Zhong, Kai, et al. 2017. “Recovery Guarantees for One-hidden-layer Neural Networks”. arXiv preprint arXiv:1706.03175.
Zou, Difan, and Quanquan Gu. 2019. “An Improved Analysis of Training Over-parameterized Deep Neural Networks”.arXiv preprint arXiv:1906.04688.
Zou, Difan, et al. 2018. “Stochastic gradient descent optimizes over-parameterized deep ReLU networks”. arXiv preprintarXiv:1811.08888.
Quanquan Gu Towards Understanding Overparameterized Deep Neural Networks 43 / 43