Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Some theory of machine learning
Some theory of machine learning
Seoul National University Deep Learning September-December, 2019 1 / 33
Some theory of machine learning
Setup
Consider (xi , yi ), xi ∈ Rd , y ∈ {0, 1}.Given a classifier f : Rd → {0, 1}, we would like to minimizeR(f ) = P(Y 6= f (X )).
We obtain data (xi , yi ), i = 1, · · · , n. An estimator of R(f ) isR̂(f ) = n−1
∑ni=1 I (Yi 6= f (Xi )), which we call Empirical risk.
Learning f (Empirical risk minimization): We want to find f thatmakes R(f ) small. Suppose we choose f̂ which minimizes theempirical risk over F , i.e., R̂(f̂ ) = min
f ∈FR̂(f ).
Optimal f in F : Let f ∗ be the optimal classifier in F in the sensethat it minimizes the true risk over F , i.e., R(f ∗) = min
f ∈FR(f ).
f may not be in F : Let f ∗∗ be the true classifier that minimizes thetrue risk, i.e., R(f ∗∗) = min
fR(f ).
Seoul National University Deep Learning September-December, 2019 2 / 33
Some theory of machine learning
Areas of theoretical studies related to Deep Learning
R(f̂ )− R(f ∗∗) = {R(f̂ ))− R(f ∗)}+ {R(f ∗)− R(f ∗∗)}{R(f ∗)− R(f ∗∗)} is approximation error.
R(f̂ )− R(f ∗) is estimation error.
Theoretical work on deep learning addresses (i) approximation, (ii)optimization and (iii) generalization error. Works on approximationhelp explain expressiveness of deep models, but it is not our focus inthis course. We will mention some recent finding on (ii) in theoptimization section. In this section we will focus on thegeneralization error.
Seoul National University Deep Learning September-December, 2019 3 / 33
Some theory of machine learning
Excess risk
R(f̂ )− R(f ∗∗) = {R(f̂ )− R(f ∗)}+ {R(f ∗)− R(f ∗∗)}{R(f ∗)− R(f ∗∗)} is approximation error.
R(f̂ )− R(f ∗). We would like to bound this estimation error.
R(f̂ )− R(f ∗) = {R(f̂ )− R̂(f̂ )}+ {R̂(f̂ )− R̂(f ∗)}+ {R̂(f ∗)− R(f ∗)}≤ {R(f̂ )− R̂(f̂ )}︸ ︷︷ ︸
(i)
+ {R̂(f ∗)− R(f ∗)}︸ ︷︷ ︸(ii)
≤ 2 supf ∈F|R̂(f )− R(f )|
To bound (ii), concentration of measures can be invoked but not (i).
Seoul National University Deep Learning September-December, 2019 4 / 33
Some theory of machine learning
Hoeffding’s Inequality
Hoeffding’s Inequality: If Z1, · · · ,Zn are independent withP(ai ≤ Zi ≤ bi ) = 1, then for any ε > 0,
P(|Z̄n − µ| > ε) < 2e−2nε2/c ,
where c = n−1∑n
i=1(bi − ai )2 and Z̄n = n−1
∑ni=1 Zi .
Hoeffding’s inequality implies
P(|R(f )− R̂(f )| > ε) ≤ 2e−2nε2,
and with probability 1− δ,
|R(f )− R̂(f )| ≤√
1
2nlog(
2
δ)
Seoul National University Deep Learning September-December, 2019 5 / 33
Some theory of machine learning
Uniform bound
Let |F| = M. For m = 1, · · · ,M, event Bm, |R̂(fm)− R(fm)| > ε,
P(∪Mi=1Bm) = P( supfm∈F
|R̂(fm)− R(fm)| > ε) < 2Me−2nε2. We obtain
the following holds with probability 1− δ,
supfm∈F
|R̂(fm)− R(fm)| ≤√
log 2M/δ
2n
Uniform bound relies on finite |F|. This argument does not gothrough if |F| is infinite.To handle infinite |F|, we show two approaches: one is to consider‘the projection of F on data’ and the other is to attempt to boundsupf∈F |R̂(f )− R(f )| at once. In the first approach we introduceShattering number (or growth function), Vapnik and Chervonenkis(VC) dimension. In the second approach we introduce McDiamid’sInequality and Rademacher complexity (Koltchinskii and Panchenko,2002).
Seoul National University Deep Learning September-December, 2019 6 / 33
Some theory of machine learning
Measures of complexity: Shattering number
Let χ be a set and let F be a class of binary functions on χ. Forf ∈ F , f : z ={z1, · · · , zn} → {0,+1}. Define the projection of F onz , F(z1, · · · , zn) = {(f (z1), · · · , f (zn))}. Note that F(z1, · · · , zn) isa finite collection of vectors. |F| can be infinite, but|F(z1, · · · , zn)| ≤ 2n.
e.g.1. f (z) = I (z > a) with z1 < z2 < z3. ThenF(z1, z2, z3) = {(0, 0, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1)}.e.g.2. f (z) = I (a < z < b) with z1 < z2 < z3. Then F(z1, z2, z3) ={(0, 0, 0), (1, 0, 0), (1, 1, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1), (0, 1, 0)}.Definition: Shattering number, growth function: Number of the mostpossible sets of dichotomies on any n points:
mF (n) = maxz1,···zn |F(z1, · · · zn)|.
Seoul National University Deep Learning September-December, 2019 7 / 33
Some theory of machine learning
Measures of complexity: Shattering number
By definition, the shattering number satisfies mF (n) ≤ 2n.
For e.g.1, F is set of f : R→ {0, 1}, f (z) = I (z > a). When n = 3with z1 < z2 < z3,F(z1, z2, z3) = {(0, 0, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1)}.mF (n) = n + 1.
e.g.2. F is set of f : R→ {0, 1}, f (z) = I (a < z < b). When n = 3with z1 < z2 < z3. Then F(z1, z2, z3) ={(0, 0, 0), (1, 0, 0), (1, 1, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1), (0, 1, 0)}.mF (n) = n(n + 1)/2 + 1.
F is set of f : R2 → {0, 1}, f (z) = 1 if convex region.mF (n) = 2n.
Seoul National University Deep Learning September-December, 2019 8 / 33
Some theory of machine learning
Measures of complexity: VC dimension
The VC dimension of F is the largest value of n for whichmF (n) = 2n, i.e. dVC (F) = sup{n : mF (n) = 2n}, equivalently, thesize of the largest set that can be shattered.
Break point: Number of data points for which we cannot get allpossible dichotomies (=VC dimension +1).
Table: VC dimensions
Function class VC dimension
interval [a, b] 2Disc in R2 3
half-spaces in Rd d + 1Convex polycons in R2 ∞
Seoul National University Deep Learning September-December, 2019 9 / 33
Some theory of machine learning
Measures of complexity: Sauer’s Theorem
VC dimension or break point regardless of F can give informationabout mF (n). If we know the break point is k , we know that any kpoints of n cannot have all possible patterns. e.g. n = 3, k = 2. Thisobservation gives a upper bound for mF (n).
Group x1 x2 x3
G1 0 0 0G2 0 0 1G2 0 1 0G2 1 0 0
Since k = 2, any two columns cannot have all possible dichotomies.
Let B(n, k) be the maximum number of patterns with n and breakpoint k. When we consider x1 and x2 only, G2 represents a set withdistinct patterns. For this set, the size is ≤ B(n− 1, k). For G1, sincex3 has both possible patterns, the size is ≤ B(n − 1, k − 1).
Seoul National University Deep Learning September-December, 2019 10 / 33
Some theory of machine learning
Measures of complexity: Sauer’s Theorem
When the break point is k ,|F(z1, · · · zn)| ≤ B(n − 1, k) + B(n − 1, k − 1) = B(n, k). Using this,the following theorem holds.
Sauer’s Theorem: Suppose that F has finite VC dimension d . Then
mF (n) ≤d∑
i=0
(n
i
)and for all n ≥ d ,
mF (n) ≤ (en
d)d
If mF (n) is less than 2n, mF (n) is polynomial, nothing in between.
Seoul National University Deep Learning September-December, 2019 11 / 33
Some theory of machine learning
Uniform bounds using shattering number
(Vapnik and Chervonenkis) Let F be a class of binary functions. Forany t >
√2/n
P(supf ∈F|Pn(f )− P(f )| > t) ≤ 4mF (2n)e−nt
2/8
and hence with probability at least 1− δ,
supf ∈F|Pn(f )− P(f )| ≤
√8
nlog
(4mF (2n)
δ
)where Pn(f ) = 1
n
∑ni=1 f (xi ), P(f ) =
∫f (x)dP(x).
The symmetrization technique can be used in proving the VCtheorem.
Seoul National University Deep Learning September-December, 2019 12 / 33
Some theory of machine learning
Symmetrization Lemma
Denote the empirical distribution of xi · · · , xn from P by Pn. Letx ′i · · · , x ′n denote a second independent sample from P and P ′n denotethe empirical distribution of the second sample. For all t >
√2/n,
P(supf ∈F|Pn(f )− P(f )| > t) ≤ 2P(sup
f ∈F|Pn(f )− P ′n(f )| > t/2).
The second sample, x ′i · · · , x ′n, is called a ghost sample.
If |(Pn(f )− P(f ))| > t and |(P(f )− P ′n(f )| ≤ t/2, then|P ′n(f )− Pn(f )| > t/2. This can be used in a proof.
The importance of the result is that we can express a bound ofsupf ∈F |Pn(f )− P(f )| using supf ∈F |Pn(f )− P ′n(f )|. This allowstaking maximum over a finite number that can be expressed using thegrowth function.
Seoul National University Deep Learning September-December, 2019 13 / 33
Some theory of machine learning
Uniform bounds using VC dimension
Recall mF (n) ≤ ( end )d . Replacing mF (n) with ( end )d in VC inequality,
we obtain for any t >√
2/n
P(supf ∈F|Pn(f )− P(f )| > t) ≤ 4(
en
d)de−nt
2/8
and hence with probability at least 1− δ,
supf ∈F|Pn(f )− P(f )| ≤
√√√√8
n
(log
(4
δ
)+ d log
(ne
d
))
where Pn(f ) = 1n
∑ni=1 f (xi ), P(f ) =
∫f (x)dP(x).
Seoul National University Deep Learning September-December, 2019 14 / 33
Some theory of machine learning
VC dimension bounds for neural network
One can compute the complexity or capacity of Neural Networkmodels by measuring how many configurations can be shattered (VCdimension)
The capacity of the network, if measured by the number of pieces in apiecewise linear approximation, increases exponentially with depth[Montufar, Pascanu et al, 2014]
These results quantify an upper bound on the empirical risk of deepneural networks
The bounds might be very pessimistic.
Seoul National University Deep Learning September-December, 2019 15 / 33
Some theory of machine learning
General measures of complexity: Rademacher complexity
Motivation: y = {−1, 1}. f : χ→ {−1, 1}.R̂(f ) = 1
m
∑mi=1 I (f (xi ) 6= yi ) = 1
2 −12m
∑mi=1 yi f (xi ).
Minimizing training error is to maximize 1m
∑mi=1 yi f (xi ) over f .
Consider a random label instead of y . A bigger model class F canmake 1
m
∑mi=1 yi f (xi ) big as well.
Definition (Rademacher complexity): Let σ1 · · · , σn be independentrandom variables P(σi = 1) = P(σi = −1) = 1/2. Rademachercomplexity of F is
Radn(F) = E
supf ∈F
1
n
n∑i=1
σi f (xi )
.
Seoul National University Deep Learning September-December, 2019 16 / 33
Some theory of machine learning
General measures of complexity: Rademacher complexity
Empirical Rademacher complexity of F by
Radn(F , x) = Eσ
supf ∈F
1
n
n∑i=1
σi f (xi )
.
When |F| = 1,
Radn(F) = E{
supf ∈F1n
∑ni=1 σi f (xi )
}= E 1
n
∑ni=1 Eσi f (xi ) = 0.
When |F| = 2n, Radn(F) = E{
supf ∈F1n
∑ni=1 σi f (xi )
}= 1.
Seoul National University Deep Learning September-December, 2019 17 / 33
Some theory of machine learning
Rademacher complexity: Example
Example (Ridge regression): Let F be the class of linear predictorswith restriction ‖w‖2 ≤W2. Additionally assume that ‖x‖2 ≤ X2.
Radn(F , x) = Eσ supw:‖w‖2≤W2
1
n
n∑i=1
σi < xi ,w >
=1
nEσ sup
w:‖w‖2≤W2
<
n∑i=1
σixi ,w >
=W2
nEσ‖
n∑i=1
σixi‖2 ≤W2
n
√√√√Eσ‖n∑
i=1
σixi‖22
≤ W2
n
√√√√Eσ
n∑i=1
‖σixi‖22 =W2
n
√√√√Eσ
n∑i=1
‖xi‖22 ≤X2W2√
n
Seoul National University Deep Learning September-December, 2019 18 / 33
Some theory of machine learning
Properties of Rademacher complexity
(Monotonicity) If F ⊂ G then Radn(F , x) ≤ Radn(G, x).
Proof. Radn(F) = E{
supf ∈F1n
∑ni=1 σi f (xi )
}≤
E{
supf ∈G1n
∑ni=1 σi f (xi )
}= Radn(G).
(Convex hull) Let conv(F) be the convex hull of F . ThenRadn(F , x) = Radn(conv(F), x).
(Scale and Shift) For any function class F and c, d ∈ R, definecF + d = {cf + d : f ∈ F}. Then Radn(cF + d , x) = |c |Radn(F , x).
(Lipschitz composition) If φ is a Lipschitz function such that∀s, t ∈ dom(φ), |φ(s)− φ(t)| ≤ L|s − t|, we haveRadn(φ ◦ F) ≤ LRadn(F).
Seoul National University Deep Learning September-December, 2019 19 / 33
Some theory of machine learning
Uniform bounds using Rademacher complexity
With probability at least 1− δ,
supf ∈F|Pn(f )− P(f )| ≤ 2Radn(F) +
√1
2nlog(
2
δ)
supf ∈F|Pn(f )− P(f )| ≤ 2Radn(F , x) +
√2
nlog(
2
δ)
A proof requires application of theorems of (i) McDiarmid and (ii)Symmetrization. (i) Let g(x) = supf ∈F |Pn(f )− P(f )|, wherex = {x1 · · · , xi , · · · , xn}. Let x ′ = {x1 · · · , zi , · · · , xn}. Then checkg(x)− g(x ′) is bounded (e.g. binary case, ≤ 1
n ). By Mcdiarmid
inequality, P(|g(x)−Eg(x)| > ε) ≤ 2e−2nε2, and with probability at
least 1−δ, g(x) ≤ E (g(x)) +√
12n log 2
δ . Using (ii), one can show
E (g(x)) ≤ 2Radn(F)
Seoul National University Deep Learning September-December, 2019 20 / 33
Some theory of machine learning
McDiarmid’s Inequality
McDiarmid’s Inequality: Let Z1, · · · ,Zn be independent randomvariables. Let x1, · · · , xn be independent random variables. Letx = {x1 · · · , xi , · · · , xn} and x ′ = {x1 · · · , zi , · · · , xn}. Suppose thatsupx1,··· ,xn,zi |f (x)− f (x ′)| ≤ ci for i = 1, · · · n. Then
P(|f (Z1, · · · ,Zn)− E (f (Z1, · · · ,Zn))| ≥ ε) ≤ 2 exp
(− 2ε2∑n
i=1 c2i
).
If f (x1 · · · , xi , · · · , xn) = n−1∑n
i=1 xi , the McDiarmid’s inequalityreduces to Hoffding’s.
Seoul National University Deep Learning September-December, 2019 21 / 33
Some theory of machine learning
Uniform bounds using Rademacher complexity
McDiarmid inequality shows with probability at least 1−δ,
g(x)− Eg(x) ≤√
12n log 2
δ , where g(x) = supf ∈F |Pn(f )− P(f )|.Symmetrization lemma shows E (g(x)) ≤ 2Radn(F). To prove, weuse a ghost sample x ′i · · · , x ′n and Rademacher variables σ1, · · · , σn.Note that n−1
∑ni=1{f (xi )− f (x ′i )} has the same distribution as
n−1∑n
i=1 σi{f (xi )− f (x ′i )}.
E (g(x)) = E{supf ∈F|P(f )− Pn(f )|} = E{sup
f ∈F|E ′(P ′n(f )− Pn(f ))|}
≤ EE ′[
supf ∈F|P ′n(f )− Pn(f )|
]= EE ′
[supf ∈F|n−1
n∑i=1
{f (xi )− f (x ′i )}|]
= EE ′[
supf ∈F|n−1
n∑i=1
σi{f (xi )− f (x ′i )}|]≤ 2Radn(F)
Seoul National University Deep Learning September-December, 2019 22 / 33
Some theory of machine learning
Uniform bounds using empirical Rademacher complexity
McDiarmid’s inequality implies with probability at least 1− δ,
|Radn(F)− Radn(F , x)| ≤√
1
2nlog(
2
δ)
With probability at least 1− δ,
supf ∈F|Pn(f )− P(f )| ≤ 2Radn(F , x) +
√2
nlog(
2
δ)
Seoul National University Deep Learning September-December, 2019 23 / 33
Some theory of machine learning
Relationship between Rademacher Complexity and Growthfunction
Massart’s Finite Lemma: Let A be some finite subset of Rn and σi beindependent Rademacher random variables. Let r = supa∈A ‖a‖2 thenwe have
Eσ[supa∈A
1
n
n∑i=1
σiai ] ≤r√
2 log |A|n
To prove Massart’s Lemma, we first establish
esEσ[supa∈A1n
∑ni=1 σiai ] ≤ |A|e
s2r2
2n2 using Jensen’s inequality andHoeffding’s Lemma. Then taking log on both sides and deviding by s,
Eσ[supa∈A
1
n
n∑i=1
σiai ] ≤log |A|
s+
sr2
2n2,
and then solve for s and substitute back.
Seoul National University Deep Learning September-December, 2019 24 / 33
Some theory of machine learning
Bounding Rademacher complexity
Let f ∈ F and f = (f (x1), · · · , f (xn)). Assume that for any x ,|f (x)| ≤ M. Then ‖f ‖2 ≤
√nM. Applying Massart’s Lemma, we
have
Radn(F) ≤ M
√2 log |FX1,··· ,Xn |
n≤ M
√2 logmF (n)
n.
For a binary function class F with VC dimension of d ,mF (n) ≤ ( end )d if n > d . Therefore
Radn(F) ≤√
2 logmF (n)
n≤√
2d log end
n≤√
2VC (F)(1 + log n)
n
Seoul National University Deep Learning September-December, 2019 25 / 33
Some theory of machine learning
Error bound for binary cases
Using Radn(F) ≤√
2 logmF (n)n , with probability 1− δ
supf ∈F|Pn(f )− P(f )| ≤ 2
√2 logmF (n)
n+
√1
2nlog(
2
δ)
When F has finite VC dimension d , using Radn(F) ≤ C√
(d log n)/nfor a constant C > 0, with probability 1− δ
supf ∈F|Pn(f )− P(f )| ≤ 2C
√(d log n)/n +
√1
2nlog(
2
δ)
Seoul National University Deep Learning September-December, 2019 26 / 33
Some theory of machine learning
Covering number
A pseudometric space (S , d) is a set S and a function d : S ×S → R+
(called a pseudometric) such that, for any x , y , z ∈ S we have:
d(x , y) = d(y , x) (symmetry);d(x , z) ≤ d(x , y) + d(y , z) (triangle inequality);d(x , x) = 0.
A metric space is obtained if one further assumes that d(x , y) = 0implies x = y . Covering number is defined on pseudometric spaces.
Definition (ε-cover) The set C ⊆ S is a ε-cover of (S , d) if for everyx ∈ S there exists y ∈ C such that d(x , y) ≤ ε.
Seoul National University Deep Learning September-December, 2019 27 / 33
Some theory of machine learning
Covering number of F
Definition (ε-cover of F) If Q is a measure and p ≥ 1, define
‖f ‖Lp(Q) =(∫|f (x)|pdQ(x)
)1/p. A set V = {f1, f2, · · · , fN} is an
ε-cover of F if for every f ∈ F there exists a fj ∈ V such that‖f − fj‖Lp(Q) < ε.Definition (Covering Number) Covering number is denoted byNp(ε,F ,Q) = min{|V | : V is a ε-cover of F}.Definition (Uniform Covering Number)
Np(ε,F) = supQ
Np(ε,F ,Q)
Definition (Empirical Covering Number) Let {Xi}ni=1 be n fixed pointsand Qn be the corresponding empirical measure. Then‖f ‖Lp(Qn) = ( 1n
∑ni=1 |f (Xi )|p)1/p and Np(ε,F ,Qn) is called empirical
covering number, the minimal N such that there exist f1, · · · , fN thatfor every f ∈ F there is a j ∈ {1, · · · ,N} such that
{ 1n∑n
i=1 |f (Xi )− fj(Xi )|p}1p < ε.
Seoul National University Deep Learning September-December, 2019 28 / 33
Some theory of machine learning
Covering Number: Example
Suppose that A ⊂ Rm, let c = maxa∈A ‖a‖, and assume that A lies in ad-dimensional subspace of Rm. Then N(ε,A) ≤ (2c
√d/ε)d . To see this,
denoting vi as orthonormal basis, for any a ∈ A, a =∑d
i=1 αivi with‖α‖∞ ≤ ‖α‖2 = ‖a‖2 ≤ c . Consider
A′ = {d∑
i=1
α′ivi : ∀α′ ∈ {−c,−c + r ,−c + 2r , · · · , c}}.
Given a ∈ A, there exists a′ ∈ A′ such that
‖a− a′‖2 = ‖d∑
i=1
(α′i − αi )vi‖2 ≤ r2d∑
i=1
‖vi‖2 ≤ r2d .
Setting r2d = ε, A′ is an ε-cover of A, and
N(ε,A) ≤ |A′| = (2c
r)d = (
2c√d
ε)d .
Seoul National University Deep Learning September-December, 2019 29 / 33
Some theory of machine learning
Bounding Rademacher Complexity with Covering Number:Pollard’s Lemma
Pollard’s Lemma: For any sample S = {Xi}ni=1, letF = {f : X → {−1, 1}}, we have
Radn(F ,S) ≤ infβ≥0{β +
√2 logN1(β,F ,Qn)
n}.
A key to prove Pollard’s Lemma is to use identity
supf ∈F
1
n
n∑i=1
σi f (Xi ) = supv∈V
supf ∈Bβ(v)
1
n
n∑i=1
σi (f (Xi )− vi + vi ),
where V is a l1 cover of F , |V | = N1(β,F ,Qn),Bβ(v) = {f ∈ F : 1
n
∑ni=1 |f (Xi )− vi | ≤ β}, and then to apply
Massart’s Finite lemma.
Seoul National University Deep Learning September-December, 2019 30 / 33
Some theory of machine learning
Bounding Rademacher Complexity with Covering Number:Dudley’s Chaining
Dudley’s Chaining: For any i.i.d. sample S = {Xi}ni=1, letF = {f : X → {−1, 1}}, we have
Radn(F , S) ≤ inf0≤α≤1
{4α + 12
∫ 1
α
√logN2(δ,F ,Qn)
ndδ}.
The main idea is to use balls with different size for the covering. LetVj be a minimal l2 cover at scale εj = 2−j with |Vj | = N2(F , εj ,Qn)for j ∈ {0, 1, · · · ,N} and |Vj | ≤ |Vj+1|.
Seoul National University Deep Learning September-December, 2019 31 / 33
Some theory of machine learning
Generalization Bounds for Deep Neural Networks: Setup
We can write DNN as follows. Let σi : Rdi → Rdi+1 1-Lipschitzcontinuous function and σi (0) = 0. Let
fA(x) = σL(ALσL−1(AL−1 · · ·σ1(A1x) · · · ))
with x ∈ Rd , Ai ∈ Rdi×di−1 , with d0 = d , dL+1 = k , andW = maxi di . Assume ‖x‖ ≤ B.
We give two results for two slightly different function class, one with
bounded Frobenius norm, ‖A‖F =√∑d
i=1
∑mj=1 A
2ij and the other
with bounded spectral norm, ‖A‖2 = sup‖u‖2=1 ‖Au‖2 = λmax , thelargest singular value of A.
Seoul National University Deep Learning September-December, 2019 32 / 33
Some theory of machine learning
Generalization Bounds for Deep Neural Networks
(Golowich et al., 2017) For function class FA,‖.‖F ,γ = {fA(x) : Rd →Rk ,A = {A1, · · · ,AL} : ‖Ai‖F ≤ Mi , i = 1, · · · , L, γ ≤
∏Li=1 ‖Ai‖2},
Radn(F , S) . B(L∏
i=1
Mi ) min({
√log(γ−1
∏Li=1Mi )√
n,
√L
n}).
(Li et al., 2018) For function class FA,‖.‖2 = {fA(x) : Rd → Rk ,A ={A1, · · · ,AL} : ‖Ai‖2 ≤ si , i = 1, · · · , L, σi (0) = 0,
Radn(F , S) .B(∏L
i=1 si )√n
√LW 2 log
(√Lnmax1≤i≤L si )
min1≤i≤L si.
Seoul National University Deep Learning September-December, 2019 33 / 33