Upload
others
View
54
Download
2
Embed Size (px)
Citation preview
• Classical case:
• n� d .
• Asymptotic assumption: d is fixed and n→∞.
• Basic tools: LLN and CLT.
• High-dimensional setting:
• n ∼ d , e.g. n/d → γ or
• d � n, e.g. d ∼ en
• e.g. 104 genes with only 50 samples.
• Classical methods fail. E.g.,
• Linear regression y = Xβ + ε, where ε ∼ N(0, σ2In).
βOLS = argminβ∈Rd
‖y − Xβ‖22
We have MSE(βOLS) = O(σ2dn ).
• Solution: Assume some underlying low-dimensional structure (e.g. sparsity).
• Basic tools: Concentration inequalities.
2 / 124
Table of Contents I1 Concentration inequalities
Sub-Gaussian concentration (Hoeffding inequality)Sub-exponential concentration (Bernstein inequality)Applications of Bernstein inequalityχ2 ConcentrationJohnson-Lindenstrauss embedding`2 norm concentration`∞ norm
Bounded difference inequality (Azuma–Hoeffding)Detour: MartingalesAzuma–HoeffdingBounded difference inequalityConcentration of (bounded) U-statisticsConcentration of clique numbers
Gaussian concentrationχ2 concentration revisitedOrder statistics and singular values
Gaussian chaos (Hanson–Wright inequality)
2 Sparse linear modelsBasis Pursuit
Restricted null space property (RNS)3 / 124
Table of Contents IISufficient conditions for RNS
Pairwise incoherenceRestricted isometry property (RIP)
Noisy sparse regressionRestricted eigenvalue conditionDeviation bounds under RERE for anisotropic designOracle inequality (and `q sparsity)
3 Metric entropyPacking and coveringVolume ratio estimates
4 Random matrices and covariance estimationOp-norm concentration: sub-Gaussian matrices, independent entriesOp-norm concentration: Gaussian caseSub-Gaussian random vectorsOp-norm concentration: sample covariance
5 Structured covariance matricesHard thresholding estimatorApproximate sparsity (`q balls)
6 Matrix concentration inequalities4 / 124
Concentration inequalities
• Main tools in dealing with high-dimensional randomness.
• Non-asymptotic versions of the CLT.
• General form: P(|X − EX | > t) < something small.
• Classical examples: Markov and Chebyshev inequalities:
• Markov: Assume X ≥ 0, then
P(X ≥ t) ≤ EXt.
• Chebyshev: Assume EX 2 <∞, and let µ = EX . Then,
P(|X − µ| ≥ t) ≤ var(X )
t2.
• Stronger assumption: E|X |k <∞. Then,
P(|X − µ| ≥ t) ≤ E|X − µ|ktk
.
5 / 124
Concentration inequalities
Example 1
• X1, . . . ,Xn ∼ Ber(1/2) and Sn =∑n
i=1 Xi . Then, by CLT
Zn :=Sn − n/2√
n/4
d→ N(0, 1).
• Letting g ∼ N(0, 1),
P(Sn ≥
n
2+
√n
4t
)≈ P(g ≥ t) ≤ 1
2e−t
2/2.
• Letting t = α√n,
P(Sn ≥
n
2(1 + α)
)/
1
2e−nα
2/2.
• Problem: Approximation is not tight in general.
6 / 124
Theorem 1 (Berry–Esseen CLT)
Under the assumption of CLT, with ρ = E|X1 − µ|3/σ3,∣∣P(Zn ≥ t)− P(g ≥ t)∣∣ ≤ ρ√
n.
• Bound is tight since P(Sn = n/2) = 12n
(n
n/2
)∼ 1√
n, for Bernoulli exa.
• Conclusion, the approximation error is O(n−1/2) which is a lot larger than
the exponential bound O(e−nα2/2) that we want to establish.
• Solution: Directly obtain the concentration inequalities,
• often using Chernoff bounding technique: for any λ > 0,
P(Zn ≥ t) = P(eλZn ≥ eλt) ≤ EeλZn
eλt, t ∈ R.
• Leads to the study of the MGF of random variables.
7 / 124
Sub-Gaussian concentration
Definition 1A zero-mean random variable X is sub-Gaussian if for some σ > 0.
EeλX ≤ eσ2λ2/2, for all λ ∈ R. (1)
A general random variable is sub-Gaussian if X − EX is sub-Gaussian.
• X ∼ N(0, σ2) satisfies (1) with equality.
• A Rademacher variable, also called symmetric Bernoulli P(X = ±1) = 12 is
sub-Gaussian,
EeλX = cosh(λ) ≤ eλ2/2.
• Any bounded RV is sub-Gaussian: X ∈ [a, b] a.s., then (1) with σ = b−a2 .
8 / 124
Proposition 1
Assume that X is zero-mean sub-Gaussian satisfying (1). Then,
P(X ≥ t) ≤ exp(− t2
2σ2
), for all t ≥ 0.
Same bound holds with X replaced with −X .
• Proof: Chernoff bound
P(X ≥ t) ≤ infλ>0
[e−λtEeλX
]= infλ>0
exp(−λt +
λ2σ2
2
).
• Union bound gives two-sided bound: P(|X | ≥ t) ≤ 2 exp(− t2
2σ2 ).
• What if µ := EX 6= 0? Apply to X − µ,
P(|X − µ| ≥ t) ≤ 2 exp(− t2
2σ2
).
9 / 124
Proposition 2
Assume that {Xi} are
• independent, zero-mean sub-Gaussian with parameters {σi}.Then, Sn =
∑i Xi is sub-Gaussian with parameter σ :=
√∑i σ
2i .
• Sub-Gaussian parameter squared behaves like the variance.
• Proof: EeλSn =∏
i EeλXi .
10 / 124
Theorem 2 (Hoeffding)
Assume that {Xi} are
• independent, zero-mean sub-Gaussian with parameters {σi}.Then, letting σ2 :=
∑i σ
2i ,
P(∑
i
Xi ≥ t)≤ exp
(− t2
2σ2
), t ≥ 0.
Same bound holds with Xi replaced with −Xi .
• Alternative form, assume there are n variables, and
• let σ2 := 1n
∑ni=1 σ
2i , and Xn := 1
n
∑ni=1 Xi . Then,
P(Xn ≥ t
)≤ exp
(− nt2
2σ2
), t ≥ 0.
• Example: Xiiid∼Rad so that σ = σi = 1.
11 / 124
Equivalent characterizations of sub-Gaussianity
For a RV X , the following are equivalent: (HDP, Prop. 2.5.2)
1. The tails of X satisfy
P(|X | ≥ t) ≤ 2 exp(−t2/K 21 ), for all t ≥ 0.
2. The moments of X satisfy
‖X‖p = (E|X |p)1/p ≤ K2√p, for all p ≥ 1.
3. The MGF of X 2 satisfies
E exp(λX 2) ≤ exp(K 23λ
2), for all |λ| ≤ 1
K3
4. The MGF of X 2 is bounded at some point,
E exp(X 2/K 24 ) ≤ 2.
Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies
E exp(λX ) ≤ exp(K 25λ
2), for all λ ∈ R.
12 / 124
Sub-Gaussian norm
• The sub-Gaussian norm is the smallest K4 in property 4, i.e.,
‖X‖ψ2 = inf{t > 0 : E exp(X 2/t2) ≤ 2
}.
• X is sub-Gaussian iff ‖X‖ψ2 <∞.
• ‖ · ‖ψ2 is a proper norm on the space of sub-Gaussian RVs.
• Every sub-Gaussian variable satisfies the following bounds:
P(|X | ≥ t) ≤ 2 exp(−ct2/‖X‖2ψ2
), for all t ≥ 0.
‖X‖p ≤ C‖X‖ψ2
√p, for all p ≥ 1.
E exp(X 2/‖X‖2ψ2
) ≤ 2
When EX = 0, E exp(λX ) ≤ exp(Cλ2‖X‖2ψ2
) for all λ ∈ R.
for some universal constant C , c > 0.
13 / 124
Some consequences
• Recall what a universal/numerical/absolute constant means.
• Sub-Gaussian norm is within a constant factor of the sub-Gaussianparameter σ: for numerical constant c1, c2 > 0,
c1‖X‖ψ2 ≤ σ(X ) ≤ c2‖X‖ψ2 .
• Easy to see that ‖X‖ψ2 . ‖X‖∞. (Bounded variables are sub-Gaussian)
• a . b means a ≤ Cb for some universal constant C .
Lemma 1 (Centering)
If X is sub-Gaussian, then X − EX is sub-Gaussian too and
‖X − EX‖ψ2 ≤ C‖X‖ψ2
where C is a universal constant.
• Proof: ‖EX‖ψ2 . |EX | ≤ E|X | = ‖X‖1 . ‖X‖ψ2 .
• Note: ‖X − EX‖ψ2 could be much smaller than ‖X‖ψ2 .
14 / 124
Alternative forms
• Alternative form of Proposition 2:
Proposition 3 (HDP 2.6.1)
Assume that {Xi} are
• independent, zero-mean sub-Gaussian RVs.
Then∑
i Xi is also sub-Gaussian and
‖∑i
Xi‖2ψ2≤ C
∑i
‖Xi‖2ψ2
where C is an absolute constant.
15 / 124
• Alternative form of Theorem 2:
Theorem 3 (Hoeffding)
Assume that {Xi} are independent, zero-mean sub-Gaussian RVs. Then,
P(∣∣∣∑
i
Xi
∣∣∣ ≥ t)≤ 2 exp
(− c t2∑
i ‖Xi‖ψ2
), t ≥ 0.
• c > 0 is some universal constant.
16 / 124
Sub-exponential concentration
Definition 2A zero-mean random variable X is sub-exponential if for some ν, α > 0.
EeλX ≤ eν2λ2/2, for all |λ| < 1
α. (2)
A general random variable is sub-exponential if X − EX is sub-exponential.
• If Z ∼ N(0, 1), then Z 2 is sub-exponential.
Eeλ(Z 2−1) =
{e−λ√
1−2λλ < 1/2
∞ λ > 1/2.
• We have Eeλ(Z 2−1) ≤ e4λ2/2, |λ| < 1/4.
• hence sub-exponential with parameters (2, 4).
• Tails of Z 2 − 1 are heavier than a Gaussian.
17 / 124
Proposition 4
Assume that X is zero-mean sub-exponential satisfying (2). Then,
P(X ≥ t) ≤ exp(−1
2min
{ t2
ν2,t
α
}), for all t ≥ 0.
Same bound holds with X replaced with −X .
• Proof: Chernoff bound
P(X ≥ t) ≤ infλ≥0
[e−λtEeλX
]≤ inf
0≤λ< 1α
exp(−λt +
λ2ν2
2
).
• Let f (λ) = −λt + λ2ν2/2.
• Minimizer of f over R is λ = t/ν2.
18 / 124
• Hence minimizer of f over [0, 1/α] is
λ∗ =
{t/ν2 t/ν2 < 1/α
1/α t/ν2 ≥ 1/α.
and the minimum is
f (λ∗) =
− t2
2ν2t < ν2/α
− t
α+
ν2
2α2≤ − t
2αt ≥ ν2/α
.
• Thus,
f (λ∗) ≤ max{− t2
2ν2,− t
2α
}= −1
2min
{ t2
ν2,t
α
}.
19 / 124
Bernstein inequality for sub-exponential RVs
Theorem 4 (Bernstein)
Assume that {Xi} are independent, zero-mean sub-exponential RVs with
parameters (νi , αi ). Let ν :=(∑
i ν2i
)1/2and α := maxi αi .
Then∑
i Xi is sub-exponential with parameters (ν, α), and
P(∑
i
Xi ≥ t)≤ exp
(−1
2min
{ t2
ν2,t
α
}).
• Proof: We have
EeλXi ≤ eλ2ν2
i /2, for all |λ| < 1
maxi αi.
• Let Sn =∑
i Xi . By independence
EeλSn =∏i
EeλXi ≤ eλ2 ∑
i ν2i /2, for all |λ| < 1
maxi αi.
• The tail bound follows from Proposition 4.20 / 124
Equivalent characterizations of sub-exponential RVs
For a RV X , the following are equivalent: (HDP, Prop. 2.7.1)
1. The tails of X satisfy
P(|X | ≥ t) ≤ 2 exp(−t/K1), for all t ≥ 0.
2. The moments of X satisfy
‖X‖p = (E|X |p)1/p ≤ K2p, for all p ≥ 1.
3. The MGF of |X | satisfies
E exp(λ|X |) ≤ exp(K3λ), for all 0 ≤ λ ≤ 1
K3
4. The MGF of |X | is bounded at some point,
E exp(|X |/K4) ≤ 2.
Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies
E exp(λX ) ≤ exp(K 25λ
2), for all |λ| ≤ 1
K5.
21 / 124
Equivalent characterizations of sub-Gaussianity
For a RV X , the following are equivalent: (HDP, Prop. 2.5.2)
1. The tails of X satisfy
P(|X | ≥ t) ≤ 2 exp(−t2/K 21 ), for all t ≥ 0.
2. The moments of X satisfy
‖X‖p = (E|X |p)1/p ≤ K2√p, for all p ≥ 1.
3. The MGF of X 2 satisfies
E exp(λX 2) ≤ exp(K 23λ
2), for all |λ| ≤ 1
K3
4. The MGF of X 2 is bounded at some point,
E exp(X 2/K 24 ) ≤ 2.
Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies
E exp(λX ) ≤ exp(K 25λ
2), for all λ ∈ R.
22 / 124
Sub-exponential norm
• The sub-exponential norm is the smallest K4 in property 4, i.e.,
‖X‖ψ1 = inf{t > 0 : E exp(|X |/t) ≤ 2
}.
• X is sub-exponential iff ‖X‖ψ1 <∞.
• ‖ · ‖ψ1 is a proper norm on the space of sub-exponential RVs.
• Every sub-exponential variable satisfies the following bounds:
P(|X | ≥ t) ≤ 2 exp(−ct/‖X‖ψ1 ), for all t ≥ 0.
‖X‖p ≤ C‖X‖ψ1p, for all p ≥ 1.
E exp(|X |/‖X‖ψ1 ) ≤ 2
When EX = 0, E exp(λX ) ≤ exp(Cλ2‖X‖ψ1 ) for all |λ| ≤ 1/‖X‖ψ1 .
for some universal constant C , c > 0.
23 / 124
Lemma 2
A random variable X is sub-Gaussian if and only if X 2 is sub-exponential, in fact
‖X 2‖ψ1 = ‖X‖2ψ2.
• Proof: Immediate from definition.
Lemma 3If X and Y are sub-Gaussian, then XY is sub-exponential, and
‖XY ‖ψ1 ≤ ‖X‖ψ2‖Y ‖ψ2
• Proof: Assume ‖X‖ψ2 = ‖Y ‖ψ2 = 1, WLOG.
• Apply Young’s inequality ab ≤ (a2 + b2)/2 for all a, b ∈ R, twice
Ee|XY | ≤ Ee(X 2+Y 2)/2 = E[eX2/2eY
2/2] ≤ 1
2E[eX
2
+ eY2] ≤ 2.
24 / 124
• Alternative form of Proposition 4:
Theorem 5 (Bernstein)
Assume that {Xi} are independent, zero-mean sub-exponential RVs. Then,
P(∣∣∣∑
i
Xi
∣∣∣ ≥ t)≤ 2 exp
[−c min
( t2∑i ‖Xi‖2
ψ1
,t
maxi ‖Xi‖ψ1
)], t ≥ 0.
• c > 0 is some universal constant.
Corollary 1 (Bernstein)
Assume that {Xi} are independent, zero-mean sub-exponential RVs with‖Xi‖ψ1 ≤ K for all i . Then,
P(∣∣∣1
n
n∑i=1
Xi
∣∣∣ ≥ t)≤ 2 exp
[−c n ·min
( t2
K 2,t
K
)], t ≥ 0.
25 / 124
Concentration of χ2 RVs I
Example 2
Let Y ∼ χ2n, i.e., Y =
∑ni=1 Z
2i where Zi
iid∼N(0, 1).
• Z 2i are sub-exponential with parameters (2, 4).
• Then, Y is sub-exponential with parameters (2√n, 4) and we obtain
P(|Y − EY |) ≤ 2 exp[−1
2min
( t2
4n,t
4
)]or replacing t with nt
P(∣∣∣1
n
n∑i=1
Z 2i − 1
∣∣∣ ≥ t)≤ 2 exp
[−1
8nmin(t2, t)
], t ≥ 0.
26 / 124
Concentration of χ2 RVs II
• In particular,
P(∣∣∣1
n
n∑i=1
Z 2i − 1
∣∣∣ ≥ t)≤ 2e−nt
2/8, t ∈ [0, 1].
• Second approach ignoring constants:
• We have ‖Z 2i − 1‖ψ1 ≤ C ′‖Z 2
i ‖ψ1 = C ′‖Zi‖2ψ2
= C .
• Applying Corollary 1 with K = C ′
P(∣∣∣1
n
n∑i=1
Z 2i − 1
∣∣∣ ≥ t)≤ 2 exp
[−c n min
( t2
C 2,t
C
)],
≤ 2 exp[−c2n min(t2, t)
], t ≥ 0
where c2 = c min(1/C 2, 1/C ).
27 / 124
Random projection for dimension reduction
• Suppose that we have data points {u1, . . . , uN} ⊆ Rd .
• Want to project them down to a lower-dimensional space Rm (m� d)
• such that pairwise distances ‖ui − uj‖ are approximately preserved.
• Can be done by a linear random projection X : Rd → Rm,
• which can be viewed as a random matrix X ∈ Rm×d .
Lemma 4 (Johnson-Lindenstrauss embedding)
Let X := 1√mZ ∈ Rm×d where Z has iid N(0, 1) entries. Consider any collection
of points {u1, . . . , uN} ⊆ Rd . Take ε, δ ∈ (0, 1) and assume that
m ≥ 16
δ2log( N√
ε
).
Then with probability at least 1− ε,
(1− δ)‖ui − uj‖22 ≤ ‖Xui − Xuj‖2
2 ≤ (1 + δ)‖ui − uj‖22, ∀i 6= j
28 / 124
Proof
• Fix u ∈ Rd and let
Y :=‖Zu‖2
2
‖u‖22
=m∑i=1
〈zi ,u
‖u‖2〉2
where zTi is the ith row of Z .
• Then, Y ∼ χ2m.
• Recalling X = Z/√m, for all δ ∈ (0, 1),
P(∣∣∣‖Xu‖2
2
‖u‖22
− 1∣∣∣ ≥ δ) = P
(∣∣∣Ym− 1∣∣∣ ≥ δ) ≤ 2e−mδ
2/8
• Applying to u = ui − uj , for any fixed pair (i , j), we have
P(∣∣∣‖X (ui − uj)‖2
2
‖ui − uj‖22
− 1∣∣∣ ≥ δ) ≤ 2e−mδ
2/8
29 / 124
• Apply a further union bound for all pairs i 6= j
P(∣∣∣‖X (ui − uj)‖2
2
‖ui − uj‖22
− 1∣∣∣ ≥ δ, for some i 6= j
)≤ 2
(N
2
)e−mδ
2/8
• Since 2(N2
)≤ N2, the result follows by solving the following for m
N2e−mδ2/8 ≤ ε.
30 / 124
`2 norm of sub-Gaussian vectors• Here ‖X‖2 =
√∑ni=1 X
2i .
Proposition 5 (Concenration of norm, HDP 3.1.1)
Let X = (X1, . . . ,Xn) ∈ Rn be a random vector with independent, sub-Gaussiancoordinates Xi that satisfy EX 2
i = 1. Then,
‖ ‖X‖2 −√n ‖ψ2 ≤ CK 2
where K = maxi ‖Xi‖ψ2 and C is an absolute constant.
• The result says that the norm is highly concentration around√n:
‖X‖2 ≈√n
in high dimensions (n large).
• Assuming K = O(1), it shows that w.h.p. ‖X‖2 =√n + O(1).
• More precisely, w.p. ≥ 1− e−c1t2
, we have√n − K 2t ≤ ‖X‖2 ≤
√n + K 2t
31 / 124
• Simple argument:
E‖X‖22 = n
var(‖X‖22) = n var(X 2
1 )
sd(‖X‖22) =
√n sd(X 2
1 )
• Assuming sd(X 21 ) = O(1),
‖X‖2 ≈√n ± O(
√n) =
√n ± O(1).
• The last equality can be shown by Taylor expansion:
•√
1 + t = 1 + O(t), as t → 0. hence√
1 + 1√n
= 1 + O( 1√n
).
• Side note: Better than√a + b ≤ √a +
√b which holds for any a, b ≥ 0.
• That is, when b = o(a), we have√a + b ≤ √a + O( b√
a).
32 / 124
• Proof of Proposition 5: Argue that we can take K ≥ 1.
• Since Xi is sub-Gaussian, X 2i is sub-exponential and
‖X 2i − 1‖ψ1 ≤ C‖X 2
i ‖ψ1 = C‖Xi‖2ψ2≤ CK 2.
• Applying Bernstein’s inequality (Corollary 1), for any u ≥ 0,
P(∣∣∣‖X‖2
2
n− 1∣∣∣ ≥ u
)≤ 2 exp
(−c1n
K 4min(u2, u)
),
where we used K 4 ≥ K 2 and absorbed C into c1.
• Using the inequality |z − 1| ≥ δ =⇒ |z2 − 1| ≥ max(δ, δ2), ∀z ,
P(∣∣∣‖X‖2√
n− 1∣∣∣ ≥ δ) ≤ P
(∣∣∣‖X‖22
n− 1∣∣∣ ≥ max(δ, δ2)
)≤ 2 exp
(−c1n
K 4δ2).
• f (u) = min(u2, u) and g(δ) = max(δ, δ2), then f (g(δ)) = δ2 for all δ ≥ 0.
• Change of variable δ = t/√n gives the result.
33 / 124
`∞ norm of sub-Gaussian vectors
• For any vector X ∈ Rn, the `∞ norm is:
‖X‖∞ = maxi=1,...,n
|Xi |.
Lemma 5
Let X = (X1, . . . ,Xn) ∈ Rn be a random vector with zero-mean, sub-Gaussiancoordinates Xi with parameter σi . Then, for any γ ≥ 0,
P(‖X‖∞ ≥ σ
√2(1 + γ) log n
)≤ 2n−γ
where σ = maxi σi .
• Proof: We have P(|Xi | ≥ t) ≤ 2 exp(−t2/2σ2), hence
P(maxi|Xi | ≥ t) ≤ 2n exp
(− t2
2σ2
)= 2n−γ
taking t =√
2σ2(1 + γ) log n.
34 / 124
Theorem 6
Assume {Xi}ni=1 are zero-mean RVs, sub-Gaussian with parameter σ. Then,
E[ maxi=1,...,n
Xi ] ≤√
2σ2 log n, ∀n ≥ 1.
• Proof of Theorem 6:
• Apply Jensen’s inequality to eλZ where Z = maxi Xi .
• Note that we always have (for any λ ∈ R)
eλmaxi Xi ≤∑i
eλXi .
• Exercise: complete the proof (by optimizing over λ).
35 / 124
Detour: Martingales
• A filteration {Fk}k≥1 is a nested sequence of σ-fields: F1 ⊂ F2 ⊂ · · · .• A sequence of random variables Y1,Y2, . . . .
• The pair ({Yk}, {Fk}) is a martingale if
• {Yk}k≥1 is adapted to {Fk}k≥1, i.e., Yk ∈ Fk for all k ≥ 1,
• E|Yk | <∞,
• E[Yk+1 | Fk ] = Yk . for all k ≥ 1.
• Often Fk = σ(X1,X2, . . . ), in which case we say {Yk} is a martingale, w.r.t.{Xk}. The key condition in this case is
E[Yk+1 | Xk , . . . ,X1] = Yk .
• One of the most general dependence structures in probability.
• Allow relaxing independence assumptions in classical limit theorems.
• {Yk} could be a martingale w.r.t. itself; then it is just called a martingale.
• We have a supermartingle if Yk ≥ E[Yk+1 | Fk ].
• An L1 bounded supermartingle (supk E[Y−k ] <∞) converges almost surely.
36 / 124
Detour: Martingale Examples
• Partial sums Sk :=∑k
j=1 Xj of an iid zero-mean sequence {Xi}.• Partial products Lk :=
∏kj=1 Xj of an iid sequence with E[X1] = 1.
• Likelhood ratio process is an example of the above exponential martingale.
• Doob’s martingale: For any integrable Z (i.e., E|Z | <∞) the following isalways a martingale:
Yk := E[Z | X1, . . . ,Xk ]
• Exercise: verify the martingale property.
37 / 124
Theorem 7 (Azuma–Hoeffding)
Let X = (X1, . . . ,Xn) be a random vector and Z = f (X ). Consider the Doob’smartingale
Yi := Ei [Z ] := E[Z | X1, . . . ,Xi ]
and let ∆i := Yi − Yi−1. Assume that
Ei−1[eλ∆i ] ≤ eσ2i λ
2/2, ∀λ ∈ R (3)
almost surely, for all i = 1, . . . , n.
Then, Z − EZ is sub-Gaussian with parameter σ =√∑n
i=1 σ2i .
• In particular, we have the tail bound
P(|Z − EZ | ≥ t) ≤ 2 exp(− t2
2σ2
).
• {∆i} is a martingale difference sequence, i.e., Ei−1[∆i ] = 0.
38 / 124
Proof• Let Sj :=
∑ji=1 ∆i which is only a function of Xi , i ≤ j .
• Noting that E0[Z ] = E[Z ] and En[Z ] = Z ,
Sn =n∑
i=1
∆i = Z − EZ .
• By properties of conditional expectation, and assumption (3),
En−1[eλSn ] = eλSn−1En−1[eλ∆n ] ≤ eλSn−1eσ2nλ
2/2.
• Taking En−2 of both sides:
En−2[eλSn ] ≤ eσ2nλ
2/2En−2[eλSn−1 ] ≤ eλSn−2e(σ2n+σ2
n−1)λ2/2.
• Repeating the process, we get
E0[eλSn ] ≤ exp(
(n∑
i=1
σ2i )λ2/2
).
39 / 124
Bounded difference inequality
• Conditional sub-G. assump. holds under bounded difference property:∣∣f (x1, . . . , xi−1, xi , xi+1, . . . , xn)− f (x1, . . . , xi−1, x′i , xi+1, . . . , xn)
∣∣ ≤ Li(4)
for all x1, . . . , xn, x′i ∈ X , and all i ∈ [n], for some constants (L1, . . . , Ln).
Theorem 8 (Bounded difference)
Assume that X = (X1, . . . ,Xn) has independent coordinates, andassume that f : X n 7→ R satisfies the bounded difference property (4). Then,
P(∣∣f (X )− Ef (X )
∣∣ ≥ t)≤ 2 exp
(− 2t2∑n
i=1 L2i
), t ≥ 0.
40 / 124
Proof (Naive bound)
• Let gi (X1, . . . ,Xi ) := Ei [Z ] and show that gi also has bounded diff.property.
∆i = Ei [Z ]− Ei−1[Ei [Z ]] = gi (X1, . . . ,Xi )− Ei−1[gi (X1, . . . ,Xi )]
• Let X ′i be an independent copy of Xi .
• Conditioned on X1, . . . ,Xi−1, we are effectively looking at
gi (x1, . . . , xi−1,Xi )− E[gi (x1, . . . , xi−1,X′i )]
due to independence of {X1, . . . ,Xi ,X′i }.
• Thus, |∆i | ≤ Li condition on X1, . . . ,Xi−1.
• That is, Ei−1[eλ∆i ] ≤ eσ2i λ
2/2 where σ2i = (2Li )
2/4 = L2i .
41 / 124
Proof (Better bound)
• Can show that |∆i | ∈ Ii where |Ii | ≤ Li , improving the constant by 4.
• Conditioned on X1, . . . ,Xi , we are effectively looking at
∆i = gi (x1, . . . , xi−1,Xi )− µi
where µi is a constant (only a function of x1, . . . , xi−1).
• Then, ∆i + µi ∈ [ai , bi ] where
ai = infxgi (x1, . . . , xi−1, x), bi = sup
xgi (x1, . . . , xi−1, x).
• We have (need to argue that gi satisfies bounded difference)
bi − ai = supx,y
[gi (x1, . . . , xi−1, x)− gi (x1, . . . , xi−1, y)
]≤ Li .
• Thus Ei−1[eλ∆i ] ≤ eσ2λ2/2 where σ2
i = (bi − ai )2/4 ≤ L2
i /4.
42 / 124
• The role of independence in the second argument is subtle.
• The only place we used independence is to argue that Ei [Z ] satisfiesbounded difference for all i .
• We argue that Ei [Z ] = gi (X1, . . . ,Xi ), which is where we use independence.
• Then, gi by definition and Jensen satisfies bounded difference.
43 / 124
Example: (Bounded) U-statistics
• g : R2 → R a symmetric function,
• X1, . . . ,Xn and iid sequence,
U :=1(n2
) ∑i<j
g(Xi ,Xj)
is called a U-statistic (of order 2).
• U is not a sum of independent variables, e.g. n = 3 gives
U =1
3
(g(X1,X2) + g(X1,X3) + g(X2,X3)
),
but the dependence between terms is relatively weak (made precise shortly).
• For example, g(x , y) = 12 (x − y)2 gives an unbiased estimator of the
variance. (Exercise)
44 / 124
• Assume that g is bounded, i.e. ‖g‖∞ ≤ b, meaning
‖g‖∞ := supx,y|g(x , y)| ≤ b
i.e., |g(x , y)| ≤ b for all x , y ∈ R.
• Writing U = f (X1, . . . ,Xn), we observe that (for fixed k)
|f (x)− f (x\k)| ≤ 1(n2
) ∑i 6=k
|g(xi , xk)− g(xi , x′k)|
≤ (n − 1)2b
n(n − 1)/2=
4b
n
thus f has bounded differences with parameters Lk = 4b/n.
• Applying Theorem 8
P(|U − EU| ≥ t
)≤ 2e−nt
2/8b2
, t ≥ 0.
45 / 124
Clique number of Erdos-Renyi
• Let G be an undirected graph.on n nodes.
• A clique in G is a complete (induced) sub-graph.
• Clique number of G—denoted as ω(G )—is the size of the largest clique(s).
• For two graphs G and G ′ that differ in at most 1 edge,
|ω(G )− ω(G ′)| ≤ 1.
• Thus E (G ) 7→ ω(G ) has bounded difference property with L = 1.
• Let G be an Erdos-Renyi random graph: Edges are independently drawnwith probability p. Then, with m =
(n2
),
P(|ω(G )− Eω(G )| ≥ δ
)≤ 2e−2δ2/m
or setting ω(G ) = ω(G )/m,
P(|ω(G )− E ω(G )| ≥ δ
)≤ 2e−2mδ2
46 / 124
Lipschitz functions of standard Gaussian vector
• A function f : Rn → R is L-Lipschitz w.r.t. ‖ · ‖2 if
|f (x)− f (y)| ≤ L‖x − y‖2, x , y ∈ Rn
Theorem 9 (Gaussian concentration)
Let X ∼ N(0, In) be a standard Gaussian vector and assume that f : Rn → R isL-Lipschitz w.r.t. the Euclidean norm. Then,
P(∣∣f (X )− E[f (X )]
∣∣ ≥ t)≤ 2 exp
(− t2
2L2
), t ≥ 0. (5)
• In other words, f (X ) is sub-Gaussian with parameter L.
• Deep result, no easy proof!
• Has far-reaching consequences.
• One-sided bounds holds with prefactor 2 removed.
47 / 124
Example: χ2 and norm concentrations revisited• Let X ∼ N(0, In) and condition the function f (x) = ‖x‖2/
√n.
• f is L-Lipschitz with L = 1/√n. Hence,
P(‖X‖2√
n− E‖X‖2√
n≥ t)≤ e−nt
2/2, t ≥ 0
• Since E‖X‖2 ≤√n (why?), we have
P(‖X‖2√
n− 1 ≥ t
)≤ e−n t2/2, t ≥ 0.
• For t ∈ [0, 1], (1 + t)2 ≤ 1 + 3t, hence
P(‖X‖2
2
n− 1 ≥ 3t
)≤ e−n t2/2, t ∈ [0, 1].
or setting 3t = δ,
P(‖X‖2
2
n− 1 ≥ δ
)≤ e−n δ
2/18, δ ∈ [0, 3].
48 / 124
Example: order statistics
• Let X ∼ N(0, In), and
• let f (x) = x(k) be the kth order statistic: For x ∈ Rn,
x(1) ≤ x(2) ≤ · · · ≤ x(n)
• For any x , y ∈ Rn, we have
|x(k) − y(k)| ≤ ‖x − y‖2
hence f is 1-Lipschitz. (Exercise)
• It follows that
P(|X(k) − EX(k)| ≥ t
)≤ 2e−t
2/2, t ≥ 0
• In particular, if Xiiid∼N(0, 1), i = 1, . . . , n, then
P(∣∣∣ max
i=1,...,nXn − E[ max
i=1,...,nXn]∣∣∣ ≥ t
)≤ 2e−t
2/2, t ≥ 0
49 / 124
Example: Singular values
• Consider a matrix X ∈ Rn×d where n > d .
• Let σ1(X ) ≥ σ2(X ) ≥ · · · ≥ σk(X ) be (ordered) singular values of X .
• By Weyl’s theorem, for any X ,Y ∈ Rn×d :
|σk(X )− σk(Y )| ≤ |||X − Y |||op ≤ |||X − Y |||F
(Note that this is a generalization of order-statistics inequality.)
• Thus, X 7→ σk(X ) is 1-Lipschitz:
Proposition 6
Let X ∈ Rn×d be a random matrix with iid N(0, 1) entries. Then,
P(∣∣σk(X )− E[σk(X )]
∣∣ ≥ δ) ≤ 2e−δ2/2, δ ≥ 0
• It remains to characterize E[σk(X )].
• For an overview of matrix norms, see matrix norms.pdf
50 / 124
Table of Contents I1 Concentration inequalities
Sub-Gaussian concentration (Hoeffding inequality)Sub-exponential concentration (Bernstein inequality)Applications of Bernstein inequalityχ2 ConcentrationJohnson-Lindenstrauss embedding`2 norm concentration`∞ norm
Bounded difference inequality (Azuma–Hoeffding)Detour: MartingalesAzuma–HoeffdingBounded difference inequalityConcentration of (bounded) U-statisticsConcentration of clique numbers
Gaussian concentrationχ2 concentration revisitedOrder statistics and singular values
Gaussian chaos (Hanson–Wright inequality)
2 Sparse linear modelsBasis Pursuit
Restricted null space property (RNS)51 / 124
Table of Contents IISufficient conditions for RNS
Pairwise incoherenceRestricted isometry property (RIP)
Noisy sparse regressionRestricted eigenvalue conditionDeviation bounds under RERE for anisotropic designOracle inequality (and `q sparsity)
3 Metric entropyPacking and coveringVolume ratio estimates
4 Random matrices and covariance estimationOp-norm concentration: sub-Gaussian matrices, independent entriesOp-norm concentration: Gaussian caseSub-Gaussian random vectorsOp-norm concentration: sample covariance
5 Structured covariance matricesHard thresholding estimatorApproximate sparsity (`q balls)
6 Matrix concentration inequalities52 / 124
Linear regression setup
• The data is (y ,X ) where y ∈ Rn and X ∈ Rn×d , and the model
y = Xθ∗ + w .
• θ∗ ∈ Rd is an unknown parameter.
• w ∈ Rn is the vector of noise variables.
• Equivalently,yi = 〈θ∗, xi 〉+ wi , i = 1, . . . , n
where xi ∈ Rd is the nth row of X :
X =
− xT
1 −− xT
2 −...
− xTn −
︸ ︷︷ ︸
d
• Recall 〈θ∗, xi 〉 =∑d
j=1 θ∗j xij .
53 / 124
Sparsity models
• When n < d , no hope of estimating θ∗,
• unless we impose some sort of of low-dimensional model on θ∗.
• Support of θ∗ (recall [d ] = {1, . . . , d}):
supp(θ∗) := S(θ∗) ={j ∈ [d ] : θ∗j 6= 0
}.
• Hard sparsity assumption: s = |S(θ∗)| � d .
• Weaker sparsity assumption via `q balls for q ∈ [0, 1]
Bq(Rq) ={θ ∈ Rd :
d∑j=1
|θj |q ≤ Rq
}.
• q = gives `1 ball.
• q = 0 the `0 ball, same as hard sparsity:
‖θ∗‖0 := |S(θ∗)| = #{j ; θ∗j 6= 0
}54 / 124
Basis pursuit
• Consider the noiseless case y = Xθ∗.
• We assume that ‖θ∗‖0 is small.
• Ideal program to solve:
minθ∈Rd
‖θ‖0 subject to y = Xθ
• ‖ · ‖0 is highly non-convex, relax to ‖ · ‖1:
minθ∈Rd
‖θ‖1 subject to y = Xθ (6)
This is called basis pursuit (regression).
• (6) is a convex program.
• In fact, can be written as a linear program1.
• Global solutions can be obtained very efficiently.
1Exercise: Introduce auxiliary variables sj ∈ R and note that minimizing∑
j sj subject to
|θj | ≤ sj gives the `1 norm of θ.56 / 124
Restricted null space property (RNS)
• Define
C(S) = {∆ ∈ Rd : ‖∆Sc‖1 ≤ ‖∆S‖1}. (7)
Theorem 10The following two are equivalent:
• For any θ∗ ∈ Rd with support ⊆ S , the basis pursuit program (6) applied to
the data (y = Xθ∗,X ) has unique solution θ = θ∗.
• The restricted null space (RNS) property holds, i.e.,
C(S) ∩ ker(X ) = {0}. (8)
57 / 124
Proof
• Consider the tangent cone to the `1 ball (of radius ‖θ∗‖1) at θ∗:
T(θ∗) = {∆ ∈ Rd : ‖θ∗ + t∆‖1 ≤ ‖θ∗‖1, for some t > 0.}
i.e., the set of descent directions for `1 norm at point θ∗.
• Feasible set is θ∗ + ker(X ), i.e.
• ker(X ) is the set of feasible directions ∆ = θ − θ∗.• Hence, there is a minimizer other than θ∗ if and only if
T(θ∗) ∩ ker(X ) 6= {0} (9)
• It is enough to show that
C(S) =⋃
θ∗∈Rd : supp(θ∗)⊆ST(θ∗).
58 / 124
B1
•θ∗(1)
T(θ∗(1))•θ∗(2)
T(θ∗(2))
Ker(X)
C(S)
d = 2, [d ] = {1, 2}, S = {2}, θ∗(1)
= (0, 1), θ∗(2)
= (0,−1).
C(S) = {(∆1,∆2) : |∆1| ≤ |∆2|}.
59 / 124
• It is enough to show that
C(S) =⋃
θ∗∈Rd : supp(θ∗)⊆ST(θ∗) (10)
• We have ∆ ∈ T1(θ∗) iff2
‖∆Sc‖1 ≤ ‖θ∗S‖1 − ‖θ∗S + ∆S‖1
• We have ∆ ∈ T1(θ∗) for some θ∗ ∈ Rd s.t. supp(θ∗) ⊂ S iff
‖∆Sc‖1 ≤ supθ∗S∈Rd
[‖θ∗S‖1 − ‖θ∗S + ∆S‖1
]= ‖∆S‖1
2Let T1(θ∗) be the subset of T(θ∗) where t = 1, and argue that w.l.o.g. we can work thissubset.
60 / 124
Sufficient conditions for restricted nullspace• [d ] := {1, . . . , d}• For a matrix X ∈ Rd , let Xj be its jth column (for j ∈ [d ]).
• The pairwise incoherence of X is defined as
δPW(X ) := maxi, j∈[d ]
∣∣∣ 〈Xi ,Xj〉n
− 1{i = j}∣∣∣
• Alternative form: XTX is the Gram matrix of X ,
• (XTX )ij = 〈Xi ,Xj〉.
δPW(X ) := ‖XTX
n− Ip‖∞
where ‖ · ‖∞ is the vector `∞ norm of the matrix.
Proposition 7 (HDS Prop. 7.1)
(Uniform) restricted nullspace holds for all S with |S | ≤ s if
δPW(X ) ≤ 1
3s
• Proof: Exercise 7.3.61 / 124
Restricted isometry property (RIP)
• A more relaxed condition:
Definition 3 (RIP)
X ∈ Rn×d satisfies a restricted isometry property (RIP) of order s with constantδs(X ) > 0 if
|||XTS XS
n− I |||op ≤ δs(X ), for all S with |S | ≤ s
• PW incoherence is close to RIP with s = 2;
• for example, when ‖Xj/√n‖2 = 1 for all j , we have
δ2(X ) = δPW(X ).
• In general, for any s ≥ 2, (Exercise 7.4)
δPW(X ) ≤ δs(X ) ≤ s δPW(X ).
62 / 124
Definition (RIP)
X ∈ Rn×d satisfies a restricted isometry property (RIP) of order s with constantδs(X ) > 0 if
|||XTS XS
n− I |||op ≤ δs(X ), for all S with |S | ≤ s
• Let xTi be the i th row of X . Consider the sample covariance matrix:
Σ :=1
nXTX =
1
n
n∑i=1
xixTi ∈ Rd×d .
• Then ΣSS = 1nX
TS XS , hence, RIP is
|||ΣSS − I |||op ≤ δ < 1
i.e., ΣSS ≈ Is . More precisely,
(1− δ)‖u‖2 ≤ ‖ΣSSu‖2 ≤ (1 + δ)‖u‖2, ∀u ∈ Rs
63 / 124
• RIP gives sufficient conditions:
Proposition 8 (HDS Prop. 7.2)
(Uniform) restricted null space holds for all S with |S | ≤ s if
δ2s(X ) ≤ 1
3
• Consider a sub-Gaussian matrix X with i.i.d. entries and EXij = 0 andEX 2
ij = 1 (Exercise 7.7):
• We have
n & s2 log d =⇒ δPW(X ) <1
3s, w.h.p.
n & s log(eds
)=⇒ δ2s <
1
3, w.h.p.
• Sample complexity requirement for RIP is milder.
64 / 124
Neither RIP or PWI is necessary
• For more general covariance Σ, it is harder to satisfy either PWI or RIP.
• Consider X ∈ Rn×d with i.i.d. rows Xi ∼ N(0,Σ).
• Letting 1 ∈ Rd be the all-ones vector, and
Σ := (1− µ)Id + µ11T
for µ ∈ [0, 1). (A spiked covariance matrix.)
• We have γmax(ΣSS) = 1 + µ(s − 1)→∞ as s →∞.
• Exercise 7.8,
(a) PW is violated w.h.p. unless µ� 1/s.
(b) RIP is violated w.h.p. unless µ� 1/√s.
In fact δ2s grows like µ√s for any fixed µ ∈ (0, 1).
• However, for any µ ∈ [0, 1), basis pursuit succeeds w.h.p. if
n & s log(eds
).
(A later result shows this.)
65 / 124
Noisy sparse regression
• A very popular estimator is the `1-regularized least-squares:
θ ∈ argminθ∈Rd
[ 1
2n‖y − Xθ‖2
2 + λ‖θ‖1
](11)
• The idea: minimizing `1 norm leads to sparse solutions.
• (11) is a convex program; global solution can be obtained efficiently.
• Other options: constrained form of lasso
min‖θ‖1≤R
1
2n‖y − Xθ‖2
2. (12)
66 / 124
Restricted eigenvalue condition
• For a constant α ≥ 1,
Cα(S) :={
∆ ∈ Rd | ‖∆Sc‖1 ≤ α‖∆S‖1
}.
• A strengthening of RNS is:
Definition 4 (RE condition)
A matrix X satisfies the restricted eigenvalue (RE) condition over S withparameters (κ, α) if
1
n‖X∆‖2
2 ≥ κ‖∆‖22 for all ∆ ∈ Cα(S).
• RNS corresponds to
1
n‖X∆‖2
2 > 0 for all ∆ ∈ C1(S) \ {0}.
67 / 124
Deviation bounds under RE
Theorem 11 (Deterministic deviations)
Assume that y = Xθ∗ + w , where X ∈ Rn×d and θ∗ ∈ Rd , and
• θ∗ is supported on S ⊂ [d ] with |S | ≤ s
• X satisfies RE(κ, 3) over S .
Let us define z = XTwn . Then, we have the following:
(a) Any solution of Lasso (11) with λ ≥ 2‖z‖∞ satisfies
‖θ − θ∗‖2 ≤3
κ
√s λ
(b) Any solution of constrained Lasso (12) with R = ‖θ∗‖1 satisfies
‖θ − θ∗‖2 ≤4
κ
√s‖z‖∞
68 / 124
Example (fixed design regression)• Assume y = Xθ∗ + w where w ∼ N(0, σ2In), and
• X ∈ Rn×d fixed and satisfying RE condition and normalization
maxj=1,...,d
‖Xj‖√n≤ C .
where Xj is the jth column of X .
• Recall z = XTw/n.
• It is easy to show that w.p. ≥ 1− 2e−nδ2/2,
‖z‖∞ ≤ Cσ(√2 log d
n+ δ)
• Thus, setting λ = 2Cσ(√
2 log dn + δ
), Lasso solution satisfies
‖θ − θ∗‖2 ≤6Cσ
κ
√s(√2 log d
n+ δ)
w.p. at least 1− 2e−nδ2/2.
69 / 124
• Taking δ =√
2 log /n, we have
‖θ − θ∗‖2 . σ
√s log d
n
w.h.p. (i.e., ≥ 1− 2d−1).
• This is the typical high-dimensional scaling in sparse problems.
• Had we known the support S in advance, our rate would be (w.h.p.)
‖θ − θ∗‖2 . σ
√s
n.
• The log d factor is the price for not knowing the support;
• roughly the price for searching over(ds
)≤ d s collection of candidate
supports.
70 / 124
Proof of Theorem 11• Let us simplify the loss L(θ) := 1
2n‖Xθ − y‖2.
• Setting ∆ = θ − θ∗,
L(θ) =1
2n‖X (θ − θ∗)− w‖2
=1
2n‖X∆− w‖2
=1
2n‖X∆‖2 − 1
n〈X∆,w〉+ const.
=1
2n‖X∆‖2 − 1
n〈∆,XTw〉+ const.
=1
2n‖X∆‖2 − 〈∆, z〉+ const.
where z = XTw/n. Hence,
L(θ)− L(θ∗) =1
2n‖X∆‖2 − 〈∆, z〉. (13)
• Exercise: Show that (13) is the Taylor expansion of L around θ∗.
71 / 124
Proof (constrained version)• By optimality of θ and feasibility of θ∗:
L(θ) ≤ L(θ∗)
• Error vector ∆ := θ − θ∗ satisfies basic inequality
1
2n‖X ∆‖2
2 ≤ 〈z , ∆〉.
• Using Holder inequality
1
2n‖X ∆‖2
2 ≤ ‖z‖∞‖∆‖1.
• Since ‖θ‖1 ≤ ‖θ∗‖1, we have ∆ = θ − θ∗ ∈ C1(S), hence
‖∆‖1 = ‖∆S‖1 + ‖∆Sc‖1 ≤ 2‖∆S‖1 ≤ 2√s‖∆‖2.
• Combined with RE condition (∆ ∈ C3(S) as well)
1
2κ‖∆‖2
2 ≤ 2√s‖z‖∞‖∆‖2.
which gives the desired result.72 / 124
Proof (Lagrangian version)• Let L(θ) := L(θ) + λ‖θ‖1 be the regularized loss.
• Basic inequality is
L(θ) + λ‖θ‖1 ≤ L(θ∗) + λ‖θ∗‖1
• Rearranging1
2n‖X ∆‖2
2 ≤ 〈z , ∆〉+ λ(‖θ∗‖1 − ‖θ‖1)
• We have
‖θ∗‖1 − ‖θ‖1 = ‖θ∗S‖1 − ‖θ∗S + ∆S‖1 − ‖∆Sc‖1
≤ ‖∆S‖1 − ‖∆Sc‖1
• Since λ ≥ 2‖z‖∞,
1
n‖X ∆‖2
2 ≤ λ‖∆‖1 + 2λ(‖∆S‖1 − ‖∆Sc‖1)
≤ λ( 3‖∆S‖1 − ‖∆Sc‖1 )
• It follows that ∆ ∈ C3(S) and the rest of the proof follows.73 / 124
RE condition for anisotropic design• For a PSD matrix Σ, let
ρ2(Σ) = maxi
Σii .
Theorem 12
Let X ∈ Rn×d with rows i.i.d. from N(0,Σ). Then, there exist universalconstants c1 < 1 < c2 such that
‖Xθ‖22
n≥ c1‖
√Σθ‖2
2 − c2 ρ2(Σ)
log d
n‖θ‖2
1, for all θ ∈ Rd (14)
withe probability at least 1− e−n/32/(1− e−n/32).
• Exercise 7.11: (14) implies RE condition over C3(S) uniformly over allsubsets of cardinality
|S | ≤ c1
32c2
γmin(Σ)
ρ2(Σ)
n
log d
• In other words, n & s log d =⇒ RE condition over C3(S) for all |S | ≤ s.74 / 124
Comments
• Note that E‖Xθ‖22
n = ‖√
Σθ‖22.
• Bound (14) says that ‖Xθ‖22/n is lower-bounded by a multiple of its
expectation minus a slack ∝ ‖θ‖21.
• Why the slack is needed?
• In the high-dimensional setting ‖Xθ‖2 = 0 for any θ ∈ ker(X ), while‖√
Σθ‖2 > 0 for any θ 6= 0 assuming Σ is non-singular.
• In fact, ‖√
Σθ‖2 is uniformly bounded below in that case:
‖√
Σθ‖22 ≥ γmin(Σ)‖θ‖2
2,
showing that the population level version of X/n, that is,√
Σ, satisfies RE,and in fact global eigenvalue condition.
• (14) gives a nontrivial/good lower bound for θ for which ‖θ‖1 is smallcompared to ‖θ‖2, i.e., sparse vectors.
• Recall that if θ is s-sparse, then ‖θ‖1 ≤√s‖θ‖2,
• while for general θ ∈ Rd , we have ‖θ‖1 ≤√d‖θ‖2.
75 / 124
Examples
• Toeplitz family: Σij = ν|i−j|,
ρ2(Σ) = 1, γmin(Σ) ≥ (1− ν)2 > 0
• Spiked model: Σ := (1− µ)Id + µ11T ,
ρ2(Σ) = 1, γmin(Σ) = 1− µ
• For future applications, note that (14) implies
‖Xθ‖22
n≥ α1‖θ‖2
2 − α2‖θ‖21, ∀θ ∈ Rd .
where α1 = c1γmin(Σ) and α2 = c2ρ2(Σ)
log d
n.
76 / 124
Lasso oracle inequality
• For simplicity, let κ = γmin(Σ) and ρ2 = ρ2(Σ) = maxi Σii
Theorem 13
Under condition (14), consider the Lagrangian Lasso with regularizationparameter λ ≥ 2‖z‖∞ where z = XTw/n. For any θ∗ ∈ Rd , any optimal
solution θ satisfies the bound
‖θ − θ∗‖22 ≤
144
c21
λ2
κ2|S |︸ ︷︷ ︸
Estimation error
+16
c1
λ
κ‖θ∗Sc‖1 +
32c2
c1
ρ2
κ
log d
n‖θ∗Sc‖2
1︸ ︷︷ ︸ApproximationError
(15)
valid for any subset S with cardinality
|S | ≤ c1
64c2
κ
ρ2
n
log d.
77 / 124
• Simplifying the bound
‖θ − θ∗‖22 ≤ κ1 λ
2|S |+ κ2 λ‖θ∗Sc‖1 + κ3log d
n‖θ∗Sc‖2
1
where κ1, κ2, κ3 are constant dependent on Σ.
• Assume σ = 1 (noise variance) for simplicity.
• Since we ‖z‖∞ .√
log d/n w.h.p., we can take λ of this order:
‖θ − θ∗‖22 .
log d
n|S |+
√log d
n‖θ∗Sc‖1 +
log d
n‖θ∗Sc‖2
1
• Optimizing the bound
‖θ − θ∗‖22 . inf
|S|. nlog d
[log d
n|S |+
√log d
n‖θ∗Sc‖1 +
log d
n‖θ∗Sc‖2
1
]
• An oracle that knows θ∗ can choose the optimal S .
78 / 124
Example: `q-ball sparsity
• Assume that θ∗ ∈ Bq, i.e.,∑d
j=1 |θ∗j |q ≤ 1, for some q ∈ [0, 1].
• Then, assuming σ2 = 1, we have the rate (Exercise 7.12)
‖θ − θ∗‖22 .
( log d
n
)1−q/2
.
Sketch:
• Trick: take S = {i : |θ∗i | > τ} and find a good threshold τ later.
• Show that ‖θ∗Sc‖1 ≤ τ 1−q and |S | ≤ τ−q.
• The bound would be of the form (ε :=√
log d/n)
ε2τ−q + ετ 1−q + (ετ 1−q)2.
• Ignore the last term (assuming ετ 1−q ≤ 1, it is not dominant),
79 / 124
Proof of Theorem 13 (Compact)• Basic inequality argument and assumption λ ≥ 2‖z‖∞ gives
1
n‖X ∆‖2
2 ≤ λ(3‖∆S‖1 − ‖∆Sc‖1 + b).
where b := 2‖θ∗Sc‖1.
• The error satisfies ‖∆Sc‖1 ≤ 3‖∆S‖1 + b,
• ‖∆‖21 ≤ 32 s ‖∆‖2
2 + 2b2 (∆ behaves almost like an s-sparse vector.)
• Bound (14) can be written as (α1, α2 > 0)
1
n‖X∆‖2
2 ≥ α1‖∆‖22 − α2‖∆‖2
1, ∀∆ ∈ Rd .
• Combining, rearranging, dropping, etc., we get
(α1 − 32α2)‖∆‖22 ≤ λ(3
√s‖∆‖2 + b) + 2α2b
2
• Assume α1 − 32α2s ≥ α1/2 > 0 so that the quadratic inequality enforces an
upper bound on ∆.
• Hint: ax2 ≤ bx + c , x ≥ 0 =⇒ x ≤ ba +
√ca =⇒ x2 ≤ 2b2
a2 + 2ca .
80 / 124
Table of Contents I1 Concentration inequalities
Sub-Gaussian concentration (Hoeffding inequality)Sub-exponential concentration (Bernstein inequality)Applications of Bernstein inequalityχ2 ConcentrationJohnson-Lindenstrauss embedding`2 norm concentration`∞ norm
Bounded difference inequality (Azuma–Hoeffding)Detour: MartingalesAzuma–HoeffdingBounded difference inequalityConcentration of (bounded) U-statisticsConcentration of clique numbers
Gaussian concentrationχ2 concentration revisitedOrder statistics and singular values
Gaussian chaos (Hanson–Wright inequality)
2 Sparse linear modelsBasis Pursuit
Restricted null space property (RNS)81 / 124
Table of Contents IISufficient conditions for RNS
Pairwise incoherenceRestricted isometry property (RIP)
Noisy sparse regressionRestricted eigenvalue conditionDeviation bounds under RERE for anisotropic designOracle inequality (and `q sparsity)
3 Metric entropyPacking and coveringVolume ratio estimates
4 Random matrices and covariance estimationOp-norm concentration: sub-Gaussian matrices, independent entriesOp-norm concentration: Gaussian caseSub-Gaussian random vectorsOp-norm concentration: sample covariance
5 Structured covariance matricesHard thresholding estimatorApproximate sparsity (`q balls)
6 Matrix concentration inequalities82 / 124
Metric entropy and related ideas
• Metric entropy ideas allow us to reduce considerations over massive(uncountable) sets, to finite subsets.
• It is a measure of the size of a set; a measure of quantitative measure ofcompactness.
• Totally bounded set in a metric space = can be covered with “finite”number of ε-balls for any ε > 0.
83 / 124
Covering and metric entropy
• How to measure the sizes of sets in metric spaces?
• One idea is based on measures of sets, assuming that there is a suitablemeasure on the space (say Lebesgue measure in Rd).
• A more topological idea: Covering one set with copies of another set.
Definition 5 (Covering number)
An ε-cover or ε-net of a set T w.r.t. a metric ρ is a set
N := {θ1, . . . , θN} ⊂ T
such that∀t ∈ T , ∃θi ∈ N such that ρ(θ, θi ) ≤ ε.
The ε-covering number N(ε;T , ρ) is cardinality of the smallest ε-cover.
• logN(ε;T , ρ) is called the metric entropy of the (metric) space (T , ρ) .
84 / 124
• In normed vector spaces, ρ(x , y) = ‖x − y‖.• Notation B := {θ : ‖θ‖ ≤ 1}.• Ball of radius ε centered at θj is θj + εB.
• N := {θ1, . . . , θN} ⊂ T is an ε-covering iff
T ⊂N⋃j=1
(θj + εB)
i.e., covering T with shifted copies of εB.
85 / 124
Packing
Definition 6A ε-packing of a set T w.r.t. a metric ρ is a set
M := {θ1, . . . , θN} ⊂ T
such thatρ(θi , θj) > ε, ∀i 6= j .
The ε-packing number M(ε;T , ρ) is the cardinality of the largest ε-packing.
87 / 124
Example
Example 3 (Covering `∞)
• Take T = [−1, 1] and ρ(θ, θ′) = |θ − θ′|.• Divide [−1, 1] into L = b1/εc+ 1 sub-intervals,
• centered at the points θi = −1 + (2i − 1)ε for i ∈ [L] = {1, . . . , L}.• Easy to verify that θ1, θ2, . . . , θL forms an ε-net of [−1, 1], hence
N(ε; [−1, 1], | · |) ≤ 1
ε+ 1.
• Exercise: Show that this analysis can be extended to coveringB∞ = [−1, 1]d in the `∞ metric
N(ε; [−1, 1]d , ‖ · ‖∞) ≤(
1 +1
ε
)d.
89 / 124
Relation between packing and covering
Lemma 6For all ε > 0, the packing and covering numbers are related as follows:
M(2ε;T , ρ) ≤ N(ε;T , ρ) ≤ M(ε;T , ρ).
• Proof: Exercise.(Hint: any maximal packing is automatically a covering of suitable radius.)
90 / 124
Volume ratio estimates
Lemma 7
• Let ‖ · ‖ and ‖ · ‖′ be two norms on Rd with respective unit balls B and B′.• That is, B = {θ ∈ Rd | ‖θ‖ ≤ 1}, and similarly for B′.
Then, ε-covering of B in ‖ · ‖′ satisfies(1
ε
)d vol(B)
vol(B′)≤ N(ε;B, ‖ · ‖′) ≤ M(ε;B, ‖ · ‖′) ≤ vol( 2
ε B+B′)vol(B′)
.
• Important special case: Covering balls in their own metric: B = B′
d log(1
ε
)≤ logN(ε;B, ‖ · ‖) ≤ d log
(1 +
2
ε
).
• Recovers N(ε,Bd∞, ‖ · ‖∞) � d log(1/ε) from Example 3.
• Often care about ε→ 0. N(ε,Bd2 , ‖ · ‖2) ≤ (3/ε)d for ε ≤ 1.
• We also write N2(ε,Bd2 ) := N(ε,Bd
2 , ‖ · ‖2)
91 / 124
Proof of Lemma 7
• Let {θ1, . . . , θN} be an ε-covering of B in ‖ · ‖′. Then
B ⊆N⋃j=1
(θj + εB′)
which gives (by union bound)
vol(B) ≤ N vol(εB′) = Nεd vol(B′)
using translation invariance of the Lebesgue measure.
92 / 124
• For the other direction, let {θ1, . . . , θM} be a maximal ε-packing.
• By maximality, it should also be an ε-covering.
• The balls θi + ε2 B′, i = 1, . . . ,M are disjoint and contained in B+ ε
2 B′.
Hence
M vol(ε
2B′)
=M∑i=1
vol(θi +
ε
2B′)≤ vol
(B+
ε
2B′).
• Pulling ε/2 out from both sides completes the proof.
93 / 124
Table of Contents I1 Concentration inequalities
Sub-Gaussian concentration (Hoeffding inequality)Sub-exponential concentration (Bernstein inequality)Applications of Bernstein inequalityχ2 ConcentrationJohnson-Lindenstrauss embedding`2 norm concentration`∞ norm
Bounded difference inequality (Azuma–Hoeffding)Detour: MartingalesAzuma–HoeffdingBounded difference inequalityConcentration of (bounded) U-statisticsConcentration of clique numbers
Gaussian concentrationχ2 concentration revisitedOrder statistics and singular values
Gaussian chaos (Hanson–Wright inequality)
2 Sparse linear modelsBasis Pursuit
Restricted null space property (RNS)94 / 124
Table of Contents IISufficient conditions for RNS
Pairwise incoherenceRestricted isometry property (RIP)
Noisy sparse regressionRestricted eigenvalue conditionDeviation bounds under RERE for anisotropic designOracle inequality (and `q sparsity)
3 Metric entropyPacking and coveringVolume ratio estimates
4 Random matrices and covariance estimationOp-norm concentration: sub-Gaussian matrices, independent entriesOp-norm concentration: Gaussian caseSub-Gaussian random vectorsOp-norm concentration: sample covariance
5 Structured covariance matricesHard thresholding estimatorApproximate sparsity (`q balls)
6 Matrix concentration inequalities95 / 124
Operator norm of sub-G matrices
• Now show that A ∈ Rm×n with independent sub-Gaussian entries withmaxij ‖Aij‖ψ2 = O(1) has operator norm |||A|||op .
√m +
√n.
• An application of metric entropy ideas via discretization arguments.
Theorem 14 (HDP 4.4.5, p. 85)
• A = (Aij) an m × n matrix
• Aij : independent, zero mean, sub-Gaussian
• maxij ‖Aij‖ψ2 ≤ K .
Then, with prob. ≥ 1− exp(−t2).
|||A|||op ≤ C K (√m +
√n + t)
• Will use a discretization argument, using ε-nets (metric entropy ideas).
• For a deterministic matrix with Aij ∈ [−K ,K ], |||A|||op can be as large asK√mn. Consider A = K1m1T
n where 1m is the all-ones vector in Rm.
96 / 124
Discretization argument
• Recall that the operator norm of A ∈ Rm,n is
|||A|||op = supx ∈Bn
2
‖Ax‖2 = supx ∈ Sn−1
‖Ax‖2 = supx,y∈Sn−1
〈Ax , y〉
where Bn2 = {x ∈ Rn : ‖x‖2 ≤ 1} and Sn−1 = {x ∈ Rn : ‖x‖2 = 1}.
• Let N be any ε-net of Bn2 (or Sn−1). Then:
supx ∈N
‖Ax‖2 ≤ |||A|||op ≤1
1− ε supx ∈N
‖Ax‖2 (16)
• Let N and M be any ε-nets of Bn2 and Bm
2 .
supx ∈N , y∈M
〈Ax , y〉 ≤ |||A|||op ≤1
1− 2εsup
x ∈N , y∈M〈Ax , y〉. (17)
97 / 124
Proof
• Discretize the spheres. Recall N2(ε,Bd2 ) ≤ (1 + 2/ε)d .
• Take ε = 1/4 nets M⊂ Sm−1 and N ⊂ Sn−1:
|M| ≤ 9m, |N | ≤ 9n.
• Hence, (1− 2ε)−1 = 2,
|||A|||op ≤ 2 supx ∈N , y∈M
〈Ax , y〉
• Fix x ∈ N and y ∈M, then
‖〈Ax , y〉‖2ψ2≤ C
∑i,j
‖Aijxiyi‖2ψ2
= C∑i,j
x2i y
2i ‖Aij‖2
ψ2≤ CK 2
hence, P(〈Ax , y〉 ≥ Ku) ≤ exp(−cu2) for u ≥ 0.
98 / 124
• Applying union bound, |N ||M| = 9n+m,
P( max(x,y) ∈ N×M
〈Ax , y〉 ≥ Ku) ≤ 9n+m · exp(−cu2)
• Taking c u2 ≥ C (n + m) + t2 for large C we can cancel 9n+m.
• For example, take c u2 = 3(n + m) + t2, then
9n+m · exp(−cu2) ≤ exp(−t2)
and√cu ≤
√3(m + n) + t2 ≤
√3(√m +
√n) + t.
99 / 124
Op-norm: Gaussian case, sharper constants
• For the Gaussian case Aij ∼ N(0, 1), one can get the sharp version
|||A|||op ≤√m +
√n + t, w.p. ≥ 1− exp(−t2/2).
using concentration of Lipschitz functions of Gaussian vectors
P(|||A|||op − E|||A|||op ≥ t) ≤ exp(−t2/2)
combined with sharp bound E|||A|||op ≤√m +
√n, obtained by
Sudakov-Fernique (an example of Gaussian comparison inequalities).
• This proof is very different than the ε-net argument.
100 / 124
Sub-Gaussian vectors
• A random vector x ∈ Rd has sub-Gaussian distribution if (HDP §3.4)
〈u, x〉 is sub-Gaussian for all u ∈ Sd−1,
in which case the sub-Gaussian norm of x is defined to be
‖x‖ψ2 = supu∈Sd−1
‖〈u, x〉‖ψ2
• Any vector x = (x1, . . . , xd) with independent sub-Gaussian coordinates issub-Gaussian, and (HDP Lemma 3.4.2)
‖x‖ψ2 ≤ C max1≤i≤d
‖xi‖ψ2 .
• The definition includes examples where the coordinates are not independent:
• x ∼ Unif(√dSd−1) =⇒ ‖x‖ψ2 ≤ C . (Uniform distribution on sphere)
• x ∼ N(0, In) =⇒ ‖x‖ψ2 ≤ C .
• x ∼ N(0,Σ) =⇒ ‖x‖ψ2 ≤ C |||√
Σ|||op.
• If x ∼ N(0,Σ) then x =√
Σz for z ∈ N(0, In). Apply Lemma 8.
101 / 124
Lemma 8
Let x ∈ Rd be a sub-Gaussian vector, and A ∈ Rn×d a deterministic matrix.Then, Ax is a sub-Gaussian vector and
‖Ax‖ψ2 ≤ |||A|||op‖x‖ψ2
• Let u ∈ Sn−1. Then 〈u,Ax〉 = 〈ATu, x〉 sub-Gaussian by the definition ofsub-Gaussianity for x .
• Let v = ATu. Then,
‖〈u,Ax〉‖ψ2 = ‖〈v , x〉‖ψ2 = ‖v‖2‖〈v
‖v‖2, x〉‖ψ2 ≤ ‖v‖2‖x‖ψ2 .
• Thus,
supu∈Sn−1
‖〈u,Ax〉‖ψ2 ≤ ‖x‖ψ2 supu∈Sn−1
= ‖x‖ψ2 |||AT |||op
• But |||AT |||op = |||A|||op (Show it!) finishing the proof.
• Lemma 8 parallels similar property for Gaussians.
102 / 124
Concentration of sample covariance matrix• A discretization argument can prove concentration of covariance matrices.
Theorem 15 (≈ HDS Theorem 6.2)
Let X ∈ Rn×d have
• independent sub-Gaussian rows xi ∈ Rd
• with E[xixTi ] = Σ, and
• let σ be a bound on the sub-Gaussian norm of xi for all i .
Let Σ = 1nX
TX = 1n
∑ni=1 xixT
i . Then, for all δ ≥ 0,
P[ |||Σ− Σ|||op
σ2≥ c1
(√d
n+
d
n
)+ δ]≤ c2 exp[−c3n ·min(δ, δ2)].
• For the Gaussian case σ2 . |||Σ|||op, hence w.h.p.
|||Σ− Σ|||op .(√d
n+
d
n
)|||Σ|||op.
• That is, n & ε−2d is enough to get |||Σ− Σ|||op . ε|||Σ|||op.• Significance: Don’t need ∼ d2 samples, just ∼ d , to approximate in operator norm.
103 / 124
Proof sketch
• Again we reduce to an 14 -net N of Sd−1 with |N | ≤ 9d .
• Then for any fixed u ∈ Sd−1, we have
〈u, (Σ− Σ)u〉 =1
n
n∑i=1
[〈xi , u〉2 − E〈xi , u〉2
].
• and by the centering lemma,
‖〈xi , u〉2 − E〈xi , u〉2‖ψ1 . ‖〈xi , u〉2‖ψ1 .
= ‖〈xi , u〉‖2ψ2
. σ2
• Bernstein inequality for sub-exponential variables (Corollary 1):
P(∣∣〈u, (Σ− Σ)u〉
∣∣ ≥ t)≤ 2 exp
[−c n ·min
( t2
σ4,t
σ2
)]
104 / 124
• Replace t with σ2t, take union bound over N ; recall |N | ≤ 9d ,
P(
maxu ∈N
∣∣〈u, (Σ− Σ)u〉∣∣ ≥ σ2t
)≤ 2 · 9d exp
[−c n min(t2, t)
]• Take t = max(ε2, ε). Then,
P(
maxu ∈N
∣∣〈u, (Σ− Σ)u〉∣∣ ≥ σ2 max(ε2, ε)
)≤ 2 · 9d exp
[−c n ε2
]• Taking ε =
√Cd/n + δ for sufficiently large C , we have
9d exp(−cnε2) = 9d exp[−cn
(Cd
n+ δ2
)]= 9de−cCde−cnδ
2 ≤ e−cnδ2
assuming that C is large enough so that 9de−cCd ≤ 1.
105 / 124
• On the other hand
max(ε2, ε) ≤ max(
2Cd
n+ 2δ2,
√Cd
n+ δ)
≤ 2Cd
n+
√Cd
n+ δ2 + 2δ.
• Recalling that |||Σ− Σ|||op ≤ 2 maxu ∈N
∣∣〈u, (Σ− Σ)u〉∣∣, we have proven the
following result
P[ |||Σ− Σ|||op
σ2≥ c1
(dn
+
√d
n+ δ2 + δ
)]≤ 2e−cnδ
2
.
• Alternative form in the statement of the theorem can be obtained as follows:
• First, δ2 + δ ≤ 2 max(δ2, δ).
• Now take δ2 = min(t, t2) for t ≥ 0. By Lemma 9, max(δ2, δ) = t.
106 / 124
Trick Lemma
Lemma 9
The two functions f (δ) := max(δ, δ2) and g(t) =√
min(t2, t) = min(t,√t) are
the inverses of each other in [0,∞). That is,
t = max(δ, δ2) =⇒ min(t2, t) = δ2
• Can be proved by separating the cases [0, 1] and (1,∞),• or by drawing a picture, or realizing
(maxi fi )
−1 = mini f−1.
0 0.5 1 1.5 2 2.5 3 3.5 40
0.5
1
1.5
2
2.5
3
3.5
4
max( /,/2)min(t, t )
107 / 124
Table of Contents I1 Concentration inequalities
Sub-Gaussian concentration (Hoeffding inequality)Sub-exponential concentration (Bernstein inequality)Applications of Bernstein inequalityχ2 ConcentrationJohnson-Lindenstrauss embedding`2 norm concentration`∞ norm
Bounded difference inequality (Azuma–Hoeffding)Detour: MartingalesAzuma–HoeffdingBounded difference inequalityConcentration of (bounded) U-statisticsConcentration of clique numbers
Gaussian concentrationχ2 concentration revisitedOrder statistics and singular values
Gaussian chaos (Hanson–Wright inequality)
2 Sparse linear modelsBasis Pursuit
Restricted null space property (RNS)108 / 124
Table of Contents IISufficient conditions for RNS
Pairwise incoherenceRestricted isometry property (RIP)
Noisy sparse regressionRestricted eigenvalue conditionDeviation bounds under RERE for anisotropic designOracle inequality (and `q sparsity)
3 Metric entropyPacking and coveringVolume ratio estimates
4 Random matrices and covariance estimationOp-norm concentration: sub-Gaussian matrices, independent entriesOp-norm concentration: Gaussian caseSub-Gaussian random vectorsOp-norm concentration: sample covariance
5 Structured covariance matricesHard thresholding estimatorApproximate sparsity (`q balls)
6 Matrix concentration inequalities109 / 124
Structured covariance matrices
• Assuming that the covariance matrix is simpler leads to faster rates.
• For diagonal matrices, we can achieve
|||Σ− Σ|||op = O(√
log d/n)
w.h.p. as opposed to√d/n. (Exercise 6.15)
• More generally, we work under a sparsity model.
• The sparsity of Σ is the same as that of its adjacency matrix:
A = AΣ = (Aij), Aij = 1{Σij 6= 0}.
• Sparsity of Σ can be effectively measured by |||AΣ|||op.
• Σ has s-sparse rows ≡ AΣ has maximum degree s − 1, hence
|||AΣ|||op ≤ s
using |||AΣ|||op ≤ |||AΣ|||∞ = maxi∑
j |Aij |.
110 / 124
Thresholded covariance estimator
• Consider the hard-thresholding operator
Tλ(u) = u 1{|u| ≥ λ} =
{u |u| > λ
0 otherwise.
Theorem 16 (Covariance thresholding; HDS Theorem 6.23, p.181)
Assume that x1, . . . , xn ∈ Rd are i.i.d. zero mean with covariance matrix Σ and‖xij‖ψ2 ≤ σ for all i , j where xij is the jth element of xi . Assume that
n > log d , λ = σ2(
8
√log d
n+ δ).
Then the thresholded estimator Tλ(Σ) satisfies
P(|||Tλ(Σ)− Σ|||op ≥ 2λ|||AΣ|||op
)≤ 8 exp
(− n
16min(δ, δ2)
)111 / 124
• If Σ has s-sparse rows, the error bound is
O(λs) = O(s
√log d
n
).
• |||AΣ|||op could be as large as s in this case: (HDS Ex.6.16)
Example: a graph with maximum degree s − 1 that contains a s-clique.
• |||AΣ|||op could be smaller:
Example: hub-and-spoke, the hub is connected to s of d − 1 spokes;
|||AΣ|||op = 1 +√s − 1, the bound is then O(
√s log d
n ).
112 / 124
Proof of Theorem 16
• The result follows easily from the following deterministic lemma:
Lemma 10
Assume that ‖Σ− Σ‖∞ ≤ λ. Then, |||Tλ(Σ)− Σ|||op ≤ 2λ|||AΣ|||op.
• If Σij = 0, then |Σij | ≤ λ, hence Tλ(Σij) = 0. Thus, |Tλ(Σij)− Σij | = 0.
• If Σij 6= 0, then
|Tλ(Σij)− Σij | ≤ |Tλ(Σij)− Σij |+ |Σij − Σij | ≤ 2λ
• It follows that |Tλ(Σij)− Σij | ≤ 2λ[AΣ]ij . The lemma follows using:
Lemma 11 (Exercise 6.3)
Assume that A,B,C be symmetric matrices.
(a) If 0 ≤ A ≤ B elementwise, then |||A|||op ≤ |||B|||op.
(b) Let |C | be elementwise absolute value of C . Then, |||C |||op ≤ ||||C ||||op.
113 / 124
• It remains to show that
Lemma 12
Under the assumptions of Theorem 16, w.p. ≥ 1− 8 exp(− n
16 min(δ, δ2)),
‖Σ− Σ‖∞ ≤ λ := σ2(
8
√log d
n+ δ)
• Let us establish a tail bound for fixed coordinates (k, `),
Σk` − Σk` =1
n
n∑i=1
(xikxi` − E[xikxi`]
)• We have, by the centering lemma,
‖xikxi` − E[xikxi`]‖ψ1 . ‖xikxi`‖ψ1
≤ ‖xik‖ψ2‖xi`‖ψ2 . σ2
• The second step follows from ‖XY ‖ψ1 ≤ ‖X‖ψ2‖Y ‖ψ2 when k 6= ` and from‖X 2‖ψ1 = ‖X‖2
ψ2when k = `.
115 / 124
• Bernstein inequality gives
P(|Σk` − Σk`| ≥ t
)≤ 2 exp
[−c n ·min
( t2
σ4,t
σ2
)].
• Applying union bound over all(d2
)≤ d2 choices of 1 ≤ k ≤ ` ≤ d gives
P(‖Σk` − Σk`‖∞ ≥ t
)≤ 2d2 exp
[−c n ·min
( t2
σ4,t
σ2
)].
• The rest is standard manipulation:
• Setting t = λ = σ2(c1
√log dn + δ
), we have
min( t2
σ4,t
σ2
)≥ min
(c2
1
( log d
n+ δ2
), c1
√log d
n+ δ)
≥ min(c2
1
log d
n, c1
√log d
n
)+ min(c2
1δ2, δ)
≥ c2
( log d
n+ min(δ2, δ)
)using n > log d . Here, c2 = min(c2
1 , c1, 1). By taking c1 large enough wecan cancel d2 and recover the desired bound.
116 / 124
Approximate sparsity
• For a completely dense Σ, the bound is very poor:
AΣ = 11T hence |||AΣ|||op = d .
• A milder constraint is approximate row sparsity in the `q sense, for q ∈ [0, 1],
maxi=1,...,d
d∑j=1
|Σij |q ≤ Rq. (18)
117 / 124
Thresholded estimator; `q bound
Theorem 17 (Covariance thresholding; HDS Theorem 6.27, p.)
Suppose that Σ satisfies `q sparsity (18). Then for any λ such that
|||Σ− Σ|||op ≤ λ/2, we have
|||Tλ(Σ)− Σ|||op ≤ 3Rqλ1−q.
Consequently, if x1, . . . , xn ∈ Rd are i.i.d. zero mean with covariance matrix Σand ‖xij‖ψ2 ≤ σ for all i , j where xij is the jth element of xi , and
n > log d , λ = σ2(
8
√log d
n+ δ),
Then the thresholded estimator Tλ(Σ) satisfies
P(|||Tλ(Σ)− Σ|||op ≥ 2Rqλ
1−q)≤ 8 exp
(− n
16min(δ, δ2)
), δ ≥ 0.
• Note that error rate is O(( log d
n )1−q
2
)118 / 124
Proof of Theorem 17
• Stochastic part already proven.
• Only need to prove the deterministic part which follows since
|||Tλ(Σ)− Σ|||op ≤ maxi=1,...,d
‖[Tλ(Σ)− Σ]i∗‖1
followed by the application of the next lemma on each row.
119 / 124
Lemma 13
Assume that ‖θ − θ∗‖∞ ≤ λ/2. Let Sλ = {i : |θ∗i | > λ/2}. Then,
‖Tλ(θ)− θ∗‖1 ≤ |Sλ| · λ+ ‖θ∗Scλ‖1. (19)
In particular, if θ∗ satisfies the `q ball constraint∑
i |θ∗i |q ≤ R, then
‖Tλ(θ)− θ∗‖1 ≤ 3Rλ1−q (20)
• Proof: Recalling that |Tλ(u)− u| ≤ λ for all u ∈ R. We have∑i∈Sλ
|Tλ(θi )− θ∗i | ≤ |Sλ| · λ.
• For i /∈ Sλ, |θ∗i | ≤ λ/2, hence by triangle inequality |θi | ≤ λ.
• It follows that Tλ(θi ) = 0 for i ∈ Sλ, thus,∑i /∈Sλ
|Tλ(θi )− θ∗i | ≤∑i /∈Sλ
|θ∗i | = ‖θ∗Scλ‖1
• This proves the first assertion, (19).120 / 124
• For the second assertion, we apply the next lemma.
• We have |Sλ| ≤ (λ/2)−qR and ‖θ∗Scλ‖1 ≤ (λ/2)1−qR. Hence,
‖Tλ(θ)− θ∗‖1 ≤ (λ/2)−qR · λ+ (λ/2)1−qR
= 3(λ/2)1−qR
which is further bounded by 3λ1−qR.
121 / 124
Lemma 14
Assume that∑
j |θj |q ≤ R for some q ≥ 0. Let S = {i : |θi | > τ}. Then,
(a) |S | ≤ R τ−q.
(b) ‖θSc‖pp ≤ R τp−q for any p ≥ q.
• Proof: Exercise. Note that |x | ≤ 1 implies |x |p ≤ |x |q for p ≥ q.
122 / 124
Table of Contents I1 Concentration inequalities
Sub-Gaussian concentration (Hoeffding inequality)Sub-exponential concentration (Bernstein inequality)Applications of Bernstein inequalityχ2 ConcentrationJohnson-Lindenstrauss embedding`2 norm concentration`∞ norm
Bounded difference inequality (Azuma–Hoeffding)Detour: MartingalesAzuma–HoeffdingBounded difference inequalityConcentration of (bounded) U-statisticsConcentration of clique numbers
Gaussian concentrationχ2 concentration revisitedOrder statistics and singular values
Gaussian chaos (Hanson–Wright inequality)
2 Sparse linear modelsBasis Pursuit
Restricted null space property (RNS)123 / 124
Table of Contents IISufficient conditions for RNS
Pairwise incoherenceRestricted isometry property (RIP)
Noisy sparse regressionRestricted eigenvalue conditionDeviation bounds under RERE for anisotropic designOracle inequality (and `q sparsity)
3 Metric entropyPacking and coveringVolume ratio estimates
4 Random matrices and covariance estimationOp-norm concentration: sub-Gaussian matrices, independent entriesOp-norm concentration: Gaussian caseSub-Gaussian random vectorsOp-norm concentration: sample covariance
5 Structured covariance matricesHard thresholding estimatorApproximate sparsity (`q balls)
6 Matrix concentration inequalities124 / 124
Matrix Bernstein inequality
Theorem 18 (Matrix Bernstein inequality)
Let X1, . . . ,Xn be independent zero-mean d × d symmetric random matricessuch that |||Xi |||op ≤ K a.s. for all i . Then, for all t ≥ 0,
P(|||
n∑i=1
Xi |||op ≥ t)≤ 2d exp
(− t2/2
σ2 + Kt/3
).
Here, σ2 = |||∑ni=1 EX 2
i |||op.
• Since EXi = 0, the matrix variance of Xi is var(Xi ) = EX 2i .
• Note that X 2i , hence EX 2
i is positive semi-definite (why?).
• An alternative form of the bound is
P(|||
n∑i=1
Xi |||op ≥ t)≤ 2d exp
(−c min
( t2
σ2,t
K
)).
125 / 124
Matrix Bernstein: expectation
Corollary 2
Let X1, . . . ,Xn be independent zero-mean d × d symmetric random matricessuch that |||Xi |||op ≤ K a.s. for all i . Then,
E|||n∑
i=1
Xi |||op . σ√
log d + K log d
where σ2 = |||∑ni=1 EX 2
i |||op.
126 / 124
Semidefinite (or Loewner) order
Definition 7For symmetric n × n matrices, we write X � Y or Y � X whenever
X − Y � 0
• |||X |||op ≤ t if and only if −tI � X � tI .
• 0 � X � Y implies |||X |||op ≤ |||Y |||op.
• X � Y implies tr(X ) ≤ tr(Y ).
127 / 124
Estimating a general covariance matrix
Theorem 19 (General covariance, HDP 5.6.1, p. 122)
Let x be a random vector in Rd . Assume that for some K ≥ 1,
‖x‖2 ≤ K (E‖x‖22)1/2, almost surely
Let x1, . . . , xniid∼ x and Σn = 1
n
∑ni=1 xixT
i . Then, for all n ≥ 1,
E‖Σn − Σ‖ ≤ C(√K 2r log d
n+
K 2r log d
n
)‖Σ‖
where Σ = ExxT and r is the stable rank of Σ,
r :=tr(Σ)
|||Σ|||op
• Note that r ≤ d , hence replacing r with d we get a further upper bound.
• We only pay a “log d” price for removing sub-Gaussianity.
• To estimate up to relative error ε, need n & ε−2r log d .
128 / 124
• Let Yi := xixTi − Σ so that EYi = 0.
• Also write Y := xxT − Σ for the generic version.
• We will apply the expectation form of matrix Bernstein.
• E‖x‖22 = tr(Σ), hence by assumption ‖x‖2
2 ≤ K 2 tr(Σ) a.s., giving
|||Y |||op ≤ |||xxT |||op + |||Σ|||op
= ‖x‖22 + |||Σ|||op
≤ K 2 tr(Σ) + |||Σ|||op ≤ 2K 2 tr(Σ) ≤ 2K 2r |||Σ|||op =: M
using K ≥ 1 and r := tr(Σ)/|||Σ|||op. We also have
0 � EY 2 = E(xxT − Σ)2 = E(xxT )2 − Σ2
� E(xxT )2
� (K 2 tr Σ)Σ
using (xxT )2 = ‖x‖22xxT � (K 2 tr Σ)xxT
• We get |||EY 2|||op ≤ (K 2 tr Σ)|||Σ|||op.
129 / 124
• We conclude that
σ2 := |||n∑
i=1
EY 2i |||op = n|||EY 2|||op
≤ n(K 2 tr Σ)|||Σ|||op ≤ nrK 2|||Σ|||2op.
• Applying matrix Bernstein in expectation, we have
E|||Σn − Σ|||op = E|||1n
n∑i=1
Yi |||op
.1
n
(σ√
log d + M log d)
≤ 1
n
(√nrK 2|||Σ|||2op
√log d + 2K 2r |||Σ|||op log d
).(√K 2r log d
n+
K 2r log d
n
)|||Σ|||op
130 / 124