35
ELE 538B: Mathematics of High-Dimensional Data Matrix concentration inequalities Yuxin Chen Princeton University, Fall 2018

Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

ELE 538B: Mathematics of High-Dimensional Data

Matrix concentration inequalities

Yuxin Chen

Princeton University, Fall 2018

Page 2: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Recap: matrix Bernstein inequality

Consider a sequence of independent random matrices{Xl ∈ Rd1×d2

}• E[Xl] = 0 • ‖Xl‖ ≤ B for each l

• variance statistic:

v := max{∥∥∥E [∑

lXlX

>l

]∥∥∥ , ∥∥∥E [∑lX>l Xl

]∥∥∥}

Theorem 3.1 (Matrix Bernstein inequality)

For all τ ≥ 0,

P{∥∥∥∑

lXl

∥∥∥ ≥ τ} ≤ (d1 + d2) exp(−τ2/2

v +Bτ/3

)

Matrix concentration 3-2

Page 3: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Recap: matrix Bernstein inequality

Gaussian tail exponential tail

1

Gaussian tail exponential tail

1

Gaussian tail exponential tail

1

Matrix concentration 3-3

Page 4: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

This lecture: detailed introduction of matrix Bernstein

An introduction to matrix concentration inequalities— Joel Tropp ’15

Page 5: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Outline

• Matrix theory background

• Matrix Laplace transform method

• Matrix Bernstein inequality

• Application: random features

Matrix concentration 3-5

Page 6: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Matrix theory background

Page 7: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Matrix function

Suppose the eigendecomposition of a symmetric matrix A ∈ Rd×d is

A = U

λ1. . .

λd

U>

Then we can define

f(A) := U

f(λ1). . .

f(λd)

U>

Matrix concentration 3-7

Page 8: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Examples of matrix functions

• Let f(a) = c0 +∑∞k=1 cka

k, then

f(A) := c0I +∞∑k=1

ckAk

• matrix exponential: eA := I +∑∞k=1

1k!A

k (why?)◦ monotonicity: if A �H, then tr eA ≤ tr eH

• matrix logarithm: log(eA) := A

◦ monotonicity: if 0 � A �H, then logA � log(H)

Matrix concentration 3-8

Page 9: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Matrix moments and cumulants

Let X be a random symmetric matrix. Then

• matrix moment generating function (MGF):

MX(θ) := E[eθX ]

• matrix cumulant generating function (CGF):

ΞX(θ) := logE[eθX ]

Matrix concentration 3-9

Page 10: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Matrix Laplace transform method

Page 11: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Matrix Laplace transform

A key step for a scalar random variable Y : by Markov’s inequality,

P {Y ≥ t} ≤ infθ>0

e−θt E[eθY

]

This can be generalized to the matrix case

Matrix concentration 3-11

Page 12: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Matrix Laplace transform

Lemma 3.2

Let Y be a random symmetric matrix. For all t ∈ R,

P {λmax(Y ) ≥ t} ≤ infθ>0

e−θt E[tr eθY

]• can control extreme eigenvalues of Y via the trace of the matrix

MGF

Matrix concentration 3-12

Page 13: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Proof of Lemma 3.2

For any θ > 0,

P {λmax(Y ) ≥ t} = P{

eθλmax(Y ) ≥ eθt}

≤ E[eθλmax(Y )]eθt (Markov’s inequality)

= E[eλmax(θY )]eθt

= E[λmax(eθY )]eθt (eλmax(Z) = λmax(eZ))

≤ E[tr eθY ]eθt

This completes the proof since it holds for any θ > 0

Matrix concentration 3-13

Page 14: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Issues of the matrix MGF

The Laplace transform method is effective for controlling anindependent sum when MGF decomposes

• in the scalar case where X = X1 + · · ·+Xn with independent{Xl}:

MX(θ) = E[eθX1+···+θXn ] = E[eθX1 ] · · ·E[eθXn ] =n∏l=1

MXl(θ)︸ ︷︷ ︸

look at each Xl separately

Issues in the matrix settings:

eX1+X2 6= eX1eX2 unless X1 and X2 commute

tr eX1+···+Xn � tr eX1eX1 · · · eXn

Matrix concentration 3-14

Page 15: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Subadditivity of the matrix CGF

Fortunately, the matrix CGF satisfies certain subadditivity rules,allowing us to decompose independent matrix components

Lemma 3.3

Consider a finite sequence {Xl}1≤l≤n of independent randomsymmetric matrices. Then for any θ ∈ R,

E[tr eθ

∑lXl

]︸ ︷︷ ︸tr exp

(ΞΣlXl

(θ)) ≤ tr exp

(∑llogE

[eθXl

])︸ ︷︷ ︸

tr exp(∑

lΞXl

(θ))

• this is deep result — based on Lieb’s Theorem!

Matrix concentration 3-15

Page 16: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Lieb’s Theorem

Elliott Lieb

Theorem 3.4 (Lieb ’73)

Fix a symmetric matrix H. Then

A 7→ tr exp(H + logA)

is concave on positive-semidefinite cone

Lieb’s Theorem immediately implies (exercise (Jensen’s inequality))

E[tr exp(H + X)

]≤ tr exp

(H + logE

[eX])

(3.1)

Matrix concentration 3-16

Page 17: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Proof of Lemma 3.3

E[tr eθ

∑lXl]

= E[tr exp

(θ∑n−1

l=1Xl + θXn

)]≤ E

[tr exp

(θ∑n−1

l=1Xl + logE

[eθXn

])](by (3.1))

≤ E[tr exp

(θ∑n−2

l=1Xl + logE

[eθXn−1

]+ logE

[eθXn

])]≤ · · ·

≤ tr exp(∑n

l=1logE

[eθXl

])

Matrix concentration 3-17

Page 18: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Master bounds

Combining the Laplace transform method with the subadditivity ofCGF yields:

Theorem 3.5 (Master bounds for sum of independent matrices)

Consider a finite sequence {Xl} of independent random symmetricmatrices. Then

P{λmax

(∑lXl

)≥ t}≤ inf

θ>0

tr exp(∑

l logE[eθXl ])

eθt

• this is a general result underlying the proofs of the matrixBernstein inequality and beyond (e.g. matrix Chernoff)

Matrix concentration 3-18

Page 19: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Matrix Bernstein inequality

Page 20: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Matrix CGF

P{λmax

(∑lXl

)≥ t}≤ inf

θ>0

tr exp(∑

l logE[eθXl ])

eθt

To invoke the master bound, one needs tocontrol the matrix CGF︸ ︷︷ ︸main step for proving matrix Bernstein

Matrix concentration 3-20

Page 21: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Symmetric case

Consider a sequence of independent random symmetric matrices{Xl ∈ Rd×d

}• E[Xl] = 0 • λmax(Xl) ≤ B for each l

• variance statistic: v :=∥∥E [∑lX

2l

]∥∥Theorem 3.6 (Matrix Bernstein inequality: symmetric case)

For all τ ≥ 0,

P{λmax

(∑lXl

)≥ τ

}≤ d exp

(−τ2/2

v +Bτ/3

)

Matrix concentration 3-21

Page 22: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Bounding matrix CGF

For bounded random matrices, one can control the matrix CGF asfollows:

Lemma 3.7

Suppose E[X] = 0 and λmax(X) ≤ B. Then for 0 < θ < 3/B,

logE[eθX

]� θ2/2

1− θB/3E[X2]

Matrix concentration 3-22

Page 23: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Proof of Theorem 3.6

Let g(θ) := θ2/21−θB/3 , then it follows from the master bound that

P{λmax

(∑iXi)≥ t}≤ inf

θ>0

tr exp(∑n

i=1 logE[eθXi ])

eθtLemma 3.7≤ inf

0<θ<3/B

tr exp(g(θ)

∑ni=1 E[X2

i ])

eθt

≤ inf0<θ<3/B

d exp(g(θ)v

)eθt

Taking θ = tv+Bt/3 and simplifying the above expression, we establish

matrix Bernstein

Matrix concentration 3-23

Page 24: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Proof of Lemma 3.7Define f(x) = eθx−1−θx

x2 , then for any X with λmax(X) ≤ B:

eθX = I + θX +(eθX − I − θX

)= I + θX + X · f(X) ·X

� I + θX + f(B) ·X2

In addition, we note an elementary inequality: for any 0 < θ < 3/B,

f(B) = eθB − 1− θBB2 = 1

B2

∞∑k=2

(θB)k

k! ≤ θ2

2

∞∑k=2

(θB)k−2

3k−2 = θ2/21− θB/3

=⇒ eθX � I + θX + θ2/21− θB/3 ·X

2

Since X is zero-mean, one further has

E[eθX

]� I + θ2/2

1− θB/3E[X2] � exp(

θ2/21− θB/3E[X2]

)Matrix concentration 3-24

Page 25: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Application: random features

Page 26: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Kernel trick

A modern idea in machine learning: replace inner product by a kernelevaluation (i.e. certain similarity measure)

Advantage: work beyond the Euclidean domain via task-specificsimilarity measures

Matrix concentration 3-26

Page 27: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Similarity measure

Define the similarity measure Φ

• Φ(x,x) = 1

• |Φ(x,y)| ≤ 1

• Φ(x,y) = Φ(y,x)

Example: angular similarity

Φ(x,y) = 2π

arcsin 〈x,y〉‖x‖2‖y‖2

= 1− 2∠(x,y)π

Matrix concentration 3-27

Page 28: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Kernel matrix

Consider N data points x1, · · · ,xN ∈ Rd. Then the kernel matrixG ∈ RN×N is

Gi,j = Φ(xi,xj) 1 ≤ i, j ≤ N

• Kernel Φ is said to be positive definite if G � 0 for any {xi}

Challenge: kernel matrices are usually large• cost of constructing G is O(dN2)

Question: can we approximate G more efficiently?

Matrix concentration 3-28

Page 29: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Random features

Introduce a random variable w and a feature map ψ such that

Φ(x,y) = Ew[ψ(x;w) · ψ(y;w)︸ ︷︷ ︸decouple x and y

]

• example (angular similarity)

Φ(x,y) = 1− 2∠(x,y)π

= Ew[sgn〈x,w〉 · sgn〈y,w〉] (3.2)

with w uniformly drawn from unit sphere

• this results in a random feature vector

z =

z1...zN

=

ψ(x1;w)...

ψ(xN ;w)

◦ zz>︸︷︷︸

rank 1

is an unbiased estimate of G, i.e. G = E[zz>]

Matrix concentration 3-29

Page 30: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Random features

Introduce a random variable w and a feature map ψ such that

Φ(x,y) = Ew[ψ(x;w) · ψ(y;w)︸ ︷︷ ︸decouple x and y

]

• example (angular similarity)

Φ(x,y) = 1− 2∠(x,y)π

= Ew[sgn〈x,w〉 · sgn〈y,w〉] (3.2)

with w uniformly drawn from unit sphere

• this results in a random feature vector

z =

z1...zN

=

ψ(x1;w)...

ψ(xN ;w)

◦ zz>︸︷︷︸

rank 1

is an unbiased estimate of G, i.e. G = E[zz>]

Matrix concentration 3-29

Page 31: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Example

Angular similarity:

Φ(x,y) = 1− 2∠(x,y)π

= Ew [sign〈x,w〉 sign〈y,w〉]

where w is uniformly drawn from unit sphere

As a result, the random feature map is ψ(x,w) = sign〈x,w〉

Matrix concentration 3-30

Page 32: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Random feature approximation

Generate n independent copies of R = zz>, i.e. {Rl}1≤l≤n

Estimator of kernel matrix G:

G = 1n

n∑l=1

Rl

Question: how many random features are needed to guaranteeaccurate estimation?

Matrix concentration 3-31

Page 33: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Statistical guarantees for random featureapproximation

Consider the angular similarity example (3.2):• To begin with,

E[R2l ] = E[zz>zz>] = NE[zz>] = NG

=⇒ v =∥∥∥ 1n2

∑n

l=1E[R2

l ]∥∥∥ = N

n‖G‖

• Next, 1n‖R‖ = 1

n‖z‖22 = N

n := B

• Applying the matrix Bernstein inequality yields: with high prob.

‖G−G‖ .√v logN +B logN .

√N

n‖G‖ logN + N

nlogN

.

√√√√√N

n‖G‖︸︷︷︸≥1

logN (for sufficiently large n)

Matrix concentration 3-32

Page 34: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Sample complexity

Define the intrinsic dimension of G as

intdim(G) = trG‖G‖

= N

‖G‖

If n & ε−2intdim(G) logN , then we have

‖G−G‖‖G‖

≤ ε

Matrix concentration 3-33

Page 35: Matrix concentration inequalities - Princeton Universityyc5/ele538_math_data/lectures/matrix_concent… · Recap: matrix Bernstein inequality Consider a sequence of independent random

Reference

[1] ”An introduction to matrix concentration inequalities,” J. Tropp,Foundations and Trends in Machine Learning, 2015.

[2] ”Convex trace functions and the Wigner-Yanase-Dyson conjecture,”E. Lieb, Advances in Mathematics, 1973.

[3] ”User-friendly tail bounds for sums of random matrices,” J. Tropp,Foundations of computational mathematics, 2012.

[4] ”Random features for large-scale kernel machines,” A. Rahimi, B. Recht,NIPS, 2008.

Matrix concentration 3-34