22
Lecture 14: Covariance estimation and matrix completion May 27 - 29, 2020 Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020 1 / 22

Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

Lecture 14 Covariance estimation and matrixcompletion

May 27 - 29 2020

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 1 22

1 Covariance estimation

Suppose we have a sample of data points X1 XN in Rn It isoften reasonable to assume that these points are independentlysampled from the same probability distribution (or ldquopopulationrdquo)which is unknown We would like to learn something useful aboutthis distribution

Denote by X a random vector with this (unknown) distributionThe most basic parameter of the distribution is the mean EXOne can estimate EX from the sample by computing the samplemean

983123Ni=1XiN The law of large numbers guarantees that the

estimate becomes tight as the sample size N grows to infinity Inother words

1

N

N983131

i=1

Xi rarr EX as N rarr infin

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 2 22

The next most basic parameter of the distribution is thecovariance matrix

Σ = E(X minus EX)(X minus EX)T

The eigenvectors of the covariance matrix Σ are called theprincipal components Principal components that correspond tolarge eigenvalues of Σ are the directions in which the distributionof X is most extended see the figure

This method is called Principal Component Analysis (PCA)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 3 22

One can estimate the covariance matrix Σ from the sample bycomputing the sample covariance

ΣN =1

N

N983131

i=1

(Xi minus EXi)(Xi minus EXi)T

Again the law of large numbers guarantees that the estimatebecomes tight as the sample size N grows to infinity ie

ΣN rarr Σ as N rarr infin

But how large should the sample size N be for covarianceestimation Generally one can not have N lt n for dimensionreasons (Why) We are going to show that

N sim n log n

is enough In other words covariance estimation is possible withjust logarithmic oversampling

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 4 22

For simplicity we shall state the covariance estimation bound formean zero distributions (If the mean is not zero we can estimateit from the sample and subtract The mean can be accuratelyestimated from a sample of size N = O(n))

Theorem 1 (Covariance estimation)

Let X be a random vector in Rn with covariance matrix Σ Supposethat

983042X98304222 ≲ E983042X98304222 = trΣ almost surely

Then for every N ge 1 we have

E 983042ΣN minus Σ983042 ≲ 983042Σ983042983075983157

n log n

N+

n log n

N

983076

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 5 22

Proof Apply matrix Bernsteinrsquos inequality corollary 3 for the sum ofindependent random matrices XiX

Ti minus Σ and get

E 983042ΣN minus Σ983042 =1

NE

983056983056983056983056983056

N983131

i=1

983043XiX

Ti minus Σ

983044983056983056983056983056983056

≲ 1

N(σ983155

log n+K log n)

where

σ2 =

983056983056983056983056983056

N983131

i=1

E983043XiX

Ti minus Σ

9830442983056983056983056983056983056 = N

983056983056983056E983043XXT minus Σ

9830442983056983056983056

and K is chosen so that

983042XXT minus Σ983042 le K almost surely

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 6 22

It remains to bound σ and K Let us start with σ We have

E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2

≾ tr(Σ) middot EXXT

= tr(Σ) middot Σ

Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have

983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K

Therefore

E 983042ΣN minus Σ983042 ≲ 1

N(983155

Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)

The proof is completed by using tr(Σ) le n983042Σ983042

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22

11 Low-dimensional distributions

Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms

r(A) =983042A9830422F983042A98304222

le rank(A)

The proof of Theorem 1 yields

E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157

r log n

N+

r log n

N

983076

where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22

2 Norms of random matrices

Let Ai denote the ith row of A we have (exercise)

maxi

983042Ai9830422 le 983042A9830422 leradicnmax

i983042Ai9830422

For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch

Theorem 2 (Norms of random matrices without boundednessassumptions)

Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then

Emaxi

983042Ai9830422 le E983042A9830422 le C log n middot Emaxi

983042Ai9830422

where Ai denote the rows of A

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22

Lemma 3 (Symmetrization)

Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen

1

2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056 983249 E

983056983056983056983056983056

N983131

i=1

Xi

983056983056983056983056983056 983249 2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056

Proof To prove the upper bound let (X primei) be an independent copy of

the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

Xi minus E

983075983131

i

X primei

983076983056983056983056983056983056

983249 E

983056983056983056983056983056983131

i

Xi minus983131

i

X primei

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

983043Xi minusX prime

i

983044983056983056983056983056983056

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22

The distribution of the random vectors Yi = Xi minusX primei is symmetric

which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime

i in the sum above with εi(Xi minusX primei) Thus

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 le E

983056983056983056983056983056983131

i

εi(Xi minusX primei)

983056983056983056983056983056

le E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056+ E

983056983056983056983056983056983131

i

εiXprimei

983056983056983056983056983056

= 2E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056

This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22

Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)

A =983131

ilejZij

By the symmetrization inequality (Lemma 3) for the random matricesZij we get

E983042A983042 = E983056983056983056983131

i983249jZij

983056983056983056 983249 2E983056983056983056983131

i983249jXij

983056983056983056

where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij

become fixed values and all randomness remains in the Rademacherrandom variables εij

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 2: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

1 Covariance estimation

Suppose we have a sample of data points X1 XN in Rn It isoften reasonable to assume that these points are independentlysampled from the same probability distribution (or ldquopopulationrdquo)which is unknown We would like to learn something useful aboutthis distribution

Denote by X a random vector with this (unknown) distributionThe most basic parameter of the distribution is the mean EXOne can estimate EX from the sample by computing the samplemean

983123Ni=1XiN The law of large numbers guarantees that the

estimate becomes tight as the sample size N grows to infinity Inother words

1

N

N983131

i=1

Xi rarr EX as N rarr infin

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 2 22

The next most basic parameter of the distribution is thecovariance matrix

Σ = E(X minus EX)(X minus EX)T

The eigenvectors of the covariance matrix Σ are called theprincipal components Principal components that correspond tolarge eigenvalues of Σ are the directions in which the distributionof X is most extended see the figure

This method is called Principal Component Analysis (PCA)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 3 22

One can estimate the covariance matrix Σ from the sample bycomputing the sample covariance

ΣN =1

N

N983131

i=1

(Xi minus EXi)(Xi minus EXi)T

Again the law of large numbers guarantees that the estimatebecomes tight as the sample size N grows to infinity ie

ΣN rarr Σ as N rarr infin

But how large should the sample size N be for covarianceestimation Generally one can not have N lt n for dimensionreasons (Why) We are going to show that

N sim n log n

is enough In other words covariance estimation is possible withjust logarithmic oversampling

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 4 22

For simplicity we shall state the covariance estimation bound formean zero distributions (If the mean is not zero we can estimateit from the sample and subtract The mean can be accuratelyestimated from a sample of size N = O(n))

Theorem 1 (Covariance estimation)

Let X be a random vector in Rn with covariance matrix Σ Supposethat

983042X98304222 ≲ E983042X98304222 = trΣ almost surely

Then for every N ge 1 we have

E 983042ΣN minus Σ983042 ≲ 983042Σ983042983075983157

n log n

N+

n log n

N

983076

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 5 22

Proof Apply matrix Bernsteinrsquos inequality corollary 3 for the sum ofindependent random matrices XiX

Ti minus Σ and get

E 983042ΣN minus Σ983042 =1

NE

983056983056983056983056983056

N983131

i=1

983043XiX

Ti minus Σ

983044983056983056983056983056983056

≲ 1

N(σ983155

log n+K log n)

where

σ2 =

983056983056983056983056983056

N983131

i=1

E983043XiX

Ti minus Σ

9830442983056983056983056983056983056 = N

983056983056983056E983043XXT minus Σ

9830442983056983056983056

and K is chosen so that

983042XXT minus Σ983042 le K almost surely

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 6 22

It remains to bound σ and K Let us start with σ We have

E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2

≾ tr(Σ) middot EXXT

= tr(Σ) middot Σ

Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have

983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K

Therefore

E 983042ΣN minus Σ983042 ≲ 1

N(983155

Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)

The proof is completed by using tr(Σ) le n983042Σ983042

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22

11 Low-dimensional distributions

Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms

r(A) =983042A9830422F983042A98304222

le rank(A)

The proof of Theorem 1 yields

E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157

r log n

N+

r log n

N

983076

where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22

2 Norms of random matrices

Let Ai denote the ith row of A we have (exercise)

maxi

983042Ai9830422 le 983042A9830422 leradicnmax

i983042Ai9830422

For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch

Theorem 2 (Norms of random matrices without boundednessassumptions)

Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then

Emaxi

983042Ai9830422 le E983042A9830422 le C log n middot Emaxi

983042Ai9830422

where Ai denote the rows of A

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22

Lemma 3 (Symmetrization)

Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen

1

2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056 983249 E

983056983056983056983056983056

N983131

i=1

Xi

983056983056983056983056983056 983249 2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056

Proof To prove the upper bound let (X primei) be an independent copy of

the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

Xi minus E

983075983131

i

X primei

983076983056983056983056983056983056

983249 E

983056983056983056983056983056983131

i

Xi minus983131

i

X primei

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

983043Xi minusX prime

i

983044983056983056983056983056983056

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22

The distribution of the random vectors Yi = Xi minusX primei is symmetric

which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime

i in the sum above with εi(Xi minusX primei) Thus

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 le E

983056983056983056983056983056983131

i

εi(Xi minusX primei)

983056983056983056983056983056

le E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056+ E

983056983056983056983056983056983131

i

εiXprimei

983056983056983056983056983056

= 2E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056

This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22

Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)

A =983131

ilejZij

By the symmetrization inequality (Lemma 3) for the random matricesZij we get

E983042A983042 = E983056983056983056983131

i983249jZij

983056983056983056 983249 2E983056983056983056983131

i983249jXij

983056983056983056

where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij

become fixed values and all randomness remains in the Rademacherrandom variables εij

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 3: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

The next most basic parameter of the distribution is thecovariance matrix

Σ = E(X minus EX)(X minus EX)T

The eigenvectors of the covariance matrix Σ are called theprincipal components Principal components that correspond tolarge eigenvalues of Σ are the directions in which the distributionof X is most extended see the figure

This method is called Principal Component Analysis (PCA)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 3 22

One can estimate the covariance matrix Σ from the sample bycomputing the sample covariance

ΣN =1

N

N983131

i=1

(Xi minus EXi)(Xi minus EXi)T

Again the law of large numbers guarantees that the estimatebecomes tight as the sample size N grows to infinity ie

ΣN rarr Σ as N rarr infin

But how large should the sample size N be for covarianceestimation Generally one can not have N lt n for dimensionreasons (Why) We are going to show that

N sim n log n

is enough In other words covariance estimation is possible withjust logarithmic oversampling

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 4 22

For simplicity we shall state the covariance estimation bound formean zero distributions (If the mean is not zero we can estimateit from the sample and subtract The mean can be accuratelyestimated from a sample of size N = O(n))

Theorem 1 (Covariance estimation)

Let X be a random vector in Rn with covariance matrix Σ Supposethat

983042X98304222 ≲ E983042X98304222 = trΣ almost surely

Then for every N ge 1 we have

E 983042ΣN minus Σ983042 ≲ 983042Σ983042983075983157

n log n

N+

n log n

N

983076

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 5 22

Proof Apply matrix Bernsteinrsquos inequality corollary 3 for the sum ofindependent random matrices XiX

Ti minus Σ and get

E 983042ΣN minus Σ983042 =1

NE

983056983056983056983056983056

N983131

i=1

983043XiX

Ti minus Σ

983044983056983056983056983056983056

≲ 1

N(σ983155

log n+K log n)

where

σ2 =

983056983056983056983056983056

N983131

i=1

E983043XiX

Ti minus Σ

9830442983056983056983056983056983056 = N

983056983056983056E983043XXT minus Σ

9830442983056983056983056

and K is chosen so that

983042XXT minus Σ983042 le K almost surely

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 6 22

It remains to bound σ and K Let us start with σ We have

E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2

≾ tr(Σ) middot EXXT

= tr(Σ) middot Σ

Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have

983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K

Therefore

E 983042ΣN minus Σ983042 ≲ 1

N(983155

Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)

The proof is completed by using tr(Σ) le n983042Σ983042

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22

11 Low-dimensional distributions

Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms

r(A) =983042A9830422F983042A98304222

le rank(A)

The proof of Theorem 1 yields

E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157

r log n

N+

r log n

N

983076

where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22

2 Norms of random matrices

Let Ai denote the ith row of A we have (exercise)

maxi

983042Ai9830422 le 983042A9830422 leradicnmax

i983042Ai9830422

For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch

Theorem 2 (Norms of random matrices without boundednessassumptions)

Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then

Emaxi

983042Ai9830422 le E983042A9830422 le C log n middot Emaxi

983042Ai9830422

where Ai denote the rows of A

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22

Lemma 3 (Symmetrization)

Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen

1

2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056 983249 E

983056983056983056983056983056

N983131

i=1

Xi

983056983056983056983056983056 983249 2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056

Proof To prove the upper bound let (X primei) be an independent copy of

the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

Xi minus E

983075983131

i

X primei

983076983056983056983056983056983056

983249 E

983056983056983056983056983056983131

i

Xi minus983131

i

X primei

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

983043Xi minusX prime

i

983044983056983056983056983056983056

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22

The distribution of the random vectors Yi = Xi minusX primei is symmetric

which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime

i in the sum above with εi(Xi minusX primei) Thus

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 le E

983056983056983056983056983056983131

i

εi(Xi minusX primei)

983056983056983056983056983056

le E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056+ E

983056983056983056983056983056983131

i

εiXprimei

983056983056983056983056983056

= 2E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056

This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22

Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)

A =983131

ilejZij

By the symmetrization inequality (Lemma 3) for the random matricesZij we get

E983042A983042 = E983056983056983056983131

i983249jZij

983056983056983056 983249 2E983056983056983056983131

i983249jXij

983056983056983056

where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij

become fixed values and all randomness remains in the Rademacherrandom variables εij

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 4: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

One can estimate the covariance matrix Σ from the sample bycomputing the sample covariance

ΣN =1

N

N983131

i=1

(Xi minus EXi)(Xi minus EXi)T

Again the law of large numbers guarantees that the estimatebecomes tight as the sample size N grows to infinity ie

ΣN rarr Σ as N rarr infin

But how large should the sample size N be for covarianceestimation Generally one can not have N lt n for dimensionreasons (Why) We are going to show that

N sim n log n

is enough In other words covariance estimation is possible withjust logarithmic oversampling

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 4 22

For simplicity we shall state the covariance estimation bound formean zero distributions (If the mean is not zero we can estimateit from the sample and subtract The mean can be accuratelyestimated from a sample of size N = O(n))

Theorem 1 (Covariance estimation)

Let X be a random vector in Rn with covariance matrix Σ Supposethat

983042X98304222 ≲ E983042X98304222 = trΣ almost surely

Then for every N ge 1 we have

E 983042ΣN minus Σ983042 ≲ 983042Σ983042983075983157

n log n

N+

n log n

N

983076

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 5 22

Proof Apply matrix Bernsteinrsquos inequality corollary 3 for the sum ofindependent random matrices XiX

Ti minus Σ and get

E 983042ΣN minus Σ983042 =1

NE

983056983056983056983056983056

N983131

i=1

983043XiX

Ti minus Σ

983044983056983056983056983056983056

≲ 1

N(σ983155

log n+K log n)

where

σ2 =

983056983056983056983056983056

N983131

i=1

E983043XiX

Ti minus Σ

9830442983056983056983056983056983056 = N

983056983056983056E983043XXT minus Σ

9830442983056983056983056

and K is chosen so that

983042XXT minus Σ983042 le K almost surely

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 6 22

It remains to bound σ and K Let us start with σ We have

E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2

≾ tr(Σ) middot EXXT

= tr(Σ) middot Σ

Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have

983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K

Therefore

E 983042ΣN minus Σ983042 ≲ 1

N(983155

Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)

The proof is completed by using tr(Σ) le n983042Σ983042

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22

11 Low-dimensional distributions

Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms

r(A) =983042A9830422F983042A98304222

le rank(A)

The proof of Theorem 1 yields

E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157

r log n

N+

r log n

N

983076

where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22

2 Norms of random matrices

Let Ai denote the ith row of A we have (exercise)

maxi

983042Ai9830422 le 983042A9830422 leradicnmax

i983042Ai9830422

For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch

Theorem 2 (Norms of random matrices without boundednessassumptions)

Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then

Emaxi

983042Ai9830422 le E983042A9830422 le C log n middot Emaxi

983042Ai9830422

where Ai denote the rows of A

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22

Lemma 3 (Symmetrization)

Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen

1

2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056 983249 E

983056983056983056983056983056

N983131

i=1

Xi

983056983056983056983056983056 983249 2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056

Proof To prove the upper bound let (X primei) be an independent copy of

the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

Xi minus E

983075983131

i

X primei

983076983056983056983056983056983056

983249 E

983056983056983056983056983056983131

i

Xi minus983131

i

X primei

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

983043Xi minusX prime

i

983044983056983056983056983056983056

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22

The distribution of the random vectors Yi = Xi minusX primei is symmetric

which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime

i in the sum above with εi(Xi minusX primei) Thus

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 le E

983056983056983056983056983056983131

i

εi(Xi minusX primei)

983056983056983056983056983056

le E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056+ E

983056983056983056983056983056983131

i

εiXprimei

983056983056983056983056983056

= 2E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056

This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22

Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)

A =983131

ilejZij

By the symmetrization inequality (Lemma 3) for the random matricesZij we get

E983042A983042 = E983056983056983056983131

i983249jZij

983056983056983056 983249 2E983056983056983056983131

i983249jXij

983056983056983056

where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij

become fixed values and all randomness remains in the Rademacherrandom variables εij

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 5: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

For simplicity we shall state the covariance estimation bound formean zero distributions (If the mean is not zero we can estimateit from the sample and subtract The mean can be accuratelyestimated from a sample of size N = O(n))

Theorem 1 (Covariance estimation)

Let X be a random vector in Rn with covariance matrix Σ Supposethat

983042X98304222 ≲ E983042X98304222 = trΣ almost surely

Then for every N ge 1 we have

E 983042ΣN minus Σ983042 ≲ 983042Σ983042983075983157

n log n

N+

n log n

N

983076

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 5 22

Proof Apply matrix Bernsteinrsquos inequality corollary 3 for the sum ofindependent random matrices XiX

Ti minus Σ and get

E 983042ΣN minus Σ983042 =1

NE

983056983056983056983056983056

N983131

i=1

983043XiX

Ti minus Σ

983044983056983056983056983056983056

≲ 1

N(σ983155

log n+K log n)

where

σ2 =

983056983056983056983056983056

N983131

i=1

E983043XiX

Ti minus Σ

9830442983056983056983056983056983056 = N

983056983056983056E983043XXT minus Σ

9830442983056983056983056

and K is chosen so that

983042XXT minus Σ983042 le K almost surely

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 6 22

It remains to bound σ and K Let us start with σ We have

E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2

≾ tr(Σ) middot EXXT

= tr(Σ) middot Σ

Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have

983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K

Therefore

E 983042ΣN minus Σ983042 ≲ 1

N(983155

Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)

The proof is completed by using tr(Σ) le n983042Σ983042

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22

11 Low-dimensional distributions

Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms

r(A) =983042A9830422F983042A98304222

le rank(A)

The proof of Theorem 1 yields

E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157

r log n

N+

r log n

N

983076

where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22

2 Norms of random matrices

Let Ai denote the ith row of A we have (exercise)

maxi

983042Ai9830422 le 983042A9830422 leradicnmax

i983042Ai9830422

For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch

Theorem 2 (Norms of random matrices without boundednessassumptions)

Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then

Emaxi

983042Ai9830422 le E983042A9830422 le C log n middot Emaxi

983042Ai9830422

where Ai denote the rows of A

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22

Lemma 3 (Symmetrization)

Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen

1

2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056 983249 E

983056983056983056983056983056

N983131

i=1

Xi

983056983056983056983056983056 983249 2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056

Proof To prove the upper bound let (X primei) be an independent copy of

the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

Xi minus E

983075983131

i

X primei

983076983056983056983056983056983056

983249 E

983056983056983056983056983056983131

i

Xi minus983131

i

X primei

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

983043Xi minusX prime

i

983044983056983056983056983056983056

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22

The distribution of the random vectors Yi = Xi minusX primei is symmetric

which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime

i in the sum above with εi(Xi minusX primei) Thus

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 le E

983056983056983056983056983056983131

i

εi(Xi minusX primei)

983056983056983056983056983056

le E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056+ E

983056983056983056983056983056983131

i

εiXprimei

983056983056983056983056983056

= 2E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056

This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22

Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)

A =983131

ilejZij

By the symmetrization inequality (Lemma 3) for the random matricesZij we get

E983042A983042 = E983056983056983056983131

i983249jZij

983056983056983056 983249 2E983056983056983056983131

i983249jXij

983056983056983056

where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij

become fixed values and all randomness remains in the Rademacherrandom variables εij

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 6: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

Proof Apply matrix Bernsteinrsquos inequality corollary 3 for the sum ofindependent random matrices XiX

Ti minus Σ and get

E 983042ΣN minus Σ983042 =1

NE

983056983056983056983056983056

N983131

i=1

983043XiX

Ti minus Σ

983044983056983056983056983056983056

≲ 1

N(σ983155

log n+K log n)

where

σ2 =

983056983056983056983056983056

N983131

i=1

E983043XiX

Ti minus Σ

9830442983056983056983056983056983056 = N

983056983056983056E983043XXT minus Σ

9830442983056983056983056

and K is chosen so that

983042XXT minus Σ983042 le K almost surely

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 6 22

It remains to bound σ and K Let us start with σ We have

E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2

≾ tr(Σ) middot EXXT

= tr(Σ) middot Σ

Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have

983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K

Therefore

E 983042ΣN minus Σ983042 ≲ 1

N(983155

Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)

The proof is completed by using tr(Σ) le n983042Σ983042

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22

11 Low-dimensional distributions

Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms

r(A) =983042A9830422F983042A98304222

le rank(A)

The proof of Theorem 1 yields

E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157

r log n

N+

r log n

N

983076

where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22

2 Norms of random matrices

Let Ai denote the ith row of A we have (exercise)

maxi

983042Ai9830422 le 983042A9830422 leradicnmax

i983042Ai9830422

For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch

Theorem 2 (Norms of random matrices without boundednessassumptions)

Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then

Emaxi

983042Ai9830422 le E983042A9830422 le C log n middot Emaxi

983042Ai9830422

where Ai denote the rows of A

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22

Lemma 3 (Symmetrization)

Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen

1

2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056 983249 E

983056983056983056983056983056

N983131

i=1

Xi

983056983056983056983056983056 983249 2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056

Proof To prove the upper bound let (X primei) be an independent copy of

the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

Xi minus E

983075983131

i

X primei

983076983056983056983056983056983056

983249 E

983056983056983056983056983056983131

i

Xi minus983131

i

X primei

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

983043Xi minusX prime

i

983044983056983056983056983056983056

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22

The distribution of the random vectors Yi = Xi minusX primei is symmetric

which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime

i in the sum above with εi(Xi minusX primei) Thus

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 le E

983056983056983056983056983056983131

i

εi(Xi minusX primei)

983056983056983056983056983056

le E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056+ E

983056983056983056983056983056983131

i

εiXprimei

983056983056983056983056983056

= 2E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056

This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22

Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)

A =983131

ilejZij

By the symmetrization inequality (Lemma 3) for the random matricesZij we get

E983042A983042 = E983056983056983056983131

i983249jZij

983056983056983056 983249 2E983056983056983056983131

i983249jXij

983056983056983056

where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij

become fixed values and all randomness remains in the Rademacherrandom variables εij

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 7: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

It remains to bound σ and K Let us start with σ We have

E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2

≾ tr(Σ) middot EXXT

= tr(Σ) middot Σ

Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have

983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K

Therefore

E 983042ΣN minus Σ983042 ≲ 1

N(983155

Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)

The proof is completed by using tr(Σ) le n983042Σ983042

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22

11 Low-dimensional distributions

Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms

r(A) =983042A9830422F983042A98304222

le rank(A)

The proof of Theorem 1 yields

E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157

r log n

N+

r log n

N

983076

where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22

2 Norms of random matrices

Let Ai denote the ith row of A we have (exercise)

maxi

983042Ai9830422 le 983042A9830422 leradicnmax

i983042Ai9830422

For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch

Theorem 2 (Norms of random matrices without boundednessassumptions)

Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then

Emaxi

983042Ai9830422 le E983042A9830422 le C log n middot Emaxi

983042Ai9830422

where Ai denote the rows of A

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22

Lemma 3 (Symmetrization)

Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen

1

2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056 983249 E

983056983056983056983056983056

N983131

i=1

Xi

983056983056983056983056983056 983249 2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056

Proof To prove the upper bound let (X primei) be an independent copy of

the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

Xi minus E

983075983131

i

X primei

983076983056983056983056983056983056

983249 E

983056983056983056983056983056983131

i

Xi minus983131

i

X primei

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

983043Xi minusX prime

i

983044983056983056983056983056983056

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22

The distribution of the random vectors Yi = Xi minusX primei is symmetric

which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime

i in the sum above with εi(Xi minusX primei) Thus

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 le E

983056983056983056983056983056983131

i

εi(Xi minusX primei)

983056983056983056983056983056

le E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056+ E

983056983056983056983056983056983131

i

εiXprimei

983056983056983056983056983056

= 2E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056

This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22

Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)

A =983131

ilejZij

By the symmetrization inequality (Lemma 3) for the random matricesZij we get

E983042A983042 = E983056983056983056983131

i983249jZij

983056983056983056 983249 2E983056983056983056983131

i983249jXij

983056983056983056

where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij

become fixed values and all randomness remains in the Rademacherrandom variables εij

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 8: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

11 Low-dimensional distributions

Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms

r(A) =983042A9830422F983042A98304222

le rank(A)

The proof of Theorem 1 yields

E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157

r log n

N+

r log n

N

983076

where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22

2 Norms of random matrices

Let Ai denote the ith row of A we have (exercise)

maxi

983042Ai9830422 le 983042A9830422 leradicnmax

i983042Ai9830422

For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch

Theorem 2 (Norms of random matrices without boundednessassumptions)

Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then

Emaxi

983042Ai9830422 le E983042A9830422 le C log n middot Emaxi

983042Ai9830422

where Ai denote the rows of A

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22

Lemma 3 (Symmetrization)

Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen

1

2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056 983249 E

983056983056983056983056983056

N983131

i=1

Xi

983056983056983056983056983056 983249 2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056

Proof To prove the upper bound let (X primei) be an independent copy of

the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

Xi minus E

983075983131

i

X primei

983076983056983056983056983056983056

983249 E

983056983056983056983056983056983131

i

Xi minus983131

i

X primei

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

983043Xi minusX prime

i

983044983056983056983056983056983056

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22

The distribution of the random vectors Yi = Xi minusX primei is symmetric

which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime

i in the sum above with εi(Xi minusX primei) Thus

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 le E

983056983056983056983056983056983131

i

εi(Xi minusX primei)

983056983056983056983056983056

le E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056+ E

983056983056983056983056983056983131

i

εiXprimei

983056983056983056983056983056

= 2E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056

This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22

Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)

A =983131

ilejZij

By the symmetrization inequality (Lemma 3) for the random matricesZij we get

E983042A983042 = E983056983056983056983131

i983249jZij

983056983056983056 983249 2E983056983056983056983131

i983249jXij

983056983056983056

where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij

become fixed values and all randomness remains in the Rademacherrandom variables εij

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 9: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

2 Norms of random matrices

Let Ai denote the ith row of A we have (exercise)

maxi

983042Ai9830422 le 983042A9830422 leradicnmax

i983042Ai9830422

For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch

Theorem 2 (Norms of random matrices without boundednessassumptions)

Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then

Emaxi

983042Ai9830422 le E983042A9830422 le C log n middot Emaxi

983042Ai9830422

where Ai denote the rows of A

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22

Lemma 3 (Symmetrization)

Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen

1

2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056 983249 E

983056983056983056983056983056

N983131

i=1

Xi

983056983056983056983056983056 983249 2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056

Proof To prove the upper bound let (X primei) be an independent copy of

the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

Xi minus E

983075983131

i

X primei

983076983056983056983056983056983056

983249 E

983056983056983056983056983056983131

i

Xi minus983131

i

X primei

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

983043Xi minusX prime

i

983044983056983056983056983056983056

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22

The distribution of the random vectors Yi = Xi minusX primei is symmetric

which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime

i in the sum above with εi(Xi minusX primei) Thus

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 le E

983056983056983056983056983056983131

i

εi(Xi minusX primei)

983056983056983056983056983056

le E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056+ E

983056983056983056983056983056983131

i

εiXprimei

983056983056983056983056983056

= 2E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056

This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22

Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)

A =983131

ilejZij

By the symmetrization inequality (Lemma 3) for the random matricesZij we get

E983042A983042 = E983056983056983056983131

i983249jZij

983056983056983056 983249 2E983056983056983056983131

i983249jXij

983056983056983056

where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij

become fixed values and all randomness remains in the Rademacherrandom variables εij

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 10: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

Lemma 3 (Symmetrization)

Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen

1

2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056 983249 E

983056983056983056983056983056

N983131

i=1

Xi

983056983056983056983056983056 983249 2E

983056983056983056983056983056

N983131

i=1

εiXi

983056983056983056983056983056

Proof To prove the upper bound let (X primei) be an independent copy of

the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

Xi minus E

983075983131

i

X primei

983076983056983056983056983056983056

983249 E

983056983056983056983056983056983131

i

Xi minus983131

i

X primei

983056983056983056983056983056 = E

983056983056983056983056983056983131

i

983043Xi minusX prime

i

983044983056983056983056983056983056

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22

The distribution of the random vectors Yi = Xi minusX primei is symmetric

which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime

i in the sum above with εi(Xi minusX primei) Thus

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 le E

983056983056983056983056983056983131

i

εi(Xi minusX primei)

983056983056983056983056983056

le E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056+ E

983056983056983056983056983056983131

i

εiXprimei

983056983056983056983056983056

= 2E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056

This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22

Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)

A =983131

ilejZij

By the symmetrization inequality (Lemma 3) for the random matricesZij we get

E983042A983042 = E983056983056983056983131

i983249jZij

983056983056983056 983249 2E983056983056983056983131

i983249jXij

983056983056983056

where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij

become fixed values and all randomness remains in the Rademacherrandom variables εij

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 11: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

The distribution of the random vectors Yi = Xi minusX primei is symmetric

which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime

i in the sum above with εi(Xi minusX primei) Thus

E

983056983056983056983056983056983131

i

Xi

983056983056983056983056983056 le E

983056983056983056983056983056983131

i

εi(Xi minusX primei)

983056983056983056983056983056

le E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056+ E

983056983056983056983056983056983131

i

εiXprimei

983056983056983056983056983056

= 2E

983056983056983056983056983056983131

i

εiXi

983056983056983056983056983056

This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22

Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)

A =983131

ilejZij

By the symmetrization inequality (Lemma 3) for the random matricesZij we get

E983042A983042 = E983056983056983056983131

i983249jZij

983056983056983056 983249 2E983056983056983056983131

i983249jXij

983056983056983056

where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij

become fixed values and all randomness remains in the Rademacherrandom variables εij

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 12: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)

A =983131

ilejZij

By the symmetrization inequality (Lemma 3) for the random matricesZij we get

E983042A983042 = E983056983056983056983131

i983249jZij

983056983056983056 983249 2E983056983056983056983131

i983249jXij

983056983056983056

where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij

become fixed values and all randomness remains in the Rademacherrandom variables εij

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 13: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives

983056983056983056983131

ilejXij

983056983056983056 ≲ σ983155

log n+K log n

where σ2 = 983042983123

ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is

to check that

σ ≲ maxi

983042Ai9830422 and K ≲ maxi

983042Ai9830422

Then we have

983056983056983056983131

i983249jXij

983056983056983056 ≲ log n middotmaxi

983042Ai9830422

Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 14: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes

E983042A9830422 le C log(m+ n) middot (Emaxi

983042Ai9830422 + Emaxj

983042Aj9830422)

To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063

0 AAT 0

983064

3 Matrix completion

Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 15: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank

rank(X) = r ≪ n

The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how

Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 16: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables

δij sim Ber(p) with p =m

n2

which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m

Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 17: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

Theorem 4 (Matrix completion)

Let 983142X be a best rank r approximation to pminus1Y Then

E1

n983042983142X minusX983042F 983249 C logn

983157rn

m983042X983042max

Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X

Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n

983155rnm = o(1) and forces the

recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 18: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality

983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042

Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have

983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2

p983042Y minus pX983042

The entries of the matrix Y minus pX

(Y minus pX)ij = (δij minus p)Xij

are independent and mean zero random variables

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 19: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

We have

E983042Y minus pX983042

983249 C log n middot983061Emax

i983042(Y minus pX)i9830422 + Emax

j983042(Y minus pX)j9830422

983062

All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables

983042(Y minus pX)i98304222 =n983131

j=1

(δij minus p)2X2ij 983249

n983131

j=1

(δij minus p)2 middot 983042X9830422max

and similarly for columns Taking expectation and noting that

E (δij minus p)2 = Var (δij) = p(1minus p)

we get

E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222

98306012983249 radic

pn983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 20: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)

P

983099983103

983101

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 exp(minusctpn) for t 983245 3

This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to

P

983099983103

983101maxiisin[n]

n983131

j=1

(δij minus p)2 gt tpn

983100983104

983102 983249 n middot nminusct for t 983245 3

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 21: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

Integrating this tail we have

Emaxiisin[n]

n983131

j=1

(δij minus p)2 ≲ pn

And this yields the desired bound on the rows

Emaxiisin[n]

983042(Y minus pX)i9830422 ≲radicpn983042X983042max

We can do similarly for the columns Then

E983042Y minus pX983042 ≲ log nradicpn983042X983042max

Therefore we get

E983042983142X minusX983042 ≲ log n

983157n

p983042X983042max

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22

Page 22: Lecture 14: Covariance estimation and matrix completionmath.xmu.edu.cn › group › nona › damc › Lecture14.pdf · Covariance estimation and MC DAMC Lecture 14 May 27 - 29, 2020

Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by

construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms

983042983142X minusX983042F leradic2r983042983142X minusX983042

Taking expectation of both sides we get

E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n

983157rn

p983042X983042max

Dividing both sides by n we can rewrite this bound as

E1

n983042983142X minusX983042F ≲ log n

983157rn

pn2983042X983042max

The proof is completed by noting the definition of the samplingprobability p = mn2

Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22