Random matrix theory: From mathematical physics to high … · 2018-11-17 · Non-stationary time series is important in understanding the temporal correlation of data. Assuming that

Random matrix theory: From mathematical physics to highdimensional statistics and time series analysis

by

Xiucai Ding

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Department of Statistical SciencesUniversity of Toronto

c© Copyright 2018 by Xiucai Ding

Abstract

Random matrix theory: From mathematical physics to high dimensional statistics and

time series analysis

Xiucai Ding

Doctor of Philosophy

Graduate Department of Statistical Sciences

University of Toronto

2018

Random matrix serves as one of the key tools in understanding the eigen-structure of

large dimensional matrices. The application ranges from the estimation and inference

of the high dimensional covariance matrices, the noise reduction of rectangular matrices

to the understanding of separable matrices and even matrices having correlation both in

rows and columns. Assuming that we can observe a p by n data matrix, where log p is

comparable to log n, we derive the convergent limits and distributions for the eigenvalues

and eigenvectors for a few random matrix models related to the above problems in the

study of high dimensional statistics. This part is based on a few papers jointly with

Zhigang Bao (HKUST), Fan Yang (UCLA) and Ke Wang (HKUST), where we employ

the dynamic approach developed by Laszlo Erdos and Horng-Tzer Yau [51].

Non-stationary time series is important in understanding the temporal correlation

of data. Assuming that only one time series is observed, we develop a methodology to

estimate the underlying high dimensional covariance and precision matrices. Based on

our methodology, we can infer the covariance and precision matrices using the strategy

of bootstrapping. This part is based on two papers jointly with Professor Zhou Zhou

(UofT). It is notable that, we apply Stein’s method to prove the Gaussian approximation,

which is essentially the same as the Green function comparison strategy for proving the

universality for random matrix models.

ii

Acknowledgements

This dissertation is dedicated to my son Kyrie Ding, my wife Xin Zhang, my father

Shenyue Ding and my mother Yunzhen Ma. My motivation is from their love and support.

This dissertation is also dedicated to all of my teachers in my PhD study, especially

to my advisor Professor Jeremy Quastel. I not only learn mathematics from him, but

also learn how to understand the questions in the correct way.

I would like to thank all of my collaborators, they are smart and I enjoy learning

mathematics and statistics from them. They are (in alphabet order of last name): Zhi-

gang Bao (HKUST), Dehan Kong (UofT), Weihao Kong (Stanford), Jeremy Quastel

(UofT), Qiang Sun (UofT), Gregory Valiant (Stanford), Ke Wang (HKUST), Hautieng

Wu (Duke), Fan Yang (UCLA), and Zhou Zhou (UofT).

Finally, I want to thank all my friends in Toronto, who have been enjoying research

and life with me, they are (in alphabet order of their last name): Philippe Casgrain,

Jinlong Fu, Luhui Gan, Boris Garbuzov, Tianyi Jia, Zhenhua Lin, Peng Liu, Qixuan Ma,

Chongda Wang, Shuai Yang and Xingshuo Zhai.

iii

Contents

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction 1

1.1 Introduction to random matrix theory . . . . . . . . . . . . . . . . . . . 2

1.2 Two approaches for analyzing random matrices . . . . . . . . . . . . . . 13

1.3 Applications in statistics and mathematical physics . . . . . . . . . . . . 17

1.4 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Random matrices in high dimensional statistics 26

2.1 Universality of sample covariance matrices . . . . . . . . . . . . . . . . . 26

2.1.1 Edge universality of sample covariance matrices . . . . . . . . . . 26

2.1.2 Universality of singular vector distribution . . . . . . . . . . . . . 68

2.2 Eigen-structure of the model of matrix denoising . . . . . . . . . . . . . . 114

2.3 Eigen-structure of sample covariance matrix of general form . . . . . . . 150

3 Random matrices in non-stationary time series analysis 197

3.1 Locally stationary time series and physical dependence measure . . . . . 197

3.2 Estimation of covariance and precision matrices . . . . . . . . . . . . . . 200

3.3 Inference of covariance and precision matrices . . . . . . . . . . . . . . . 212

iv

Bibliography 229

v

List of Tables

1.1 Orthogonal polynomials (OP) and random matrix model (RMM) . . . . 13

2.1 Comparison of different algorithms . . . . . . . . . . . . . . . . . . . . . 123

2.2 Loss functions and their optimal shrinkers . . . . . . . . . . . . . . . . . 194

3.1 Operator norm error for estimation of covariance matrices . . . . . . . . 211

3.2 Operator norm error for estimation of precision matrices. . . . . . . . . . 212

3.3 Simulated type I error rates under H10. . . . . . . . . . . . . . . . . . . . 226

3.4 Simulated type I error rates under H20 for k0 = 2. . . . . . . . . . . . . . 226

vi

List of Figures

1.1 An example of general sample covariance matrices . . . . . . . . . . . . . 24

2.1 Rotation invariant estimator . . . . . . . . . . . . . . . . . . . . . . . . . 124

2.2 Rotation invariant estimator Vs TSVD . . . . . . . . . . . . . . . . . . . 125

2.3 Estimation loss using factor model . . . . . . . . . . . . . . . . . . . . . 170

2.4 Spectrum of the examples . . . . . . . . . . . . . . . . . . . . . . . . . . 193

2.5 Optimal shrinkers under different loss functions . . . . . . . . . . . . . . 194

2.6 Estimation of oracle estimator . . . . . . . . . . . . . . . . . . . . . . . . 195

2.7 Estimation error using POET with extra information . . . . . . . . . . . 196

3.1 White noise test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

3.2 Bandedness test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

3.3 Bandedness test for different levels . . . . . . . . . . . . . . . . . . . . . 228

vii

Chapter 1

Introduction

Random matrices first appeared in mutivariate statistics in 1928 [108], where Wishart

formulated the Wishart distribution on matrix-valued random variables (i.e. Wishart

matrix) to study the problem of the estimation of covariance matrices. However, it did

not attract much attention at that time. Then in 1950s, Wigner used a simple random

matrix model (see Definition 1.1.1) to study the statistical behaviour of slow neutron

resonances in nuclear physics [106, 107]. Since then, many random matrix models have

been employed to study the quantum mechanics, for a detailed review, we refer to the

book [85] by Mehta.

In physics, random matrices are used to describe the limiting behavior of the eigen-

values of a Hamiltonian operator. However, in multivariate statistics, the typical objects

are sample covariance matrices. In this direction, Marcenko and Pastur [83] studied the

random matrix model (see Definition 1.1.2) of the form of sample covariance matrices.

Since then, statisticians have used random matrix models to study the estimation and

inference problems in multivariate and high dimensional statistics. For a comprehensive

review, we refer to the monograph [4] by Bai and Silverstein and the book [113] by Yao,

Zheng and Bai.

In this thesis, we will focus the discussion on the theory and applications of random

1

Chapter 1. Introduction 2

matrix models in statistics and mathematical physics. However, we remark that random

matrix theory has also been successfully used to many other disciplines of mathematics,

for instance the combinatorics, knot theory and number theory (Riemann zeta function).

These are out of the scope of this thesis, we refer the readers to the lecture note [55] by

Eynard, Kimura and Ribault.

1.1 Introduction to random matrix theory

In this section, we introduce the well-known random matrix models and the associated

statistical properties of their eigenvalues and eigenvectors.

Definition 1.1.1 (Wigner matrices, Definition 2.2 of [11]). A Wigner matrix is a n× n

Hermitian matrix whose entries Hij satisfy the following conditions: (i). The upper-

triangular entries (Hij, 1 ≤ i ≤ j ≤ n) are independent; (ii). For all i, j, we have

EHij = 0 and E|Hij|2 = n−1(1 +O(δij)); (iii). The random variables√nHij are bounded

in any Lp space, uniformly in n, i, j.

Definition 1.1.2 (Sample covariance matrices, Section 1.3 of [17]). For a p× p positive

definite matrix Σ and a p × n rectangular matrix X, where the entries Xij are centered

i.i.d random variables such that EXij = 0, E|Xij|2 = 1n

and√nXij are bounded in any

Lp space, uniformly in n, i, j, then we call Σ1/2XX∗Σ1/2 a sample covariance matrix.

Definition 1.1.3 (Addition of random matrix and deterministic matrix, Section 1 of

[37]). Consider a p×n random matrix X satisfying the conditions of those in Definition

1.1.2 and a p × n deterministic matrix S, we call S = S + X as an addition of random

matrix and deterministic matrix.

Definition 1.1.4 (Separable sample covariance matrices, Section 1 of [88]). For a p× p

positive definite matrix Σa and a n×n positive definite matrix Σb, consider a random p×n

matrix X satisfying the conditions of those in Definition 1.1.2, we call Σ1/2a XΣbX

∗Σ1/2a

a seperable sample covariance matrix.


Remark 1.1.5. It is notable that recently, researchers have been studying the random

matrices with correlated entries for square matrices using the matrix self-consistent equa-

tions, for instance [1, 2, 3, 30]. It is expected that all the above new techniques can be

applied to study the rectangular matrices. We will pursue this direction in the future.

In Definition 1.1.1 and 1.1.2, when the entries of the matrices are Gaussian random

variables, we can understand them from the perspective of ensembles. We first introduce

the one matrix model, which covers the Gaussian ensemble of Wigner matrices and sample

covariance matrices and then extend it to the multi-matrix model.

Definition 1.1.6 (One-matrix model, Section 1 of [54]). Consider the following proba-

bility law on a space M whose points M are matrices,

P(M) =1

Ze−TrV (M),

where V (x) is the potential function and Z is a normalization constant. We usually

consider the following two types of ensembles: (1). Hermite ensemble: M is the class of

n× n Hermitian matrix and V (x) is some polynomial function (e.g. V (x) = 12x2 in the

Gaussian case); (2). Laguerre ensemble: M is the class of the positive definite matrix

of the form Σ1/2XX∗Σ1/2 and V (x) is some function (e.g. V (x) = x − p log x in the

Gaussian case).

Definition 1.1.7 (Multi-matrix model, Section 1 of [56]). For multi-matrix model, we

consider a chain of m Hermitian n× n matrices with probability density proportional to

exp

[−Tr

1

2V1(H1) + V1(H2) + · · ·+ 1

2Vm(Hm)

+ Tr (c1H1H2 + · · ·+ cm−1Hm−1Hm)

],

where Vj(x) are real polynomials of even order and cj are real constants. Here Hi, i =

1, 2, · · · ,m are Hermitian matrices.

Remark 1.1.8. (i). For the one-matrix model, we can define different ensembles by choos-


ing different matrix space and probability law. A third important class is the Jacobi

ensemble [68]. (ii). For the multi-matrix model, we only discuss the Hermite ensemble.

A special type of Laguerre ensemble was derived by Tracy and Widom in [105].

To clarify the presentation of the results of random matrices, we first introduce some

useful notations. For any n × n matrix H, we denote its empericial spectral distri-

bution (ESD) as

FH(x) ≡ FHn (x) :=

1

n

n∑j=1

δλj(x),

where λj := λj(H) are the eigenvalues of H in the decreasing order. Then the Stieltjes’s

transform of FH(x) is defined as

mn(z) :=

∫1

x− zdFH(x) =

1

nTr(H − z)−1, z ∈ C+.

For a given domain S ∈ C+, we define the Green function of H on S as

G(z) ≡ GH(z) := (H − z)−1, z ∈ S.

Therefore, we have mn(z) = 1n

TrG(z). It is notable that the choice of S is also crucial

to our local analysis. We are mainly interested in the following two questions: (i). Does

FH(x) converge to some nonrandom limit? This is called the limiting spectral distribution

(LSD). (ii). If it does, what is the convergent rate? The answers for (i) are called global

laws and for (ii) are called local laws. The answers of (ii) have many consequences, for

instance the eigenvalue gap, the rigidity of eigenvalues, the bulk and edge universality

and the delocalization of eigenvectors.

(i). Global laws. It is well-known that establishing the convergence of the ESD of a

sequence of matrices is equivalent to show the convergence of their Stieltjes’s transforms,

and then the LSD can be found using the inversion formula (see Appendix B.2 of [4]).

For the above random matrix models, the Stieltjes’s transforms of the global laws satisfy


their associated self-consistent equations.

For Wigner matrices, we denote msc as the limit of mn, then msc is the unique solution

of following equation (see equation (1.29) of [47])

msc +1

msc + z= 0,

and the associated global law is called semicircular law, which can date back to the

work of Wigner [107].

For sample covariance matrices, assuming that np→ c ∈ (0,∞) and the ESD of Σ

converges to some nonrandom limit π, denote mmp as the limit and then it satisfies the

following equation (see equation (1.3) of [98])

z = − 1

mmp

+ c−1

∫λdπ(λ)

1 + λmmp

,

and the associated global law is called deformed Marcenko-Pastur law [80]. When

Σ = I, this coincides with the Marcenko-Pastur law [83].

For the addition of random matrix and deterministic matrix, assuming that 1pSS∗

converges to some nonrandom distribution π, denote msn as the limit of the matrices

1pSS∗, then it satisfies [46]

msn =

∫dπ(t)

t1+p−1msn

− (1 + p−1msn)z + n−1(1− c).

However, in this thesis, for the purpose of application, we will assume that S is of low rank

structure [37]. We will simply consider the singular vector decomposition of S without

normalizing by p−1. In this situation, by Cauchy’s interlacing property, the global law of

SS∗ is the Marcenko-Pastur law.

For separable sample covariance matrices, denote mse as the limit of mn, then mse is

the unique solution of the following system of equations (see equation (4.1.3) of [116] or


equations (1) and (2) of [88])

mse =

∫1

a∫

b1+cbe(z)

dπB(b)−zdπA(a)

e(z) =∫

aa∫

b1+cbe(z)

dπB(b)−zdπA(a)

,

where πA and πB are the LSDs of Σa and Σb respectively. Note that if Σb = I, this

reduces to the deformed Marcenko-Pastur law. For the recent results on the random

matrices with correlated entries, we refer to the lecture note [48].

Finally, it is remarkable that the global laws of the general deformed random matrices

can also be understood using free probability theory, where they can be written in terms

of subordination functions. For a comprehensive review, we refer to Section 3 of [26].

(ii). Local laws. The local laws measure how close the ESD and LSD are when the

spectral domain is restricted to some region containing only a few of the eigenvalues.

We will summarize the isotropic local law for Wigner matrices, the anisotropic local law

for sample covariance matrices satisfying some regularity conditions on Σ. For sample

covariance matrices with another condition on Σ, the local law is derived in [7]. The local

law for random matrices with fast decay correlation can be found in [48].

Finally we remark that the local laws for separable sample covariance matrices and

correlated sample covariance matrix are still missing at this point.

Theorem 1.1.9 (Isotropic local semicircle law, Theorem 2.12 and 2.15 of [16] and The-

orem 2.2 and 2.3 of [72]). (1). For a small ω ∈ (0, 1) and define

Sb ≡ Sb(ω, n) := z = E + iη ∈ C+ : |E| ≤ ω−1, n−1+ω ≤ η ≤ ω−1. (1.1.1)

Then for the Wigner matrices defined in Definition 1.1.1, for some small ε > 0 and large


D > 0, with 1− n−D probability, we have

|< v, G(z)w > − < v,w > msc(z)| ≤ nε

(√Immsc(z)

nη+

1

nη

), z ∈ Sb,

where v,w are unit vectors in Cn. When v,w are the standard basis in Rn, this reduces

to the local semicircle law.

(2). Denote the spectral domain outside the bulk as

So ≡ So(ω, n) := z = E + iη ∈ C : 2 + n−2/3+ω ≤ |E| ≤ ω−1, 0 ≤ η ≤ ω−1, (1.1.2)

then with 1− n−D probability, we have

|< v, G(z)w > − < v,w > msc(z)| ≤ nε

(√Immsc(z)

nη

), z ∈ So.

We next introduce the anisotropic local law for the sample covariance matrices. We

will need the following assumptions on Σ. This type of condition ensures square root

behavior of the Stieltjes’s transformation near the right edges and has been used in a

series of papers in [9, 40, 69, 80]. We first denote the non-asymptotic version of the global

law as z = f(m), Imm(z) > 0, where f(x) is defined as

f(x) = −1

x+

1

n

p∑i=1

1

x+ σ−1i

, (1.1.3)

where σipi=1 are the eigenvalues of Σ in the decreasing order. The elementary properties

of f are collected as the following lemma.

Lemma 1.1.10 (Properties of f). Denote R = R ∪ ∞, then f defined in (1.1.3)

is smooth on the p + 1 open intervals of R defined through I1 := (−σ−11 , 0), Ii :=

(−σ−1i ,−σ−1

i−1), i = 2, · · · , p, I0 := R/ ∪pi=1 Ii. We also introduce a multiset C ⊂ R

containing the critical points of f , using the conventions that a nondegenerate critical


point is counted once and a degenerate critical point will be counted twice. In the case

n/p = 1, ∞ is a nondegenerate critical point. With the above notations, we have

• |C ∩ I0| = |C ∩ I1| = 1 and |C ∩ Ii| ∈ 0, 2 for i = 2, · · · , t. Therefore, |C| = 2t,

where for convenience, we denote by x1 ≥ x2 ≥ · · · ≥ x2t−1 be the 2t − 1 critical

points in I1 ∪ · · · ∪ It and x2t be the unique critical point in I0.

• Denote ak := f(xk), we have a1 ≥ · · · ≥ a2t. Moreover, we have xk = m(ak) by

assuming m(0) := ∞ for n/p = 1. Furthermore, for k = 1, · · · , 2t, there exists a

constant C such that 0 ≤ ak ≤ C.

• supp ρ ∩ (0,∞) = (∪tk=1[a2k, a2k−1]) ∩ (0,∞).

Using the dual relation f(m(z)) = z, we can easily derive the asymptotic properties

of m(z), where we will discuss in detail in Chapter 2. Now we list the key assumption.

Assumption 1.1.11 (Regularity assumption on Σ, Definition 2.7 of [74]). Fix τ > 0,

we assume that (i). The edges ak, k = 1, · · · , 2t are regular in sense that

ak ≥ τ, minl 6=k|ak − al| ≥ τ, min

i|xk + σ−1

i | ≥ τ. (1.1.4)

(ii). The bulk components k = 1, · · · ,m are regular if for any fixed τ ′ > 0 there exists a

constant c ≡ cτ,τ ′ such that the density of ρ in [a2k + τ ′, a2k−1− τ ′] is bounded from below

by c.

Theorem 1.1.12 (Anisotropic local laws, Theorem 3.6 and 3.7 of [74]). Denote the

(p+ n)× (p+ n) deterministic matrices

Π(z) :=

−Σ(1 +m(z)Σ)−1 0

0 m(z)

, Σ =

Σ 0

0 1

,


and the random matrix as

G(z) :=

−Σ−1 X

X∗ −z

−1

.

(i). Under Assumption 1.1.11, for the spectral domain defined in (1.1.1), for some small

ε > 0 and large D > 0, with 1− n−D probability, we have

∣∣< v,Σ−1(G(z)− Π(z))Σ−1w >∣∣ ≤ nε

(√Imm(z)

nη+

1

nη

), z ∈ Sb,

where v and w are deterministic unit vectors in Rp+n.

(ii). Under Assumption 1.1.11, for z ∈ S, 0 < η ≤ ω−1 and dist(E, Suppρ) ≥ n−2/3+ω,

with 1− n−D probability, we have

∣∣< v,Σ−1(G(z)− Π(z))Σ−1w >∣∣ ≤ nε

(√Imm(z)

nη

).

It is remarkable that the isotropic local law for sample covariance matrix can be

recovered by the anisotropic local law as

G(z) :=

zΣ1/2G1(z)Σ1/2 ΣXG2(z)

G2(z)X∗Σ G2(z)

−1

,

where G1(z) = (Σ1/2XX∗Σ1/2 − z)−1, G2(z) = (X∗ΣX − z)−1.

Finally, we summarize the results on the matrix ensembles defined in Definition 1.1.6

and 1.1.7. As we know the exact probability measure, we can compute the joint probabil-

ity density function using the integral over the unitary (orthogonal) group. We first recall

the definition of determinantal point process [21, 66]. Consider a point process ξ on

a complete separable metric space Λ, with reference measure λ, all of whose correlation


functions ρn exist. If there exists a function K : Λ× Λ→ C such that

ρn(x1, · · · , xn) = det(K(xi, xj))ni,j=1,

for all xi, xj ∈ Λ, then we call ξ is a determinantal point process with correlation kernel K.

We can view the correlation kernel K as an integral kernel of a Hilbert Schmidt operator,

then K can be written into a matrix form [93]. The joint probability density function of

the eigenvalues ρ(n)n (x1, · · · , xn) can be easily computed using change of variables (e.g see

Section 8 of [23]), and we are usually interested in computing two important quantities.

One of them is the k-point correlation function ρ(k)n (x1, · · · , xk), which is defined as

ρ(k)n (x1, · · · , xk) :=

n!

(n− k)!

∫Rn−k

ρ(n)n (x1, · · · , xn)

n∏i=k+1

dxi

When k = 1, it is the averaged spectral density. The other is the level-spacing function

A(k)n (θ;x1, · · · , xk), which is defined as

A(k)n (θ;x1, · · · , xk) =

n!

k!(n− k)!

∫Rn−k/Dn−k

ρ(n)n (x1, · · · , xn)

n∏i=k+1

dxi,

where D := [−θ, θ]. We are mainly interested in the following two questions: (i). Are the

functions ρ(k)n , A

(k)n (θ) are determinantal? (ii). If they are, how can we use orthogonal

and biorothogonal polynomials to characterize this point process? The answers of (ii)

have many important consequences, for instance, the limiting distribution of the global

law (i.e level density) and the largest eigenvalues after doing the steepest decent analysis.

In the following discussion, we focus on the Hermitian matrices.

(i). Determinantal point process. For the one-matrix model, we rescale it and

consider the ensemble

1

Ze−nTrV (M),

where we assume that V (M) is a polynomial with positive leading coefficient. Denote


xi, i = 1, 2, · · · , n as the eigenvalues of M, it is well-known the joint probability density

function can be written as

ρ(n)n (x1, · · · , xn) = Z−1

∏1≤i<j≤n

|xi − xj|2n∏i=1

exp(−nV (xi)).

To show it is determinantal, we need to find its correlation kernel. This is usually done

in terms of orthogonal polynomials [100], where we record it as the following lemma and

theorem.

Lemma 1.1.13. For any partition function of the form

Z =

∫Rn

∏1≤i<j≤n

|∆(x1, · · · , xn)|2n∏i=1

e−W (xi),

there always exists a unique sequence of polynomials (Pn)n≥0 with the following properties:

(1). Pn is a monic polynomial of degree n;

(2). For any n,m ≥ 0 and some constant hn > 0,

∫RPn(x)Pm(x)e−W (x)dx = δnmhn.

Then Z = n!∏n−1

m=0 hm.

Theorem 1.1.14 (Theorem 9.2 of [23]). Denote the Christoffel-Darboux kernel as

K(x, y) =n−1∑k=0

Pk(x)Pk(y)

hk,

then we have that

ρ(k)n (x1, · · · , xk) = det[K(xi, xj)]

ni,j=1,

where K(x, y) = K(x, y)e−V (x)/2−V (y)/2.

For the multi-matrix model, we need to use the biorthogonal polynomials.


Theorem 1.1.15 (Main theorem of [56]). Consider a chain of m Hermitian n × n

matrices with probability density proportional to

exp

[−Tr

1

2V1(H1) + V1(H2) + · · ·+ 1

2Vm(Hm)

+ Tr (c1H1H2 + · · ·+ cm−1Hm−1Hm)

],

where Vj(x) are real polynomials of even order and cj are real constants. Denote

Eij(x, y) =

0, i > j,

ωi(x, y), i = j + 1,

ωi ∗ · · · ∗ ωj−1(x, y), i < j + 1.

where ωi(x, y) = exp(−12Vi(x) − 1

2Vi+1(y) + cixy). Then the correlation function of the

eigenvalues are determinantal and the kernel can be written as

Kij(x, y) = Hij(x, y)− Eij(x, y), 1 ≤ i, j ≤ m,

where Hij(x, y) =∑n−1

l=01hl

Ψil(x)Φjl(y), where

∫Ψij(x)Φikdx = hjδjk, 1 ≤ i ≤ m, j, k ≥ 0.

Here Φik(x), Ψjl(x) can be constructed in the following ways: choose polynomials Pj(x), Qk(y)

of degrees j, k, where they satisfy

∫ ∫Pj(x)(ω1 ∗ · · · ∗ ωm−1)(x, y)Qk(y)dxdy = hjδjk.

Let Ψmj(x) = Qj(x), Φ1j(x) = Pj(x), we then have

Ψij(x) =

∫ωi(x, y)Ψi+1,j(y)dy, Φij(x) =

∫Φi−1,j(y)ωi−1(y, x)dy.


(ii). Kernel representation using orthogonal polynomials. Once we have proved

it is determinantal, the next step is to rewrite the kernel function into a summation of

polynomials, where the asymptotics can be done by using the steepest decent analysis.

For the one matrix model, we list the common orthogonal polynomials and their asso-

ciated random matrix models. For the classic polynomials, there exists some recursive

relation (i.e Christoffel-Darboux formula) by analyzing the generating functions of the

orthogonal polynomials.

Table 1.1: Orthogonal polynomials (OP) and random matrix model (RMM)

V (x) OP RMM

12x2 Hermite Wigner

x− a log x Laguerre Wishart−a log(1− x)− b log x Jacobi Double Wishart

For the biorthogonal polynomials, the existence is guaranteed by the work of Borodin

[20], but due to the lack of a simple explicit Christoffel-Darboux formula, we need to find

the biorthogonal system case by case. For instance, in the Gaussian case [105], they turn

to be the extended Hermite polynomials.

1.2 Two approaches for analyzing random matrices

There are four important methods employed in the study of random matrices: the mo-

ment method, the Stieltjes’s transform, the orthogonal and biorthogonal polynomial de-

composition and the approach of free probability. We will not discuss the method of

moment and free probability as they are beyond the main discussion of this thesis, for

the purpose of reference, we refer to [4] and [86].

For statistical applications, we will rely on the dynamic approach by analyzing the

Green functions. This approach can be regarded as an extension of the method of Stielt-

jes’s transform. An important advantage is that this method can provide us the local

laws with optimal bounds. For the applications in mathematical physics, we focus on


the discussion of orthogonal and biorthogonal polynomial decomposition, which can pro-

vide us an exact determinantal formula for the eigenvalue correlation function and level

spacing function.

Dynamic approach developed by Erdos and Yau. Two good references for this

approach are the book [51] and the lecture note [11]. To employ this idea, we firstly

need to prove the local laws (or their variants) for the associated random matrix models.

This will reply on the detailed analysis of Green functions following the steps below (an

example of detailed representation will be given in Chapter 2):

(a). Using Schur’s complement formula and large derivation bounds to prove that the

diagonal entry of the Green function is close to its expectation.

(b). Splitting the terms in the Schur’s complement formula into a leading term and a

random term (usually by multiplying the diagonal term of Green functions on both sides

of the Schur’s complement formula). Averaging all the above terms, we can control the

error terms for some suitably chosen η, for instance, for the Wigner matrix and sample

covariance matrix satisfying Assumption 1.1.11, we firstly take η ≥ 1. This step will give

us the self-consistent equation for the global law. Note that the spectral domain is also

crucial. For example, if we only want to study the local law near the edge, we can choose

the real part of the parameter within the typical distance from the edge.

(c). Repeat (a) and (b) for the off-diagonal terms of the Green functions.

(d). Steps (a), (b) and (c) provide us a priori bound for some η, we can repeatedly

improve our estimation bounds. This will provide us a weak local law.

(e). To obtain the final form, we need the fluctuation averaging for the summations,

where we need to apply the decoupling technique.

Once we obtain the local laws for our application, the next step is to write the quantity

of interest in terms of smooth functions of the entries of Green functions. For instance, in

Chapter 2 (see also [9, 40, 80, 81, 89]), we will show that the distribution function of the

largest eigenvalue of sample covariance matrix can be well approximated by a function


only depending on Green functions on some well-chosen interval. We will also show that

(see also [12, 17, 37, 72]) for the spiked sample covariance matrices, the outlier eigenvalues

are completely determined by a deterministic equation involving Green functions, and

the overlap of eigenvectors can be written into the derivative of Green functions. Once we

have the representation of the quantity of interest, we can compute and prove the desired

results. In statistical applications, the following two types of questions are usually taken

into consideration: universality of the eigen-structure and asymptotics of the first few

largest eigenvalues and eigenvectors of sample covariance matrices.

The strategy for proving universality is to either use the Lindeberg’s replacement

trick [28] by replacing entry by entry [40, 52, 81, 102, 103] or column by column [9, 89]

or use the Green function flow (continuous interpolations) by controlling the derivatives

of Green functions [74, 79, 80].

The outlier eigenvalues and eigenvetors play important roles in the statistical estima-

tion and inference. The convergent limits can usually be computed from the representa-

tion of the quantities and the local laws. For the asymptotics, in the supercritical case,

we usually have Gaussian fluctuation [12, 37, 72, 73]. To prove this, we need to derive a

recursion formula for the moments using Stein’s lemma [8], where we need to control the

error terms using local laws. For reference, we list the key formulas for Gaussian random

variables. Their proofs are just a simple application of the trick of integration by parts.

Lemma 1.2.1 (Recursive formula for moments of Gaussian random variables). For a

real Gaussian random variable X ∼ N (µ, σ2), denote its n-th moment by an, we then

have an+2 = µan+1 + σ2(n+ 1)an

a1 = µ

a2 = µ2 + σ2

.

Lemma 1.2.2 (Stein’s lemma, Appendix A of [29]). Suppose that X = (x1, · · · , xn) ∈ Rn


is a n-dimensional centered Gaussian vector. Let f : Rn → R be an absolutely continuous

function such that |∇f(X)| has finite expectation, then for any i,

E(xif(X)) =n∑j=1

E(xixj)E(∂if(X)).

We point out that the tricks we need to use are the self-consistent equations and the

fact that zG = HG−I. There are many other important applications in high dimensional

statistics, for instance the linear spectral statistics and the correlated sample covariance

matrices. We will not discuss them in this thesis but will focus on this direction in the

future.

Orthogonal and biorthogonal polynomial decomposition. Two good references

are the books [34] and [85].To apply this approach, we firstly need to prove that the

correlation function of the random matrices model is determinantal following the steps

below:

(a). Deriving the jointly probability density function for the eigenvalues of the random

matrix models using the confluent form of the Harish-Chandra-Itzykson-Zuber integral

[14, 85].

(b). For the one-matrix model, choosing suitable orthogonal polynomials to rewrite the

Vandermonde determinant of the product of the eigenvalues according to the weights of

the ensemble. For the multi-matrix model, following Theorem 1.1.15 to construct the

biorthogonal system by choosing some suitably initial polynomials according to the po-

tential functions.

(c). Rewrite the correlation function into the determinantal form and analyze the asymp-

totic behavior of the polynomials using the Riemann-Hilbert steepest decent analysis [34].

Once we find the correlation function, the next step is to write the quantities of inter-

est in terms of the correlation function, for instance, the gap probabilities can be write

into an infinite summation of the integration of correlation function. It is notable that


the most important step is to biorthogonalize the correlation kernel, for some special

cases, we can use the technique of TASEP [84].

Remark 1.2.3. (1). The asymptotic analysis is usually easy for studying the complex

case but hard for the real case. For example, the BBP transition was only tackled for

the real case by Bloemendal and Virag [18, 19] by relating the distribution of perturbed

GOE to the probability of explosion of the solution of second order stochastic differential

equations in 2011.

(2). It is notable that we can add some external resources to the one-matrix model,

where we need to biorthogonalize it as well. We will not pursue this direction, we refer

to the work of Kuijlaars [15, 76] for future discussion.

1.3 Applications in statistics and mathematical physics

Covariance matrices play important roles in high dimensional data analysis, which find

applications in many scientific endeavors, ranging from functional magnetic resonance

imaging and analysis of gene expression arrays to risk management and portfolio alloca-

tion. Furthermore, a large collection of statistical methods, including principal compo-

nent analysis, discriminant analysis, clustering analysis, and regression analysis, require

the knowledge of the covariance structure. Estimating a high dimensional covariance

matrix becomes the fundamental problem in high dimensional statistics. The starting

point of covariance matrix estimation is the sample covariance matrix. For the purpose

of statistical applications, our work focus on the models in Definition 1.1.2 and 1.1.3.

After deriving the local laws, we can study the statistical properties of the eigenvalues

and eigenvectors of such matrices. For the sample covariance matrix, an important sub-

class of is the spiked sample covariance matrix [6, 17, 36, 67, 87], where a finite number

of eigenvalues can detach from the bulk and become the outliers. In the language of

statistics, we can regard the outlier eigenvalues as the signals and the bulk eigenvalues


as the noise. The signal part contains information only depending on itself and the

noise part will stick to the sample covariance matrix XX∗. In the supercritical case,

the outlier eigenvalues have Gaussian fluctuation [5] and the distribution of the angle

between the eigenvectors of population and sample covariance matrix is also Gaussian.

However, in the general situation, it may not be universal [27, 73]. For the extremal

non-outlier eigenvalues, they will be governed by the Tracy-Widom asymptotics. Simi-

lar results hold true for the addition of random matrix and low-lank deterministic matrix.

Random matrix theory can also help us study the covariance structure of non-stationary

time series and high dimensional time series. On one hand, for the non-stationary time

series, the underlying covariance and precision matrices are large dimensional matri-

ces. We adapt the construction of Wu and Zhou [109, 119, 120] to characterize the

non-stationary time series. Assuming that we can only observe one non-stationary time

series xini=1, xi ∈ R, and xi = G( in,Fi), where Fi = (· · · , ηi−1, ηi) and ηi, i ∈ Z are

i.i.d centered random variables, and G : [0, 1] × R∞ → R is a measurable function such

that ξi(t) := G(t,Fi) is a properly defined random variable for all t ∈ [0, 1]. It is very

important to test whether xini=1 is a White noise process (with possibly time-varying

variances) and whether its precision matrix is banded. In many cases, the statistics is

of quadratic form with a vector of diverging dimension [42, 118]. The distribution of

the Gaussian case is easy to compute using the classic central limit theorem, and for the

general distribution, we need to prove the Gaussian approximation. This is usually done

by using Stein’s method [95, 96], which is essentially the same as the Green function

comparison strategy in random matrix theory. On the other hand, in high dimensional

statistics, even though the entries between each vector are correlated through Σ, the vec-

tors are assumed to be independent, it is important to study the case when the vectors

are correlated with each other. In [82], the Marcenko-Pastur law is derived for the lagged

autocovariance matrices for stationary linear time series. And in [115], the Gaussian

asymptotics of the largest eigenvalue for a special class of unstable time series is derived.


However, the connection between non-stationary time series and random matrix is still

missing at this point and we will pursue this direction in the future using the framework

of Wu and Zhou.

Random matrix models are also useful to help us understand the stochastic growth

phenomena in physics [59, 101]. From the work of Johansson [64, 65], Prahofer and Spohn

[92], the Airy process is employed to describe the spatial fluctuations in a wide range of

growth models. These processes are at the center of the KPZ universality class [94]. One

way to characterize one of such processes is to scale the top eigenvalue curves of Dyson

Brownian motion at different time points. A Dyson Brownian motion is matrix-valued

SDEs whose entries independently undergo Ornstein-Uhlenbeck diffusion [59], if we con-

sider the GUE initial condition and study the transition density at finite time points, it

has the form of the multi-matrix model of Definition 1.1.7. Using Theorem 1.1.15, we

can get the extended Hermite kernel and scale this kernel at the edge we can get the Airy

process [105]. The key part for the above computation is to biorthogonalization of the

correlation kernel, where in the GUE case they are the standard Hermite polynomials

[100]. Similar technical problems appear in the discussion of totally asymmetric simple

exclusion process (TASEP) [22, 97]. Very recently, Matetski, Quastel and Remenik [84]

proposed a new way to understand this problem in the environment of random walk, it

is our hope that we can extend the Airy process using this technique [39].

1.4 Our contributions

This section is devoted to listing our contributions of this thesis and the detail can be

found in Chapter 2 and 3. We divide them into two parts accordingly:

Random matrix theory and high dimensional statistics. We have successfully

applied the dynamic approach developed by Erdos and Yau to study some problems

related to high dimensional statistics.


(1). We prove a necessary and sufficient condition for the edge universality at the largest

eigenvalue of a general class of sample covariance matrix satisfying Assumption

1.1.11 and Σ is diagonal in [40]. To make the Tracy-Widom asymptotics hold true,

the following moment assumption is the necessary and sufficient condition

lims→∞

s4P(|√nXij| ≥ s) = 0.

This implies that Tracy-Widom distribution still holds true for data with slightly

heavy tails, for example the probability density function of the form

f(x) =e4(4 log x+ 1)

x5(log x)21(x > e).

This condition was originally proposed for Wigner matrix by Lee and Yin in [81].

In an on-going project [41], we will prove that this condition still holds true for a

more general class of Σ.

(2). We prove the universality of the singular vectors for a general class of sample

covariance matrices provided that Assumption 1.1.11 holds true in [35]. We consider

a class of sample covariance matrices of the form Σ1/2XX∗Σ1/2. Assuming p is

comparable to n, we prove that the distribution of the components of the singular

vectors close to each edge singular value agrees with that of Gaussian ensembles

provided the first two moments coincide with the Gaussian random variables. For

the singular vectors associated with each bulk singular value, the same conclusion

holds if the first four moments match with those of Gaussian random variables.

Similar results have been proved for Wigner matrices by Knowles and Yin in [71].

We only prove for the diagonal case in this paper, however, using the Green function

flow method [80], we can extend the results for any Σ satisfying Assumption 1.1.11.

(3). We systematically study the eigen-structure of the model in Definition 1.1.3 assum-


ing that S is of low rank structure in [37]. Denote the singular vector decompo-

sition of S as S = UDV ∗, where D = diagd1, · · · , dr, U = (u1, · · · , ur), V =

(v1, · · · , vr), and where ui ∈ Rp, vi ∈ Rn are orthonormal vectors and r is a fixed

constant. We also assume d1 > d2 > · · · > dr > 0. We are interested in the regime

cn := np, limn→∞ cn = c ∈ (0,∞). We now give a heuristic description of our

results for rank one case and the detail can be found in Chapter 2. We denote

µ1 ≥ · · · ≥ µK as the eigenvalues of SS∗, K = minn, p and ui, vi as the singular

vectors of S. We prove that when d1 > c−1/4, µ1 → p(d1), where p(d1) is defined as

p(d) =(d2 + 1)(d2 + c−1)

d2.

When d1 > c−1/4, the largest eigenvalue µ1 will detach from the bulk and become

an outlier around its classical location p(d1). We would expect this happens under

a scale of n−1/3. This can be understood in the following ways: increasing d beyond

the critical value c−1/4, we expect µ1 to become an outlier, where its location p(d)

is located at a distance greater than O(n−2/3) from λ+. By using mean value theo-

rem, the phase transition will take place on the scale when |d1− c−1/4| ≥ O(n−1/3).

Furthermore, we also prove that µ1 = p(d1) +O(n−1/2(d1 − c−1/4)1/2

). Below this

scale, we would expect the spectrum of SS∗ to stick to that of XX∗. Especially, the

largest eigenvalue µ1 still has the Tracy-Widom distribution with the scale n−2/3,

which reads as µ1 = λ+ +O(n−2/3), λ+ = (1 + c−1/2)2..

For the singular vectors, when d1 > c−1/4, we have < u1, u1 >2→ a1(d1), <

v1, v1 >2→ a2(d1), where a1(d1), a2(d2) are deterministic functions of d. We further

prove that if d1 > c−1/4 + n−1/3, we have that

< u1, u1 >2= a1(d1) +O(n−1/2), < v1, v1 >

2= a2(d1) +O(n−1/2).


Below this scale, we prove that

< u1, u1 >2= O(n−1), < v1, v1 >

2= O(n−1).

Finally, we point out that in the working paper [8], we prove that in the super-

critical case when d1 > c−1/4, < u1, u1 >2 is asymptotically normally distributed if

the singular vector has no component of order O(1).

We also consider two statistical applications. Under the assumption ui, vi are

sparse, we provide a algorithm to consistently estimate S from S. And in the gen-

eral situation, we provide the rotation-invariant estimator, which performs better

than simply using singular value decomposition.

(4). We extend the famous spiked sample covariance matrix model to a more general

model containing more bulk components and outliers in [36]. To extend the bulk

model, we now add r (finite) number of spikes to the spectrum of Σb satisfying

Assumption 1.1.11. Denote the spectral decomposition of Σb as

Σb =

p∑i=1

σbiviv∗i , Db = diagσb1, · · · , σbp.

Denote the index set I ⊂ 1, 2, · · · , p as the collection of the indices of the r

outliers, where I := o1, · · · , or ⊂ 1, 2, · · · ,M. Now we define

Σg =M∑i=1

σgi viv∗i , where σgi =

σbi (1 + di), i ∈ I;

σbi , otherwise.

, di > 0.

We also assume that di are in the decreasing fashion. Therefore, we can write

Σg = Σb(1 + VDV∗) = (1 + VDV∗)Σb,


where V = (v1, · · · ,vM) and D = (di) is a p×p diagonal matrix, where di = di, i ∈

I and zero otherwise. Then our new model can be written as Qg = Σ1/2g XX∗Σ

1/2g .

As there exist m bulk components, for convenience, we relabel the indices of the

eigenvalues of Qg using µi,j, which stands for the j-th eigenvalue of the i-th bulk

component. Similarly, we can relabel for di,j, σgi,j, σ

bi,j. Recall the definitions related

to f in Lemma 1.1.10, we assume that the r outliers are associated with t bulk

components and each with ri, i = 1, 2, · · · , t outliers satisfying∑t

i=1 ri = r. Using

the convention that x0 = ∞, we denote the subset O+ ⊂ O by O+ =⋃ti=1O

+i ,

where O+i is defined as

O+i = σgi,j : x2i−1 +N−1/3+ε0 ≤ − 1

σgi,j< x2(i−1) − c0,

where ε0 > 0 is some small constant and 0 < c0 < minix2(i−1)−x2i−1

2. We further

denote r+i := |O+

i | and the index sets associated with O+i ,O+ by I+

i , I+, where

I+i := (i, j) : σgi,j ∈ O+

i , I+ :=

p⋃i=1

I+i .

We can relabel I in the similar fashion. We prove that for i = 1, 2, · · · , t, j =

1, 2, · · · , r+i , there exists some constant C > 1, when N is large enough, with

1−N−D1 probability, we have

|µi,j − f(− 1

σgi,j)| ≤ n−1/2+Cε0(− 1

σgi,j− x2i−1)1/2.

Moreover, for i = 1, 2, · · · , t, j = r+i + 1, · · · , ri, we have

|µi,j − f(x2i−1)| ≤ n−2/3+Cε0 .

Similar results hold for the angle between the eigenvectors of sample covariance ma-


trices and population covariance matrices, where the limit is 1σgi,j

f ′(−1/σgi,j)

f(−1/σgi,j). Examples

and statistical applications are considered to verify our results.

Figure 1.1: An example of the general model. The spectrum of the population covariancematrix contains three bulk components and there are three, two and one spike associatedwith the first, second and third bulk component respectively.

Random matrix theory and time series analysis. We developed a methodology,

which can help us estimate the underlying high dimensional covariance and precision

matrices of the locally stationary time series [42] assuming that only one observation is

available. Consider the one dimensional non-stationary time series xini=1, the starting

point of our methodology is the idea of Cholesky decomposition [91]. Let xi be the best

linear predictor of xi based on its predecessors xi−1, · · · , x1, where

xi =i−1∑j=1

φijxi−j, i = 2, · · · , n.

Denote φi = (φi1, · · · , φi,i−1)∗, where we use ∗ to stand for the transpose. Then we have

φi = Γ−1i γi, where Γi and γi are defined as Γi = Cov(xi−1,xi−1), γi = Cov(xi−1, xi),

with xi−1 = (xi−1, · · · , x1). Let εi be the prediction error with variance σ2i , εi = xi − xi.


Therefore, we can write

xi =i−1∑j=1

φijxi−j + εi, i = 2, · · · , n.

As xi is centered, we have x1 = ε1, as a consequence, we can write ΦΓΦ∗ = D, where the

diagonal matrix D = diagσ21, · · · , σ2

n and Φ is a lower triangular matrix having ones

on its diagonal and −φij at its (i, i − j)−th element for j < i. We need to estimate the

coefficients φij and the variances of εi. Under mild smoothing conditions, φij can be well

approximated by φj(in), where φj(t) is a smooth function defined on [0, 1]. Hence, it is

natural to employ the idea of Sieve estimation [31], where φj(t) can be estimated using

some given basis functions. For the variances σ2i , due to the smoothness assumption,

they can also be estimated using the method of Sieve. An advantage of the Cholesky

decomposition is that the precision matrix can also be easily (actually numerically easier)

estimated.

As byproducts, we can use the estimators of φij to infer the structure of the covariance

and precision matrices of the non-stationary time series. In the first paper [42], we

consider two concrete hypothesis testing problems, one is to test whether xini=1 is a

White noise process and the other is to test whether its precision matrix is banded. In

the second paper [43], we test the stationarity of the correlation of time series.

Chapter 2

Random matrices in high

dimensional statistics

In this chapter, we provide detailed proof and computation on the eigen-structure of some

random matrix models and discuss their statistical applications, which are sketched as the

first part of our contributions in Section 1.4. We will list the detailed results and the key

proofs here. For a complete discussion, we refer to our papers [7, 8, 35, 36, 37, 38, 40, 41].

2.1 Universality of sample covariance matrices

2.1.1 Edge universality of sample covariance matrices

Sample covariance matrices with general populations. We consider the M1×M1

sample covariance matrix Q1 := TX(TX)∗, where T is a deterministic M1 ×M2 matrix

and X is a random M2 × N matrix. We assume X = (xij) has entries xij = N−1/2qij,

1 ≤ i ≤M2 and 1 ≤ j ≤ N , where qij are i.i.d. random variables satisfying

Eq11 = 0, E|q11|2 = 1. (2.1.1)

26

Chapter 2. Random matrices in high dimensional statistics 27

In this subsection, we regardN as the fundamental (large) parameter andM1,2 ≡M1,2(N)

as depending on N . We define M := minM1,M2 and the aspect ratio dN := N/M .

Moreover, we assume that

dN → d ∈ (0,∞), as N →∞. (2.1.2)

For simplicity of notations, we will almost always abbreviate dN as d in this paper. We

denote the eigenvalues of Q1 in decreasing order by λ1(Q1) ≥ . . . ≥ λM1(Q1). We will

also need the N × N matrix Q2 := (TX)∗TX and denote its eigenvalues by λ1(Q2) ≥

. . . ≥ λN(Q2). Since Q1 and Q2 share the same nonzero eigenvalues, we will for simplicity

write λj, 1 ≤ j ≤ minN,M1, to denote the j-th eigenvalue of both Q1 and Q2 without

causing any confusion.

We assume that T ∗T is diagonal. In other words, T has a singular decomposition

T = UD, where U is an M1 × M1 unitary matrix and D is an M1 × M2 rectangular

diagonal matrix. Then it is equivalent to study the eigenvalues of DX(DX)∗. When

M1 ≤ M2 (i.e. M = M1), we can write D = (D, 0) where D is an M ×M diagonal

matrix such that D11 ≥ . . . ≥ DMM . Hence we have DX = DX, where X is the upper

M × N block of X with i.i.d. entries xij, 1 ≤ i ≤ M and 1 ≤ j ≤ N . On the other

hand, when M1 ≥ M2 (i.e. M = M2), we can write D =

D0

where D is an M ×M

diagonal matrix as above. Then DX =

DX0

, which shares the same nonzero singular

values with DX. The above discussions show that we can make the following stronger

assumption on T :

M1 = M2 = M, and T ≡ D = diag(σ

1/21 , σ

1/22 , . . . , σ

1/2M

), (2.1.3)


where

σ1 ≥ σ2 ≥ . . . ≥ σM ≥ 0.

Under the above assumption, the population covariance matrix of Q1 is defined as

Σ := EQ1 = D2 = diag (σ1, σ2, . . . , σM) . (2.1.4)

We denote the empirical spectral density of Σ by

πN :=1

M

M∑i=1

δσi . (2.1.5)

We assume that there exists a small constant τ > 0 such that

σ1 ≤ τ−1 and πN([0, τ ]) ≤ 1− τ for all N. (2.1.6)

Note the first condition means that the operator norm of Σ is bounded by τ−1, and the

second condition means that the spectrum of Σ cannot concentrate at zero.

For definiteness, in this subsection we will focus on the real case, i.e. the random

variable q11 is real. However, we remark that our proof can be applied to the complex case

after minor modifications if we assume in addition that Re q11 and Im q11 are independent

centered random variables with variance 1/2.

We summarize our basic assumptions here for future reference.

Assumption 2.1.1. We assume that X is an M × N random matrix with real i.i.d.

entries satisfying (2.1.1) and (2.1.2). We assume that T is an M × M deterministic

diagonal matrix satisfying (2.1.3) and (2.1.6).

Deformed Marchenko-Pastur law. In this part, we will study the eigenvalue statis-

tics of Q1,2 through their Green functions or resolvents.

Definition 2.1.2 (Green functions). For z = E + iη ∈ C+, where C+ is the upper half


complex plane, we define the Green functions for Q1,2 as

G1(z) := (DXX∗D∗ − z)−1 , G2(z) := (X∗D∗DX − z)−1 . (2.1.7)

We denote the empirical spectral densities (ESD) of Q1,2 as

ρ(N)1 :=

1

M

M∑i=1

δλi(Q1), ρ(N)2 :=

1

N

N∑i=1

δλi(Q2).

Then the Stieltjes transforms of ρ1,2 are given by

m(N)1 (z) :=

∫1

x− zρ

(N)1 (dx) =

1

MTrG1(z),

m(N)2 (z) :=

∫1

x− zρ

(N)2 (dx) =

1

NTrG2(z).

Throughout the rest of this subsection, we omit the super-index N from our notations.

Remark 2.1.3. Since the nonzero eigenvalues of Q1 and Q2 are identical, and Q1 has

M −N more (or N −M less) zero eigenvalues, we have

ρ1 = ρ2d+ (1− d)δ0, (2.1.8)

and

m1(z) = −1− dz

+ dm2(z). (2.1.9)

In the case D = IM×M , it is well known that the ESD of X∗X, ρ2, converges weakly

to the Marchenko-Pastur (MP) law [83]:

ρMP (x)dx :=1

2π

√[(λ+ − x)(x− λ−)]+

xdx, (2.1.10)

where λ± = (1± d−1/2)2. Moreover, m2(z) converges to the Stieltjes transform mMP (z)


of ρMP (z), which can be computed explicitly as

mMP (z) =d−1 − 1− z + i

√(λ+ − z)(z − λ−)

2z, z ∈ C+. (2.1.11)

Moreover, one can verify that mMP (z) satisfies the self-consistent equation [9, 98]

1

mMP (z)= −z + d−1 1

1 +mMP (z), ImmMP (z) ≥ 0 for z ∈ C+. (2.1.12)

Using (2.1.8) and (2.1.9), it is easy to get the expressions for ρ1c, the asymptotic eigen-

value density of Q1, and m1c, the Stieltjes transform of ρ1c.

If D is non-identity but the ESD πN in (2.1.5) converges weakly to some π, then it

was shown that the empirical eigenvalue distribution of Q2 still converges in probability

to some deterministic distributions ρ2c, referred to as the deformed Marchenko-Pastur

law below. It can be described through the Stieltjes transform

m2c(z) :=

∫R

ρ2c(dx)

x− z, z = E + iη ∈ C+.

For any given probability measure π compactly supported on R+, we define m2c as the

unique solution to the self-consistent equation [98]

1

m2c(z)= −z + d−1

∫x

1 + m2c(z)xπ(dx), (2.1.13)

where the branch-cut is chosen such that Im m2c(z) ≥ 0 for z ∈ C+. It is well known that

the functional equation (2.1.13) has a unique solution that is uniformly bounded on C+

under the assumptions (2.1.2) and (2.1.6). Letting η 0, we can recover the asymptotic

eigenvalue density ρ2c with the inverse formula

ρ2c(E) = limη0

1

πIm m2c(E + iη). (2.1.14)


The measure ρ2c is sometimes called the multiplicative free convolution of π and the MP

law. Again with (2.1.8) and (2.1.9), we can easily obtain m1c and ρ1c(z).

Similar to (2.1.13), for any finite N we define m(N)2c as the unique solution to the

self-consistent equation

1

m(N)2c (z)

= −z + d−1N

∫x

1 +m(N)2c (z)x

πN(dx), (2.1.15)

and define ρ(N)2c through the inverse formula as in (2.1.14). Then we define m

(N)1c and

ρ(N)1c (z) using (2.1.8) and (2.1.9). In the rest of this paper, we will always omit the

super-index N from our notations. The properties of m1,2c and ρ1,2c have been stud-

ied extensively. Here we collect some basic results that will be used in our proof. In

particular, we shall define the rightmost edge (i.e. the soft edge) of ρ1,2c.

Corresponding to the equation in (2.1.15), we define the function

f(m) := − 1

m+ d−1

N

∫x

1 +mxπN(dx). (2.1.16)

Then m2c(z) can be characterized as the unique solution to the equation z = f(m) with

Imm ≥ 0.

Lemma 2.1.4 (Support of the deformed MP law). The densities ρ1c and ρ2c have the

same support on R+, which is a union of connected components:

supp ρ1,2c ∩ (0,∞) =

p⋃k=1

[a2k, a2k−1] ∩ (0,∞), (2.1.17)

where p ∈ N depends only on πN . Here ak are characterized as following: there exists a

real sequence bk2pk=1 such that (x,m) = (ak, bk) are the real solutions to the equations

x = f(m), and f ′(m) = 0. (2.1.18)


Moreover, we have b1 ∈ (−σ−11 , 0). Finally, under assumptions (2.1.2) and (2.1.6), we

have a1 ≤ C for some positive constant C.

It is easy to observe that m2c(ak) = bk according to the definition of f . We shall call

ak the edges of the deformed MP law ρ2c. In particular we will focus on the rightmost

edge λr := a1. To establish our result, we need the following extra assumption.

Assumption 2.1.5. For σ1 defined in (2.1.3), we assume that there exists a small con-

stant τ > 0 such that

|1 +m2c(λr)σ1| ≥ τ, for all N. (2.1.19)

Remark 2.1.6. The above assumption guarantees a regular square-root behavior of the

spectral density ρ2c near λr (see Lemma 2.1.15 below), which is used in proving the local

deformed MP law at the soft edge. Note that f(m) has singularities at m = −σ−1i for

nonzero σi, so the condition (2.1.19) simply rules out the singularity of f at m2c(λr).

Main result. The main result of this paper is the following theorem. It establishes the

necessary and sufficient condition for the edge universality of the deformed covariance

matrix Q2 at the soft edge λr. We define the following tail condition for the entries of

X,

lims→∞

s4P(|q11| ≥ s) = 0. (2.1.20)

Theorem 2.1.7. Let Q2 = X∗T ∗TX be an N × N sample covariance matrix with X

and T satisfying Assumptions 2.1.1 and 2.1.5. Let λ1 be the largest eigenvalues of Q2.

• Sufficient condition: If the tail condition (2.1.135) holds, then we have

limN→∞

P(N2/3(λ1 − λr) ≤ s) = limN→∞

PG(N2/3(λ1 − λr) ≤ s), (2.1.21)

for all s ∈ R, where PG denotes the law for X with i.i.d. Gaussian entries.


• Necessary condition: If the condition (2.1.135) does not hold for X, then for

any fixed s > λr, we have

lim supN→∞

P(λ1 ≥ s) > 0. (2.1.22)

Remark 2.1.8. In [79], it was proved that there exists γ0 ≡ γ0(N) depending only on πN

and the aspect ratio dN such that

limN→∞

PG(γ0N

2/3(λ1 − λr) ≤ s)

= F1(s)

for all s ∈ R, where F1 is the type-1 Tracy-Widom distribution. The scaling factor γ0 is

given by [69]

1

γ30

=1

d

∫ (x

1 +m2c(λr)x

)3

πN(dx)− 1

m2c(λr)3,

and Assumption 2.1.5 assures that γ0 ∼ 1 for all N . Hence (2.1.21) and (2.1.22) together

show that the distribution of the rescaled largest eigenvalue of Q2 converges to the Tracy-

Widom distribution if and only if the condition (2.1.135) holds.

Remark 2.1.9. The universality result (2.1.21) can be extended to the joint distribution

of the k largest eigenvalues for any fixed k:

limN→∞

P((N2/3(λi − λr) ≤ si

)1≤i≤k

)= lim

N→∞PG((N2/3(λi − λr) ≤ si

)1≤i≤k

),

(2.1.23)

for all s1, s2, . . . , sk ∈ R. Let HGOE be an N × N random matrix belonging to the

Gaussian orthogonal ensemble. The joint distribution of the k largest eigenvalues of

HGOE, µGOE1 ≥ . . . ≥ µGOEk , can be written in terms of the Airy kernel for any fixed k


and

limN→∞

PG((γ0N

2/3(λi − λr) ≤ si)

1≤i≤k

)= lim

N→∞P((N2/3(µGOEi − 2) ≤ si

)1≤i≤k

),

for all s1, s2, . . . , sk ∈ R. Hence (2.1.23) gives a complete description of the finite-

dimensional correlation functions of the largest eigenvalues of Q2.

Notations. Following the notations in [49, 50], we will use the following definition to

characterize events of high probability.

Definition 2.1.10 (High probability event). Define

ϕ := (logN)log logN . (2.1.24)

We say that an N-dependent event Ω holds with ξ-high probability if there exist constant

c, C > 0 independent of N , such that

P(Ω) ≥ 1−NC exp(−cϕξ), (2.1.25)

for all sufficiently large N . For simplicity, for the case ξ = 1, we just say high probability.

Note that if (2.1.25) holds, then P(Ω) ≥ 1− exp(−c′ϕξ) for any constant 0 ≤ c′ < c.

Definition 2.1.11 (Bounded support condition). A family of M×N matrices X = (xij)

are said to satisfy the bounded support condition with q ≡ q(N) if

P(

max1≤i≤M,1≤j≤N

|xij| ≤ q

)≥ 1− e−Nc

, (2.1.26)


for some c > 0. Here q ≡ q(N) depends on N and usually satisfies

N−1/2 logN ≤ q ≤ N−φ,

for some small constant φ > 0. Whenever (2.1.26) holds, we say that X has support q.

Remark 2.1.12. Note that the Gaussian distribution satisfies the condition (2.1.26) with

q < N−φ for any φ < 1/2. We also remark that if (2.1.26) holds, then the event

|xij| ≤ q,∀1 ≤ i ≤M, 1 ≤ j ≤ N holds with ξ-high probability for any fixed ξ > 0

according to Definition 2.1.10. For this reason, the bad event |xij| > q for some i, j is

negligible, and we will not consider the case it happens throughout the proof.

Next we introduce a convenient self-adjoint linearization trick, which has been proved

to be useful in studying the local laws of random matrices of the A∗A type. We define

the following (N +M)× (N +M) block matrix, which is a linear function of X.

Definition 2.1.13 (Linearizing block matrix). For z ∈ C+, we define the (N + M) ×

(N +M) block matrix

H ≡ H(X) :=

0 DX

(DX)∗ 0

, (2.1.27)

and its Green function

G ≡ G(X, z) :=

−IM×M DX

(DX)∗ −zIN×N

−1

. (2.1.28)

Definition 2.1.14 (Index sets). We define the index sets

I1 := 1, ...,M, I2 := M + 1, ...,M +N, I := I1 ∪ I2.


Then we label the indices of the matrices according to

X = (Xiµ : i ∈ I1, µ ∈ I2) and D = diag(Dii : i ∈ I1).

In the rest of this paper, whenever referring to the entries of H and G, we will consistently

use the latin letters i, j ∈ I1, greek letters µ, ν ∈ I2, and a, b ∈ I. For 1 ≤ i ≤ minN,M

and M + 1 ≤ µ ≤ M + minN,M, we introduce the notations i := i + M ∈ I2 and

µ := µ−M ∈ I1. For any I × I matrix A, we define the following 2× 2 submatrices

A[ij] =

Aij Aij

Aij Aij

, 1 ≤ i, j ≤ minN,M. (2.1.29)

We shall call A[ij] a diagonal group if i = j, and an off-diagonal group otherwise .

It is easy to verify that the eigenvalues λ1(H) ≥ . . . ≥ λM+N(H) of H are related to

the ones of Q2 through

λi(H) = −λN+M−i+1(H) =√λi (Q2), 1 ≤ i ≤ N ∧M, (2.1.30)

and

λi(H) = 0, N ∧M + 1 ≤ i ≤ N ∨M,

where we used the notations N ∧M := minN,M and N ∨M := maxN,M. Fur-

thermore, by Schur complement formula, we can verify that

G =

z(DXX∗D∗ − z)−1 (DXX∗D∗ − z)−1DX

X∗D∗(DXX∗D∗ − z)−1 (X∗D∗DX − z)−1

=

zG1 G1DX

X∗D∗G1 G2

=

zG1 DXG2

G2X∗D∗ G2

. (2.1.31)


Thus a control of G yields directly a control of the resolvents G1,2 defined in (2.2.6). By

(2.2.48), we immediately get that

m1 =1

Mz

∑i∈I1

Gii, m2 =1

N

∑µ∈I2

Gµµ.

Next we introduce the spectral decomposition of G. Let

DX =N∧M∑k=1

√λkξkζ

∗k ,

be a singular value decomposition of DX, where

λ1 ≥ λ2 ≥ . . . ≥ λN∧M ≥ 0 = λN∧M+1 = . . . = λN∨M ,

and ξkMk=1 and ζkNk=1 are orthonormal bases of RI1 and RI2 , respectively. Then using

(2.2.48), we can get that for i, j ∈ I1 and µ, ν ∈ I2,

Gij =M∑k=1

zξk(i)ξ∗k(j)

λk − z, Gµν =

N∑k=1

ζk(µ)ζ∗k(ν)

λk − z, (2.1.32)

Giµ =N∧M∑k=1

√λkξk(i)ζ

∗k(µ)

λk − z, Gµi =

N∧M∑k=1

√λkζk(µ)ξ∗k(i)

λk − z. (2.1.33)

Main tools. For small constant c0 > 0 and large constants C0, C1 > 0, we define a

domain of the spectral parameter z = E + iη as

S(c0, C0, C1) :=

z = E + iη : λr − c0 ≤ E ≤ C0λr,

ϕC1

N≤ η ≤ 1

. (2.1.34)

We define the distance to the rightmost edge as

κ ≡ κE := |E − λr|, for z = E + iη. (2.1.35)


Then we have the following lemma, which summarizes some basic properties of m2c and

ρ2c.

Lemma 2.1.15. There exists sufficiently small constant c > 0 such that

ρ2c(x) ∼√λr − x, for all x ∈ [λr − 2c, λr] . (2.1.36)

The Stieltjes transform m2c satisfies that

|m2c(z)| ∼ 1, (2.1.37)

and

Imm2c(z) ∼

η/√κ+ η, E ≥ λr

√κ+ η, E ≤ λr

, (2.1.38)

for z = E + iη ∈ S(c, C0,−∞).

Remark 2.1.16. Recall that ak are the edges of the spectral density ρ2c; see (2.1.17).

Hence ρ2c(ak) = 0, and we must have ak < λr − 2c for 2 ≤ k ≤ 2p. In particular,

S(c0, C0, C1) is away from all the other edges if we choose c0 ≤ c.

Definition 2.1.17 (Classical locations of eigenvalues). The classical location γj of the

j-th eigenvalue of Q2 is defined as

γj := supx

∫ +∞

x

ρ2c(x)dx >j − 1

N

. (2.1.39)

In particular, we have γ1 = λr.

Remark 2.1.18. If γj lies in the bulk of ρ2c, then by the positivity of ρ2c we can define γj

through the equation ∫ +∞

γj

ρ2c(x)dx =j − 1

N.


We can also define the classical location of the j-th eigenvalue of Q1 by changing ρ2c to

ρ1c and (j− 1)/N to (j− 1)/M in (2.1.39). By (2.1.8), this gives the same location as γj

for j ≤ N ∧M .

Definition 2.1.19 (Deterministic limit of G). We define the deterministic limit Π of the

Green function G in (2.2.48) as

Π(z) :=

− (1 +m2c(z)Σ)−1 0

0 m2c(z)IN×N

, (2.1.40)

where Σ is defined in (2.1.4).

In the rest of this section, we present some results that will be used in the proof of

Theorem 2.1.7. Their proofs will be given in subsequent sections.

Lemma 2.1.20 (Local deformed MP law). Suppose the Assumptions 2.1.1 and 2.1.5

hold. Suppose X satisfies the bounded support condition (2.1.26) with q ≤ N−φ for some

constant φ > 0. Fix C0 > 0 and let c1 > 0 be a sufficiently small constant. Then

there exist constants C1 > 0 and ξ1 ≥ 3 such that the following events hold with ξ1-high

probability:

⋂z∈S(2c1,C0,C1)

|m2(z)−m2c(z)| ≤ ϕC1

(min

q,

q2

√κ+ η

+

1

Nη

), (2.1.41)

⋂z∈S(2c1,C0,C1)

maxa,b∈I|Gab(z)− Πab(z)| ≤ ϕC1

(q +

√Imm2c(z)

Nη+

1

Nη

), (2.1.42)

‖H‖2 ≤ λr + ϕC1(q2 +N−2/3)

. (2.1.43)

The estimates in (2.1.41) and (2.1.42) are usually referred to as the averaged local

law and entrywise local law, respectively. For completeness, we will give a concise proof

in the end that fits into our setting. The local laws (2.1.41) and (2.1.42) can be used

to derive some important properties of the eigenvectors and eigenvalues of the random


matrices. For instance, they lead to the following results about the delocalization of

eigenvectors and the rigidity of eigenvalues. Note that (2.1.44) gives an almost optimal

estimate on the flatness of the singular vectors of DX, while (2.1.45) gives some quite

precise information on the locations of the singular values of DX.

Lemma 2.1.21. Suppose the events (2.1.41) and (2.1.42) hold with ξ1-high probabil-

ity. Then there exists constant C ′1 > 0 such that the following events hold with ξ1-high

probability:

(1) Delocalization:

⋂k:λr−c1≤γk≤λr

maxi|ξk(i)|2 + max

µ|ζk(µ)|2 ≤ ϕC

′1

N

; (2.1.44)

(2) Rigidity of eigenvalues: if q ≤ N−φ for some constant φ > 1/3,

⋂j:λr−c1≤γj≤λr

|λj − γj| ≤ ϕC

′1(j−1/3N−2/3 + q2

), (2.1.45)

where λj is the j-th eigenvalue of (DX)∗DX and γj is defined in (2.1.39).

With Lemma 2.1.20, Lemma 2.1.21 and a standard Green function comparison method,

one can prove the following edge universality result when the support q is small.

Lemma 2.1.22. Let XW and XV be two sample covariance matrices satisfying the as-

sumptions in Lemma 2.1.20. Moreover, suppose q ≤ ϕCN−1/2 for some constant C > 0.

Then there exist constants ε, δ > 0 such that, for any s ∈ R, we have

PV(N2/3(λ1 − λr) ≤ s−N−ε

)−N−δ ≤ PW

(N2/3(λ1 − λr) ≤ s

)(2.1.46)

≤ PV(N2/3(λ1 − λr) ≤ s+N−ε

)+N−δ,

where PV and PW denote the laws of XV and XW , respectively.


For any matrix X satisfying Assumption 2.1.1 and the tail condition (2.1.135), we

can construct a matrix X1 that approximates X with probability 1− o(1), and satisfies

Assumption 2.1.1, the bounded support condition (2.1.26) with q ≤ N−φ for some small

φ > 0, and

E|xij|3 ≤ BN−3/2, E|xij|4 ≤ B(logN)N−2. (2.1.47)

for some constant B > 0. We will need the following local law, eigenvalues rigidity,

and edge universality results for covariance matrices with large support and satisfying

condition (2.1.47).

Theorem 2.1.23 (Rigidity of eigenvalues: large support case). Suppose the Assumptions

2.1.1 and 2.1.5 hold. Suppose X satisfies the bounded support condition (2.1.26) with

q ≤ N−φ for some constant φ > 0 and the condition (2.1.47). Fix the constants c1, C0,

C1, and ξ1 as given in Lemma 2.1.20. Then there exists constant C2 > 0, depending only

on c1, C1, B and φ, such that with high probability we have

maxz∈S(c1,C0,C2)

|m2(z)−m2c(z)| ≤ ϕC2

Nη, (2.1.48)

for sufficiently large N . Moreover, (2.1.48) implies that for some constant C > 0, the

following events hold with high probability:

⋂j:λr−c1≤γj≤λr

|λj − γj| ≤ ϕCj−1/3N−2/3

(2.1.49)

and sup

E≥λr−c1|n(E)− nc(E)| ≤ ϕC

N

, (2.1.50)

where

n(E) :=1

N#λj ≥ E, nc(E) :=

∫ +∞

E

ρ2c(x)dx. (2.1.51)

Theorem 2.1.24. Let XW and XV be two i.i.d. sample covariance matrices satisfying


the assumptions in Theorem 2.1.23. Then there exist constants ε, δ > 0 such that, for

any s ∈ R, we have

PV(N2/3(λ1 − λr) ≤ s−N−ε

)−N−δ ≤ PW

(N2/3(λ1 − λr) ≤ s

)(2.1.52)

≤ PV(N2/3(λ1 − λr) ≤ s+N−ε

)+N−δ,

where PV and PW denote the laws of XV and XW , respectively.

Lemma 2.1.25 (Bounds on Gij: large support case). Let X be a sample covariance

matrix satisfying the assumptions in Theorem 2.1.23. Then for any 0 < c < 1 and

z ∈ S(c1, C0, C2) ∩ z = E + iη : η ≥ N−1+c, we have the following weak bound

E|Gab(z)|2 ≤ ϕC3

(Imm2c(z)

Nη+

1

(Nη)2

), a 6= b, (2.1.53)

for some constant C3 > 0.

In proving Theorem 2.1.23, Theorem 2.1.24 and Lemma 2.1.25, we will make use of the

results in Lemmas 2.1.20-2.1.22 for covariance matrices with small support. In fact, given

any matrix X satisfying the assumptions in Theorem 2.1.23, we can construct a matrix X

having the same first four moments as X but with smaller support q = O(N−1/2 logN).

Lemma 2.1.26. Suppose X satisfies the assumptions in Theorem 2.1.23. Then there

exists another matrix X = (xij), such that X satisfies the bounded support condition

(2.1.26) with q = O(N−1/2 logN), and the first four moments of the entries of X and X

match, i.e.

Exkij = Exkij, k = 1, 2, 3, 4. (2.1.54)

From Lemmas 2.1.20-2.1.22, we see that Theorems 2.1.23, 2.1.24 and Lemma 2.1.25

hold for X. Then due to (2.1.54), we expect that X has “similar properties” as X, so

that these results also hold for X. This will be proved with a Green function comparison


method, that is, we expand the Green functions with X in terms of Green functions with

X using resolvent expansions, and then estimate the relevant error terms.

Proof of of the main result. In this part, we prove Theorem 2.1.7 with the results.

We begin by proving the necessity part.

Proof of the necessity. Assume that lims→∞

s4P(|q11| ≥ s) 6= 0. Then we can find a constant

0 < c0 < 1/2 and a sequence rn such that rn →∞ as n→∞ and

P(|qij| ≥ rn) ≥ c0r−4n . (2.1.55)

Fix any s > λr. We denote L := bτMc, I :=√τ−1s and define the event

ΓN = There exist i and j, 1 ≤ i ≤ L, 1 ≤ j ≤ N, such that |xij| ≥ I .

We first show that λ1(Q2) ≥ s when ΓN holds. Suppose |xij| ≥ I for some 1 ≤ i ≤ L

and 1 ≤ j ≤ N . Let u ∈ RN such that u(k) = δkj. By assumption (2.1.6), we have

σi ≥ τ for i ≤ L. Hence

λ1(Q2) ≥ 〈u, (DX)∗(DX)u〉 =M∑k=1

σkx2kj ≥ σix

2ij ≥ τ

(√τ−1s

)2

= s.

Now we choose N ∈ b(rn/I)2c : n ∈ N. With the choice N = b(rn/I)2c, we have

1− P(ΓN) = (1− P(|x11| ≥ I))NL ≤ (1− P(|q11| ≥ rn))NL

≤(1− c0r

−4n

)NL ≤ (1− c1N−2)c2N

2

, (2.1.56)

for some constant c1 > 0 depending on c0 and I, and some constant c2 > 0 depending on

τ and d. Since (1− c1N−2)c2N

2 ≤ c3 for some constant 0 < c3 < 1 independent of N , the

above inequality shows that P(ΓN) ≥ 1−c3 > 0. This shows that lim supN→∞ P(ΓN) > 0


and concludes the proof.

Proof of the sufficiency. Given the matrix X satisfying Assumption 2.1.1 and the tail

condition (2.1.135), we introduce a cutoff on its matrix entries at the level N−ε. For any

fixed ε > 0, define

αN := P(|q11| > N1/2−ε) , βN := E

[1(|q11| > N1/2−ε)q11

].

By (2.1.135) and integration by parts, we get that for any δ > 0 and large enough N ,

αN ≤ δN−2+4ε, |βN | ≤ δN−3/2+3ε. (2.1.57)

Let ρ(x) be the distribution density of q11. Then we define independent random

variables qsij, qlij, cij, 1 ≤ i ≤M and 1 ≤ j ≤ N , in the following ways:

• qsij has distribution density ρs(x), where

ρs(x) = 1

(∣∣∣∣x− βN1− αN

∣∣∣∣ ≤ N1/2−ε) ρ

(x− βN

1−αN

)1− αN

; (2.1.58)

• qlij has distribution density ρl(x), where

ρl(x) = 1

(∣∣∣∣x− βN1− αN

∣∣∣∣ > N1/2−ε) ρ

(x− βN

1−αN

)αN

; (2.1.59)

• cij is a Bernoulli 0-1 random variable with P(cij = 1) = αN and P(cij = 0) = 1−αN .

Let Xs, X l and Xc be random matrices such that Xsij = N−1/2qsij, X

lij = N−1/2qlij and

Xcij = cij. By (2.1.58), (2.1.59) and the fact that Xc

ij is Bernoulli, it is easy to check that

for independent Xs, X l and Xc,

Xijd= Xs

ij

(1−Xc

ij

)+X l

ijXcij −

1√N

βN1− αN

, (2.1.60)


where by (2.1.57), we have

∣∣∣∣ 1√N

βN1− αN

∣∣∣∣ ≤ 2δN−2+3ε.

Therefore, if we define the M ×N matrix Y = (Yij) by

Yij =1√N

βN1− αN

for all i and j,

we have ‖Y ‖ ≤ cN−1+3ε for some constant c > 0. In the proof below, one will see that

‖D(X + Y )‖ = λ1/21 ((X + Y )∗D∗D(X + Y )) = O(1) with probability 1 − o(1), where

λ1(·) denotes the largest eigenvalue of the random matrix. Then it is easy to verify that

with probability 1− o(1),

|λ1 ((X + Y )∗D∗D(X + Y ))− λ1 (X∗D∗DX)| = O(N−1+3ε

). (2.1.61)

Thus the deterministic part in (2.1.60) is negligible under the scaling N2/3.

By (2.1.135) and integration by parts, it is easy to check that

Eqs11 = 0, E|qs11|2 = 1−O(N−1+2ε),

E|qs11|3 = O(1), E|qs11|4 = O(logN).

(2.1.62)

We note that X1 := (E|qsij|2)−1/2Xs is a matrix that satisfies the assumptions for X in

Theorem 2.1.24. Together with the estimate for E|qsij|2 in (2.1.62), we conclude that there

exist constants ε, δ > 0 such that for any s ∈ R,

PG(N2/3(λ1 − λr) ≤ s−N−ε

)−N−δ ≤ Ps

(N2/3(λ1 − λr) ≤ s

)(2.1.63)

≤ PG(N2/3(λ1 − λr) ≤ s+N−ε

)+N−δ,

where Ps denotes the law for Xs and PG denotes the law for a Gaussian covariance matrix.


Now we write the first two terms on the right-hand side of (2.1.60) as

Xsij(1−Xc

ij) +X lijX

cij = Xs

ij +RijXcij,

where Rij := X lij − Xs

ij. It remains to show that the effect of the RijXcij terms on λ1

is negligible. We call the corresponding matrix as Rc := (RijXcij). Note that Xc

ij is

independent of Xsij and Rij.

We first introduce a cutoff on matrix Xc as Xc := 1AXc, where

A :=

#(i, j) : Xcij = 1 ≤ N5ε

∩Xcij = Xc

kl = 1⇒ i, j = k, l or i, j ∩ k, l = ∅.

If we regard the matrix Xc as a sequence Xc of NM i.i.d. Bernoulli random variables,

it is easy to obtain from the large deviation formula that

P

(MN∑i=1

Xci ≤ N5ε

)≥ 1− exp(−N ε), (2.1.64)

for sufficiently large N . Suppose the number m of the nonzero elements in Xc is given

with m ≤ N5ε. Then it is easy to check that

P

(∃ i = k, j 6= l or i 6= k, j = l such that Xc

ij = Xckl = 1

∣∣∣∣∣MN∑i=1

Xci = m

)(2.1.65)

= O(m2N−1).

Combining the estimates (2.1.64) and (2.1.65), we get that

P(A) ≥ 1−O(N−1+10ε). (2.1.66)


On the other hand, by condition (2.1.135), we have

P (|Rij| ≥ ω) ≤ P(|qij| ≥

ω

2N1/2

)= o(N−2), (2.1.67)

for any fixed constant ω > 0. Hence if we introduce the matrix

E = 1

(A ∩

maxi,j|Rij| ≤ ω

)Rc,

then

P(E = Rc) = 1− o(1) (2.1.68)

by (2.1.66) and (2.1.67). Thus we only need to study the largest eigenvalue of (Xs +

E)∗D∗D(Xs + E), where maxi,j |Eij| ≤ ω and the rank of E is less than N5ε. In fact, it

suffices to prove that

P(∣∣λs1 − λE1 ∣∣ ≤ N−3/4

)= 1− o(1), (2.1.69)

where λs1 := λ1 ((Xs)∗D∗DXs) and λE1 := λ1 ((Xs + E)∗D∗D(Xs + E)). The estimate

(2.1.69), combined with (2.1.61), (2.1.63) and (2.1.68), concludes (2.1.21).

Now we prove (2.1.69). Note that Xc is independent of Xs, so the positions of the

nonzero elements of E are independent of Xs. Without loss of generality, we assume the

m nonzero entries of DE are

e11, e22, · · · , emm, m ≤ N5ε. (2.1.70)

For the other choices of the positions of nonzero entries, the proof is exactly the same.

But we make this assumption to simplify the notations. By (2.1.6) and the definition of

E, we have |eii| ≤ τ−1ω for 1 ≤ i ≤ m.


We define the matrices

Hs :=

0 DXs

(DXs)∗ 0

and HE := Hs + P, P :=

0 DE

(DE)∗ 0

.

Then we have the eigendecomposition P = V PDV∗, where PD is a 2m × 2m diagonal

matrix

PD = diag (e11, . . . , emm,−e11, . . . ,−emm) ,

and V is an (M +N)× 2m matrix such that

Vab =

δa,i/√

2 + δa,(M+i)/√

2, b = i, i ≤ m,

δa,i/√

2− δa,(M+i)/√

2, b = i+m, i ≤ m,

0, b ≥ 2m+ 1.

With the identity

det

−IM×M DX

(DX)∗ −zIN×N

= det(−IM×M) det(X∗D∗DX − zIN×N),

we find that if µ /∈ σ((DXs)∗DXs), then µ is an eigenvalue ofQγ := (Xs+γE)∗D∗D(Xs+

γE) if and only if

det(V ∗Gs(µ)V + (γPD)−1) = 0, (2.1.71)

where

Gs(µ) :=

Hs −

IM×M 0

0 µIN×N

−1

.

Define Rγ := V ∗GsV + (γPD)−1 for 0 < γ < 1. It has the following 2 × 2 blocks (recall


the definition (2.1.29)): for 1 ≤ i ≤ m,

Rγi,j Rγ

i,j+m

Rγi+m,j Rγ

i+m,j+m

=1

2

1 1

1 −1

G[ij]

1 1

1 −1

+ δij

(γeii)−1 0

0 −(γeii)−1

. (2.1.72)

Now let µ := λs1 ±N−3/4. We claim that

P (detRγ(µ) 6= 0 for all 0 < γ ≤ 1) = 1− o(1). (2.1.73)

If (2.1.73) holds, then µ is not an eigenvalue of Qγ with probability 1 − o(1). Denoting

the largest eigenvalue of Qγ by λγ1 , 0 < γ ≤ 1, and defining λ01 := limγ→0 λ

γ1 , we have

λ01 = λs1 and λ1

1 = λE1 by definition. With the continuity of λγ1 with respect to γ and the

fact that λ01 ∈ (λs1 −N−3/4, λs1 +N−3/4), we find that

λE1 = λ11 ∈ (λs1 −N−3/4, λs1 +N−3/4),

with probability 1− o(1), i.e. we have proved (2.1.69).

Finally, we prove the claim (2.1.73). Choose z = λr + iN−2/3 and note that Hs has

support N−ε. Then by (2.1.42) and (2.1.38), we have with high probability,

maxa|Gs

aa(z)− Πaa(λr)| ≤ N−ε/2, (2.1.74)

where we also used the Assumption 2.1.5 and

|m2c(z)−m2c(λr)| ∼ |z − λr|1/2,

which follows from (2.1.36). For the off-diagonal terms, we use (2.1.53), (2.1.38) and the


Markov inequality to conclude that

maxa6=b∈1,...,m∪M+1,...,M+m

|Gsab(z)| ≤ N−1/6, (2.1.75)

holds with probability 1−o(N−1/6). We can extend (2.1.63) to finite correlation functions

of the largest eigenvalues. Since the largest eigenvalues in the Gaussian case are separated

in the scale ∼ N−2/3, we conclude that

P(

mini|λi((Xs)∗Xs)− µ| ≥ N−3/4

)≥ 1− o(1). (2.1.76)

On the other hand, the rigidity result (2.1.49) gives that with high probability,

|µ− λr| ≤ ϕCN−2/3 +N−3/4. (2.1.77)

Using (2.1.44), (2.1.76), (2.1.77) and the rigidity estimate (2.1.49), we can get that with

probability 1− o(1),

maxa,b|Gs

ab(z)−Gsab(µ)| < N−1/4+ε. (2.1.78)

For instance, for α, β ∈ I2, small c > 0 and large enough C > 0, we have with probability

1− o(1) that

|Gαβ(z)−Gαβ(µ)| ≤∑k

|ζk(α)ζ∗k(β)|∣∣∣∣ 1

λk − z− 1

λk − µ

∣∣∣∣≤ C

N2/3

∑γk≤λr−c

|ζk(α)ζ∗k(β)|+ ϕC

N5/3

∑γk>λr−c

1

|λk − z||λk − µ|

≤ C

N2/3+

ϕC

N5/3

∑1≤k≤ϕC

1

|λk − z||λk − µ|+

ϕC

N5/3

∑k>ϕC ,γk>λr−c

1

|λk − z||λk − µ|

≤ C

N2/3+

ϕ2C

N1/4+

ϕC

N2/3

1

N

∑k>ϕC ,γk>λr−c

1

|λk − z||λk − µ|

≤ N−1/4+ε,

where in the first step we used (2.1.32), in the second step (2.1.44) and |λk−z||λk−µ| & 1


for γk ≤ λr − c due to (2.1.49), in the third step the Cauchy-Schwarz inequality, in the

fourth step (2.1.76), and in the last step the rigidity estimate (2.1.49). For all the other

choices of a and b, we can prove the estimate (2.1.78) in a similar way. Now by (2.1.78),

we see that (2.1.74) and (2.1.75) still hold if we replace z by µ = λs1 ±N−3/4 and double

the right hand sides. Then using maxi |eii| ≤ τ−1ω and (2.1.72), we get that for any

0 < γ ≤ 1,

min1≤i≤m,γ

|Rγii|, |R

γi+m,i+m| ≥ τω−1 − 1

2|Πii(λr) +m2c(λr)| −O(N−ε/2),

max1≤i≤m,γ

|Rγi,i+m|, |R

γi+m,i| ≤

1

2|Πii(λr)−m2c(λr)|+O(N−ε/2),

and

max1≤i 6=j≤m,γ

(|Rγ

i,j|+ |Rγi+m,j|+ |R

γi,j+m|+ |R

γi+m,j+m|

)= O(N−1/6),

hold with probability 1− o(1). Thus Rγ is diagonally dominant with probability 1− o(1)

(provided that ω is chosen to be sufficiently small). This proves the claim (2.1.73), which

further gives (2.1.69) and completes the proof.

Proofs of local laws. We firstly collect some tools that will be used in the proof. For

simplicity, we denote Y := DX.

Definition 2.1.27 (Minors). For T ⊆ I, we define the minor H(T) := (Hab : a, b ∈ I \T)

obtained by removing all rows and columns of H indexed by a ∈ T. Note that we keep the

names of indices of H when defining H(T), i.e. (H(T))ab = 1a,b/∈THab. Correspondingly,

we define the Green function

G(T) :=(H(T)

)−1=

zG(T)1 G(T)

1 Y (T)(Y (T)

)∗ G(T)1 G(T)

2

=

zG(T)1 Y (T)G(T)

2

G(T)2

(Y (T)

)∗ G(T)2

,


and the partial traces

m(T)1 :=

1

MTrG(T)

1 =1

Mz

∑i/∈T

G(T)ii , m

(T)2 :=

1

NTrG(T)

2 =1

N

∑µ/∈T

G(T)µµ ,

where we adopt the convention that G(T )ab = 0 if a ∈ T or b ∈ T. We will abbreviate

(a) ≡ (a), (a, b) ≡ (ab), and

∑a/∈T

≡(T)∑a

,∑a,b/∈T

≡(T)∑a,b

.

Lemma 2.1.28. (Resolvent identities).

(i) For i ∈ I1 and µ ∈ I2, we have

1

Gii

= −1−(Y G(i)Y ∗

)ii,

1

Gµµ

= −z −(Y ∗G(µ)Y

)µµ. (2.1.79)

(ii) For i 6= j ∈ I1 and µ 6= ν ∈ I2, we have

Gij = GiiG(i)jj

(Y G(ij)Y ∗

)ij, Gµν = GµµG

(µ)νν

(Y ∗G(µν)Y

)µν. (2.1.80)

For i ∈ I1 and µ ∈ I2, we have

Giµ = GiiG(i)µµ

(−Yiµ +

(Y G(iµ)Y

)iµ

),

Gµi = GµµG(µ)ii

(−Y ∗µi +

(Y ∗G(µi)Y ∗

)µi

).

(2.1.81)

(iii) For a ∈ I and b, c ∈ I \ a,

G(a)bc = Gbc −

GbaGac

Gaa

,1

Gbb

=1

G(a)bb

− GbaGab

GbbG(a)bb Gaa

. (2.1.82)

(iv) All of the above identities hold for G(T) instead of G for T ⊂ I.


Lemma 2.1.29. Fix constants c0, C0, C1 > 0. The following estimates hold uniformly

for all z ∈ S(c0, C0, C1):

‖G‖ ≤ Cη−1, ‖∂zG‖ ≤ Cη−2. (2.1.83)

Furthermore, we have the following identities:

∑µ∈I2

|Gνµ|2 =∑µ∈I2

|Gµν |2 =ImGνν

η, (2.1.84)

∑i∈I1

|Gji|2 =∑i∈I1

|Gij|2 =|z|2

ηIm

(Gjj

z

), (2.1.85)

∑i∈I1

|Gµi|2 =∑i∈I1

|Giµ|2 = Gµµ +z

ηImGµµ, (2.1.86)

∑µ∈I2

|Giµ|2 =∑µ∈I2

|Gµi|2 =Gii

z+z

ηIm

(Gii

z

). (2.1.87)

All of the above estimates remain true for G(T) instead of G for any T ⊆ I.

Lemma 2.1.30. Fix constants c0, C0, C1 > 0. For any T ⊆ I, the following bounds hold

uniformly in z ∈ S(c0, C0, C1):

∣∣∣m2 −m(T)2

∣∣∣ ≤ 2 |T|Nη

, (2.1.88)

and ∣∣∣∣∣ 1

N

M∑i=1

σi

(G

(T)ii −Gii

)∣∣∣∣∣ ≤ C |T|Nη

, (2.1.89)

where C > 0 is a constant depending only on τ .

Proof. For µ ∈ I2, we have

∣∣∣m2 −m(µ)2

∣∣∣ =1

N

∣∣∣∣∣∑ν∈I2

GνµGµν

Gµµ

∣∣∣∣∣ ≤ 1

N |Gµµ|∑ν∈I2

|Gνµ|2 =ImGµµ

Nη|Gµµ|≤ 1

Nη,

where in the first step we used (2.1.82), and in the second and third steps we used the


equality (2.1.84). Similarly, using (2.1.82) and (2.1.86) we get

∣∣∣m2 −m(i)2

∣∣∣ =1

N

∣∣∣∣∣∑ν∈I2

GνiGiν

Gii

∣∣∣∣∣ ≤ 1

N |Gii|

(Gii

z+z

ηIm

(Gii

z

))≤ 2

Nη.

Then we can prove (2.1.88) by induction on the indices in T. The proof for (2.1.89) is

similar except that one needs to use the assumption (2.1.6).

Lemma 2.1.31. Let (xi), (yi) be independent families of centered and independent ran-

dom variables, and (Ai), (Bij) be families of deterministic complex numbers. Suppose the

entries xi and yj have variance at most N−1 and satisfies the bounded support condition

(2.1.26) with q ≤ N−ε for some constant ε > 0. Then for any fixed ξ > 0, the following

bounds hold with ξ-high probability:

∣∣∣∣∣∑i

Aixi

∣∣∣∣∣ ≤ ϕξ

qmaxi|Ai|+

1√N

(∑i

|Ai|2)1/2

, (2.1.90)

∣∣∣∣∣∑i,j

xiBijyj

∣∣∣∣∣ ≤ ϕ2ξ

q2Bd + qBo +1

N

(∑i 6=j

|Bij|2)1/2

, (2.1.91)

∣∣∣∣∣∑i

xiBiixi −∑i

(E|xi|2)Bii

∣∣∣∣∣ ≤ ϕξqBd, (2.1.92)

∣∣∣∣∣∑i 6=j

xiBijxj

∣∣∣∣∣ ≤ ϕ2ξ

qBo +1

N

(∑i 6=j

|Bij|2)1/2

, (2.1.93)

where

Bd := maxi|Bii|, Bo := max

i 6=j|Bij|.

Finally, we have the following lemma, which is a consequence of the Assumption 2.1.5.

Lemma 2.1.32. There exists constants c0, τ′ > 0 such that

|1 +m2c(z)σk| ≥ τ ′, (2.1.94)


for all z ∈ S(c0, C0, C1) and 1 ≤ k ≤M .

Proof. By Assumption 2.1.5 and the fact m2c(λr) ∈ (−σ−11 , 0), we have

|1 +m2c(λr)σk| ≥ τ, 1 ≤ k ≤M.

Applying (2.1.36) to the Stieltjes transform

m2c(z) :=

∫R

ρ2c(dx)

x− z, (2.1.95)

one can verify that m2c(z) ∼√z − λr for z close to λr. Hence if κ + η ≤ 2c0 for some

sufficiently small constant c0 > 0, we have

|1 +m2c(z)σk| ≥ τ/2.

Then we consider the case with E − λr ≥ c0 and η ≤ c1 for some constant c1 > 0. In

fact, for η = 0 and E ≥ λr + c0, m2c(E) is real and it is easy to verify that m′2c(E) ≥ 0

using the formula (2.1.95). Hence we have

|1 + σkm2c(E)| ≥ |1 + σkm2c(λr + c0)| ≥ τ/2, for E ≥ λr + c0.

Using (2.1.95) again, we can get that

∣∣∣∣dm2c(z)

dz

∣∣∣∣ ≤ c−20 , for E ≥ λr + c0.

So if c1 is sufficiently small, we have

|1 + σkm2c(E + iη)| ≥ 1

2|1 + σkm2c(E)| ≥ τ/4

for E ≥ λr + c0 and η ≤ c1. Finally, it remains to consider the case with η ≥ c1. If


σk ≤ |2m2c(z)|−1, then we have |1 + σkm2c(z)| ≥ 1/2. Otherwise, we have Imm2c(z) ∼ 1

by (2.1.38). Together with (2.1.37), we get that

|1 + σkm2c(z)| ≥ σk Imm2c(z) ≥ Imm2c(z)

2m2c(z)≥ τ ′

for some constant τ ′ > 0.

Our goal is to prove that G is close to Π in the sense of entrywise and averaged local

laws. Hence it is convenient to introduce the following random control parameters.

Definition 2.1.33 (Control parameters). We define the entrywise and averaged errors

Λ := maxa,b∈I|(G− Π)ab| , Λo := max

a6=b∈I|Gab| , θ := |m2 −m2c|. (2.1.96)

Moreover, we define the random control parameter

Ψθ :=

√Imm2c + θ

Nη+

1

Nη, (2.1.97)

and the deterministic control parameter

Ψ :=

√Imm2c

Nη+

1

Nη. (2.1.98)

We introduce the Z variables

Z(T)a := (1− Ea)

(G(T)aa

)−1, a /∈ T,

where Ea[·] := E[· | H(a)], i.e. it is the partial expectation over the randomness of the

a-th row and column of H. By (2.1.271), we have

Zi = (Ei − 1)(Y G(i)Y ∗

)ii

= σi∑µ,ν∈I2

G(i)µν

(1

Nδµν −XiµXiν

), (2.1.99)


and

Zµ = (Eµ − 1)(Y ∗G(µ)Y

)µµ

=∑i,j∈I1

√σiσjG

(µ)ij

(1

Nδij −XiµXjµ

).

(2.1.100)

The following lemma plays a key role in the proof of local laws.

Lemma 2.1.34. Let c0 > 0 be a sufficiently small constant and fix C0, C1, ξ > 0. Define

the z-dependent event Ξ(z) := Λ(z) ≤ (logN)−1. Then there exists constant C > 0

such that the following estimates hold for all a ∈ I and z ∈ S(c0, C0, C1) with ξ-high

probability:

1(Ξ) (Λo + |Za|) ≤ Cϕ2ξ (q + Ψθ) , (2.1.101)

1 (η ≥ 1) (Λo + |Za|) ≤ Cϕ2ξ (q + Ψθ) . (2.1.102)

Proof. Applying the large deviation Lemma 2.1.31 to Zi in (2.1.99), we get that on Ξ,

|Zi| ≤ Cϕ2ξ

q +1

N

(∑µ,ν

∣∣G(i)µν

∣∣2)1/2

= Cϕ2ξ

q +1

N

(∑µ

ImG(i)µµ

η

)1/2 = Cϕ2ξ

q +

√Imm

(i)2

Nη

(2.1.103)

holds with ξ-high probability, where we used (2.1.6), (2.1.84) and the fact that maxa,b |Gab| =

O(1) on event Ξ. Now by (2.1.96), (2.1.97) and the bound (2.1.88), we have that

√Imm

(i)2

Nη=

√Imm2c + Im(m

(i)2 −m2) + Im(m2 −m2c)

Nη≤ CΨθ. (2.1.104)


Together with (2.1.103), we conclude that

1(Ξ)|Zi| ≤ Cϕ2ξ (q + Ψθ)

with ξ-high probability. Similarly, we can prove the same estimate for 1(Ξ)|Zµ|. In the

proof, we also need to use (2.1.9) and

Im

(−d− 1

z

)= O(η) = O(Imm2c(z)).

If η ≥ 1, we always have maxa,b |Gab| = O(1) by (2.1.83). Then repeating the above

proof, we obtain that

1(η ≥ 1)|Za| ≤ Cϕ2ξ (q + Ψθ)

with ξ-high probability. Similarly, using (2.1.80) and Lemmas 2.1.29-2.1.276, we can

prove that with ξ-high probability,

1(Ξ) (|Gij|+ |Gµν |) ≤ Cϕ2ξ (q + Ψθ) (2.1.105)

holds uniformly for i 6= j and µ 6= ν. It remains to prove the bound for Giµ and Gµi. Using

(2.1.81), the bounded support condition (2.1.26) for Xiµ, the bound maxa,b |Gab| = O(1)


on Ξ, Lemma 2.1.29 and Lemma 2.1.276, we get that with ξ-high probability,

|Giµ| ≤ C

q +

∣∣∣∣∣∣(iµ)∑j,ν

XiνG(iµ)νj Xjµ

∣∣∣∣∣∣

≤ Cϕ2ξ

q +1

N

(iµ)∑j,ν

∣∣∣G(iµ)νj

∣∣∣21/2

≤ Cϕ2ξ

q +1

N

(µ)∑ν

(G(iµ)νν +

z

ηImG(iµ)

νν

)1/2

≤ Cϕ2ξ

q +

√|m(iµ)

2 |N

+

√Imm

(iµ)2

Nη

.

(2.1.106)

As in (2.1.104), we can show that

√Imm

(iµ)2

Nη= O(Ψθ). (2.1.107)

For the other term, we have

√|m(iµ)

2 |N

≤

√|m2c|+ |m(iµ)

2 −m2|+ |m2 −m2c|N

≤ C

(1

N√η

+

√θ

N+

√|m2c|N

)≤ CΨθ,

(2.1.108)

where we used (2.1.88), and that

|m2c|N

= O

(Imm2c

Nη

),

since |m2c| = O(1) and Imm2c & η by Lemma 2.1.15. From (2.1.106), (2.1.107) and

(2.1.108), we obtain that

1(Ξ)|Giµ| ≤ Cϕ2ξ (q + Ψθ)


with ξ-high probability. Together with (2.1.105), we get the estimate in (2.1.101) for

Λo. Finally, the estimate (2.1.102) for Λo can be proved in a similar way with the bound

1(η ≥ 1) maxa,b |Gab| = O(1).

Our proof of the local law starts with an analysis of the self-consistent equation.

Recall that m2c(z) is the solution to the equation z = f(m) for f defined in (2.1.16).

Lemma 2.1.35. Let c0 > 0 be sufficiently small. Fix C0 > 0, ξ ≥ 3 and C1 ≥ 8ξ. Then

there exists C > 0 such that the following estimates hold uniformly in z ∈ S(c0, C0, C1)

with ξ-high probability:

1(η ≥ 1) |z − f(m2)| ≤ Cϕ2ξ(q +N−1/2), (2.1.109)

1(Ξ) |z − f(m2)| ≤ Cϕ2ξ(q + Ψθ), (2.1.110)

where Ξ is as given in Lemma 2.1.34. Moreover, we have the finer estimates

1(Ξ) (z − f(m2)) = 1(Ξ) ([Z]1 + [Z]2) +O(ϕ4ξ(q2 + Ψ2

θ

)), (2.1.111)

with ξ-high probability, where

[Z]1 :=1

N

∑i∈I1

σi(1 +m2σi)2

Zi, [Z]2 :=1

N

∑µ∈I2

Zµ. (2.1.112)

Proof. We first prove (2.1.111), from which (2.1.110) follows due to (2.1.101) and (2.1.94).

By (2.1.271), (2.1.99) and (2.1.100), we have

1

Gii

= −1− σiN

∑µ∈I2

G(i)µµ + Zi = −1− σim2 + εi, (2.1.113)

and

1

Gµµ

= −z − 1

N

∑i∈I1

σiG(µ)ii + Zµ = −z − 1

N

∑i∈I1

σiGii + εµ, (2.1.114)


where

εi := Zi + σi

(m2 −m(i)

2

)and εµ := Zµ +

1

N

∑i∈I1

σi

(Gii −G(µ)

ii

).

By (2.1.88), (2.1.89) and (2.1.101), we have for all i and µ,

1(Ξ) (|εi|+ |εµ|) ≤ Cϕ2ξ(q + Ψθ), (2.1.115)

with ξ-high probability. Then using (2.1.114), we get that for any µ and ν,

1(Ξ)(Gµµ −Gνν) = 1(Ξ)GµµGνν(εν − εµ) = O(ϕ2ξ(q + Ψθ)

), (2.1.116)

with ξ-high probability. This implies that

1(Ξ)|Gµµ −m2| ≤ Cϕ2ξ(q + Ψθ), µ ∈ I2, (2.1.117)

with ξ-high probability.

Now we plug (2.1.113) into (2.1.114) and take the average N−1∑

µ. Note that we can

write

1

Gµµ

=1

m2

− 1

m22

(Gµµ −m2) +1

m22

(Gµµ −m2)2 1

Gµµ

.

After taking the average, the second term on the right-hand side vanishes and the third

term provides a O(ϕ4ξ(q + Ψθ)2) factor by (2.1.117). On the other hand, using (2.1.82)

and (2.1.101) we get that

1(Ξ)

∣∣∣∣∣ 1

N

∑i∈I1

σi

(G

(µ)ii −Gii

)∣∣∣∣∣ ≤ 1(Ξ)1

N

∑i∈I1

σi

∣∣∣∣GiµGµi

Gµµ

∣∣∣∣ ≤ Cϕ4ξ(q + Ψθ)2,

and

1(Ξ)|m2 −m(i)2 | ≤ 1(Ξ)

1

N

∑µ∈I2

∣∣∣∣GµiGiµ

Gii

∣∣∣∣ ≤ Cϕ4ξ(q + Ψθ)2,


with ξ-high probability. Hence the average of (2.1.114) gives

1(Ξ)1

m2

= 1(Ξ)

−z +

1

N

∑i∈I1

σi1 + σim2 − Zi +O (ϕ4ξ(q + Ψθ)2)

+ [Z]2

+O(ϕ4ξ(q + Ψθ)

2),

with ξ-high probability. Finally, using (2.1.94) and the definition of Ξ we can expand the

fractions in the sum to get that

1(Ξ)

z +

1

m2

− 1

N

∑i∈I1

σi1 + σim2

= 1(Ξ) ([Z]1 + [Z]2) +O

(ϕ4ξ(q + Ψθ)

2).

This concludes (2.1.111).

Then we prove (2.1.109). Using the bound 1(η ≥ 1) maxa,b |Gab| = O(1), it is easy to

get that |m2| = O(1) and θ = O(1). Thus we have 1(η ≥ 1)Ψθ = O(N−1/2) and (2.1.115)

gives

1(η ≥ 1)(|εi|+ |εµ|) ≤ Cϕ2ξ(q +N−1/2) (2.1.118)

with ξ-high probability. First, we claim that for η ≥ 1,

|m2| ≥ Imm2 ≥ c with ξ-high probability, (2.1.119)

for some constant c > 0. By the spectral decomposition (2.1.32), we have

ImGii = ImM∑k=1

z|ξk(i)|2

λk − z=

M∑k=1

|ξk(i)|2Im

(−1 +

λkλk − z

)≥ 0.

Then by (2.1.114), G−1µµ is of orderO(1) and has an imaginary part≤ −η+O

(ϕ2ξ(q +N−1/2)

)with ξ-high probability. This implies that ImGµµ & η with ξ-high probability, which con-

cludes (2.1.119). Next, we claim that

|1 + σim2| ≥ c′ with ξ-high probability, (2.1.120)


for some constant c′ > 0. In fact, if σi ≤ |2m2|−1, we trivially have |1 + σim2| ≥ 1/2.

Otherwise, we have σi & 1 (since |m2| = O(1)), which gives that

|1 + σim2| ≥ σiImm2 ≥ c′.

Finally, with (2.1.118), (2.1.119) and (2.1.120), we can repeat the previous arguments to

get (2.1.109).

The following lemma gives the stability of the equation z = f(m). Roughly speaking,

it states that if z − f(m2(z)) is small and m2(z)−m2c(z) is small for Im z ≥ Im z, then

m2(z)−m2c(z) is small. For an arbitrary z ∈ S(c0, C0, C1), we define the discrete set

L(w) := z ∪ z′ ∈ S(c0, C0, C1) : Re z′ = Re z, Im z′ ∈ [Im z, 1] ∩ (N−10N).

Thus, if Im z ≥ 1, then L(z) = z; if Im z < 1, then L(z) is a 1-dimensional lattice with

spacing N−10 plus the point z. Obviously, we have |L(z)| ≤ N10.

Lemma 2.1.36. The self-consistent equation z − f(m) = 0 is stable on S(c0, C0, C1) in

the following sense. Suppose the z-dependent function δ satisfies N−2 ≤ δ(z) ≤ (logN)−1

for z ∈ S(c0, C0, C1) and that δ is Lipschitz continuous with Lipschitz constant ≤ N2.

Suppose moreover that for each fixed E, the function η 7→ δ(E + iη) is non-increasing

for η > 0. Suppose that u2 : S(c0, C0, C1) → C is the Stieltjes transform of a probability

measure. Let z ∈ S(c0, C0, C1) and suppose that for all z′ ∈ L(z) we have

|z − f(u2)| ≤ δ(z). (2.1.121)

Then we have

|u2(z)−m2c(z)| ≤ Cδ√κ+ η + δ

, (2.1.122)

for some constant C > 0 independent of z and N , where κ is defined in (2.1.35).


Note that by Lemma 2.1.36 and (2.1.109), we immediately get that

1(η ≥ 1)θ(z) ≤ Cϕ2ξ(q +N−1/2), (2.1.123)

with ξ-high probability. From (2.1.102), we obtain the off-diagonal estimate

1(η ≥ 1)Λo(z) ≤ Cϕ2ξ(q +N−1/2) (2.1.124)

with ξ-high probability. Using (2.1.117), (2.1.113) and (2.1.123), we get that

1(η ≥ 1) (|Gii − Πii|+ |Gµµ −m2c|) ≤ Cϕ2ξ(q +N−1/2), (2.1.125)

with ξ-high probability, which gives the diagonal estimate. These bounds can be easily

generalized to the case η ≥ c for some fixed c > 0. Comparing with (2.1.42), one can see

that the bounds (2.1.124) and (2.1.125) are optimal for the η ≥ c case. Now it remains

to deal with the small η case (in particular, the local case with η 1). We first prove

the following weak bound.

Lemma 2.1.37. Let c0 > 0 be sufficiently small. Fix C0 > 0, ξ ≥ 3 and C1 ≥ 8ξ. Then

there exists C > 0 such that with ξ-high probability,

Λ(z) ≤ Cϕ2ξ(√

q + (Nη)−1/3), (2.1.126)

holds uniformly in z ∈ S(c0, C0, C1).

To get stronger local laws in Lemma 2.1.20, we need stronger bounds on [Z]1 and [Z]2

in (2.1.111). They follow from the following fluctuation averaging lemma.

Lemma 2.1.38. Fix a constant ξ > 0. Suppose q ≤ ϕ−5ξ and that there exists S ⊆


S(c0, C0, L) with L ≥ 18ξ such that with ξ-high probability,

Λ(z) ≤ γ(z) for z ∈ S, (2.1.127)

where γ is a deterministic function satisfying γ(z) ≤ ϕ−ξ. Then we have that with

(ξ − τN)-high probability,

|[Z]1(z)|+ |[Z]2(z)| ≤ ϕ18ξ

(q2 +

1

(Nη)2+

Imm2c(z) + γ(z)

Nη

), (2.1.128)

for z ∈ S, where τN := 2/ log logN .

Proof. We suppose that the event Ξ holds. The bound for [Z]2 is proved in Lemma 4.1

of [50]. The bound for [Z]1 can be proved in a similar way, except that the coefficients

σi/(1 +m2σi)2 are random and depend on i. This can be dealt with by writing, for any

i ∈ I1,

m2 = m(i)2 +

1

N

∑µ∈I2

GµiGiµ

Gii

= m(i)2 +O(Λ2

o),

where by Lemma 2.1.34, we have

Λ2o ≤ Cϕ4ξ

(q2 + Ψ2

θ

)≤ Cϕ4ξ

(q2 +

1

(Nη)2+

Imm2c(z) + γ(z)

Nη

).

with ξ-high probability. Then we write

[Z]1 =1

N

∑i∈I1

σi(1 +m

(i)2 σi

)2Zi +O(Λ2o)

=1

N

∑i∈I1

(1− Ei)

σi(1 +m

(i)2 σi

)2G−1ii

+O(Λ2o)

=1

N

∑i∈I1

(1− Ei)[

σi

(1 +m2σi)2G−1ii

]+O(Λ2

o). (2.1.129)

The method to bound the first term in the line (2.1.129) is a slight modification of the


one in [50]. Finally, one can use that the event Ξ holds with ξ-high probability by Lemma

2.1.37 to conclude the proof.

Proof of (2.1.41) and (2.1.42). Fix c0, C0 > 0, ξ > 3 and set

L := 120ξ, ξ := 2/ log 2 + ξ.

Hence we have ξ ≤ 2ξ and L ≥ 60ξ. Then to prove (2.1.42), it suffices to prove

⋂z∈S(c0,C0,L)

Λ(z) ≤ Cϕ20ξ

(q +

√Imm2c(z)

Nη+

1

Nη

), (2.1.130)

with ξ-high probability.

By Lemma 2.1.37, the event Ξ holds with ξ-high probability. Then together with

Lemma 2.1.38 and (2.1.111), we get that with (ξ − τN)-high probability,

|z − f(m2)| ≤ Cϕ18ξ

[q2 +

1

(Nη)2+

Imm2c + Cϕ2ξ(√q + (Nη)−1/3)

Nη

]

≤ C

[ϕ20ξ

(q2 +

1

(Nη)4/3

)+ ϕ18ξ Imm2c

Nη

],

where we used Young’s inequality for the√q/(Nη) term. Now applying Lemma 2.1.36,

we get that with (ξ − τN)-high probability,

θ ≤ Cϕ10ξ

(q +

1

(Nη)2/3

)+ Cϕ18ξ Immc

Nη√κ+ η

≤ Cϕ18ξ

(q +

1

(Nη)2/3

),

where we used (2.1.38) in the second step. Then using Lemma 2.1.34, (2.1.113) and


(2.1.117), it is easy to obtain that

Λ ≤ Cϕ2ξ(q + Ψθ) + θ ≤Cϕ18ξ

(q +

1

(Nη)2/3

)+ Cϕ2ξ

√Imm2c

Nη

≤ϕ20ξ

(q +

1

(Nη)2/3

)+ ϕ3ξ

√Imm2c

Nη

uniformly in z ∈ S(c0, C0, L) with (ξ−τN)-high probability, which is a better bound than

the one in (2.1.126). We can repeat this process M times, where each iteration yields a

stronger bound on Λ which holds with a smaller probability. More specifically, suppose

that after k iterations we get the bound

Λ ≤ ϕ20ξ

(q +

1

(Nη)1−τ

)+ ϕ3ξ

√Imm2c

Nη(2.1.131)

uniformly in z ∈ S(c0, C0, L) with ξ′-high probability. Then by Lemma 2.1.38 and

(2.1.111), we have with (ξ′ − τN)-high probability,

|z − f(m2)| ≤ Cϕ18ξ

[q2 +

1

(Nη)2+

Imm2c

Nη+ϕ20ξ

Nη

(q +

1

(Nη)1−τ

)+ϕ3ξ

Nη

√Imm2c

Nη

]

≤ C

[ϕ38ξ

(q2 +

1

(Nη)2−τ

)+ ϕ18ξ Imm2c

Nη

].

Then using Lemma 2.1.36, we get that with (ξ′ − τN)-high probability,

θ ≤ Cϕ19ξ

(q +

1

(Nη)1−τ/2

)+ Cϕ18ξ Immc

Nη√κ+ η

≤ Cϕ19ξ

(q +

1

(Nη)1−τ/2

).


Again with Lemma 2.1.34, (2.1.113) and (2.1.117), we obtain that

Λ ≤ Cϕ2ξ(q + Ψθ) + θ ≤ Cϕ19ξ

(q +

1

(Nη)1−τ/2

)+ Cϕ2ξ

√Immc

Nη

≤ ϕ20ξ

(q +

1

(Nη)1−τ/2

)+ ϕ3ξ

√Immc

Nη,

(2.1.132)

uniformly in z ∈ S(c0, C0, L) with (ξ′ − τN)-high probability. Comparing with (2.1.131),

we see that the power of (Nη)−1 is increased from 1− τ to 1− τ/2, and moreover, there

is no extra constant C appearing on the right-hand side of (2.1.132). Thus after M

iterations, we get

Λ ≤ ϕ20ξ

(q +

1

(Nη)1−(1/2)M−1/3

)+ ϕ3ξ

√Immc

Nη, (2.1.133)

uniformly in z ∈ S(c0, C0, L) with (ξ−MτN)-high probability. TakingM = blog logN/ log 2c

such that

ξ −MτN ≥ ξ,1

(Nη)−(1/2)M−1/3≤ (Nη)4/(3 logN) ≤ C,

we can then conclude (2.1.130) and hence (2.1.42). Finally to prove (2.1.41), we only

need to plug (2.1.130) into Lemma 2.1.38 and then apply Lemma 2.1.36.

Proof of (2.1.43). The bound in (2.1.43) follows from a standard application of the local

laws (2.1.41) and (2.1.42). The proof is exactly the same as the one for Lemma 4.4 in

[50]. We omit the details here.

2.1.2 Universality of singular vector distribution

Sample covariance matrices with a general class of populations We first in-

troduce some notations. Throughout the paper, we will use

r = limN→∞

rN = limN→∞

N

M. (2.1.134)


Let X = (xij) be an M ×N data matrix with centered entries xij = N−1/2qij, 1 ≤ i ≤M

and 1 ≤ j ≤ N, where qij are i.i.d random variables with unit variance and for all p ∈ N,

there exists a constant Cp, such that q11 satisfies the following condition

E|q11|p ≤ Cp. (2.1.135)

We consider the sample covariance matrix Q = TXX∗T ∗, where T is a deterministic

matrix satisfying T ∗T is a positive diagonal matrix. Using the QR factorization [63,

Theorem 5.2.1], we find that T = UΣ1/2, where U is an orthogonal matrix and Σ is a

positive diagonal matrix. Denote Y = Σ1/2X, and the singular value decomposition of Y

as Y =N∧M∑k=1

√λkξkζ

∗k , where λk, k = 1, 2, · · · , N ∧M are the nontrivial eigenvalues of Q

and ξkMk=1 and ζkNk=1 are orthonormal bases of RM and RN respectively. First of all,

we observe that

X∗T ∗TX = Y ∗Y = ZΛNZ∗,

where the columns of Z are ζ1, · · · , ζN and ΛN is a diagonal matrix with entries λ1, · · · , λN .

As a consequence, U will not influence the right singular vectors of Y . Next, we have

TXX∗T ∗ = UY Y ∗U∗ = UΞΛMΞ∗U∗,

where the columns of Ξ are ξk, k = 1, 2, · · · ,M and ΛM is a diagonal matrix containing

λ1, · · · , λM . Use the fact that the product of orthogonal matrices is again orthogonal, we

conclude that the left singular vectors of TX are ξk := Uξk. Hence, the components of

ξk is a linear combination of ξk. For instance, we have

ξk(i)ξk(j) =M∑p1=1

M∑p2=1

Uip1Ujp2ξk(p1)ξk(p2).

By the delocalization result (see Lemma 2.1.62) and dominated convergence theorem, we

only need to consider the universality of the entries of ξk. The above discussion shows


that, we can make the following assumptions on T :

T ≡ Σ1/2 = diagσ1/21 , · · · , σ1/2

M , with σ1 ≥ σ2 ≥ · · · ≥ σM > 0. (2.1.136)

We denote the empirical spectral distribution of Σ by

π :=1

M

M∑i=1

δσi . (2.1.137)

Suppose that there exists some small positive constant τ such that,

τ < σM ≤ σ1 ≤ τ−1, τ ≤ r ≤ τ−1, π([0, τ ]) ≤ 1− τ. (2.1.138)

For definiteness, in this paper we focus on the real case, i.e. all the entries xij are real.

However, it is clear that our results and proofs can be applied to the complex case after

minor modifications if we assume in addition that Re xij and Im xij are independent

centered random variables with the same variance. To avoid repetition, we summarize

the basic assumptions for future reference.

Assumption 2.1.39. We assume X is an M × N matrix with centered i.i.d entries

satisfying (2.3.1) and (2.1.135). We also assume that T is a deterministic M × M

matrix satisfying (2.1.136) and (2.3.4).

From now on, we will always use Y = Σ1/2X and its singular value decomposition

Y =N∧M∑k=1

√λkξkζ

∗k , where λ1 ≥ λ2 ≥ · · · ≥ λM∧N .

Deformed Marcenko-Pastur law. We use this subsection to discuss the empirical

spectral distribution of X∗T ∗TX, where we basically follow the discussion of [74, Section

2.2]. It is well-known that if π is a compactly supported probability measure on R, and


let rN > 0, then for any z ∈ C+, there is a unique m ≡ mN(z) ∈ C+ satisfying

1

m= −z +

1

rN

∫x

1 +mxπ(dx). (2.1.139)

In this paper, we define the deterministic function m ≡ m(z) as the unique solution of

(2.3.5) with π defined in (2.3.3). We define by ρ the probability measure associated with

m (i.e. m is the Stieltjes transform of ρ) and call it the asymptotic density of X∗T ∗TX.

Our assumption (2.3.4) implies that the spectrum of Σ cannot be concentrated at zero,

thus it ensures π is a compactly supported probability measure. Therefore, m and ρ are

well-defined.

Let z ∈ C+, then m ≡ m(z) can be characterized as the unique solution of the

equation

z = f(m), Imm ≥ 0, where f(x) := −1

x+

1

rN

M∑i=1

π(σi)x+ σ−1

i

. (2.1.140)

The behaviour of ρ can be entirely understood by the analysis of f . We summarize the

elementary properties of ρ as the following lemma.

Lemma 2.1.40. Denote R = R∪∞, then f defined in (2.3.7) is smooth on the M + 1

open intervals of R defined through

I1 := (−σ−11 , 0), Ii := (−σ−1

i ,−σ−1i−1), i = 2, · · · ,M, I0 := R/ ∪Mi=1 Ii.

We also introduce a multiset C ⊂ R containing the critical points of f , using the conven-

tions that a nondegenerate critical point is counted once and a degenerate critical point

will be counted twice. In the case rN = 1, ∞ is a nondegenerate critical point. With the

above notations, we have

• |C ∩ I0| = |C ∩ I1| = 1 and |C ∩ Ii| ∈ 0, 2 for i = 2, · · · ,M. Therefore, |C| = 2p,

where for convenience, we denote by x1 ≥ x2 ≥ · · · ≥ x2p−1 be the 2p − 1 critical


points in I1 ∪ · · · ∪ IM and x2p be the unique critical point in I0.

• Denote ak := f(xk), we have a1 ≥ · · · ≥ a2p. Moreover, we have xk = m(ak) by

assuming m(0) := ∞ for rN = 1. Furthermore, for k = 1, · · · , 2p, there exists a

constant C such that 0 ≤ ak ≤ C.

• supp ρ ∩ (0,∞) = (∪pk=1[a2k, a2k−1]) ∩ (0,∞).

With the above definitions and properties, we now introduce the key regularity as-

sumption on Σ.

Assumption 2.1.41. Fix τ > 0, we say that

(i) The edges ak, k = 1, · · · , 2p are regular if


i|xk + σ−1

i | ≥ τ. (2.1.141)

(ii) The bulk components k = 1, · · · , p are regular if for any fixed τ ′ > 0 there exists a

constant c ≡ cτ,τ ′ such that the density of ρ in [a2k + τ ′, a2k−1− τ ′] is bounded from below

by c.

Remark 2.1.42. The second condition in (2.3.9) states that the gap in the spectrum

of ρ adjacent to ak can be well separated when N is sufficiently large. And the third

condition ensures a square root behaviour of ρ in a small neighbourhood of ak. To be

specific, consider the right edge of the k-th bulk component, by (A.12) of [74], there

exists some small constant c > 0, such that ρ has the following square root behavior

ρ(x) ∼√a2k−1 − x, x ∈ [a2k−1 − c, a2k−1]. (2.1.142)

As a consequence, it will rule out the outliers. The bulk regularity imposes a lower bound

on the density of eigenvalues away from the edges.


Main results. This subsection is devoted to providing the main results of this paper.

We first introduce some notations. Recall that the nontrivial classical eigenvalue locations

γ1 ≥ γ2 ≥ · · · ≥ γM∧N of Q are defined as∫∞γidρ =

i− 12

N. By Lemma 2.3.2, there are p

bulk components in the spectrum of ρ. For k = 1, · · · , p, we define the classical number

of eigenvalues of the k-th bulk component through Nk := N∫ a2k−1

a2kdρ. When p ≥ 1, we

relabel λi and γi separately for each bulk component k = 1, · · · , p by introducing

λk,i := λi+∑l<k Nl

, γk,i := γi+∑l<k Nl

∈ (a2k, a2k−1). (2.1.143)

Equivalently, we can characterize γk,i through

∫ a2k−1

γk,i

dρ =i− 1

2

N. (2.1.144)

In the present paper, we will use the following assumption for the technical purpose of

the application of the anisotropic local law.

Assumption 2.1.43. For k = 1, 2, · · · , p, i = 1, 2, · · · , Nk, γk,i ≥ τ, for some constant

τ > 0.

We define the index sets I1 := 1, ...,M, I2 := M+1, ...,M+N, I := I1∪I2. We

will consistently use the latin letters i, j ∈ I1, greek letters µ, ν ∈ I2, and s, t ∈ I. Then

we label the indices of the matrix according to X = (Xiµ : i ∈ I1, µ ∈ I2). Similarly, we

can label the entries of ξk ∈ RI1 , ζk ∈ RI2 . In the k-th bulk component, k = 1, 2, · · · , p,

we rewrite the index of λα′ as

α′ := l +∑t<k

Nt, when α′ −∑t<k

Nt <∑t≤k

Nt − α′, (2.1.145)

α′ := −l + 1 +∑t≤k

Nt, when α′ −∑t<k

Nt >∑t≤k

Nt − α′. (2.1.146)

In this paper, we will always say that l is associated with α′. Note that α′ is the index


of λk,l before the relabeling of (2.1.143) and the two cases correspond to the right and

left edges respectively. Our main result on the distribution of the components of the

singular vectors near the edge is the following theorem. For any positive integers m, k,

some function θ : Rm → R and x = (x1, · · · , xm) ∈ Rm, we denote

∂(k)θ(x) =∂kθ(x)

∂xk11 ∂xk22 · · · ∂xkmm

,

m∑i=1

ki = k, k1, k2, · · · , km ≥ 0, (2.1.147)

and ||x||2 to be its l2 norm. Denote QG := Σ1/2XGX∗GΣ1/2, where XG is GOE and Σ

satisfies (2.1.136) and (2.3.4).

Theorem 2.1.44. For QV = Σ1/2XVX∗V Σ1/2 satisfying Assumption 2.1.39, let EG,EV

denote the expectations with respect to XG, XV . Consider the k-th bulk component, k =

1, 2, · · · , p, and l defined in (2.1.145) or (2.1.146), under Assumption 2.1.41 and 2.1.43,

for any choices of indices i, j ∈ I1, µ, ν ∈ I2, there exists a δ ∈ (0, 1), when l ≤ N δk , we

have

limN→∞

[EV − EG]θ(Nξα′(i)ξα′(j), Nζα′(µ)ζα′(ν)) = 0,

where θ is a smooth function in R2 that satisfies

|∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, with some constant C > 0. (2.1.148)

Theorem 2.1.45. For QV = Σ1/2XVX∗V Σ1/2 satisfying Assumption 2.1.39. Consider

the k1-th, · · · , kn-th bulk components, k1, · · · , kn ∈ 1, 2, · · · , p, n ≤ p, for lki defined in

(2.1.145) or (2.1.146) associated with the ki-th bulk component, i = 1, 2, · · · , n, under

Assumption 2.1.41 and 2.1.43, for any choices of indices i, j ∈ I1, µ, ν ∈ I2, there exists

a δ ∈ (0, 1), when lki ≤ N δki, where lki is associated with α′ki , i = 1, 2, · · · , n, we have

limN→∞

[EV−EG]θ(Nξα′k1 (i)ξα′k1 (j), Nζα′k1 (µ)ζα′k1 (ν), · · · , Nξα′kn (i)ξα′kn (j), Nζα′kn (µ)ζα′kn (ν)) = 0,


where θ is a smooth function in R2n that satisfies


Remark 2.1.46. The results in Theorem 2.1.44 and 2.1.45 can be easily extended to a

general form containing more entries of the singular vectors using a general form of Green

function comparison argument. For example, to extend Theorem 2.1.44, we consider the

k-th bulk component and choose any positive integer β, under Assumption 2.1.41 and

2.1.43, for any choices of indices i1, j1, · · · , iβ, jβ ∈ I1 and µ1, ν1, · · · , µβ, νβ ∈ I2, for

the corresponding li defined in (2.1.145) or (2.1.146), i = 1, 2, · · · , β, there exists some

0 < δ < 1 with 0 < max1≤i≤βli ≤ N δk , we have

limN→∞

[EV−EG]θ(Nξα′1(i1)ξα′1(j1), Nζα′1(µ1)ζα′1(ν1), · · · , Nξα′β (iβ)ξα′β (jβ), Nζα′β (µβ)ζα′β (νβ)) = 0,

(2.1.150)

where θ ∈ R2β is a smooth function function satisfying |∂(k)θ(x)| ≤ C(1 + ||x||2)C , k =

1, 2, 3, with some constant C > 0. Similarly, we can extend Theorem 2.1.45 to contain

more entries of singular vectors.

Recall (2.1.143), denote $k := (|f ′′(xk)|/2)1/3, k = 1, 2, · · · , 2p, for any positive

integer h, we define

q2k−1,h :=N

23

$2k−1

(λk,h − a2k−1), q2k,h := −N23

$2k

(λk,Nk−h+1 − a2k).

Consider a smooth function θ ∈ R whose third derivative θ(3) satisfying |θ(3)(x)| ≤

C(1 + |x|)C , for some constant C > 0. Then we have

limN→∞

[EV − EG]θ(qk,h) = 0. (2.1.151)

Together with Theorem 2.1.44, we have the following corollary. Denote t = 2k − 1 if α′

is of (2.1.145) and 2k if α′ is of (2.1.146).


Corollary 2.1.47. Under the assumptions of Theorem 2.1.44, for some positive integer

h, we have

limN→∞

[EV − EG]θ(qt,h, Nξα′(i)ξα′(j), Nζα′(µ)ζα′(ν)) = 0, (2.1.152)

where θ ∈ R3 satisfying


Corollary 2.1.47 can be extended to a general form for several bulk components.

Denote ti = 2ki − 1 if α′ki is of (2.1.145) and 2ki if α′ki is of (2.1.146).

Corollary 2.1.48. Under the assumptions of Theorem 2.1.45, for some positive integer

h, we have

limN→∞

[EV−EG]θ(qt1,h, Nξα′k1 (i)ξα′k1 (j), Nζα′k1 (µ)ζα′k1 (ν), · · · ,qtn,h, Nξα′kn (i)ξα′kn (j), Nζα′kn (µ)ζα′kn (ν)) = 0,

where θ ∈ R3n is a smooth function function satisfying

|∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, with some arbitrary C > 0. (2.1.154)

Remark 2.1.49. (i). Similar to (2.1.150), the results in Corollary 2.1.47 and 2.1.48 can

be easily extended to a general form containing more entries of the singular vectors. For

example, to extend Corollary 2.1.47, we can choose any positive integers β and h1, · · · , hβ,

under Assumption 2.1.41 and 2.1.43, for any choices of indices i1, j1, · · · , iβ, jβ ∈ I1

and µ1, ν1, · · · , µβ, νβ ∈ I2, for the corresponding li defined in (2.1.145) or (2.1.146),

i = 1, 2, · · · , β, there exists some 0 < δ < 1 with max1≤i≤βli ≤ N δk , we have

limN→∞

[EV−EG]θ(qt1,h1 , Nξα′1(i1)ξα′1(j1), ζα′1(µ1)ζα′1(ν1), · · · ,qtβ ,hβ , Nξα′β (iβ)ξα′β (jβ), Nζα′β (µβ)ζα′β (νβ)) = 0.

where the smooth function θ ∈ R3β satisfies |∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, for


some constant C.

(ii). Theorem 2.1.44 and 2.1.45, Corollary 2.1.47 and 2.1.48 still hold true for the complex

case, where the moment matching condition is replaced by

EGxlijxuij = EV xlijxuij, 0 ≤ l + u ≤ 2. (2.1.155)

In the bulks, similar results hold under the stronger assumption that the first four

moments of the matrix entries match with those of Gaussian ensembles.

Theorem 2.1.50. For QV = Σ1/2XVX∗V Σ1/2 satisfying Assumption 2.1.39. Assuming

that the third and fourth moments of XV agree with those of XG and considering the

k-th bulk component, k = 1, 2, · · · , p and l defined in (2.1.145) or (2.1.146) , under

Assumption 2.1.41 and 2.1.43, for any choices of indices i, j ∈ I1, µ, ν ∈ I2, there exists

a small δ ∈ (0, 1), when δNk ≤ l ≤ (1− δ)Nk, we have

limN→∞

[EV − EG]θ(Nξα′(i)ξα′(j), Nζα′(µ)ζα′(ν)) = 0,

where θ is a smooth function in R2 that satisfies

|∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, 4, 5, with some constant C > 0. (2.1.156)

Theorem 2.1.51. For QV = Σ1/2XVX∗V Σ1/2 satisfying Assumption 2.1.39. Assuming

that the third and fourth moments of XV agree with those of XG, consider the k1-th, · · · ,

kn-th bulks, k1, · · · , kn ∈ 1, 2, · · · , p, n ≤ p, for lki defined in (2.1.145) or (2.1.146)

associated with the ki-th bulk component, i = 1, 2, · · · , n, under Assumption 2.1.41 and

2.1.43, for any choices of indices i, j ∈ I1, µ, ν ∈ I2, there exists a δ ∈ (0, 1), when

δNki ≤ lki ≤ (1− δ)Nki , i = 1, 2, · · · , n, we have

limN→∞

[EV−EG]θ(Nξα′k1 (i)ξα′k1 (j), Nζα′k1 (µ)ζα′k1 (ν), · · · , Nξα′kn (i)ξα′kn (j), Nζα′kn (µ)ζα′kn (ν)) = 0,


where θ is a smooth function in R2n that satisfies

|∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, 4, 5, with some constant C > 0. (2.1.157)

Remark 2.1.52. (i). Similar to Corollary 2.1.47, 2.1.48 and (i) of Remark 2.1.49, we

can extend the results to the joint distribution containing singular values. We take the

extension of Theorem 2.1.50 as an example. By (ii) of Assumption 2.1.41, in the bulk, we

have∫ γα′λα′

dρ = 1N

+o(N−1). Using a similar Dyson Brownian motion argument, combining

with Theorem 2.1.50, we have

limN→∞

[EV − EG]θ(pα′ , Nξα′(i)ξα′(j), Nζα′(µ)ζα′(ν)) = 0. (2.1.158)

where pα′ is defined as

pα′ := ρ(γα′)N(λα′ − γα′),

and θ ∈ R3 satisfying

|∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, 4, 5, with some constant C > 0.

(ii). Theorem 2.1.50 and 2.1.51 still hold true for the complex case, where the moment

matching condition is replaced by

EGxlijxuij = EV xlijxuij, 0 ≤ l + u ≤ 4. (2.1.159)

Applications to statistics. In this subsection, we give a few remarks on the possible

applications to statistics. It is notable that, in general, the distribution of the singular

vectors of sample covariance matrix Q = TXX∗T ∗ is unknown, even for the GOE case.

However, when T is a scalar matrix (i.e T = cI, c > 0), Bourgade and Yau [114, Ap-

pendix C] have shown that the entries of the singular vectors are asymptotically normally


distributed. Hence, our universality results imply that under Assumption 2.1.39, 2.1.41

and 2.1.43, when T is conformal (i.e T ∗T = cI, c > 0), the entries of the right singular

vectors are asymptotically normally distributed. Therefore, this can be used to test the

null hypothesis

H0 : T is a conformal matrix. (2.1.160)

The statistical testing problem (2.1.160) contains a rich class of hypothesis tests. For

instance, when T = I, it reduces to the sphericity test and when c = 1, it reduces to test

whether the covariance matrix of X is orthogonal [113].

To illustrate how our results can be used to test (2.1.160), we take the example by

assuming c = 1 in the following discussion. Under H0, denote the QR factorization of

T to be T = UI, the right singular vector of TX is the same of X, ζk, k = 1, 2, · · · , N.

Using [114, Corollary 1.3], we find that for i, k = 1, 2, · · · , N,

√Nζk(i)→ N , (2.1.161)

where N is a standard Gaussian random variable. In detail, we can take the following

steps to test whether H0 holds true:

1). Randomly choose two index sets R1, R2 ⊂ 1, 2, · · · , N with |Ri| = O(1), i = 1, 2.

2). Use Bootstrapping method to sample the columns of Q and get a sequence of

M ×N matrices Qj, j = 1, 2, · · · , K.

3). Select ζjk(i), k ∈ R1, i ∈ R2 from Qj, j = 1, 2, · · · , K. Use the classic normality

test, for instance the Shapiro-Wilk test to check whether (2.1.161) hold true for all the

above samples. Record the number of samples cannot be rejected by the normality test

by A.

4). Given some pre-chosen significant level α, reject H0 if A|R1||R2| < 1− α.


The other important information from our result is that the singular vectors are

completely delocalized. In the low rank matrix denoising problem,

S = TX + S,

where S is a deterministic low rank matrix. Consider the rank one case and assume

the left singular vector u of S is sparse, using the completely delocalization result, it

can be shown that the first left singular vector of S has the same sparse structure as

that of u. Thus, to estimate the singular vectors of S, we only need to do singular value

decomposition on a block matrix of S.

Notations and tools. In this part, we introduce some notations and tools which will

be used in this paper. Throughout the paper, we will always use ε1 for a small constant

and D1 a large constant. Recall that the ESD of an n×n symmetric matrix H is defined

as

F(n)H (λ) :=

1

n

n∑i=1

1λi(H)≤λ.

For some small constant τ > 0, we define the typical domain for z = E + iη as

D(τ) = z ∈ C+ : |E| ≤ τ−1, N−1+τ ≤ η ≤ τ−1. (2.1.162)

Definition 2.1.53 (Stieltjes transform). Recall that the Green functions for Y Y ∗ and

Y ∗Y are defined

G1(z) := (Y Y ∗ − z)−1, G2(z) := (Y ∗Y − z)−1, z = E + iη ∈ C+. (2.1.163)

The Stieltjes transform of the ESD of Y ∗Y is given by

m2(z) ≡ m(N)2 (z) :=

∫1

x− zdF

(N)Y ∗Y (x) =

1

N

N∑i=1

(G2)ii(z) =1

NTrG2(z). (2.1.164)


Similarly, we can also define m1(z) ≡ m(M)1 (z) := M−1TrG1(z).

Definition 2.1.54. For z ∈ C+, we define the (N +M)× (N +M) self-adjoint matrix

H ≡ H(X,Σ) :=

−zI z1/2Y

z1/2Y ∗ −zI

, (2.1.165)

and

G ≡ G(X, z) := H−1. (2.1.166)

By Schur’s complement, it is easy to check that

G =

G1(z) z−1/2G1(z)Y

z−1/2Y ∗G1(z) z−1Y ∗G1(z)Y − z−1I

=

z−1Y G2(z)Y ∗ − z−1I z−1/2Y G2(z)

z−1/2G2(z)Y ∗ G2(z)

,

(2.1.167)

for G1,2 defined in (2.2.6). Thus a control of G yields directly a control of (Y Y ∗ − z)−1

and (Y ∗Y − z)−1. Moreover, we have

m1(z) =1

M

∑i∈I1

Gii, m2(z) =1

N

∑µ∈I2

Gµµ.

Recall that Y =∑M∧N

i=1

√λkξkζ

∗k , ξk ∈ RI1 , ζk ∈ RI2 , by (2.2.48), we have

G(z) =M∧N∑k=1

1

λk − z

ξkξ∗k z−1/2

√λkξkζ

∗k

z−1/2√λkζkξ

∗k ζkζ

∗k

. (2.1.168)

Denote

Ψ(z) :=

√Imm(z)

Nη+

1

Nη, Σo :=

Σ 0

0 I

, Σ :=

z−1/2Σ1/2 0

0 I

. (2.1.169)


Definition 2.1.55. For z ∈ C+, we define the I × I matrix

Π(z) :=

−z−1(1 +m(z)Σ)−1 0

0 m(z)

. (2.1.170)

We will see later from Lemma 2.3.27 that G(z) converges to Π(z) in probability.

Remark 2.1.56. In [74, Definition 3.2], the linearizing block matrix is defined as

Ho :=

−Σ−1 X

X∗ −zI

. (2.1.171)

It is easy to check the following relation between (2.1.165) and (2.1.171)

H =

z1/2Σ1/2 0

0 I

Ho

z1/2Σ1/2 0

0 I

. (2.1.172)

In [74, Definition 3.3], the deterministic convergent limit of H−1o is

Πo(z) =

−Σ(1 +m(z)Σ)−1 0

0 m(z)

. (2.1.173)

Therefore, by (2.1.172), we can get a similar relation between (2.2.64) and (2.1.173)

Π(z) =

z−1/2Σ−1/2 0

0 I

Πo(z)

z−1/2Σ−1/2 0

0 I

. (2.1.174)

Definition 2.1.57. We introduce the notation X(T) to represent the M×(N−|T|) minor

of X by deleting the i-th, i ∈ T columns of X. For convenience, (i) will be abbreviated

to (i). We will keep the name of indices of X for X(T), that is X(T)ij = 1(j /∈ T)Xij. We


will denote

Y (T) = Σ1/2X(T), G(T)1 = (Y (T)Y (T)∗ − zI)−1, G(T)

2 = (Y (T)∗Y (T) − zI)−1. (2.1.175)

Consequently, m(T)1 (z) = M−1 TrG(T)

1 (z), m(T)2 (z) = N−1 TrG(T)

2 (z).

Our key ingredient is the anisotropic local law derived by Knowles and Yin in [74].

Lemma 2.1.58. Fix τ > 0, assume (2.3.1), (2.1.135) and (2.3.4) hold. Moreover,

suppose that every edge k = 1, · · · , 2p satisfies ak ≥ τ and every bulk component k =

1, · · · , p is regular in the sense of Assumption 2.1.41. Then for all z ∈ D(τ) and any

unit vectors u,v ∈ RM+N , there exists some small constant ε1 > 0 and large constant

D1 > 0, when N is large enough, with 1−N−D1 probability, we have

∣∣< u,Σ−1(G(z)− Π(z))Σ−1v >∣∣ ≤ N ε1Ψ(z), (2.1.176)

and

|m2(z)−m(z)| ≤ N ε1Ψ(z). (2.1.177)

Proof. (2.1.177) is already proved in (3.11) of [74]. We only need to prove (2.2.63). By

(2.1.172), we have

Go(z) =

z1/2Σ1/2 0

0 I

G(z)

z1/2Σ1/2 0

0 I

. (2.1.178)

By [74, Theorem 3.6], with 1−N−D1 probability, we have

| < u,Σ−1o (Go(z)− Πo(z))Σ−1

o v > | ≤ N ε1Ψ(z). (2.1.179)

Therefore, by (2.1.174), (2.1.178) and (2.1.179), we conclude our proof.

It is easy to derive the following corollary from Lemma 2.3.27.


Corollary 2.1.59. Under the assumptions of Lemma 2.3.27, with 1−N−D1 probability,

we have

| < v, (G2(z)−m(z))v > | ≤ N ε1Ψ(z), | < u, (G1(z)+z−1(1+m(z)Σ)−1)u > | ≤ N ε1Ψ(z),

(2.1.180)

where v, u are unit vectors in RN ,RM respectively.

We use the following lemma to characterize the rigidity of eigenvalues within each of

the bulk component, which can be found in [74, Theorem 3.12].

Lemma 2.1.60. Fix τ > 0, assume (2.3.1), (2.1.135) and (2.3.4) hold. Moreover, sup-

pose that every edge k = 1, · · · , 2p satisfies ak ≥ τ and every bulk component k = 1, · · · , p

is regular in the sense of Assumption 2.1.41. Recall Nk is the number of eigenvalues

within each bulk, then we have that for i = 1, · · · , Nk satisfying γk,i ≥ τ and k = 1, · · · , p,

with 1−N−D1 probability, we have

|λk,i − γk,i| ≤ (i ∧ (Nk + 1− i))−13N−

23

+ε1 . (2.1.181)

Within the bulk, we have stronger result. For small τ ′ > 0, denote

Dbk := z ∈ D(τ) : E ∈ [a2k + τ ′, a2k−1 − τ ′], k = 1, 2, · · · , p, (2.1.182)

as the bulk spectral domain, then [74, Theorem 3.15] gives the following result.

Lemma 2.1.61. Fix τ, τ ′ > 0, assume (2.3.1), (2.1.135) and (2.3.4) hold and the bulk

component k = 1, · · · , 2p is regular in the sense of (ii) of Assumption 2.1.41. Then for

all i = 1, · · · , Nk satisfying γk,i ∈ [a2k + τ ′, a2k−1 − τ ′], we have (2.2.63) and (2.1.177)

hold uniformly for all z ∈ Dbk and with 1−N−D1 probability,

|λk,i − γk,i| ≤ N−1+ε1 . (2.1.183)


As discussed in [74, Remark 3.13], Lemma 2.3.27 and 2.2.22 imply the complete

delocalization of the singular vectors.

Lemma 2.1.62. Fix τ > 0, under the assumptions of Lemma 2.3.27, for any i, µ such

that γi, γµ ≥ τ, with 1−N−D1 probability, we have

maxi,s1|ξi(s1)|2 + max

µ,s2|ζµ(s2)|2 ≤ N−1+ε1 . (2.1.184)

Proof. By (2.1.180), with 1 − N−D1 probability, we have maxImGii(z), ImGµµ(z) =

O(1). Choose z0 = E+ iη0 with η0 = N−1+ε1 and use the spectral decomposition (2.2.49),

we haveN∧M∑k=1

η0

(E − λk)2 + η20

|ξk(i)|2 = ImGii(z0) = O(1), (2.1.185)

N∧M∑k=1

η0

(E − λk)2 + η20

|ζk(µ)|2 = ImGµµ(z0) = O(1), (2.1.186)

hold with 1 − N−D1 probability. Choosing E = λk in (2.1.185) and (2.1.186), we finish

the proof.

Singular vectors near the edges. In this section, we prove the universality for the

distributions of the edge singular vectors Theorem 2.1.44 and 2.1.45, as well as the joint

distribution between singular values and singular vectors Corollary 2.1.47 and 2.1.48.

The main identities on which we will rely are

Gij =M∧N∑β=1

η

(E − λβ)2 + η2ξβ(i)ξβ(j), Gµν =

M∧N∑β=1

η

(E − λβ)2 + η2ζβ(µ)ζβ(ν), (2.1.187)

where Gij, Gµν are defined as

Gij :=1

2i(Gij(z)−Gij(z)), Gµν :=

1

2i(Gµν(z)−Gµν(z)). (2.1.188)


Due to similarity, we focus our proof on the right singular vectors. The proofs reply on

three main steps: (i). Writing Nζβ(µ)ζβ(ν) as an integral of Gµν over a random interval

with size O(N εη), where ε > 0 is a small constant and η = N−2/3−ε0 , ε0 > 0 will be

chosen later; (ii). Replacing the sharp characteristic function getting from step (i) with a

smooth cutoff function q in terms of the Green function; (iii). Using the Green function

comparison argument to compare the distribution of the singular vectors between the

ensembles XG and XV .

We will follow the proof strategy of [71, Section 3] and slightly modify the detail.

Specially, the choices of random interval in step (i) and the smooth function q in step

(ii) are different due to the fact that we have more than one bulk components. And the

Green function comparison argument is also slightly different as we use the linearization

matrix (2.2.49).

We mainly focus on a single bulk component, firstly prove the singular vector dis-

tribution and then extend the results to singular values. The results containing several

bulk components will follow after minor modification. We first prove the following result

for the right singular vector.

Lemma 2.1.63. For QV = Σ1/2XVX∗V Σ1/2 satisfying Assumption 2.1.39, let EG,EV

denote the expectations with respect to XG, XV . Consider the k-th bulk component, k =

1, 2, · · · , p, and l defined in (2.1.145) or (2.1.146), under Assumption 2.1.41 and 2.1.43,

for any choices of indices µ, ν ∈ I2, there exists a δ ∈ (0, 1), when l ≤ N δk , we have

limN→∞

[EV − EG]θ(Nζα′(µ)ζα′(ν)) = 0,

where θ is a smooth function in R that satisfies

|θ(3)(x)| ≤ C1(1 + |x|)C1 , x ∈ R, with some constant C1 > 0. (2.1.189)


Near the edges, by (2.1.181) and (2.1.184), with 1−N−D1 probability, we have

|λα′ − γα′ | ≤ N−2/3+ε1 , maxµ,s2|ζµ(s2)|2 ≤ N−1+ε1 . (2.1.190)

Hence, throughout the proofs of this section, we always use the scale parameter

η = N−2/3−ε0 , ε0 > ε1 is a small constant. (2.1.191)

Proof of Lemma 2.1.63. In a first step, we express the singular vector entries as an inte-

gral of Green functions over a random interval, which is recorded as the following lemma.

Lemma 2.1.64. Under the assumptions of Lemma 2.1.63, there exists a small constant

0 < δ < 1, such that

limN→∞

maxl≤Nδ

k

maxµ,ν

∣∣∣∣EV θ(Nζα′(µ)ζα′(ν))− EV θ[N

π

∫I

Gµν(z)X (E)dE]

∣∣∣∣ = 0, (2.1.192)

where I is defined as

I := [a2k−1 −N−23

+ε, a2k−1 +N−23

+ε], (2.1.193)

when (2.1.145) holds, and when (2.1.146) holds, it is denoted as

I := [a2k −N−23

+ε, a2k +N−23

+ε], (2.1.194)

with ε satisfies that, for C1 defined in (2.1.189)

2(C1 + 1)(δ + ε1) < ε < cε0, c > 0 is a constant much smaller than 1. (2.1.195)

And X (E) is defined as

X (E) := 1(λα′+1 < E− ≤ λα′), (2.1.196)


where E± := E ±N εη. The conclusion holds true if we replace XV with XG.

Proof. We first observe that

ζα′(µ)ζα′(ν) =η

π

∫R

ζα′(µ)ζα′(ν)

(E − λα′)2 + η2dE.

Choose a, b such that

a := minλα′ −N εη, λα′+1 +N εη, b := λα′ +N εη. (2.1.197)

We also observe the elementary inequality (see the equation above (6.10) of [53]), for

some constant C > 0, ∫ ∞x

η

π(y2 + η2)dy ≤ Cη

x+ η, x > 0. (2.1.198)

By (2.1.190), (2.1.197) and (2.1.198), with 1−N−D1 probability, we have

ζα′(µ)ζα′(ν) =η

π

∫ b

a


(E − λ′α)2 + η2dE +O(N−1−ε+ε1). (2.1.199)

By (2.1.189), (2.1.190), (2.1.195), (2.1.199) and mean value theorem , we have

EV θ(Nζα′(µ)ζα′(ν)) = EV θ(Nη

π

∫ b

a


(E − λα′)2 + η2dE) + o(1). (2.1.200)

Denote λ±t := λt ±N εη, t = α′, α′ + 1, and by (2.1.197), we have

∫ b

a

dE =

∫ λ+α′

λ+α′+1

dE + 1(λ+α′+1 > λ−α′)

∫ λ+α′+1

λ−α′

dE.

By (2.1.189), (2.1.190), (2.1.200) and mean value theorem , we have


π

∫ λ+α′

λ+α′+1


(E − λα′)2 + η2dE) + o(1), (2.1.201)


where we use (2.1.181) and (2.1.195). Next we can without loss of generality, consider

the case when (2.1.145) holds true. By (2.1.190) and (2.1.195), we observe that with

1 − N−D1 probability, we have λ+α′ ≤ a2k−1 + N−2/3+ε and λ+

α′+1 ≥ a2k−1 − N−2/3+ε. By

(2.1.181) and the choice of I in (2.1.193), we have


π

∫I


(E − λα′)2 + η2X (E)dE) + o(1). (2.1.202)

Recall (2.1.187), we can split the summation as

1

ηGµν(z) =

∑β 6=α′

ζβ(µ)ζβ(ν)

(E − λβ)2 + η2+


(E − λα′)2 + η2. (2.1.203)

Denote A := β 6= α′ : λβ is not in the k-th bulk component. By (2.1.190), with 1 −

N−D1 probability, we have

∣∣∣∣∣∑β 6=α′

Nη

π

∫I

ζβ(µ)ζβ(ν)

(E − λβ)2 + η2dE

∣∣∣∣∣ ≤ N ε1

π

(∑β∈A

∫I

η

η2 + (E − λβ)2dE +

∑β∈Ac

∫I

η

η2 + (E − λβ)2dE

).

(2.1.204)

By Assumption 2.1.41, with 1−ND1 probability, we have

N ε1

π

∑β∈A

∫I

η

η2 + (E − λβ)2dE ≤ N ε1

∑β∈A

N−4/3−ε0+ε. (2.1.205)

Denote

l(β) := β −∑t<k

Nt. (2.1.206)

By (2.1.190), with 1−N−D1 probability, for some small constant 0 < δ < 1, we have

N ε1

π

∑β∈Ac

∫I

η

(E − λβ)2 + η2dE ≤ N ε1+δ+

1

π

∑β∈Ac; l(β)≥Nδ

k

∫I

N ε1η

η2 + (E − λβ)2dE. (2.1.207)


By Assumption 2.1.41, (2.1.142) and (2.1.181), it is easy to check that (see (3.12) of [71])

(E − λβ)2 ≥ c(l(β)

N)4/3, c > 0 is some constant. (2.1.208)

By (2.1.208), with 1−N−D1 probability, we have

1

π


k

∫I

N ε1η

η2 + (E − λβ)2dE ≤ N ε1−ε0+ε

∫ N

Nδ−1

1

x4/3dx ≤ N−δ/3+ε1−ε0+ε.

Recall (2.1.195), we can restrict ε1 − ε0 + ε < 0, with 1−N−D1 probability, this yields


k

∫I

N ε1η

η2 + (E − λβ)2dE ≤ N−δ/3. (2.1.209)

By (2.1.204), (2.1.205), (2.1.207) and (2.1.209), with 1−N−D1 probability, we have

∣∣∣∣∣∑β 6=α′

Nη

π

∫I

ζβ(µ)ζβ(ν)

(E − λβ)2 + η2dE

∣∣∣∣∣ ≤ N δ+2ε1 . (2.1.210)

By (2.1.189), (2.1.190), (2.1.203), (2.1.210) and mean value theorem, we have

∣∣∣∣EV θ(Nηπ∫I


(E − λα′)2 + η2X (E)dE)− EV θ(

N

π

∫I

Gµν(E + iη)X (E)dE)

∣∣∣∣≤ NC1(δ+2ε1)EV

∑β 6=α′

Nη

π

∫I

|ζβ(µ)ζβ(ν)|(E − λβ)2 + η2

X (E)dE, (2.1.211)

where C1 is defined in (2.1.189). To finish the proof, it suffices to estimate the right-hand

side of (2.1.211). Similar to (2.1.205), we have

∑β∈A

∫I

η

η2 + (E − λβ)2dE ≤ N−1/3−ε0+ε. (2.1.212)


Choose a small constant 0 < δ1 < 1, repeat the estimation of (2.1.209), we have

∑β∈Ac; l(β)≥Nδ1

k

∫I

η

η2 + (E − λβ)2dE ≤ N−δ1/3+ε−ε0 . (2.1.213)

Recall (2.1.145) and restrict ε > 2((C1 + 1)ε1 + δ1 +C1δ), by (2.1.190) and (2.1.198), we

have

∑β∈Ac; l≤l(β)≤Nδ1

k

Nη

πEV∫I

|ζβ(µ)ζβ(ν)|(E − λβ)2 + η2

X (E)dE ≤ EV∫ ∞λα′+1+Nεη

N δ1+ε1η

(E − λα′+1)2 + η2dE

≤ N−ε+ε1+δ1 , (2.1.214)

where we use the fact that β ∈ Ac and l < l(β) ≤ N δ1k implies λβ ≤ λα′+1. It remains to

estimate the summation of the terms when β ∈ Ac and l(β) < l. For a given constant ε′

satisfies

1

2(ε0 + 3ε+ 2(C1 + 1)ε1 + (C1 + 1)δ) < ε′ < ε0, (2.1.215)

we partition I = I1 ∪ I2 with I1 ∩ I2 = ∅ by denoting

I1 := E ∈ I : ∃β, β ∈ Ac, l(β) < l, |E − λβ| ≤ N ε′η. (2.1.216)

By (2.1.190) and (2.1.216), we have

∑β∈Ac; l(β)<l

Nη

πEV∫I2

|ζβ(µ)ζβ(ν)|(E − λβ)2 + η2

X (E)dE ≤ N−2ε′+ε0+ε+ε1+δ. (2.1.217)

It is easy to check that on I1 when λα′+1 ≤ λα′ < λβ, we have

1

(E − λβ)2 + η21(E− ≤ λα′) ≤

N2ε

(λα′+1 − λα′)2 + η2. (2.1.218)


By (2.1.190) and (2.1.218), we have

∑β∈Ac;l(β)≤l

Nη

πEV∫I1

|ζβ(µ)ζβ(ν)|(E − λβ)2 + η2

X (E)dE ≤ EV∫I1

N δ+ε1+2ε−2/3η

(λα′+1 − λα′)2 + η2dE

≤ N δ+ε1+3ε−D1+2/3+ε0 +N−2ε′+ε0+ε1+δ+3ε. (2.1.219)

By (2.1.212), (2.1.213), (2.1.214), (2.1.215) and (2.1.219), we conclude the proof of

(2.1.211). It is clear that our proof still applies when we replace XV with XG.

In a second step, we will write the sharp indicator function of (2.1.196) as some

smooth function q of Gµν . To be consistent with the proof of Lemma 2.1.64, we consider

the bulk edge a2k−1. Denote

ϑη(x) :=η

π(x2 + η2)=

1

πIm

1

x− iη. (2.1.220)

We define a smooth cutoff function q ≡ qα′ : R→ R+ satisfying

q(x) = 1, if |x− l| ≤ 1

3; q(x) = 0, if |x− l| ≥ 2

3, (2.1.221)

where l is defined in (2.1.145). We also denote Q1 = Y ∗Y.

Lemma 2.1.65. For ε defined in (2.1.195) , denote

XE(x) := 1(E− ≤ x ≤ EU), (2.1.222)

where EU := a2k−1 + 2N−2/3+ε. Denote η := N−2/3−9ε0 , where ε0 is defined in (2.1.191),

we have

limN→∞

maxl≤Nδ

k

maxµ,ν

∣∣∣∣EV θ(Nζα′(µ)ζα′(ν))− EV θ(N

π

∫I

Gµν(z)q[Tr(XE ∗ ϑη)(Q1)]dE)

∣∣∣∣ = 0,

(2.1.223)

where I is defined in (2.1.193) and ∗ is the convolution operator.


Proof. For any E1 < E2, denote the number of eigenvalues of Q1 in [E1, E2] by

N (E1, E2) := #j : E1 ≤ λj ≤ E2. (2.1.224)

Recall (2.1.193) and (2.1.196), it is easy to check that with 1−N−D1 probability, we have

N

∫I

Gµν(z)X (E)dE = N

∫I

Gµν(z)1(N (E−, EU) = l)dE = N

∫I

Gµν(z)q[TrXE(Q1)]dE,

(2.1.225)

where for the second equality, we use (2.1.181) and Assumption 2.1.41. We use the

following lemma to estimate (2.1.224) by its delta approximation smoothed on the scale

η.

Lemma 2.1.66. For t = N−2/3−3ε0 , there exists some constant C, with 1−N−D1 prob-

ability, for any E satisfying

|E− − a2k−1| ≤3

2N−2/3+ε, (2.1.226)

we have

|TrXE(Q1)− Tr(XE ∗ ϑη)(Q1)| ≤ C(N−2ε0 +N (E− − t, E− + t)). (2.1.227)

By (A.7) of [74], for any z ∈ D(τ) defined in (2.2.42), we have

Imm(z) ∼

η/√κ+ η, E /∈ supp(ρ),

√κ+ η, E ∈ supp(ρ).

, (2.1.228)

where κ := |E − a2k−1|. When µ = ν, with 1−N−D1 probability, we have

supE∈I|Gµµ(E+iη)| = sup

E∈I| ImGµµ(z)| ≤ sup

E∈I(Im |Gµµ(z)−m(z)|+| Imm(z)|) ≤ N−1/3+ε0+2ε,


where we use (2.1.180) and (2.1.228). When µ 6= ν, we use the following identity

Gµν = ηM+N∑k=M+1

GµkGνk.

By (2.1.180) and (2.1.228), with 1−N−D1 probability, we have supE∈I |Gµν(z)| ≤ N−1/3+ε0+2ε.

Therefore, for E ∈ I, with 1−N−D1 probability, we have

supE∈I|Gµν(E + iη)| ≤ N−1/3+3ε0/2. (2.1.229)

Recall (2.2.17), by (2.1.225), (2.1.227), (2.1.229) and the smoothness of q, with 1−N−D1

probability, we have

∣∣∣∣N ∫I

Gµν(z)X (E)dE −N∫I

Gµν(z)q[Tr(XE ∗ ϑη(Q1))]dE

∣∣∣∣≤ CN

∑l(β)≤Nδ

k

∫I

|Gµν(z)|1(|E− − λβ| ≤ t)dE +N−ε0/4

≤ CN1+δ|t| supz∈I|Gµν(z)|+N−ε0/4. (2.1.230)

By (2.1.229) and (2.1.230), we have

∣∣∣∣N ∫I

Gµν(z)X (E)dE −N∫I

Gµν(z)q[Tr(XE ∗ ϑη(Q1))]dE

∣∣∣∣ ≤ CN−ε0/2+δ +N−ε0/4.

Using a similar discussion to (2.1.204), by (2.1.189) and (2.1.195), we finish the proof.

In the final step, we use the Green function comparison argument to prove the fol-

lowing lemma.

Lemma 2.1.67. Under the assumptions of Lemma 2.1.65, we have

limN→∞

maxµ,ν

(EV − EG)θ

(N

π

∫I

Gµν(z)q[Tr(XE ∗ ϑη)(Q1)]dE

)= 0.


Once Lemma 2.1.67 is proved, the proof of Lemma 2.1.63 follows from Lemma 2.1.65.

Green function comparsion argument. In this part, we will prove Lemma 2.1.67

using the Green function comparison argument. In the end of this section, we will

discuss how we can extend Lemma 2.1.63 to Theorem 2.1.44 and Theorem 2.1.45. By

the orthonormal properties of ξ, ζ and (2.2.49), we have

Gij = η

M∑k=1

GikGjk, Gµν = η

M+N∑k=M+1

GµkGνk. (2.1.231)

By (2.1.180), with 1−N−D1 probability, we have

|Gµµ| = O(1), |Gµν | ≤ N−1/3+2ε0 , µ 6= ν. (2.1.232)

We firstly drop the all diagonal terms in (2.2.82).

Lemma 2.1.68. Recall EU = a2k−1 + 2N−2/3+ε and η = N−2/3−9ε0, we have

EV θ[N

π

∫I

Gµν(z)q[Tr(XE ∗ ϑη)(Q1)]dE

]− EV θ

[∫I

x(E)q(y(E))dE

]= o(1), (2.1.233)

where we denote Xµν,k := GµkGνk and

x(E) :=Nη

π

M+N∑k=M+1, and 6=µ,ν

Xµν,k(E + iη), y(E) :=η

π

∫ EU

E−

∑k

∑β 6=k

Xββ,k(E + iη)dE.

(2.1.234)

The conclusion holds true if we replace XV with XG.

Proof. We first observe that by (2.1.232), with 1−N−D1 probability, we have

|x(E)| ≤ N2/3+3ε0 , (2.1.235)


which implies that ∫I

|x(E)|dE ≤ N4ε0 . (2.1.236)

By (2.2.82) and (2.1.232), with 1−N−D1 probability, we have

∣∣∣∣Nπ Gµν(E + iη)− x(E)

∣∣∣∣ =Nη

π|GµµGνµ+GµνGνν | ≤ Nη(1(µ = ν)+N−1/3+2ε01(µ 6= ν)).

(2.1.237)

By the equations (5.11) and (6.42) of [40], we have

Tr(XE ∗ ϑη(Q1)) =N

π

∫ EU

E−Imm2(w + iη)dw,

∑µν

|Gµν(w + iη)|2 =N Imm2(w + iη)

η.

(2.1.238)

Therefore, we have

Tr(XE ∗ ϑη(Q1))− y(E) =η

π

∫ EU

E−

M+N∑β=M+1

|Gββ|2dw. (2.1.239)

By (2.1.239), mean value theorem and the fact q is smooth enough, we have

|q[Tr(XE ∗ ϑη)(Q1)]− q[y(E)]| ≤ N−1/3−7ε0 . (2.1.240)

Therefore, by mean value theorem, (2.1.189), (2.1.195), (2.1.235), (2.1.236), (2.1.237)

and (2.1.240), we can conclude our proof.

To prove Lemma 2.1.67, by (2.1.233), it suffices to prove

[EV − EG]θ[

∫I

x(E)q(y(E))dE] = o(1). (2.1.241)

For the rest, we will use the Green function comparison argument to prove (2.1.241),

where we follow the basic approach of [40, Section 6] and [72, Section 3.1]. Define a


bijective ordering map Φ on the index set, where

Φ : (i, µ1) : 1 ≤ i ≤M, M + 1 ≤ µ1 ≤M +N → 1, . . . , γmax = MN.

Recall that we relabel XV = ((XV )iµ1 , i ∈ I1, µ1 ∈ I2), similarly for XG. For any

1 ≤ γ ≤ γmax, we define the matrix Xγ =(xγiµ1

)such that xγiµ1 = XG

iµ1if Φ(i, µ1) > γ,

and xγiµ1 = XViµ1

otherwise. Note that X0 = XG and Xγmax = XV . With the above

definitions, we have

[EG − EV ]θ[

∫I

x(E)q(y(E))dE] =

γmax∑γ=1

[Eγ−1 − Eγ]θ[∫I

x(E)q(y(E))dE].

For simplicity, we rewrite the above equation as

E[θ(

∫I

xGq(yG)dE)− θ(∫I

xV q(yV )dE)] =

γmax∑γ=1

E[θ(

∫I

xγ−1q(yγ−1)dE)− θ(∫I

xγq(yγ)dE)].

(2.1.242)

The key step of the Green function comparison argument is to use Lindeberg replacement

strategy. We focus on the indices s, t ∈ I, the special case µ, ν ∈ I2 follow. Denote

Yγ := Σ1/2Xγ and

Hγ :=

0 z1/2Yγ

z1/2Y ∗γ 0

, Gγ :=

−zI z1/2Yγ

z1/2Y ∗γ −zI

−1

. (2.1.243)

As Σ is diagonal, for each fixed γ, Hγ and Hγ−1 differ only at (i, µ1) and (µ1, i) elements,

where Φ(i, µ1) = γ. Then we define the (N +M)× (N +M) matrices V and W by

Vab = z1/2(1(a,b)=(i,µ1) + 1(a,b)=(µ1,i)

)√σiX

Giµ1,

Wab = z1/2(1(a,b)=(i,µ1) + 1(a,b)=(µ1,i)

)√σiX

Viµ1,


so that Hγ and Hγ−1 can be written as

Hγ−1 = O + V, Hγ = O +W,

for some (N +M)× (N +M) matrix O satisfying Oiµ1 = Oµ1i = 0 and O is independent

of V and W . Denote

S := (Hγ−1 − z)−1, R := (O − z)−1, T := (Hγ − z)−1. (2.1.244)

With the above definitions, we can write

E[θ(

∫I


xV q(yV )dE)] =

γmax∑γ=1

E[θ(

∫I

xSq(yS)dE)− θ(∫I

xT q(yT )dE)].

(2.1.245)

The comparison argument is based on the following resolvent expansion

S = R−RV R + (RV )2R− (RV )3R + (RV )4S. (2.1.246)

For any integer m > 0, by (6.11) of [40], we have

([RV ]mR)ab =∑

(ai,bi)∈(i,µ1),(µ1,i):1≤i≤m

(z)m/2(σi)m/2(XG

iµ1)mRaa1Rb1a2 · · ·Rbmb, (2.1.247)

([RV ]mS)ab =∑

(ai,bi)∈(i,µ1),(µ1,i):1≤i≤m

(z)m/2(σi)m/2(XG

iµ1)mRaa1Rb1a2 · · ·Sbmb. (2.1.248)

Denote

∆Xµν,k := SµkSνk −RµkRνk. (2.1.249)

In [72], the discussion relies on a crucial parameter (see (3.32) of [72]), which counts the

maximum number of diagonal resolvent elements in ∆Xµν,k. We will follow this strategy

but using a different counting parameter and furthermore use (2.1.247) and (2.1.248) as


our key ingredients. Our discussion is slightly easier due to the loss of a free index (i.e.

i 6= µ1).

Inserting (2.1.246) into (2.1.249), by (2.1.247) and (2.1.248), we find that there exists

a random variable A1, which depends on the randomness only through O and the first

two moments of XGiµ1

. Taking the partial expectation with respect to the (i, µ1)-th entry

of XG(recall they are i.i.d), by (2.1.135), we have the following result.

Lemma 2.1.69. Recall (2.2.62) and denote Eγ as the partial expectation with respect to

XGiµ1

, there exists some constant C > 0, with 1−N−D1 probability, we have

|Eγ∆Xµν,k − A1| ≤ N−3/2+Cε0Ψ(z)3−s, M + 1 ≤ k 6= µ, ν ≤M +N, (2.1.250)

where s counts the maximum number of resolvent elements in ∆Xµν,k involving the index

µ1 and defined as

s := 1((µ, ν ∩ µ1 6= ∅) ∪ (k = µ1)). (2.1.251)

Proof. Inserting (2.1.246) into (2.1.249), the terms in the expansion containingXGiµ1, (XG

iµ1)2

will be included in A1, we only consider the terms containing (XGiµ1

)m,m ≥ 3. We consider

m = 3 and discuss the following terms,

Rµk[(RV )3R]νk, [RV R]µk[(RV )2R]νk.

By (2.1.247), we have

Rµk[(RV )3R]νk = Rµk(∑

(σi)3/2(XG

iµ1)3(z)3/2Rνa1Rb1a2Rb2a3Rb3k. (2.1.252)

In the worst scenario, Rb1a2 and Rb2a3 are assumed to be the diagonal entries of R.


Similarly, we have

[RV R]µk[(RV )2R]νk = (∑

z1/2σ1/2i XG

iµ1Rµa1Rb1k)(

∑σi(X

Giµ1

)2zRνa1Rb1a2Rb2k),

(2.1.253)

and the worst scenario is the case when Rb1a2 is a diagonal term. As µ, ν 6= i is always

true and there are only finite terms of summation, by (2.1.135) and (2.1.232), for some

constant C, we have

Eγ|Rµk[(RV )3R]νk| ≤ N−3/2+Cε0Ψ(z)3−s.

Similarly, we have

Eγ|[RV R]µk[(RV )2R]νk| ≤ N−3/2+Cε0Ψ(z)3−s.

The other cases 4 ≤ m ≤ 8 can be handled similarly. Hence, we conclude our proof.

Lemma 2.1.67 follows from the following result. Recall (2.1.234), denote

∆x(E) := xS(E)− xR(E), ∆y(E) := yS(E)− yR(E).

Lemma 2.1.70. For any fixed µ, ν, γ, there exists a random variable A, which depends

on the randomness only through O and the first two moments of XG, such that

Eθ[∫I

xSq(yS)dE]− Eθ[∫I

xRq(yR)dE] = A+ o(N−2+t), (2.1.254)

where t := |µ, ν ∩ µ1| and t = 0, 1 counts if there is µ, ν equals to µ1.

Before proving Lemma 2.1.70, we firstly show how Lemma 2.1.70 implies Lemma

2.1.67.

Proof of Lemma 2.1.67. It is easy to check that Lemma 2.1.70 still holds true when we


replace S with T . Note in (2.1.245), there are O(N) terms when t = 1 and O(N2) terms

when t = 0. By (2.1.254), we have

E[θ(

∫I


xV q(yV )dE)] = o(1),

where we use the assumption that the first two moments of XV are the same with XG.

Combine with (2.1.233), we conclude the proof.

Finally we will follow the approach of [72, Lemma 3.6] to finish the proof of Lemma

2.1.70. A key observation is that when s = 0, we will have a smaller bound but the total

number of such terms are O(N) for x(E) and O(N2) for y(E). And when s = 1, we have

a larger bound but the number of such terms are O(1). We need to analyze the items

with s = 0, 1 separately.

Proof of Lemma 2.1.70. Condition on the variable s = 0, 1, we introduce the following

decomposition

xs(E) :=Nη

π

M+N∑k=M+1, and 6=µ,ν

Xµν,k(E + iη)1(s = 1 ((µ, ν ∩ µ1 6= ∅) ∪ (k = µ1))),

ys(E) :=η

π

∫ EU

E−

∑k

∑β 6=k

Xββ,k(E + iη)dE1(s = 1((β = µ1) ∪ (k = µ1))).

∆xs,∆ys can be defined in the same fashion. Similar to the discussion of (2.1.250), for

any E-dependent variable f ≡ f(E) independent of the (i, µ1)-th entry of XG, there exist

two random variables A2, A3, which depend on the randomness only through O, f and

the first two moments of XGiµ1

, for any event Ω, with 1−N−D1 probability, we have

∣∣∣∣∫I

Eγ∆xs(E)f(E)dE − A2

∣∣∣∣1(Ω) ≤ ||f1(Ω)||∞N−11/6+Cε0N−2s/3+t, (2.1.255)

|Eγ∆ys(E)− A3| ≤ N−11/6+Cε0N−2s/3. (2.1.256)


In our application, f is usually a function of the entries of R (recall R is independent of

V ). Next, we use

θ[

∫I

xSq(yS)dE] = θ[

∫I

(xR + ∆x0 + ∆x1)q(yR + ∆y0 + ∆y1)dE]. (2.1.257)

By (2.1.246), (2.1.247) and (2.1.248), it is easy to check that, with 1−N−D1 probability,

we have

∫I

|∆xs(E)|dE ≤ N−5/6+Cε0N−2s/3+t, |∆ys(E)| ≤ N−5/6+Cε0N−2s/3, (2.1.258)

∫I

|x(E)|dE ≤ NCε0 , |y(E)| ≤ NCε0 . (2.1.259)


θ[

∫I

xSq(yS)dE] = θ[

∫I

xS(q(yR) + q′(yR)(∆y0 + ∆y1) + q′′(yR)(∆y0)2)dE] + o(N−2).

Similarly, we have (see (3.44) of [71])

θ[

∫I

xSq(yS)dE]− θ[∫I

xRq(yR)dE] = θ′[

∫I

xRq(yR)dE]

× [

∫I

((∆x0 + ∆x1)q(yR) + xRq′(yR)(∆y0 + ∆y1) + ∆x0q

′(yR)∆y0 + xRq′′(yR)(∆y0)2)dE]

+1

2θ′′[

∫I

xRq(yR)dE][

∫I

(∆x0q(yR) + xRq′(yR)∆y0)dE]2 + o(N−2+t). (2.1.260)

Now we start dealing with the individual terms on the right-hand side of (2.1.260).

Firstly, we consider the terms containing ∆x1, ∆y1. Similar to (2.1.250), we can find

a random variable A4, which depends on randomness only through O and the first two

moments of XGiµ1, such that with 1−N−D1 probability,

∣∣∣∣Eγ ∫I

(∆x1q(yR) + xRq′(yR)∆y1)dE − A4

∣∣∣∣ = o(N−2+t).


Hence, we only need to focus on ∆x0, ∆y0. We first observe that

∆x0(E) = 1(t = 0)Nη

π

∑k 6=µ,ν,µ1

∆Xµν,k(z),

∆y0(E) =η

π

∫ EU

E−

∑k 6=µ1

∑β 6=k,µ1

∆Xββ,k(E + iη)dE.

Denote ∆x(k)0 (E) by the summations of the terms in ∆x0(E) containing k items of XG

iµ1.

By (2.1.232), (2.1.246) and (2.1.247), it is easy to check that with 1−N−D1 probability,

|∆x(3)0 | ≤ N−7/6+Cε0 , |∆y(3)

0 | ≤ N−11/6+Cε0 . (2.1.261)

We now decompose ∆Xµν,k into three parts indexed by the number of XGiµ1

they contain.

By (2.1.232), (2.1.247), (2.1.248) and (2.1.261), with 1−N−D1 probability, we have

∆Xµν,k = ∆X(1)µν,k + ∆X

(2)µν,k + ∆X

(3)µν,k +O(N−3+Cε0), (2.1.262)

∆x0 = ∆x(1)0 + ∆x

(2)0 + ∆x

(3)0 +O(N−5/3+Cε0), (2.1.263)

∆y0 = ∆y(1)0 + ∆y

(2)0 + ∆y

(3)0 +O(N−7/3+Cε0). (2.1.264)

Inserting (2.1.263) and (2.1.264) into (2.1.260), similar to the discussion of (2.1.250), we

can find a random variable A5 depending on the randomness only through O and the

first two moments of XGiµ1, such that with 1−N−D1 probability,

Eγθ[∫I

xSq(yS)dE]− Eγθ[∫I

xRq(yR)dE]

= Eγθ′[∫I

xRq(yR)dE][

∫I

∆x(3)0 q(yR) + xRq′(yR)∆y

(3)0 dE] + A4 + A5 + o(N−2+t).

(2.1.265)


Lemma 2.1.70 will be proved if we can show

Eθ′[∫I

xRq(yR)dE][

∫I

∆x(3)0 q(yR) + xRq′(yR)∆y

(3)0 dE] = o(N−2). (2.1.266)

Due to the similarity, we shall prove

Eθ′[∫I

xRq(yR)dE][

∫I

∆x(3)0 q(yR)dE] = o(N−2),

the other term follows. By (2.1.189) and (2.1.259), with 1 − N−D1 probability, we have

|BR| :=∣∣θ′[∫

IxRq(yR)dE]

∣∣ ≤ NCε0 . Similar to (2.1.252), ∆x(3)0 is a finite sum of terms of

the form

1(t = 0)Nη∑

k 6=µ,ν,µ1

Rµk(σi)3/2(XG

iµ1)3z3/2Rνa1Rb1a2Rb2a3Rb3k. (2.1.267)

Inserting (2.1.267) into∫I

∆x(3)0 q(yR)dE, for some constant C > 0, we have

∣∣∣∣Eθ′[∫I

xRq(yR)dE][

∫I

∆x(3)0 q(yR)dE]

∣∣∣∣ ≤ N−5/6+Cε0 maxk 6=µ,ν,µ1

supE∈I

∣∣EBRRµkRνµ1Rikq(yR)∣∣+ o(N−2).

(2.1.268)

Again by (2.1.246), (2.1.247) and (2.1.248), it is easy to check that with 1 − N−D1

probability, for some constant C > 0, we have

|RµkRνµ1RikBRq(yR)− SµkSνµ1SikBSq(yS)| ≤ N−4/3+Cε0 .

Therefore, if we can show

|ESµkSνµ1SikBSq(yS)| ≤ N−4/3+Cε0 , (2.1.269)

then by (2.1.268), we finish proving (2.1.266). The rest leaves to prove (2.1.269). Recall


Definition 2.1.57 and (2.1.243), by [40, Lemma A.2], we have the following resolvent

identities,

S(µ1)µν = Sµν −

Sµµ1Sµ1νSµ1µ1

, µ, ν 6= µ1, (2.1.270)

Sµν = zSµµS(µ)νν (Y ∗γ−1S

(µν)Yγ−1)µν , µ 6= ν. (2.1.271)

By (2.1.247), (2.1.248) and (2.1.270), it is easy to check that (see (3.72) of [72]),

|SµkSνµ1SikBSq(yS)− S(µ1)µk Sνµ1S

(µ1)ik (BS)(µ1)q((yS)(µ1))| ≤ N−4/3+Cε0 . (2.1.272)

Moreover, by (3.73) of [72], we have

S(µ1)µk Sνµ1S

(µ1)ik (BS)(µ1)q((yS)(µ1)) = (SµkSikB

Sq(yS))(µ1)Sνµ1 . (2.1.273)

As t = 0, by (2.1.271), we have

Sνµ1 = zm(z)S(ν)µ1µ1

∑p,q

S(νµ1)pq (Y ∗γ−1)νp(Yγ−1)qµ1+z(Sνν−m(z))S(ν)

µ1µ1

∑p,q

S(νµ1)pq (Y ∗γ−1)νp(Yγ−1)qµ1 .

(2.1.274)

The conditional expectation Eγ applied to the first term of (2.1.274) vanishes; hence its

contribution to the expectation of (2.1.273) will vanish. By (2.1.180), with 1 − N−D1


|Sνν −m(z)| ≤ N−1/3+Cε0 . (2.1.275)

By the large deviation bound, with 1−N−D1 probability, we have

∣∣∣∣∣∑p,q

S(νµ1)pq (Y ∗γ−1)νp(Yγ−1)qµ1

∣∣∣∣∣ ≤ N ε1(∑

p,q |S(νµ1)pq |2)1/2

N. (2.1.276)



∣∣∣∣∣∑p,q

S(νµ1)pq (Y ∗γ−1)νp(Yγ−1)qµ1

∣∣∣∣∣ ≤ N−1/3+Cε0 . (2.1.277)

Therefore, inserting (2.1.275) and (2.1.277) into (2.1.273), by (2.1.180), we have

|ES(µ1)µk Sνµ1S

(µ1)ik (BS)(µ1)q((yS)(µ1))| ≤ N−4/3+Cε0 .

Combine with (2.1.272), we conclude our proof.

It is clear that our proof can be extended to the left singular vectors. For the proof

of Theorem 2.1.44, the only difference is to use mean value theorem in R2 whenever it is

needed. Moreover, for the proof of Theorem 2.1.45, we need to use n intervals defined by

Ii := [a2ki−1 −N−2/3+ε, a2ki−1 +N−2/3+ε], i = 1, 2, · · · , n.

Singular vectors in the bulks In this section, we will prove the bulk universality

Theorem 2.1.50 and 2.1.51. Our key ingredients Lemma 2.3.27, 2.1.62 and Corollary

2.1.59 are proved for N−1+τ ≤ η ≤ τ−1 (recall (2.2.42)). In the bulks, recall Lemma

2.1.61, the eigenvalue spacing is of order N−1. The following lemma extends the above

controls for a small spectral scale all the way down to the real axis. The proof relies on

Corollary 2.1.59 and the detail can be found in [72, Lemma 5.1].

Lemma 2.1.71. Recall (2.1.182), for z ∈ Dbk with 0 < η ≤ τ−1, when N is large enough,


maxµ,ν|Gµν − δµνm(z)| ≤ N ε1Ψ(z). (2.1.278)

Once Lemma 2.1.71 is established, Lemma 2.1.61 and 2.1.62 will follow. Next we

follow the basic proof strategy for Theorem 2.1.44 but use different spectral window size.


Again, we will only provide the proof for the following Lemma 2.1.72, which establishes

the universality for the distribution of ζα′(µ)ζα′(ν) in detail. To the end of this section,

we always use the scale parameter

η = N−1−ε0 , ε0 > ε1 is a small constant. (2.1.279)

Therefore, the following bounds hold with 1−N−D1 probability

maxµ|Gµµ(z)| ≤ N2ε0 , max

µ 6=ν|Gµν(z)| ≤ N2ε0 , max

µ,s|ζµ(s)|2 ≤ N−1+ε0 . (2.1.280)

The following lemma states the bulk universality for ζα′(µ)ζα′(ν).

Lemma 2.1.72. For QV = Σ1/2XVX∗V Σ1/2 satisfying Assumption 2.1.39, assuming that

the third and fourth moments of XV agree with those of XG and considering the k-th bulk

component, k = 1, 2, · · · , p and l defined in (2.1.145) or (2.1.146) , under Assumption

2.1.41 and 2.1.43, for any choices of indices µ, ν ∈ I2, there exists a small δ ∈ (0, 1),

when δNk ≤ l ≤ (1− δ)Nk, we have

limN→∞

[EV − EG]θ(Nζα′(µ)ζα′(ν)) = 0,

where θ is a smooth function in R that satisfies

|θ(5)(x)| ≤ C1(1 + |x|)C1 , with some constant C1 > 0. (2.1.281)

Proof. The proof strategy is very similar to that of Lemma 2.1.63. Our first step is an

analogue of Lemma 2.1.64. The proof is quite similar (actually easier as the window size

is much smaller). We omit further detail.

Lemma 2.1.73. Under the assumptions of Lemma 2.1.72, there exists a 0 < δ < 1, we


have

limN→∞

maxδNk≤l≤(1−δ)Nk

maxµ,ν

∣∣∣∣EV θ(Nζα′(µ)ζα′(ν))− EV θ[N

π

∫I

Gµν(z)X (E)dE]

∣∣∣∣ = 0,

(2.1.282)

where X (E) is defined in (2.1.196) and for ε satisfying (2.1.195), I is denoted as

I := [γα′ −N−1+ε, γα′ +N−1+ε]. (2.1.283)

Next we will express the indicator function in (2.1.282) using Green functions. Recall

(2.1.222), a key observation is that the size of [E−, EU ] is of order N−2/3 due to (2.1.191).

As we now use (2.1.279) and (2.1.283) in the bulks, the size here is of order 1. So we

cannot use the delta approximation function to estimate X (E). Instead, we will use

Helffer-Sjostrand functional calculus. This has been used many times when the window

size η takes the form of (2.1.279).

For any 0 < E1, E2 ≤ τ−1, denote f(λ) ≡ fE1,E2,ηd(λ) be the characteristic function

of [E1, E2] smoothed on the scale

ηd := N−1−dε0 , d > 2, (2.1.284)

where f = 1, when λ ∈ [E1, E2] and f = 0 when λ ∈ R\[E1 − ηd, E2 + ηd], and

|f ′| ≤ Cη−1d , |f ′′| ≤ Cη−2

d , (2.1.285)

for some constant C > 0. Denote fE ≡ fE−,EU ,ηd , we have

fE(λ) =1

2π

∫R2

iσf ′′E(e)χ(σ) + ifE(e)χ′(σ)− σf ′E(e)χ′(σ)

λ− e− iσdedσ, (2.1.286)

where χ(y) is a smooth cutoff function with support [−1, 1] and χ(y) = 1 for |y| ≤ 12

with

bounded derivatives. Using a similar argument to Lemma 2.1.65, we have the following


result.

Lemma 2.1.74. Recall the smooth cutoff function q defined in (2.2.17), under the as-

sumptions of Lemma 2.1.73, there exists a 0 < δ < 1, such that

limN→∞


maxµ,ν

∣∣∣∣EV θ[Nπ∫I

Gµν(z)X (E)]dE − EV θ[N

π

∫I

Gµν(z)q(Tr fE(Q1))]dE

∣∣∣∣ = 0.

(2.1.287)

Proof. It is easy to check that with 1 − N−D1 probability, (2.1.225) still holds true.

Therefore, it remains to prove the following result

EV θ[N

π

∫I

Gµν(E + iη)q(TrXE(Q1))]− EV θ[N

π

∫I

Gµν(E + iη)q(Tr fE(Q1))dE] = o(1).

(2.1.288)

We first observe that for any x ∈ R, we have

|XE(x)− fE(x)| =

0, x ∈ [E−, EU ] ∪ (−∞, E− − ηd) ∪ (EU + ηd,+∞);

|fE(x)|, x ∈ [E− − ηd, E−) ∪ (EU , EU + ηd].

Therefore, we have

|TrXE(Q1)− Tr fE(Q1)| ≤ maxx|fE(x)|

(N (E− − ηd, E−) +N (EU , EU + ηd)

).

By Lemma 2.1.61, the definition of ηd and a similar argument to (2.1.230), we can finish

the proof of (2.1.288).

Finally, we apply the Green function comparison argument, where we will follow the

basic approach of [72, Section 5]. The key difference is that we will use (2.1.279) and

(2.1.280).

Lemma 2.1.75. Under the assumptions of Lemma 2.1.74, there exists a 0 < δ < 1, we


have

limN→∞


maxµ,ν

[EV − EG]θ[N

π

∫I

Gµν(E + iη)q(Tr fE(Q1))dE] = 0. (2.1.289)

Proof. Recall (2.1.286), by (2.3.52), we have

Tr fE(Q1) =N

2π

∫R2

(iσf ′′E(e)χ(σ)+ifE(e)χ′(σ)−σf ′E(e)χ′(σ))m2(e+iσ)dedσ. (2.1.290)

Denote ηd := N−1−(d+1)ε0 , we can decompose the right-hand side of (2.1.290) by

Tr fE(Q1) =N

2π

∫ ∫R2

(ifE(e)χ′(σ)− σf ′E(e)χ′(σ))m2(e+ iσ)dedσ

+iN

2π

∫|σ|>ηd

σχ(σ)

∫f ′′E(e)m2(e+ iσ)dσde+

iN

2π

∫ ηd

−ηdσχ(σ)

∫f ′′E(e)m2(e+ iσ)dσde.

By (2.1.280) and (2.1.285), for some constant C > 0, with 1−N−D1 probability, we have

∣∣∣∣iN2π∫ ηd

−ηdσχ(σ)

∫f ′′E(e)m2(e+ iσ)dσde

∣∣∣∣ ≤ N−Cε0 . (2.1.291)

Recall (2.2.82) and (2.1.234), similar to Lemma 2.1.68, we firstly drop the diagonal terms.

By (2.1.278), with 1−N−D1 probability, we have (recall (2.1.237))

∫I

∣∣∣∣Nπ Gµν(E + iη)− x(E)

∣∣∣∣ dE ≤ N−1+Cε0 ,

for some constant C > 0. Hence, by mean value theorem, we only need to prove

limN→∞


maxµ,ν

[EV − EG]θ[

∫I

x(E)q(Tr fE(Q1))dE] = o(1). (2.1.292)

Furthermore, by Taylor expansion, (2.1.291) and the definition of χ, it suffices to prove

limN→∞


maxµ,ν

[EV − EG]θ[

∫I

x(E)q(y(E) + y(E))dE] = o(1), (2.1.293)


where

y(E) :=N

2π

∫R2

iσf ′′E(e)χ(σ)m2(e+ iσ)1(|σ| ≥ ηd)dedσ, (2.1.294)

y(E) :=N

2π

∫R2

(ifE(e)χ′(σ)− σf ′E(e)χ′(σ))m2(e+ iσ)dedσ. (2.1.295)

Next we will use the Green function comparison argument to prove (2.1.293). In the

proof of Lemma 2.1.67, we use the resolvent expansion till the order of 4. However, due

to the larger bounds in (2.1.280), we will use the following expansion,

S = R−RV R + (RV )2R− (RV )3R + (RV )4R− (RV )5S. (2.1.296)

Recall (2.1.244) and (2.1.245), we have

[EV−EG]θ[

∫I

x(E)q(y(E)+y(E))dE] =

γmax∑γ=1

E(θ[(

∫I

xSq(yS + yS))]− θ[(∫I

xT q(yT + yT ))]

).

(2.1.297)

We still use the same notation ∆x(E) := xS(E) − xR(E). We firstly deal with x(E).

Denote ∆x(k)(E) by the summations of the terms in ∆x(E) containing k numbers of XGiµ1

.

Similar to the discussion of Lemma 2.1.69, recall (2.1.249), by (2.1.135) and (2.1.280),


|∆x(5)(E)| ≤ N−3/2+Cε0 , M + 1 ≤ k 6= µ, ν ≤M +N.

This yields that

∆x(E) =4∑p=1

∆x(p)(E) +O(N−3/2+Cε0). (2.1.298)

Denote

∆y(E) = yS(E)− yR(E), ∆m2 := mS2 −mR

2 =1

N

M+N∑µ=M+1

(Sµµ −Rµµ).


We first deal with (2.1.295). By the definition of χ, we need to restrict 12≤ |σ| ≤ 1;

hence, by (2.1.180), with 1−N−D1 probability, we have

maxµ|Gµµ| ≤ N ε1 , max

µ6=ν|Gµν | ≤ N−1/2+ε1 . (2.1.299)

By (2.1.247), (2.1.248), (2.1.296) and (2.1.299), with 1 − N−D1 probability, we have

|∆m(5)2 | ≤ N−7/2+9ε1 . This yields the following decomposition

∆y(E) =4∑p=1

∆y(p)(E) +O(N−5/2+Cε0). (2.1.300)

Next we will control (2.1.294). Denote ∆y(E) := yS(E)− yR(E). By (2.1.247), (2.1.248)

and (2.1.278), with 1−N−D1 probability, we have

|∆m(5)2 | ≤ N−5/2+Cε0 . (2.1.301)

In order to estimate ∆y(E), we integrate (2.1.294) by parts, first in e then in σ, by (5.24)

of [71], with 1−N−D1 probability, we have

∣∣∣∣N2π∫R2

iσf ′′E(e)χ(σ)∆(5)m2(e+ iσ)1(|σ| ≥ ηd)dedσ

∣∣∣∣≤ CN

∣∣∣∣∫ f ′E(e)ηd∆m(5)2 (e+ iηd)de

∣∣∣∣+ CN

∣∣∣∣∫ f ′E(e)de

∫ ∞ηd

χ′(σ)σ∆m(5)(e+iσ)2 dσ

∣∣∣∣+ CN

∣∣∣∣∫ f ′E(e)de

∫ ∞ηd

χ(σ)∆m(5)2 (e+ iσ)dσ

∣∣∣∣ . (2.1.302)

By (2.1.301) and (2.1.302), with 1 − N−D1 probability, we have the following decompo-

sition

∆y(E) =4∑p=1

∆y(p)(E) +O(N−5/2+Cε0). (2.1.303)

Similar to the discussion of (2.1.298), (2.1.300) and (2.1.303), it is easy to check that



∫I

|∆x(p)(E)|dE ≤ N−p/2+Cε0 , |∆y(p)(E)| ≤ N−p/2+Cε0 , |∆y(p)(E)| ≤ N−p/2+Cε0 ,

(2.1.304)

where p = 1, 2, 3, 4 and C > 0 is some constant. Furthermore, by (2.1.278), with 1−ND1

probability, we have ∫I

|x(E)|dE ≤ NCε0 . (2.1.305)

Due to the similarity of (2.1.300) and (2.1.303), we denote y = y + y and then we have

∆y =4∑p=1

∆y(p)(E) +O(N−5/2+Cε0). (2.1.306)

By (2.1.304), (2.1.306) and Taylor expansion, we have

q(yS) = q(yR) + q′(yR)

(4∑p=1

∆y(p)(E)

)+

1

2q′′(yR)

(3∑p=1

∆y(p)(E)

)2

+1

6q(3)(yR)

(2∑p=1

∆y(p)(E)

)3

+1

24q(4)(yR)

(∆y(1)(E)

)4+ o(N−2).

(2.1.307)

By (2.1.281), we have

θ[

∫I

xSq(yS)dE]− θ[∫I

xRq(yR)dE] =4∑s=1

1

s!θ(s)(

∫I

xRq(yR)dE)

[∫I

xSq(yS)dE −∫I

xRq(yR)dE

]s+ o(N−2). (2.1.308)

Inserting xS = xR+∑4

p=1 ∆x(p) and (2.1.307) into (2.1.308), using the partial expectation

argument, by (2.1.281), (2.1.304) and (2.1.305), we find that that exists a random variable

B that depends on the randomness only through O and the first four moments of XGiµ1

,


such that

Eθ[∫I

xSq(y + y)SdE]− Eθ[∫I

xRq(y + y)RdE] = B + o(N−2). (2.1.309)

Hence, combine with (2.1.297), we prove (2.1.293), which implies (2.1.289). This finishes

our proof.

2.2 Eigen-structure of the model of matrix denoising

Consider that we can observe a noisy M ×N data matrix S, where

S = X + S. (2.2.1)

In model (2.2.1), the deterministic matrix S is known as the signal matrix and X the

noise matrix. In the classic framework, under the assumption that M is much smaller

than N, the truncated singular value decomposition (TSVD) is the default technique, see

for example [63]. This method recovers S with an estimator S =∑m

i=1 µiuiv∗i using the

truncated singular value decomposition, where m < minM,N denotes the truncation

level, µi, ui, vi, i = 1, 2, · · · ,m are the singular values, left singular vectors and right

singular vectors of S respectively. We usually need to provide a threshold γ to choose

m and use the singular values only when µi ≥ γ. Two popular methods are the soft

thresholding [44] and hard thresholding [61].

In recent years, the advance of technology has lead to the observation of massive scale

data, where the dimension of the variable is comparable to the length of the observation.

In this situation, the TSVD will lose its validity. To address this problem, in the present

paper, we consider the matrix denoising problem (2.2.1) by assuming M is comparable

to N and estimate S in the following two regimes:


Regime (1). S is of low rank and we have prior information that its singular vectors

are sparse;

Regime (2). S is of low rank and we have no prior information on the singular vectors.

In regime (1), S is called simultaneously low rank and sparse matrix. This type of

matrix has been heavily used in biology. A typical example is from the study of gene

expression data [90]. Yang, Ma and Buja [112] also consider such problem but from a

quite different perspective. They do not take the local behavior of singular values and

vectors into consideration. Instead, they use an adaptive thresholding method to recover

S in (2.2.1). In regime (2), it is almost hopeless to completely recover S as we have little

information. We are interested in looking at what is the best we can do in this case.

A natural (and probably necessary) assumption is rotation invariance [24], as the only

information we know about the singular vectors is orthonormality. It is notable that,

in this case, our result coincides with the results proposed by Gavish and Donoho [62],

where they consider the estimator from another perspective and restrict the estimator to

be conservative.

Our methodologies rely on investigating the local properties of singular values and

vectors. We study the convergent limits and rates for the singular values and vectors for

high dimensional rectangular matrices assuming M is comparable to N. In this section,

we consider the problem (2.2.1) and assume that X = (xij) is an M×N matrix with i.i.d

centered entries xij = N−1/2qij, where qij is of unit variance and there exists a constant

C, for some p ∈ N large enough, qij satisfies the following condition

E|qij|p ≤ C. (2.2.2)

We denote the SVD of S as S = UDV ∗, whereD = diagd1, · · · , dr, U = (u1, · · · , ur), V =

(v1, · · · , vr), and where ui ∈ RM , vi ∈ RN are orthonormal vectors and r is a fixed con-


stant. We also assume d1 > d2 > · · · > dr > 0. Then (2.2.1) can be written as

S = X + UDV ∗. (2.2.3)

Throughout the paper, we are interested in the following setup, for some large constant

C > 0, we have

c ≡ cN :=N

M, C−1 ≤ c ≤ C. (2.2.4)

It is well-known that for the noise matrix X, the spectrum of XX∗ satisfies the cele-

brated Marchenco-Pastur (MP) law and the largest eigenvalue satisfies the Tracy-Widom

(TW) distribution. Specifically, denote λi := λi(XX∗), i = 1, 2, · · · , K, where K =

minM,N, as the eigenvalues of XX∗ in a decreasing fashion, we have that

λ1 = λ+ +O(N−2/3), λ+ = (1 + c−1/2)2, (2.2.5)

holds with high probability. Furthermore, denote ξi, ζi as the singular vectors of X, for

some large constant C > 0, with high probability, we have [35]

maxk|ξi(k)|2 + |ζi(k)|2 = O(N−1), i ≤ C.

To sketch the behavior of S, we consider the case when r = 1 in (2.2.3). Assuming

that the distribution of the entries of X is bi-unitarily invariant, Benaych-Georges and

Nadakuditi [13] established the convergent limits using free probability theory. Denote

µi := µi(SS∗), i = 1, 2, · · · , K, as the eigenvalues of SS∗, they proved that when d >

c−1/4, µ1 would detach from the spectrum of the MP law and become an outlier. And

when d < c−1/4, µ1 converges to λ+ and sticks to the spectrum of the MP law. For the

singular vectors, denote ui, vi as the left and right singular vectors of S, i = 1, 2, · · · , K.

They proved that when d > c−1/4, u1, v1 would be concentrated on cones with axis parallel

to u1, v1 respectively, and the apertures of the cones converged to some deterministic


limits. And when d < c−1/4, u1, v1 will be asymptotically perpendicular to u1, v1

respectively.

Our computation and proof rely on the isotropic local MP law [16, 72] and the

anisotropic local law [74]. These results say that the eigenvalue distribution of the sample

covariance matrix XX∗ is close to the MP law, down to the spectral scales containing

slightly more than one eigenvalue. These local laws are formulated using the Green

functions,

G1(z) := (XX∗ − z)−1, G2(z) := (X∗X − z)−1, z = E + iη ∈ C+. (2.2.6)

To illustrate our results and ideas, we give an overview of the local behavior of the

singular values and vectors of S and how they can be used to recover the signal matrix S

in (2.2.1). As we have seen from [35, 40, 74, 110], the self-adjoint linearization technique

is quite useful in dealing with rectangular matrices. Hence, in a first step, we denote by

H =

0 z1/2S

z1/2S∗ 0

=

0 z1/2X

z1/2X∗ 0

+

0 z1/2UDV ∗

z1/2V DU∗ 0

= H + UDU∗,

(2.2.7)

where D,U are defined as

D :=

0 z1/2D

z1/2D 0

, U :=

U 0

0 V

. (2.2.8)

(2.2.7) is a very convenient expression. On one hand, the eigenvalues of SS∗ can be

uniquely characterized by the eigenvalues of H. On the other hand, the Green functions

of XX∗ and X∗X are contained in that of H (see (2.2.48)).

Next we will give a heuristic description of our results. We will always use λ1 ≥ · · · ≥

λK , K = minM,N, to represent the non-trivial eigenvalues of XX∗ and denote µ1 ≥

· · · ≥ µK as the eigenvalues of SS∗. We also denote ξi, ζi, i = 1, · · · , K as the singular


vectors ofX and ui, vi as those of S. And we denoteG(z) as the Green function ofH, G(z)

as that of H. Consider r = 1 in (2.2.3) and by a standard perturbation discussion (see

Lemma 2.2.17), we find that µ1 satisfies the equation det(U∗G(µ1)U + D−1) = 0. Using

the anisotropic local law in [74], we find that (see Lemma 2.3.27) G has a deterministic

limit Π when N is large enough. Heuristically, the convergent limit of µ1 is determined

by the equation det(U∗Π(z)U + D−1) = 0. An elementary calculation shows that, when

d > c−1/4, µ1 → p(d), where p(d) is defined as

p(d) =(d2 + 1)(d2 + c−1)

d2. (2.2.9)

When d > c−1/4, the largest eigenvalue µ1 will detach from the bulk and become an outlier

around its classical location p(d). We would expect this happens under a scale of N−1/3.

This can be understood in the following ways: increasing d beyond the critical value

c−1/4, we expect µ1 to become an outlier, where its location p(d) is located at a distance

greater than O(N−2/3) from λ+. By using mean value theorem, the phase transition will

take place on the scale when

|d− c−1/4| ≥ O(N−1/3). (2.2.10)

When (2.2.10) happens, we also prove that

µ1 = p(d) +O(N−1/2(d− c−1/4)1/2

). (2.2.11)

Below this scale, we would expect the spectrum of SS∗ to stick to that ofXX∗. Especially,

the largest eigenvalue µ1 still has the Tracy-Widom distribution with the scale N−2/3,

which reads as

µ1 = λ+ +O(N−2/3). (2.2.12)

For the singular vectors, when d > c−1/4, we have < u1, u1 >2→ a1(d), < v1, v1 >

2→


a2(d), where a1(d), a2(d) are deterministic functions of d and defined in (2.2.27). For the

local behavior, we will use an integral representation of Greens functions (see (2.2.83)).

However, when r > 1, if di ≈ dj, i 6= j, we would expect that ui(vi), uj(vj) lie in the

same eigenspace. And then we cannot distinguish the singular vectors. Therefore, in this

paper, we assume that for i 6= j, there exists some ε0 > 0, such that di, dj satisfy the

following condition

|p(di)− p(dj)| ≥ N−1/2+ε0(di − c−1/4)1/2. (2.2.13)

(2.2.13) is referred as non-overlapping condition [17, 72], it ensures that the eigenspace

corresponding to different di, i = 1, · · · , r can be well separated. This can be under-

stood in the following ways: when di, dj > c−1/4, the corresponding eigenvalues µi, µj

of SS∗ will converge to p(di) and p(dj) respectively. Hence, (2.2.13) ensures that the

singular vectors can be distinguished individually. Under the assumption that di’s are

well-separated and satisfy (2.2.10), we prove that

< u1, u1 >2= a1(d) +O(N−1/2), < v1, v1 >

2= a2(d) +O(N−1/2). (2.2.14)

Below the scale of (2.2.10), we prove that

< u1, u1 >2= O(N−1), < v1, v1 >

2= O(N−1). (2.2.15)

Armed with (2.2.11), (2.2.12), (2.2.14) and (2.2.15), we can go to the matrix denoising

problem (2.2.3) under the two different regimes. In the first regime, we assume there

exists sparse structure of the singular vectors, in the case when d > c−1/4, we would

expect u1, v1 to be sparse as well. Hence, S will be of sparse structure. Therefore, by

suitably choosing a submatrix of S and doing SVD for the submatrix, we can get an

estimator for the singular vectors. Our novelty is to truncate singular values and vectors

simultaneously. For the estimation of singular values, we can reverse (2.2.11) to get the


estimator for d. For the singular vectors, based on (2.2.15), the truncation level should

be much larger than N−1/2 and we will use K-means clustering algorithm to choose such

level. However, when d < c−1/4, we can estimate nothing according to (2.2.12) and

(2.2.15).

In the second regime, as we have no prior information whatsoever on the true eigen-

basis of S, the only possibility is to use the eigenbasis of S. This is equivalent to the

assumption of rotation invariance. We will propose a consistent rotation invariant esti-

mator (RIE) Ξ(S), which satisfies the following condition,

Ω1Ξ(S)Ω2 = Ξ(Ω1SΩ2), (2.2.16)

where Ω1,Ω2 are orthogonal (rotation) matrix in RM ,RN respectively.

Sparse estimation In the present application, we study the denoising problem (2.2.1),

where S is sparse in the sense that the nonzero entries are assumed to be confined on a

block. We assume that ui, vi are sparse and introduce the following definition to precisely

describe the sparsity.

Definition 2.2.1. For any vector ν ∈ RN , ν is a sparse vector if there exists a subset

N∗ ⊂ 1, 2, · · · , N with |N∗| = O(1), such that

|ν(i)| =

O(1), i ∈ N∗;

O(N−1/2), otherwise.

Denote

q = argmini1 ≤ i ≤ K : µi ≤ λ+ +N−2/3+τ, τ > 0 is a small constant, (2.2.17)

where λ+ is defined in (2.2.5). Therefore, q is defined as the index of the first extremal

non-outlier eigenvalue. As p(d) is an increasing function of d and p(c−1/4) = λ+, we


conclude that there exist q − 1 outliers and a phase transition will happen after µq.

With the above notations, we provide the stepwise SVD Algorithm 1 to recover S

in (2.2.1). As ui, vi are sparse, we need to find a submatrix of S by a suitable truncation.

Algorithm 1 Stepwise SVD

1: Do SVD for S =∑K

i=1 µiuiv∗i , and do the initialization S1 = S =

∑t1i u

1i (v

1i )∗.

2: while 1 ≤ j < q do3: dj = p−1((tj1)2), where p−1(x) is the inverse of the function defined in (2.2.9).4: Use two thresholds αuj 1√

M, αvj 1√

N, and denote

Ij := 1 ≤ k ≤M : |uj1(k)| ≥ αuj, Jj := 1 ≤ k ≤ N : |vj1(k)| ≥ αvj. (2.2.18)

5: Do SVD for the block matrix Sb = Sj[Ij, Jj] =∑ρiu

ji (v

ji )∗.

6: Assume Ij = k1, · · · , kj, construct uj by letting

µj(kj) =

µj1(j), kj ∈ Ij,0, otherwise.

Similarly, we can construct vj.

7: Let Sj+1 = Sj − djuj v∗j and do SVD for Sj+1 =∑tj+1i uj+1

i (vj+1i )∗.

8: end while9: Denote S =

∑q−1k=1 dkukv

∗k as our estimator.

Algorithm 1 provides us a way to recover S stepwisely. We first estimate d1, u1, v1

using the estimation d1, u1, v1, then d2, u2, v2 by analyzing S − d1u1v∗1. In each step,

we only need to look at the largest singular value and its associated singular vectors.

It is notable that, we drop all the singular values of S when they are below the level

λ+ +N−2/3+τ , where the shrinkage of the singular values can be denoted as

di = 1(µi > λ+ +N−2/3+τ )p−1(µi). (2.2.19)

Our methodology relies on truncating singular values and vectors simultaneously.

As illustrated in (2.2.18), the thresholds αu and αv play the key roles in recovering

the sparse structure of the singular vectors. It will be proved in Section 2.2 that any

threshold satisfying (2.2.18) should work when N is sufficiently large. In the finite sample


framework (when N is not quite large), we employ the K-means algorithm to stabilize

the recovery of the sparse structure of S. The reason behind is, the entries in the singular

vectors ui, vi can be well classified into two categories (see Lemma 2.2.10). Denote the

index sets Cju, C

jv getting from the K-means algorithm [60], where they satisfy

mink∈Cju|uj1(k)| 1√

M, mink∈Cjv|vj1(k)| 1√

N. (2.2.20)

We now replace (2.2.18) with the following step:

• Do K-means clustering to partition the entries of uj1, vj1 into two classes, where

Ij := 1 ≤ k ≤M : k ∈ Cju, Jj := 1 ≤ k ≤ N : k ∈ Cj

v, (2.2.21)

where Cju, C

jv satisfy (2.2.20).

We use Table 2.1 to compare the results of three algorithms, our stepwise SVD(SWSVD),

the sparse SVD(SSVD) proposed by [112] and the truncated SVD(TSVD). For the im-

plementation of SSVD, we use the ssvd package in R which is contributed by the first

author of [112]. From Table 2.1, we find that our method outperforms both the SSVD

and TSVD in all the cases . Furthermore, the standard deviation is small, which implies

that our estimation is quite stable.

Rotation invariant estimation This subsection is devoted to recovering S in (2.2.1)

assuming that no prior information about S is available. In this regime, we will consider

the rotation invariant estimator (RIE) satisfying (2.2.16). We conclude from that any

RIE shares the same singular vectors as S. To construct the optimal estimator, we use

the Frobenius norm as our loss function. Denote S = Ξ(S), we have

||S − S||22 = Tr(S − S)(S − S)∗. (2.2.22)


M=300 M=500Sparsity L2 error norm Std Sparsity L2 error norm Std

SWSVD 0.05 0.043 0.175 0.05 0.045 0.1890.1 0.614 0.178 0.1 0.6 0.160.2 0.822 0.126 0.2 0.825 0.1370.45 1.1 0.114 0.45 1.09 0.09

SSVD 0.05 4.01 0.002 0.05 4.01 0.0020.1 4.01 0.004 0.1 4.02 0.0020.2 4.04 0.004 0.2 4.03 0.0040.45 4.06 0.005 0.45 4.08 0.004

TSVD 0.05 53.9 6.872 0.05 53.75 6.630.1 53.72 6.63 0.1 53.38 6.710.2 52.33 7.01 0.2 52.2 6.650.45 51.043 2.49 0.45 52.4 4.3

Table 2.1: Comparison of the algorithms. We choose r = 2, c = 2, d1 = 7, d2 = 4 in(2.2.3). The noise matrix X is Gaussian. In the table, sparsity is defined as the ratioof non-zero entries and length of the vector and we assume that ui, vi, i = 1, 2 have thesame sparsity. We highlight the smallest error norm.

Therefore, the form of the RIE can be written in the following way

S = argminH∈M(U ,V )

||H − S||2, (2.2.23)

whereM(U , V ) is the class of M×N matrices whose left singular vectors are U and right

singular vectors are V . Suppose S =∑K

i=1 ηkukv∗k, denote µk1k =< uk1 , uk >, νk1k =<

vk1 , vk >, then by an elementary computation, we find

||S − S||22 =r∑

k=1

(d2k + η2

k)− 2r∑

k=1

dkηkµkkνkk

+K∑

k=r+1

η2k − 2

r∑k1 6=k2

dk1ηk2µk1k2νk1k2 − 2K∑

k1=r+1

r∑k2=1

ηk1dk2µk2k1νk2k1 . (2.2.24)


Therefore, S is optimal if

ηk =< uk, Svk >=r∑

k1=1

dk1µk1kνk1k, k = 1, · · · , K. (2.2.25)

In the present paper, we use the following estimator for ηk and will prove its consistency

in Section 2.2. Recall (2.2.17), the estimator is denoted as

ηk =

dka1(dk)a2(dk), k ≤ q − 1;

0, k ≥ q.

, (2.2.26)

where dk = p−1(µk) and a1(x), a2(x) are defined as

a1(x) =x4 − c−1

x2(x2 + c−1), a2(x) =

x4 − c−1

x2(x2 + 1). (2.2.27)

Figure 3.3 are two examples of the estimations of ηk. From the graph, we find that our

estimator ηk is quite accurate.

1.5 2.0 2.5 3.0 3.5 4.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

c=0.5

Value of d

RIE

Estimation ValueTrue Value

1.5 2.0 2.5 3.0 3.5 4.0

1.0

1.5

2.0

2.5

3.0

3.5

c=2

Value of d

RIE

Estimation ValueTrue Value

Figure 2.1: RIE. We choose r = 1 and M = 300 for (2.2.3). We estimate η1 using theestimator (2.2.26) for c = 0.5, 2 with different values of d. The entries of X are Gaussianrandom variables and the singular vectors satisfy the exponential distribution with rate1.

Figure 2.2 records the relative improvement in average loss (RIAL) compared to


TSVD, where the RIAL is defined as

RIAL(N) = 1− E||S − S||2E||SN − S||2

, (2.2.28)

and where SN is the TSVD estimation and S the RIE. We conclude from the figure that

our method provides better estimation compared to the TSVD. Similar results have been

shown for the estimation of covariance matrices.

50 100 150 200 250 300

0.75

0.80

0.85

0.90

0.95

Dimension

RIAL

Figure 2.2: RIE compared to TSVD. We choose r = 1, d = 4, c = 2 in (2.2.1). X isa random Gaussian matrix and the entries of the singular vectors satisfy the exponen-tial distribution with rate 1. We perform 1000 Monte-Carlo simulations for each M tosimulate the RIAL defined in (2.2.28). The red line indicates the increasing trend as Mincreases.

Remark 2.2.2. In [62], Donoho and Gavish get similar results from the perspective of

optimal shrinkage. However, they need two more assumptions: (1). they drop the last

two error terms in (2.2.24) by assuming they are small enough (see Lemma 4 in their

paper); (2) their estimators are assumed to be conservative, where they assume

ηk = 0, k ≥ q.

However, we find that the estimator defined in (2.2.26) can still be consistent even without


these assumptions.

Main results In this section, we give the main results of this paper. Throughout

the paper, we always use ε1 for a small constant and D1 for a large constant. Denote

R := 1, 2, · · · , r and O as a subset of of R by

O := i : di ≥ c−1/4 +N−1/3+ε0, ε0 > ε1 is a small constant, (2.2.29)

and

k+ = |O|. (2.2.30)

Remark 2.2.3. Our results can be extended to a more general domain by denoting O′ :=

i : di ≥ c−1/4 + N−1/3. The proofs still hold true with some minor changes except we

need to discuss the case when di ∈ (c−1/4 +N−1/3, c−1/4 +N−1/3+ε0). We will not pursue

this generalization.

For any subset A ⊂ O, we define the projections on the left and right singular subspace

of S by

Pl :=∑i∈A

uiu∗i , Pr :=

∑j∈A

vj v∗j . (2.2.31)

As discussed in (2.2.13), we need the following non-overlapping condition, which was

firstly introduced in [17].

Definition 2.2.4. For i = 1, 2, · · · ,M, the non-overlapping condition is written as

νi(A) ≥ (di − c−1/4)−1/2N−1/2+ε0 , (2.2.32)

where ε0 is defined in (2.2.29) and νi(A) is defined by

νi(A) :=

minj /∈A |di − dj|, if i ∈ A,

minj∈A |di − dj|, if i /∈ A.(2.2.33)


With the above preparation, we state our main results of the singular values of S.

Theorem 2.2.5. For i = 1, 2, · · · , k+, where k+ is defined in (2.2.30), there exists some

large constant C > 1 such that Cε1 < ε0, when N is large enough, with 1 − N−D1


|µi − p(di)| ≤ N−1/2+Cε0(di − c−1/4)1/2, (2.2.34)

where p(di) is defined in (2.2.9). Moreover, for j = k+ + 1, · · · , r, we have

|µj − λ+| ≤ N−2/3+Cε0 , (2.2.35)

where λ+ is defined in (2.2.5).

The above theorem gives precise location of the outlier singular values and the ex-

tremal non-outlier singular values. For the outliers, they locate around their classical

locations p(di) and for the non-outliers, they locate around λ+. However, (2.2.35) can

be extended to a more general framework. Instead of considering λ+, we can locate µj

around the eigenvalues of XX∗, which is the phenomenon of eigenvalue sticking. The

results of the singular vectors are given by the following theorem.

Theorem 2.2.6. For all i, j = 1, 2, · · · , r, there exists some constant C > 0, under the

assumption of (2.3.20), with 1−N−D1 probability, when N is large enough, we have

|< ui,Pluj > −δij1(i ∈ A)a1(di)| ≤ N ε1R(i, j, A,N), (2.2.36)

| < vi,Prvj > −δij1(i ∈ A)a2(di)| ≤ N ε1R(i, j, A,N), (2.2.37)

where a1(x), a2(x) are defined in (2.2.27) and R(i, j, A,N) is defined as

R(i, j, A,N) := N−1/2

[1(i ∈ A, j ∈ A)

(di − c−1/4)1/2 + (dj − c−1/4)1/2+ 1(i ∈ A, j /∈ A)

(di − c−1/4)1/2

|di − dj|

+1(i /∈ A, j ∈ A)(dj − c−1/4)1/2

|di − dj|

]+N−1

[(

1

νi+

1(i ∈ A)

|di − c−1/4|)(

1

νj+

1(j ∈ A)

|dj − c−1/4|)

].


Moreover, fix a small constant τ > 0, for k+ + 1 ≤ j ≤ (1− τ)K, denote κdj := N−2/3(j ∧

(K + 1− j))2/3, we have

| < ui, uj >2 | ≤ NCε0

N((di − c−1/4)2 + κdj ), i = 1, 2, · · · , r, (2.2.38)

and

| < vi, vj > |2 ≤NCε0

N((di − c−1/4)2 + κdj ), i = 1, 2, · · · , r. (2.2.39)

Furthermore, if c 6= 1, (2.2.38) and (2.2.39) hold for all j = k+ + 1, · · · ,M.

Remark 2.2.7. The assumption j ≤ (1−τ)K ensures that µj ≥ δ, for some constant δ > 0.

When c 6= 1, it is guaranteed as we will see from Lemma 2.2.22 that µj ≥ (1− c−1/2)2/2.

We need µj ≥ δ for the technical purpose of the application of the local laws.

Next we will give some examples to illustrate our results. We assume that c 6= 1.

Example 2.2.8. (1). Consider the right singular vectors and let A = i, we have

| < vi, vi >2 −a2(di)| ≤ N ε1

[1

N1/2(di − c−1/4)1/2+

1

Nν2i (di − c−1/4)2

].

This implies that, the cone concentration of the singular vector holds if i ∈ O and the

non-overlapping condition (2.3.20) holds. Furthermore, if di is well-separated from both

the critical point c−1/4 and the other outliers, the error bound is of order 1√N.

(2). Let A = i and for 1 ≤ j 6= i ≤ r, we have

| < vj, vi >2 | ≤ N ε1

N(di − dj)2.

Hence, if |di−dj| = O(1), then vi will be completely delocalized in any direction orthogonal

to vi.


(3). If i ∈ O, j /∈ O, then we have

| < vi, uj >2 | ≤ NCε0

N((di − c−1/4)2 + κdj ).

Hence, when |di − c−1/4| = O(1) or κdj = O(1), uj will be completely delocalized in the

direction of vi. The first case reads as µi is an outlier and the second case as that µj is

in the bulk of the spectrum of SS∗.

We now conclude the consistency of our estimators.

Theorem 2.2.9. For the matrix denosing model (2.2.3), we have:

(1). For the Regime (1), with prior information that ui, vi are sparse in the sense of

Definition 2.2.1 and di > c−1/4 + δ, |di − dj| ≥ δ, i 6= j, i, j = 1, 2, · · · , k+, δ > 0

is a small constant, then there exists some C > 0, with 1− o(1) probability, for the

estimator S getting from Algorithm 1, we have

||S − S||2 ≤ N−1/2+Cε0 +

√√√√ r∑i=k++1

d2i .

(2). For the Regime (2), recall the rotation invariant estimator defined in (2.2.25), then

there exists some large constant C > 0 and small constant τ > 0, with 1 − N−D1

probability, we have ηk → ηk, k = 1, 2, · · · , K. Furthermore, for 1 ≤ k ≤ (1− τ)K,

we have

|ηk − ηk| ≤ 1(k ≤ k+)N−1/2+Cε0 + 1(k > k+)N−1+Cε0 . (2.2.40)

Moreover, when c 6= 1, (2.2.40) holds for all k = 1, · · · , K.


Proof. We firstly prove (1). Denote S1 =∑k+

i=1 diuiv∗i , S2 =

∑ri=k++1 diuiv

∗i , we have

||S − S||2 ≤ ||S − S1||2 +

√√√√ r∑i=k++1

d2i .

It is easy to check that

||S − S1||22 ≤ 2k+∑i=1

(di − di)2 + 2 Tr (RR∗) , (2.2.41)

where R is defined as R :=∑k+

i=1 diuiv∗i −

∑k+

i=1 diuiv∗i . The first term on the right-hand

side of (2.2.41) is bounded by N−1+Cε0 using (2.2.34). For the second term, we only need

to control Tr((vi−vi)(vi−vi)∗) and Tr((ui−ui)(ui−ui)∗) by Cauchy-Schwarz inequality.

Due to similarity, we only prove for the right singular vectors.

Under the sparsity assumption, the non-zero entries of S are confined on a block

matrix Sb of some fixed dimension m × n. Denote Sb := Sb + Xb, if our algorithm can

correctly choose the positions of the non-zero entries of ui, vi (i.e. Sb) with 1 − o(1)

probability, we can conclude our proof using the fact (see [99, Lemma 4.3])

Vb = Vb +O(||X∗bXb + S∗bXb +X∗bSb||2),

where Vb, Vb are the right singular vectors of Sb, Sb respectively. Therefore, under the

assumption that xij is of variance 1/N, we have that with 1 − o(1) probability, Vb =

Vb +O(N−1/2+Cε0). This concludes our proof.

The rest of the proof leaves to show that (2.2.18) can correctly find the positions

of the non-zero entries (i.e. Sb) with 1 − o(1) probability, which is summarized as the

following lemma.

Lemma 2.2.10. For i = 1, 2, · · · , k+, denote Ji as the index set of the non-zero entries

of vi, for some constant C > 0, there exists some δ ∈ (Cε0,12), with 1− o(1) probability,


we have

|vi(k)| ≥ N−1/2+δ, k ∈ Ji; |vi(k)| ≤ N−1/2+Cε0 , k ∈ J ci ∩ 1, · · · , N.

By Lemma 2.2.10, we have that with 1−o(1) probability, maxk1 /∈Ji |vi(k1)| mink2∈Ji |vi(k2)|,

which implies that Algorithm 1 can correctly recover the sparse structure of the singular

vectors. Next we prove (2). The consistency of ηk is an immediate result of [13, Theorem

2.9]. For the convergent rate, recall (2.2.25), we have

ηk =k+∑k1=1

dk1µk1kνk1k +r∑

k1=k++1

dk1µk1kνk1k.

Hence, the proof follows from (2.2.34), (2.2.36), (2.2.37), (2.2.38) and (2.2.39).

Notations and basic tools. In this part, we introduce some notations and tools which

will be used in this paper. Recall that the empirical spectral distribution (ESD) of an

n× n symmetric matrix H is defined as

F(n)H (λ) :=

1

n

n∑i=1

1λi(H)≤λ.

We define the typical domain for z = E + iη by

D(τ) ≡ D(τ,N) := z ∈ C+ : τ ≤ E ≤ τ−1, N−1+τ ≤ η ≤ τ−1, (2.2.42)

where τ > 0 is a small constant. Recall (2.2.4), we assume that τ < cN < τ−1.

Definition 2.2.11. The Stieltjes transform of the ESD of X∗X is given by

m2(z) ≡ m(N)2 (z) :=

∫1

x− zdF

(N)X∗X(x) =

1

N

N∑i=1

(G2)ii(z) =1

NTrG2(z),


where G2(z) is defined in (2.2.6). Similarly, we can also define m1(z) := M−1TrG1(z).

Denote m1c(z) := limN→∞m1(z), m2c(z) := limN→∞m2(z) be the Stieltjes trans-

forms of limiting spectral distributions of m1(z),m2(z). Using the identity m1(z) =

−1−cNz

+ cNm2(z), we have

m1c(z) =c− 1

z+ cm2c(z). (2.2.43)

Definition 2.2.12. For X satisfying (2.2.2), under the assumption (2.2.4), the ESD of

XX∗ converges weakly to the Marchenko-Pastur (MP) law as N →∞:

µ(A) =

(1− c)10∈A + ν(A), c < 1

ν(A), c ≥ 1,

where dν(x) = ρ1c(x) and

ρ1c(x)dx =c

2π

√(λ+ − x)(x− λ−)

xdx, λ± = (1± c−

12 )2. (2.2.44)

The Stieltjes transform of the MP law m1c(z) has the closed form expression

m1c(z) =1− c−1 − z + i

√(λ+ − z)(z − λ−)

2zc−1. (2.2.45)

Remark 2.2.13. From (2.2.43), we have that m2(z) converges to m2c(z) as N →∞, where

m2c(z) =c−1 − 1

z+ c−1m1c(z) =

c−1 − 1− z + i√

(λ+ − z)(z − λ−)

2z. (2.2.46)

It is notable that

− z−1(1 +m2c(z))−1 = m1c(z). (2.2.47)

Recall (2.2.7) and G(z) = (H − z)−1, by Schur’s complement [74], it is easy to check


that

G(z) =

G1(z) z−1/2G1(z)X

z−1/2X∗G1(z) G2(z)

, (2.2.48)

for G1,2 defined in (2.2.6). Denote the index sets I1 := 1, ...,M, I2 := M + 1, ...,M +

N, I := I1 ∪ I2. Then we have

m1(z) =1

M

∑i∈I1

Gii, m2(z) =1

N

∑µ∈I2

Gµµ.

Similarly, we denote G(z) = (H − z)−1, where H is defined in (2.2.7). Next we introduce

the spectral decomposition of G(z). By (2.2.48), we have

G(z) =K∑k=1

1

µk − z

uu∗k z−1/2√µkukv

∗k

z−1/2√µkvku

∗k vkv

∗k

. (2.2.49)

As we have seen in (2.2.9), the function p(d) plays a key role in describing the con-

vergent limits of the outlier singular values of S. An elementary computation yields that

p(d) attains its global minimum when d = c−1/4 and p(c−1/4) = λ+, and

p′(x) ∼ (x− c−1/4). (2.2.50)

To precisely locate the outlier singular values of S, we need to analyze

T s(x) :=s∏i=1

(xm1c(x)m2c(x)− d−2i ). (2.2.51)

By (2.2.45) and (2.2.46), when x ≥ λ+, we have

xm1c(x)m2c(x) =x− (1 + c−1)−

√(x+ c−1 − 1)2 − 4c−1x

2c−1. (2.2.52)

Next we collect the preliminary results of the properties of T s(x) as the following lemma.


Lemma 2.2.14. Suppose d1 > d2 > · · · > ds > c−1/4, then we have that there exist s

solutions of T s(x) = 0 and they are pi := p(di), i = 1, 2, · · · , s, write

T s(pi) = 0. (2.2.53)

Furthermore, denote

T (x) := xm1c(x)m2c(x), (2.2.54)

T (x) is a strictly monotone decreasing function when x > λ+.

For z ∈ D(τ) defined in (2.2.42), denote

κ := |E − λ+|. (2.2.55)

By (2.2.52), it is easy to check that

T (z)− c1/2 =z − λ+ − i

√(λ+ − z)(z − λ−)

2c−1. (2.2.56)

The following lemma summarizes the basic properties of m2c(z) and T (z), the estimates

are based on the elementary calculations of (2.2.52) and (2.2.56).

Lemma 2.2.15. For any z ∈ D(τ) defined in (2.2.42), we have

|T (z)| ∼ |m2c(z)| ∼ 1, |c1/2 − T (z)| ∼ |1−m22c(z)| ∼

√κ+ η,

and for some small δ > 0,

Im T (z) ∼ Imm2c(z) ∼

√κ+ η, if E ∈ [λ+ − δ, λ+],

η√κ+η

, if E > λ+.

,


as well as

|Re T (z)− c1/2| ∼

η√κ+η

+ κ, E ∈ [λ+ − δ, λ+],

√κ+ η, E > λ+.

. (2.2.57)

The next lemma provides the local estimate on the derivative of T (x) on the real

axis.

Lemma 2.2.16. For d > c−1/4, denote Id := [x−(d), x+(d)], x±(d) := p(d)±N−1/2+ε0(d−

c−1/4)1/2, where ε0 is defined in (2.2.29). Then ∀ x ∈ Id, we have that

T ′(x) ∼ (d− c−1/4)−1. (2.2.58)

The following perturbation identity plays the key role in our proof, as it naturally

provides us a way to incorporate the Green functions using a deterministic equation.

Lemma 2.2.17. Recall (2.2.7), assume µ ∈ R/σ(H) and det D 6= 0, then µ ∈ σ(H) if

and only if

det(U∗G(µ)U + D−1) = 0. (2.2.59)

The following lemma establishes the connection between the Green functions of H

and H defined in (2.2.7).

Lemma 2.2.18. For z ∈ C+, we have

G(z) = G(z)−G(z)U(D−1 + U∗G(z)U)−1U∗G(z), (2.2.60)

and

U∗G(z)U = D−1 −D−1(D−1 + U∗G(z)U)−1D−1. (2.2.61)


One of the key ingredients of our computation are the local laws. Denote

Ψ(z) :=

√Imm2c(z)

Nη+

1

Nη, Σ :=

z−1/2 0

0 I

, (2.2.62)

and m(z) ≡ mN(z) as the unique solution of the equation

f(m(z)) = z, Imm(z) ≥ 0, f(x) = −1

x+

1

cN

1

x+ 1.

Recall (2.2.48), the following lemma shows that G(z) converges to a deterministic matrix

Π(z) with high probability.

Lemma 2.2.19. Fix τ > ε1, then for all z ∈ D(τ), with 1 − N−D1 probability, for any

unit deterministic vectors u,v ∈ RM+N , we have

| < u,Σ−1(G(z)− Π(z))Σ−1v > | ≤ N ε1Ψ(z), |m2(z)−m(z)| ≤ N ε1

Nη, (2.2.63)

where Π(z) is defined as

Π(z) :=

−z−1(1 +m(z))−1 0

0 m(z)

. (2.2.64)

It is notable that in general, m(z) depends on N and Lemma 2.2.15 also holds for

m(z). However, in our computation, we can replace m(z) with m2c(z) due to the following

local MP law.

Lemma 2.2.20. Fix τ > ε1, then for all z ∈ D(τ), with 1−N−D1 probability, we have

|m2(z)−m2c(z)| ≤ N ε1Ψ(z).

Beyond the support of the limiting spectrum of the MP law, we have stronger results


all the way down to the real axis. More precisely, define the region

D(τ, ε1) := z ∈ C+ : λ+ +N−2/3+ε1 ≤ E ≤ τ−1, 0 < η ≤ τ−1, (2.2.65)

then we have the following stronger control on D(τ, ε1).

Lemma 2.2.21. For z ∈ D(τ, ε1), with 1−N−D1 probability, we have

| < u,G2(z)v > −m2c(z) < u, v > | ≤ N−1/2+ε1(κ+ η)−1/4,

for all unit vectors u, v ∈ RN . Similar result holds for G1(z),m1c(z). Furthermore, for

any deterministic vectors u,v ∈ RM+N , we have

| < u,Σ−1(G(z)− Π(z))Σ−1v > | ≤ N−1/2+ε1(κ+ η)−1/4. (2.2.66)

Denote the non-trivial classical eigenvalue locations γ1 ≥ γ2 ≥ · · · ≥ γK of XX∗ as∫∞γidρ1c = i

N, where ρ1c is defined in (2.2.44). The consequent result of Lemma 2.3.27 is

the rigidity of eigenvalues.

Lemma 2.2.22. Fix any small τ ∈ (0, 1), for 1 ≤ i ≤ (1−τ)K, with 1−N−D1 probability,

we have

|λi − γi| ≤ N−2/3+ε1(i ∧ (K + 1− i))−1/3.

Furthermore, if c 6= 1, the above estimate holds for all i = 1, 2, · · · , K.

Using Lemma 2.2.22, we find that κdj defined in (2.2.38) is a deterministic version of

κµj = |µj − λ+|.

Proofs of Theorem 2.2.5 and 2.2.6 In this part, we focus on the singular values of S

and prove Theorem 2.2.5. A key deviation from their proof is that our matrix D defined

in (2.2.8) is not diagonal, it appears that in order to analyze (2.2.59), they only need to

deal with the diagonal elements but we need to control the whole matrix. We will make


use of the following interlacing theorem for rectangular matrices [104].

Lemma 2.2.23. For any M×N matrices A,B, denote σi(A) as the i-th largest singular

value of A, then we have

σi+j−1(A+B) ≤ σi(A) + σj(B), 1 ≤ i, j, i+ j − 1 ≤ K.

The proof relies on two main steps: (i) fix a configuration independent of N , establish

two permissible regions, Γ(d) of k+ components and I0, where the outliers of SS∗ are

allowed to lie in Γ(d) and each component contains precisely one eigenvalue and the

r − k+ non-outliers lie in I0; (ii) a continuity argument where the result of (i) can be

extended to arbitrary N−dependent D.

The following 2r × 2r matrix plays the key role in our analysis

M r(x) := U∗G(x)U + D−1. (2.2.67)

By Lemma 2.2.17, x ∈ σ(SS∗) if and only if detM r(z) = 0. Using Lemma 2.2.20 and

2.2.21, we find that x−rT r(x) ≈ detM r(x), where T r(x) is defined in (2.2.51). As T r(x)

behaves differently in Γ(d) and I0, we will use different strategies to prove (2.2.34) and

(2.2.35).

Proof of Theorem 2.2.5. Denote k0 := r − k+ and write

d = (d1, · · · , dr) = (d0,d+), dσ = (dσ1 , · · · , dσkσ), σ = 0,+,

where we adapt the convention

d0k0 ≤ · · · ≤ d0

1 ≤ c1/4 < d+k+ ≤ · · · ≤ d+

1 , k0 + k+ = r.


Next we define the sets

D+(ε0) := d+ : c−1/4 +N−1/3+ε0 ≤ d+i ≤ τ−1, i = 1, · · · , k+, (2.2.68)

D0(ε0) := d0 : 0 < d0i < c−1/4 +N−1/3+ε0 , i = 1, · · · , k0, (2.2.69)

and the sets of allowed d′s, which is D(ε0) := (d0,d+) : dσ ∈ Dσ(ε0), σ = +, 0.

Denote the following sequence of intervals

I+i (d) := [p(d+

i )−N−1/2+ε3(d+i − c−1/4)1/2, p(d+

i ) +N−1/2+ε3(d+i − c−1/4)1/2], (2.2.70)

where ε3 satisfies the following condition

Cε1 < ε3 <1

4ε0, C > 2 is some large constant. (2.2.71)

For d ∈ D(ε0), we denote Γ(d) := ∪k+i=1I+i (d) and I0 := [λ+−N−2/3+C′ε0 , λ++N−2/3+C′ε0 ],

where C ′ satisfies 2 < C ′ < 4.

For a first step, we show that Γ(d) is our permissible region which keeps track of the

outlier eigenvalues of SS∗. And the rest of the eigenvalues corresponding to D0(ε0) will

lie in I0. We fix a configuration d(0) ≡ d that is independent of N in this step.

Lemma 2.2.24. For any d ∈ D(ε0), with 1−N−D1 probability, we have

σ+(SS∗) ⊂ Γ(d), (2.2.72)

where σ+(SS∗) is the set of the outlier eigenvalues of SS∗ associated with D+(ε0). More-

over, each interval I+i (d) contains precisely one eigenvalue of SS∗, i = 1, 2, · · · , k+.

Furthermore, we have

σo(SS∗) ⊂ I0, (2.2.73)


where σo(SS∗) is the set of the non-outlier eigenvalues corresponding to D0(ε0).

Proof. First of all, it is easy to check that Γ(d) ∩ I0 = ∅ using (2.2.50) and the fact

C ′ > 2. Denote Sb := p(d+k+) − N−1/2+ε3(d+

k+ − c−1/4)1/2. In order to prove (2.2.72), we

first consider the case when x > Sb. It is notable that x /∈ σ(XX∗) by Lemma 2.2.22,

(2.2.50) and (2.2.71). Recall (2.2.64) and (2.2.67), using the fact r is bounded and Lemma

2.2.21, with 1−N−D1 probability, we have

M r(x) = U∗Π(x)U + D−1 +O(N−1/2+ε1κ−1/4). (2.2.74)

It is well-known that if λ ∈ σ(A+B) then dist(λ,σ(A)) ≤ ||B||; therefore, we have that

µi(SS∗) ≤ τ−1, i = 1, · · · , K for τ > 0 defined in (2.2.42). Recall (2.2.51), by (2.2.50),

(2.2.58) and (2.2.71), with 1−N−D1 probability, we have

|T r(x)| ≥ N−1/2+(C−1)ε1κ−1/4, if x ∈ [Sb, τ−1]/Γ(d). (2.2.75)

Using the formula

det

xIr diag(α1, · · · , αr)

diag(α1, · · · , αr) yIr

=r∏i=1

(xy − α2i ),

Lemma 2.2.20, (2.2.47) and (2.2.74), we conclude that

det(D−1 + U∗Π(x)U) = x−rT r(x) +O(N−1/2+ε1κ−1/4). (2.2.76)

By (2.2.75) and (2.2.76), we conclude that M r(x) is non-singular when x ∈ [Sb, τ−1]/Γ(d).

Next we will use Rouche’s theorem to show that inside the permissible region, each

interval I+i (d) contains precisely one eigenvalue of SS∗. Let i ∈ 1, · · · , k+ and pick a

smallN -independent counterclockwise (positive-oriented) contour C ⊂ C/[(1−c−1/2)2, (1+

c−1/2)2] that encloses p(d+i ) but no other p(d+

j ), j 6= i. For large enough N, define


f(z) := det(M r(z)), g(z) := det(T r(z)). By the definition of determinant, the functions

g, f are holomorphic on and inside C. And g(z) has precisely one zero z = p(d+i ) inside

C. On C, it is easy to check that

minz∈C|g(z)| ≥ c > 0, |g(z)− f(z)| ≤ N−1/2+ε1κ−1/4,

where we use (2.2.74) and Lemma 2.2.20. Hence, f(z) has only one zero in I+i (d) accord-

ing to Rouche’s theorem. This concludes the proof of (2.2.72) using Lemma 2.2.17. In

order to prove (2.2.73), using the following fact: for any two M×N rectangular matrices

A,B, we have σi(A + B) ≥ σi(A) + σK(B), i = 1, · · · , K, and Lemma 2.2.22, we find

that

µi ≥ λ+ −N−2/3+C′ε0 , i = k+ + 1, · · · , r. (2.2.77)

For the non-outliers, we assume that Sb > λ+ +N−2/3+C′ε0 , otherwise the proof is already

done. Now we assume x /∈ I0, by (2.2.72) and (2.2.77), we only need to discuss the case

when x ∈ (λ+ + N−2/3+C′ε0 , Sb). In this case, we will prove that M r(x) is non-singular

by comparing with M r(z), where z = x + iN−2/3−ε4 and ε4 < ε1 is some small positive

constant. Denote the spectral decomposition of G(z) as

G(z) =∑k

1

λk − zgαg

∗α, gα ∈ RM+N .

Denote ui, i = 1, · · · , 2r as the i-th column in U defined in (2.2.8) and abbreviate

u∗iG(z)uj as Guiuj(z), and η := N−2/3−ε4 , using spectral decomposition and the fact

x > λ+ +N−2/3+C′ε0 , we have

|Guiuj(x)−Guiuj(x+ iη)| ≤ ImGuiui(x+ iη) + ImGujuj(x+ iη).


Therefore, by Lemma 2.2.20 and 2.2.21, with 1−N−D1 probability, we have

M r(x) = M r(z) +O(N ε1

(Imm2c(z) +

√Imm2c(z)

Nη

)).

Using Lemma 2.2.15 and a similar discussion to (2.2.75), we have

M r(x) = T r(z) +O(N−1/3(N−C′ε0/4 +N ε1−C′ε0/4)).

By Lemma 2.2.15 and 2.2.20, we find that |T r(z)| ≥ N−1/3+C′ε0

2 , where we use the

assumption that x > λ+ + N−2/3+C′ε0 . Therefore, M r(x) is non-singular as we have

assumed 2 < C ′ < 4. This concludes the proof of (2.2.73).

In the second step, we will extend the proof to any configuration d(1) depending on

N using the continuity argument. This is done by a bootstrap argument by choosing a

continuous path connecting d(0) and d(1). It is recorded as the following lemma.

Lemma 2.2.25. For any N-dependent configuration d(1) ∈ D(ε0), (2.2.34) and (2.2.35)

hold true.

Singular vectors. In this section, we focus on the local behavior of singular vectors. We

first deal with the outlier singular vectors and then the non-outlier ones. Due to similarity,

we only prove (2.2.37) and (2.2.39), (2.2.36) and (2.2.38) can be handled similarly.

Proof of (2.2.37). It is notable that, by Lemma 2.2.21 and Theorem 2.2.5, for i ∈ O,

there exists a constant C > 0, for N large enough, with 1 − N−D1 probability , we can

choose an event Ξ such that for all z ∈ D(τ, ε1) defined in (2.2.65).

1(Ξ)|(V ∗G2(z)V )ij −m2c(z)δij| ≤ (κ+ η)−1/4N−1/2+Cε1 . (2.2.78)


Next we will restrict our discussion on the event Ξ. Recall (2.2.33) and for A ⊂ O, we

define for each i ∈ A the radius

ρi :=νi ∧ (di − c−1/4)

2. (2.2.79)

Under the assumption of (2.3.20), we have

ρi ≥1

2(di − c−1/4)−1/2N−1/2+ε0 . (2.2.80)

We define the contour Γ := ∂Υ as the boundary of the union of discs Υ := ∪i∈ABρi(di),

where Bρ(d) is the open disc of radius ρ around d. We summarize the basic properties of

Υ as the following lemma.

Lemma 2.2.26. Recall (2.2.9) and (2.2.65), we have p(Υ) ⊂ D(τ, ε1). Moreover, each

outlier µii∈A lies in p(Υ), and all the other eigenvalues of SS∗ lie in the complement

of p(Υ).

Armed with the above results, we now start the proof of the outlier singular vectors.

Our starting point is an integral representation of the singular vectors. By (2.2.48), we

have

v∗i G2vj = v∗i Gvj, (2.2.81)

where vi ∈ RM+N is the natural embedding of vi with vi = (0, vi)∗. Recall (2.3.29), using

the spectral decomposition of G2(z), Lemma 2.2.26 and Cauchy’s integral formula, we

have

Pr = − 1

2πi

∫p(Γ)

G2(z)dz = − 1

2πi

∫Γ

G2(p(ζ))p′(ζ)dζ. (2.2.82)

By Lemma 2.2.18, Cauchy’s integral formula, (2.2.81) and (2.2.82), we have

< vi,Prvj >=1

2didjπi

∫p(Γ)

(D−1 + U∗G(z)U)−1ij

dz

z, (2.2.83)


where i, j are defined as i := r + i, j := r + j. Recall (2.2.64), as D−1 + U∗Π(z)U is of

finite dimension, by Lemma 2.2.20, 2.2.21, (2.2.47) and (2.2.78), we can now use Π(z) as

Π(z) :=

m1c(z) 0

0 m2c(z)

.

Next we decompose D−1 + U∗G(z)U by

D−1 + U∗G(z)U = D−1 + U∗Π(z)U−∆(z), ∆(z) = U∗Π(z)U−U∗G(z)U. (2.2.84)

It is notable that ∆(z) can be controlled by Lemma 2.2.20 and 2.2.21. Using the resolvent

expansion to the order of one on (2.2.84), we have

< vi,Prvj >=1

didj(S(0) + S(1) + S(2)), (2.2.85)

where

S(0) :=1

2πi

∫p(Γ)

(1

D−1 + U∗Π(z)U)ijdz

z,

S(1) =1

2πi

∫p(Γ)

[1

D−1 + U∗Π(z)U∆(z)

1

D−1 + U∗Π(z)U]ijdz

z,

S(2) =1

2πi

∫p(Γ)

[1

D−1 + U∗Π(z)U∆(z)

1

D−1 + U∗Π(z)U∆(z)

1

D−1 + U∗G(z)U]ijdz

z.

By an elementary computation, we have

(D−1 + U∗Π(z)U)−1ij =

δijzm2c(z)

zm1c(z)m2c(z)−d−2i

, 1 ≤ i, j ≤ r;

δijzm1c(z)


, r ≤ i, j ≤ 2r;

δij(−1)i+jz1/2d−1

i


, 1 ≤ i ≤ r, r ≤ j ≤ 2r;

δij(−1)i+jz1/2d−1

j

zm1c(z)m2c(z)−d−2j

, r ≤ i ≤ 2r, 1 ≤ j ≤ r.

(2.2.86)


Using the fact pim1c(pi)m2c(pi) = 1d2i

and the residual theorem, we have

S(0) = δijm2c(pi)

T ′(pi)= δij

d4i − c−1

d2i + 1

. (2.2.87)

Next we control the term S(1). Applying (2.2.86) on S(1), we have

S(1) =1

2πi

∫p(Γ)

f(z)

(zm1c(z)m2c(z)− d−2i )(zm1c(z)m2c(z)− d−2

j )dz, (2.2.88)

where f(z) = f1(z) + f2(z) and f1,2(z) are defined as

f1(z) := m2c(z)[zm2c(z)∆(z)ij + (−1)i+iz1/2d−1i ∆(z)ij],

f2(z) := d−1j [(−1)j+jz1/2m2c(z)∆(z)ij + (−1)i+j+i+jd−1

i ∆(z)ij].

We now use the change of variable as in (2.2.82) and rewrite S(1) as

S(1) =1

2πi

∫Γ

f(p(ζ))

(ζ−2 − d−2i )(ζ−2 − d−2

j )p′(ζ)dζ = d2

i d2j

1

2πi

∫Γ

f(p(ζ))ζ4

(d2i − ζ2)(d2

j − ζ2)p′(ζ)dζ,

where we use the fact p(ζ)m1c(p(ζ))m2c(p(ζ)) = ζ−2. By (2.2.50), Lemma 2.2.15 and

2.2.21, we conclude that

|f(p(ζ))p′(ζ)ζ4| ≤ (ζ − c−1/4)1/2N−1/2+ε1 . (2.2.89)

Denote

fij(ζ) =f(p(ζ))p′(ζ)ζ4

(di + ζ)(dj + ζ).

As fij is holomorphic inside the contour Γ, by Cauchy’s differentiation formula, we have

f ′ij(ζ) =1

2πi

∫C

fij(ξ)

(ξ − ζ)2dξ, (2.2.90)


where the contour C is the circle of radius |ζ−c−1/4|2

centered at ζ. Hence, by (2.2.50),

(2.2.89), (2.2.90) and the residual theorem, we have

|f ′ij(ζ)| ≤ (ζ − c−1/4)−1/2N−1/2+ε1 . (2.2.91)

In order to estimate S(1), we consider the following three cases (i) i, j ∈ A, (ii) i ∈ A, j /∈

A, (or i /∈ A, j ∈ A), (iii) i, j /∈ A. By the residual theorem, S(1) = 0 when case (iii)

happens. Hence, we only need to consider the cases (i) and (ii). For the case (i), when

i 6= j, by the residual theorem and (2.2.91), we have

|S(1)| = d2i d

2j

∣∣∣∣fij(di)− fij(dj)di − dj

∣∣∣∣ ≤ d2i d

2j

|di − dj|

∣∣∣∣∫ dj

di

|f ′ij(t)|dt∣∣∣∣ ≤ d2

i d2jN−1/2+ε1

(di − c−1/4)1/2 + (dj − c−1/4)1/2.

When i = j, by the residual theorem, we have |S(1)| ≤ d4i (di − c−1/4)−1/2N−1/2+ε1 . For

the case (ii), when i ∈ A, j /∈ A, by the residual theorem and (2.2.78), we have

|S(1)| = |d2i d

2jfij(di)

di − dj| ≤

d2i d

2j(di − c−1/4)1/2

|di − dj|N−1/2+ε1 .

We can get similar results when i /∈ A, j ∈ A. Putting all the cases together, we find that

|S(1)| ≤ N−1/2+ε1

[1(i ∈ A, j ∈ A)d2

i d2j

(di − c−1/4)1/2 + (dj − c−1/4)1/2+ 1(i ∈ A, j /∈ A)

d2i d

2j(di − c−1/4)1/2

|di − dj|

+1(i /∈ A, j ∈ A)d2i d

2j(dj − c−1/4)1/2

|di − dj|

].

(2.2.92)

Finally, we need to estimate S(2). Here the residual calculations can not be applied

directly as U∗G(z)U is not necessary to be diagonal and a relation comparable to

p(ζ)m1c(p(ζ))m2c(p(ζ)) = ζ−2 does not exist. Instead, we need to precisely choose the

contour Γ. We record the result as the following lemma.


Lemma 2.2.27. When N is large enough, with 1−N−D1 probability, for some constant

C > 0, we have

|S(2)| ≤ CN−1+2ε1(1

νi+

1(i ∈ A)

|di − c−1/4|)(

1

νj+

1(j ∈ A)

|dj − c−1/4|). (2.2.93)

Therefore, plugging (2.2.87), (2.2.92) and (2.2.93) into (2.2.85), we conclude the proof

of (2.2.37). Before concluding this subsection, we briefly discuss the proof of (2.2.36).

By Lemma 2.2.18 and Cauchy’s integral formula , we have

< ui,Pluj >=1

2didjπi

∫p(Γ)

(D−1 + U∗G(z)U)−1ij

dz

z.

Then we can use a similar discussion as (2.2.85), computing the convergent limit from

S(0) and controlling the bounds for S(1) and S(2). We remark that the convergent limit

is different because we use (D−1 + U∗Π(z)U)ij, r ≤ i, j ≤ 2r in (2.2.86), which results in

S(0) = δijm1c(pi)

T ′(pi)= δij

d4i − c−1

d2i + c−1

.

This concludes the proof of (2.2.36).

For the non-outliers, the proof strategy for the outlier singular vectors will not work

as we cannot use the residual theorem. We will use a spectral decomposition for our

proof.

Proof of (2.2.39). Denote

z = µj + iη, (2.2.94)

where η is defined as the smallest solution of

Imm2c(z) = N−1+6ε1η−1. (2.2.95)


As we assume j ≤ (1− τ)K or c 6= 1, we conclude that |z| has a constant lower bound.

Therefore, by Lemma 2.3.27, 2.2.20 and 2.2.21, with 1−N−D1 probability, we have

| < u,Σ−1(G(z)− Π(z))Σ−1v > | ≤ N4ε1

Nη. (2.2.96)

Recall (2.2.55), abbreviating κ = |µj − λ+|, by Lemma 2.2.15 and (2.2.35), we find that

η ∼

N6ε1

N√κ+N2/3+2ε1

, if µj ≤ λ+ +N−2/3+4ε1 ,

N−1/2+3ε1κ1/4, if µj ≥ λ+ +N−2/3+4ε1 .

. (2.2.97)

For z defined in (2.2.94), by the spectral decomposition, we have

< vi, vj >2≤ η < vi, Im G2(z)vi >= η < vi, Im G(z)vi >, (2.2.98)

where vi ∈ RM+N is the natural embedding of vi. By Lemma 2.2.18, we have

< vi, G(z)vi >= − 1

zd2i

(D−1 + U∗G(z)U)−1ii .

Similar to (2.2.85), using a simple resolvent expansion and (2.2.86) , we have

< vi,G(z)vi >

= − 1

zd2i

[zm2c(z)

zm1c(z)m2c(z)− d−2i

+zf(z)

(zm1c(z)m2c(z)− d−2i )2

+([(D−1 + U∗Π(z)U)−1∆(z)]2(D−1 + U∗G(z)U)−1

)ii], (2.2.99)

where f(z) is defined in (2.2.88). To estimate the right-hand side of (2.2.99), we use the

following error estimate

minj|d−2j − T (z)| ≥ Im T (z) ∼ Imm2c(z) =

N6ε1

Nη N4ε1

Nη≥ |∆(z)|,


where we use (2.2.96) and Lemma 2.2.20. By a similar resolvent expansion, there exists

some constant C > 0, such that

∣∣∣∣∣∣∣∣ 1

D−1 + U∗G(z)U

∣∣∣∣∣∣∣∣ ≤ C

Imm2c(z)= CN1−6ε1η.

We therefore get from (2.2.99), the definition of f and (2.2.96) that

< vi, G(z)vi >=m2c(z)

1− d2iT (z)

+O(d2i

|1− d2iT (z)|2

N4ε1

Nη). (2.2.100)

By (2.2.98), we have

< vi, vj >2≤ η

|1− d2iT (z)|2

[Imm2c(z)(1− d2

i c1/2 + Re(d2

i c1/2 − d2

iT (z)))

+d2i Rem2c(z) Im T (z) +

Cd2iN

4ε1

Nη

]. (2.2.101)

By (2.2.57), (2.2.95) and (2.2.97), we have

Imm2c(z)[(1− d2i c

1/2)+ Re(d2i c

1/2 − d2iT (z))]

≤ CN6ε1

Nη

(|di − c−1/4|+ max

√κ+ η,

η√κ+ η

+ κ).

For the other item, by Lemma 2.2.15, we have |Rem2c(z) Im T (z)| ∼ Imm2c(z). Putting

all these estimates together, we have

< vi, vj >2≤ CN6ε1

N |1− d2iT (z)|2

.

The rest of the proof leaves to give an estimate of 1 − d2iT (z). We summarize it as the

following lemma.

Lemma 2.2.28. Recall (2.2.44), for all µj ∈ [λ−, λ+ +N−2/3+Cε0 ], there exists a constant


δ > 0, such that

|1− d2iT (z)| ≥ δd2

i (|d−2i − c1/2|+ Im T (z)).

Therefore, we have

< vi, vj >2≤ NCε0

N((di − c−1/4)2 + κdj ), κdj := N−2/3(j ∧ (K + 1− j))2/3,

where we use the fact that Im T (z) ≥ c√κdj . This concludes the proof of (2.2.39). For

the proof of (2.2.38), we will use the spectral decomposition

< ui, uj >2≤ η < ui, Im G1(z)ui >= η < ui, Im G(z)ui >,

and

< ui, G(z)ui >= − 1

zd2i

(D−1 + U∗G(z)U)−1ii.

Then by the resolvent expansion similar to (2.2.99) and control the items using Lemma

2.2.15, 2.3.27, 2.2.20 and 2.2.21, we can conclude the proof.

2.3 Eigen-structure of sample covariance matrix of

general form

Covariance matrices play important roles in high dimensional data analysis, which find

applications in many scientific endeavors, ranging from functional magnetic resonance

imaging and analysis of gene expression arrays to risk management and portfolio alloca-

tion. Furthermore, a large collection of statistical methods, including principal compo-

nent analysis, discriminant analysis, clustering analysis, and regression analysis, require

the knowledge of the covariance structure. Estimating a high dimensional covariance

matrix becomes the fundamental problem in high dimensional statistics.


The starting point of covariance matrix estimation is the sample covariance matrix,

which is a consistent estimator when the dimension of the data is fixed. In the high

dimensional regime, even though the sample covariance matrix itself is a poor estima-

tor, it can still provide lots of information about the eigen-structure of the population

covariance matrix. In many cases, the population covariance matrices can be effectively

estimated using the information of sample covariance matrices. Two main types of co-

variance matrices have been studied in the literature. One is the covariance matrix whose

eigenvalues are all attached in the bulk of its spectrum. The null case is when the en-

tries of the data matrix are i.i.d, where the spectrum of the sample covariance matrices

satisfies the celebrated Marcenco-Pastur (MP) law. For data matrix with correlated en-

tries, the spectrum satisfies the deformed MP law and has been well studied in. In the

deformed MP law, several bulk components are allowed (recall that the spectrum of MP

law has only one bulk component). The other line of the effort is to add a few outliers

(i.e. eigenvalues detach from the bulk) to the spectrum of MP law, which becomes the

spiked covariance matrix.

In the present paper, we study the local asymptotics of the empirical eigen-structure

of sample covariance matrices with general form. We will add a finite number of outliers

to the spectrum of the deformed MP law. Hence, our framework can be viewed as; on

one hand an extension of the spiked model by allowing multiple bulk components, on

the other hand, an extension of covariance matrices with deformed MP law by adding a

finite number of spikes. It is a unified framework for covariance matrices of general form

containing all the models discussed above.

Local deformed Marcenko-Pastur law. It is well-known that the empirical eigen-

value density of sample covariance matrices with independent entries converges to the

celebrated Marcenko-Pastur (MP) law. In the case when the population covariance ma-

trices are of general structure, it has been shown that the empirical eigenvalue density


still converges to a deterministic limit, which is called the deformed MP law. And the

local deformed MP law is an immediate result from the anisotropic local law. Denote

c ≡ cN :=N

M∈ (0,∞), (2.3.1)

and X = (xij) to be an M × N data matrix with centered entries xij = N−1/2qij,

1 ≤ i ≤ M and 1 ≤ j ≤ N , where qij are i.i.d random variables with unit variance and

for all p ∈ N, there exists a constant Cp, such that q11 satisfies that E|q11|p ≤ Cp.

The MP law and its variates are best formulated using the Stieltjes transform (see

Definition 2.3.24). Denote H = XX∗ and its Green function by

GI(z) = (H − z)−1, z = E + iη ∈ C+.

The local MP law can be informally written as

1

MTrGI(z) = mMP (z) +O(

√ImmMP (z)

Nη+

1

Nη),

where mMP is the Stieltjes transform of the MP law. It is notable that mMP (z) is

independent of N. For the general sample covariance matrices without outliers, we adapt

the model in [74] and write

Qb = Σ1/2b XX∗Σ

1/2b , (2.3.2)

where Σb is a positive definite matrix satisfying some regularity conditions. We will call

Qb the bulk model in this paper. Denote the eigenvalues of Σb by σb1 ≥ σb2 ≥ · · · ≥ σbM > 0,

and the empirical spectral distribution (ESD) of Σb by

πb(A) :=1

M

M∑i=1

1σbi∈A. (2.3.3)


We assume that there exists some small positive constant τ such that,

τ < σbM ≤ σb1 ≤ τ−1, τ ≤ c ≤ τ−1, πb([0, τ ]) ≤ 1− τ. (2.3.4)

Next we discuss the asymptotic density of Q1b := X∗ΣbX. Assuming that πb ⇒ πb∞

weakly, it is well-known that if πb∞ is a compactly supported probability measure on R,

and let c > 0, then for each z ∈ C+, there is a unique mD ≡ mΣb(z) ∈ C+ satisfying

1

mD

= −z +1

c

∫x

1 +mDxπb∞(dx). (2.3.5)

We define by ρD the probability measure associated with mD (i.e. mD is the Stieltjes

transform of ρD) and call it the asymptotic density of Q1b . Our assumption (2.3.4) implies

that the spectrum of Σb cannot be concentrated at zero, thus it ensures πb∞ is a compactly

supported probability measure. Therefore, mD and ρD are well-defined. The behaviour

of ρD can be entirely understood by the analysis of the function fD

z = fD(mD), ImmD ≥ 0, where fD(x) := −1

x+

1

c

∫λ

1 + xλπb∞(dλ). (2.3.6)

In practical applications, the limiting form (2.3.5) is usually not available and we are

interested in the large N case. We now define the deterministic function m ≡ mΣb,N(z)

as the unique solution of

z = f(m), Imm ≥ 0, where f(x) := −1

x+

1

cN

M∑i=1

πb(σbi)x+ (σbi )

−1. (2.3.7)

Similarly, we define by ρ the probability measure associated with m(z). The local de-

formed MP law can be informally written as

1

NTrG(z) = m(z) +O(

√Imm(z)

Nη+

1

Nη), (2.3.8)


where G(z) is the Green function of Q1b . It is notable that m(z) depends on N in general.

Remark 2.3.1. In the literature, there are no results on the control of m − mD as it

depends on the convergent rate of πb ⇒ πb∞. We believe that under mild assumptions,

we can replace m(z) with mD(z). We will not pursue this generalization in this paper.

General covariance matrices. This subsection is devoted to defining the general

covariance matrices. We start with the discussion of the spectrum of the bulk model Qb

defined in (2.3.2). We firstly summarize the properties of f defined in (2.3.7), it can be

found in [74, Lemma 2.4, 2.5 and 2.6].

Lemma 2.3.2. Denote R = R∪ ∞, then f defined in (2.3.7) is smooth on the M + 1

open intervals of R defined through

I1 := (−(σb1)−1, 0), Ii := (−(σbi )−1,−(σbi−1)−1), i = 2, · · · ,M, I0 := R/ ∪Mi=1 Ii.

We also introduce a multiset C ⊂ R containing the critical points of f , using the conven-

tions that a nondegenerate critical point is counted once and a degenerate critical point

will be counted twice. In the case cN = 1, ∞ is a nondegenerate critical point. With the

above notations, we have

• (Critical Points) : |C ∩ I0| = |C ∩ I1| = 1 and |C ∩ Ii| ∈ 0, 2 for i = 2, · · · ,M.

Therefore, |C| = 2p, where for convenience, we denote by x1 ≥ x2 ≥ · · · ≥ x2p−1 be

the 2p− 1 critical points in I1 ∪ · · · ∪ IM and x2p be the unique critical point in I0.

• (Ordering) : Denote ak := f(xk), we have a1 ≥ · · · ≥ a2p. Moreover, we have

xk = m(ak) by assuming m(0) := ∞ for cN = 1. Furthermore, for k = 1, · · · , 2p,

there exists a constant C such that 0 ≤ ak ≤ C.

• (Structure of ρ): supp ρ ∩ (0,∞) = (∪pk=1[a2k, a2k−1]) ∩ (0,∞).


We post the following regularity conditions on Σb, which are proposed in [74, Def-

inition 2.7]. Roughly speaking, the regularity condition rules out the ouliers from the

spectrum of Qb.

Assumption 2.3.3. Fix τ > 0, we assume that

(i). The edges ak, k = 1, · · · , 2p are regular in the sense that


i|xk + (σbi )

−1| ≥ τ. (2.3.9)

(ii). The bulk components k = 1, · · · , p are regular in the sense that for any fixed τ ′ > 0

there exists a constant ν ≡ ντ,τ ′ such that the density of ρ in [a2k + τ ′, a2k−1 − τ ′] is

bounded from below by ν.

Remark 2.3.4. The second condition in (2.3.9) states that the gap in the spectrum of ρ

adjacent to ak can be well separated when N is sufficiently large. And the third condition

ensures a square root behaviour of ρ in a small neighborhood of ak. As a consequence,

it will rule out the outliers. The bulk regularity imposes a lower bound on the density of

eigenvalues away from the edges.

To extend the bulk model, we now add r (finite) number of spikes to the spectrum of

Σb. Denote the spectral decomposition of Σb as

Σb =M∑i=1

σbiviv∗i , Db = diagσb1, · · · , σbM.

Denote the index set I ⊂ 1, 2, · · · ,M as the collection of the indices of the r outliers,

where

I := o1, · · · , or ⊂ 1, 2, · · · ,M. (2.3.10)


Now we define

Σg =M∑i=1

σgi viv∗i , where σgi =

σbi (1 + di), i ∈ I;

σbi , otherwise.

, di > 0. (2.3.11)

We also assume that di are in the decreasing fashion. We further define

O := σgi , i ∈ I. (2.3.12)

Therefore, we can write

Σg = Σb(1 + VDV∗) = (1 + VDV∗)Σb, (2.3.13)

where V = (v1, · · · ,vM) and D = (di) is an M×M diagonal matrix where di = di, i ∈ I

and zero otherwise. As D is not invertible, we write

VDV∗ =∑i∈I

diviv∗i = VoDoV

∗o, (2.3.14)

where Vo is a M × r matrix containing vi, i ∈ I and Do is a r× r diagonal matrix with

entries di, i ∈ I. Then our model can be written as

Qg = Σ1/2g XX∗Σ1/2

g . (2.3.15)

We will call it the general model. Denote K = minM,N and use µ1 ≥ µ2 ≥ · · · ≥

µK > 0 to be the nontrivial eigenvalues of Qg and ui, i = 1, 2, · · · ,M as the eigenvectors.

We also use λ1 ≥ λ2 ≥ · · · ≥ λK > 0 to denote the nontrivial eigenvalues of Qb and ubi

as the eigenvectors of Qb. As there exist p bulk components, for convenience, we relabel

the indices of the eigenvalues of Qg using µi,j, which stands for the j-th eigenvalue of the

i-th bulk component. Similarly, we can relabel for di,j, λi,j, σgi,j, σ

bi,j,vi,j, ui,j, and ubi,j.


We further assume that the r outliers are associated with p bulk components and

each with ri, i = 1, 2, · · · , p outliers satisfying∑p

i=1 ri = r. Using the convention that

x0 =∞, we denote the subset O+ ⊂ O by O+ =⋃pi=1O

+i , where O+

i is defined as

O+i = σgi,j : x2i−1 +N−1/3+ε0 ≤ − 1

σgi,j< x2(i−1) − c0, (2.3.16)

where ε0 > 0 is some small constant and 0 < c0 < minix2(i−1)−x2i−1

2. We further denote

r+i := |O+

i | and the index sets associated with O+i ,O+ by I+

i , I+, where

I+i := (i, j) : σgi,j ∈ O+

i , I+ :=

p⋃i=1

I+i . (2.3.17)

We can relabel I in the similar fashion.

Remark 2.3.5. Our results can be extended to a more general domain by denoting

O+i = σgi,j : x2i−1 +N−1/3 ≤ − 1

σgi,j< x2(i−1) − c0.

The proofs still hold true with some minor changes except we need to discuss the case

when x2i−1 +N−1/3 ≤ − 1σgi,j≤ x2i−1 +N−1/3+ε0 . We will not pursue this generalization.

For definiteness, we introduce the following assumption.

Assumption 2.3.6. For all i = 1, 2, · · · , p, j = 1, 2, · · · , ri, we have

f(x2i−1) ≤ f(− 1

σgi,j) ≤ f(x2(i−1)), f(x0) =∞. (2.3.18)

Furthermore, we assume that

∣∣∣∣f(− 1

σgi,j)− f(x2(i−1))

∣∣∣∣− ∣∣∣∣f(− 1

σgi,j)− f(x2i−1)

∣∣∣∣ ≥ τ, i = 2, · · · , p, (2.3.19)

where τ > 0 is some constant.


Roughly speaking, Assumption 2.3.6 ensures that the outliers are always on the right

of each bulk component. When the outliers are on the left (i.e. (2.3.19) reverses), we can

get similar results.

To avoid repetition, we summarize the assumptions for future reference.

Assumption 2.3.7. We assume that (2.3.1), (2.3.4), (2.3.11) and Assumption 2.3.3

and 2.3.6 hold true.

It is notable that in [25], they consider a similar model but with spikes only on the

right of the spectrum.

Main results. We first introduce the following non-overlapping condition. Roughly

speaking, it ensures that the eigenvalues of Qg are well separated so that we can identify

the eigen-structure.

Assumption 2.3.8 (Non-overlapping condition). For A ⊂ O+, i = 1, 2, · · · , p, j =

1, 2, · · · , r+i , we assume that

νi,j(A) ≥ (− 1

σgi,j− x2i−1)−1/2N−1/2+ε0 , (2.3.20)

where ε0 is defined in (2.3.16) and νi,j is defined as

νi,j ≡ νi,j(A) :=

minσgi1,j1 /∈A

| − 1σgi,j

+ 1σgi1,j1|, if σgi,j ∈ A,

minσgi1,j1∈A| − 1

σgi,j+ 1

σgi1,j1|, if σgi,j /∈ A.

. (2.3.21)

Remark 2.3.9. In this paper, we compute the convergent limits of the outlier eigenvectors

under Assumption 2.3.8. However, with extra work, we can show that the results still

hold true by removing this assumption. We will not pursue this generalization.

We now state the main results. To the end of this paper, we always use D1 as a

generic large constant and ε1 < ε0 a small constant.


Theorem 2.3.10 (Outlier eigenvalues). Under Assumption 2.3.7, for i = 1, 2, · · · , p, j =

1, 2, · · · , r+i , there exists some constant C > 1, when N is large enough, with 1 − N−D1


|µi,j − f(− 1

σgi,j)| ≤ N−1/2+Cε0(− 1

σgi,j− x2i−1)1/2. (2.3.22)

Moreover, for i = 1, 2, · · · , p, j = r+i + 1, · · · , ri, we have

|µi,j − f(x2i−1)| ≤ N−2/3+Cε0 . (2.3.23)

The above theorem gives precise location of the outlier and the extremal non-outlier

eigenvalues. For the outliers, they will locate around their classical locations f(− 1σgi,j

)

and for the non-outliers, they will locate around the right edge of the bulk component.

However, (2.3.23) can be easily extended to a more general framework. Instead of con-

sidering the bulk edge, we can locate µi,j around the eigenvalues of Qb, which is the

phenomenon of eigenvalue sticking. We denote the classical eigenvalue locations in the

bulk by γ1 ≥ γ2 ≥ · · · ≥ γK , where N∫∞γidρ = i − 1

2. And we relabel the classical

number of eigenvalues in the i-th bulk component through Ni :=∫ a2i−1

a2idρ. Furthermore,

for i = 1, 2, · · · , p and j = 1, 2, · · · , Ni, we denote

λi,j := λj+∑l<iNl

, γi,j := γj+∑l<iNl

∈ (a2i, a2i−1). (2.3.24)

It is notable that γi,j can also be characterized through N∫ a2i−1

γi,jdρ = j − 1

2.

Theorem 2.3.11 (Eigenvalue sticking). Under Assumption 2.3.7, for i = 1, 2, · · · , p,

denote

αi+ := min1≤j≤Ni

∣∣∣∣− 1

σgi,j− x2i−1

∣∣∣∣ , (2.3.25)


with 1−N−D1 probability, when αi+ ≥ N−1/3+2ε1 ,

|µi,j+r+i − λi,j| ≤N2ε1

Nαi+. (2.3.26)

Remark 2.3.12. We remark that when αi+ < N−1/3+2ε1 , it can be shown that (2.3.26) still

holds true. However, in this case, the eigenvalue rigidity (see Lemma 2.3.34) gives sharp

bound, where

|µi,j+r+i − λi,j| ≤ N−2/3+ε1(j ∧ (Ni + 1− j))−1/3.

Furthermore, for some small constant τ ′ > 0, if γi,j ∈ [a2i + τ ′, a2i−1 − τ ′], we have

|µi,j+r+i − λi,j| ≤ N−1+ε1 .

We will see later from Lemma 2.3.34 that when αi+ = O(1), the sticking bound N−1 is

much smaller than the typical gap N−2/3j−1/3 near the edges.

Theorem 2.3.10 and 2.3.11 can be used to estimate the spectrum of the general model.

For the bulk model, El Karoui [70] consistently estimated the spectrum by solving a linear

programming problem whose objective function containing (2.3.7); later on Kong and

Valiant [75] considered the problem by using the information from samples and provided

sharp convergent rates for the estimation. However, neither of the above two methods

can be applied to estimate the general model as both of them reply on the information

from the deformed MP law, which will ”ignore” the finite number of outliers. For the

general model, the spiked part can be estimated using (2.3.22) while the bulk part can

be estimated using the methods from [70, 75] due to the eigenvalue sticking property.

Next, we introduce the results of the eigenvectors. Denote

ui,j :=1

σgi,j

f ′(−1/σgi,j)

f(−1/σgi,j). (2.3.27)

Theorem 2.3.13 (Outlier eigenvectors). For 1 ≤ i1, i2 ≤ p, 1 ≤ j1 ≤ r+i1, 1 ≤ j2 ≤ r+

i2,


under Assumption 2.3.7 and 2.3.8, with 1−N−D1 probability, we have

∣∣< ui1,j1 ,vi2,j2 >2 −1(i1 = i2, j1 = j2)ui1,j1

∣∣ ≤ N ε1R(i1, j1, i2, j2, N), (2.3.28)

where R(i1, j1, i2, j2, N) is defined as

R(i1, j1, i2, j2, N)

:= 1(i1 = i2, j1 = j2)1√N

(− 1

σgi1,j1− x2i1−1

)−1/2

+N−1

1

ν2i2,j2

+1(i1 = i2, j1 = j2)

(− 1σgi2,j2

− x2i1−1)2

.

More generally, we consider the spectral projections and the generalized components.

Denote

PA :=∑

(i,j)∈A

ui,ju∗i,j, A ⊂ I+. (2.3.29)

For a vector w ∈ RM , we define wi,j :=< vi,j,w > .

Corollary 2.3.14. For A ⊂ I+ and any deterministic vector w ∈ RM , define

< w,ZAw >:=∑

(i,j)∈A

ui,jw2i,j.

Under Assumption 2.3.7 and 2.3.8, with 1−N−D1 probability, when N is large enough,

we have

|< w,PAw > − < w,ZAw >| ≤ N ε1R(w, A), (2.3.30)


where R(w, A) :=∑∑

wi1,j1wi2,j2R(i1, j1, i2, j2, A) and R(i1, j1, i2, j2, A) is defined as

N−1/2

1 ((i1, j1), (i2, j2) ∈ A)

(− 1σgi1,j1

− x2i1−1)1/4(− 1σgi2,j2

− x2i2−1)1/4+ 1((i1, j1) ∈ A, (i2, j2) /∈ A)

(− 1σgi1,j1

− x2i1−1)1/2

| − 1σgi1,j1

+ 1σgi2,j2|

+1((i1, j1) /∈ A, (i2, j2) ∈ A)(− 1

σgi2,j2− x2i2−1)1/2

| − 1σgi1,j1

+ 1σgi2,j2|

+N−1

(1

νi1,j1+

1((i1, j1) ∈ A)

− 1σgi1,j1

− x2i1−1

)(1

νi2,j2+

1((i2, j2) ∈ A)

− 1σgi2,j2

− x2i2−1

)

.Theorem 2.3.15 (Non-outlier eigenvectors). For (k, i) ∈ I+ and (l, j) ∈ I/I+, under

Assumption 2.3.7 and 2.3.8, with 1−N−D1 probability, we have

< vk,i,ul,j >2≤ N6ε1

N(κdl,j + ((σgk,i)−1 + x2k−1)2)

, (2.3.31)

where κdl,j := (j ∧ (Nl + 1− j))2/3N−2/3.

Corollary 2.3.16. For (l, j) ∈ I/I+, under Assumption 2.3.7 and 2.3.8, for w ∈ RM ,

with 1−N−D1 probability, when N is large enough, we have

< w,ul,j >2≤∑ Cw2

k,iN6ε1

N(κdl,j + ((σgk,i)−1 + x2k−1)2)

. (2.3.32)

Before concluding this part, we give a few examples to illustrate our results of the

sample eigenvectors.

Example 2.3.17. (i). Let A = (i, j) ∈ I+, w = vi,j and − 1σgi,j− x2i−1 ≥ τ > 0, then

for some constant C > 0, we have

| < ui,j,vi,j >2 −ui,j| ≤ N−1/2+Cε1 .

If we take A = (i, j) and w = vi1,j1 with (i1, j1) 6= (i, j), if − 1σgi,j

+ 1σgi1,j1

≥ τ , we then


have

| < ui,j,vi1,j1 > |2 ≤ N−1+Cε1 .

In particular, if σbi = 1, i = 1, 2, · · · ,M, our results coincide with [17, Example 2.13 and

2.14].

(ii). Take w = vk,i and ul,j as in Theorem 2.3.15. Assume that | 1σgk,i

+ x2k−1| ≥ τ and

κdl,j = O(1), then we have

< vk,i,ul,j >2≤ N−1+Cε1 .

Examples We consider a few examples to explain our results in detail. We first provide

two types of conditions on Σb verifying Assumption 2.3.3. They can be found in [74,

Example 2.8 and 2.9].

Condition 2.3.18. We suppose that n is fixed and there are only n distinct eigenvalues

of Σb. We further assume that σb1, · · · , σbn and Nπb(σb1), · · · , Nπb(σbn) all converge in

(0,∞) as N →∞. We also assume that the critical points of limN f are non-degenerate,

and limN ai > limN ai+1 for i = 1, 2, · · · , 2p− 1.

Condition 2.3.19. We suppose that c 6= 1 and πb is supported in some interval [a, b] ⊂

(0,∞), and that πb converges weakly to some measure πb∞ that is absolutely continuous

and whose density satisfies that τ ≤ dπb∞(E)/dE ≤ τ−1 for E ∈ [a, b]. In this case, p = 1.

In all the examples, we only derive the results for the eigenvalues and leave the

discussion and interpretation of the eigenvectors to the readers. We firstly provide two

examples satisfying Condition 2.3.18.

Example 2.3.20 (BBP transition). We suppose that r = 1 and let σbi = 1, i =

1, 2, · · · ,M. We assume that c > 1. In this case, we can instead use f ≡ fD defined

in (2.3.6), where we have

f(x) = −1

x+

1

c(x+ 1).


It can be easily checked that the critical points of f(x) are −√c√c−1

, −√c√c+1

, which implies

that p = 1. By (2.3.22), the convergent limit of the largest eigenvalue is

µ = f(− 1

d+ 1) = 1 + d+ c−1(1 + d−1),

and the phase transition happens when

− 1

1 + d> −

√c√

c+ 1⇒ d > c−1/2.

And the local convergence result reads as

∣∣∣∣µ− f(− 1

d+ 1)

∣∣∣∣ ≤ N−1/2+Cε0(d− c−1/2)1/2.

Example 2.3.21 (Spiked model with multiple bulk components). Consider the M ×N

sample covariance matrix with population covariance matrix defined by

Σg = diag35, 18, · · · , 18︸︷︷︸M/2− 1 times

, 4, 1, · · · , 1︸︷︷︸M/2− 1 times

. (2.3.33)

We assume c = 2 and then we have

f(x) = −1

x+

1

4

(1

x+ 118

+1

x+ 1

).

Furthermore, there are four critical points of f , approximately −2.3926,−0.62575,−0.11133,−0.037035.

Here, p = 2. Due to the fact

f(− 1

35) = 44.522 > f(−0.037035) = 40.759, f(−1

4) = 3.0476 > f(−0.62575) = 1.827,

we find that there are two outliers f(− 135

), f(−14). Similarly, we can derive the local

convergent results.


Next we provide two examples satisfying Condition 2.3.19, where there exists only

one bulk component.

Example 2.3.22 (Spiked model with uniform distributed eigenvalues). Consider an

M ×N sample covariance matrix with population covariance matrix defined by

Σg = diag8, 2.9975, 1.995, · · · , 1.005, 1.0025.

The limiting distribution of Σb is the uniform distribution on the interval [1, 3]. Let c = 2

and we can use f ≡ fD defined in (2.3.6), where we get

f(x) = − 1

2x− 1

4x2log

3x+ 1

x+ 1,

with critical points approximately −2.0051405, −0.2513025. Therefore, the left and right

edges are

f(−2.0051405) ≈ 0.1494, f(−0.2513025) ≈ 6.3941.

And the outlier is f(−18) ≈ 9.3836 > 6.3941.

Example 2.3.23 (Spiked Toeplitz matrix). Suppose that we can observe an M × N

sample covariance matrix with a Toeplitz population covariance matrix whose (i, j)-th

entry is 0.4|i−j| and spike locates at 10. We choose c = 2 and f can be written as

f(x) = − 1

2x− 1

3.8092x2log(

2.332x+ 1

0.4286x+ 1),

where the interval [0.4286, 2.332] is approximately the support of the population eigenval-

ues. The critical points are approximately −0.333552,−3.61753. Therefore, the left and

right edges are

f(−3.61753) ≈ 0.0859, f(−0.333552) ≈ 4.3852.

And the outlier is f(− 110

) ≈ 10.8221 > 4.3852.


Statistical applications Now we use some concrete examples to explain how our

results can be applied to the study of high dimensional statistics.

Optimal shrinkage of eigenvalues. Donoho, Gavish and Johnstone [45] propose a

framework to compute the optimal shrinkage of eigenvalues for a spiked covariance matrix

model, where the outliers are assumed to be on the right of the unique bulk component

and σbi = 1, i = 1, 2, · · · ,M. They shrink the outlier eigenvalues using some nonlinear

functions and keep the bulk eigenvalues as ones via the rank-aware shrinkrage rule. To

extend their results, suppose that we want to estimate Σg using Σg :=∑M

i=1 σgi uiu

∗i ,

under the rank-aware rule, where

Σg :=∑i∈I+

β(µi)uu∗i +∑i/∈I+

σgi uu∗i , (2.3.34)

where σgi = σbi , i /∈ I+ can be efficiently estimated using the spectrum estimation method

and β is some nonlinearity. Our task is find the optimal β.

The main conclusion is that the optimal β depends strongly on the loss function.

An advantage of the rank-aware rule (2.3.34) is that, a rich class of loss functions can

be well-decomposed. Hence, the problem of finding β can be reduced to optimizing a

low dimensional loss function. Theorem 2.3.10 and 2.3.13 can be used to improve their

results. In detail, we need to modify the computation procedure:

1. We can estimate the bulk spectrum and derive the form of f.

2. Calculate l(µ) = − 1f−1(µ)

.

3. Calculate c(µ) = 1l(µ)

f ′(−1/l(µ))f(−1/l(µ))

.

4. Calculate s(µ) = s(l(µ)) using s(l) =√

1− c2(l).

As the optimal β is only in terms of l(µ), c(µ) and s(µ), we can substitute l(µ), c(µ)

and s(µ) into the formulas of β under the 26 loss functions. For instance, we list 10


different loss functions and their corresponding optimal β in Table 2.2. It is notable

that, we will see from (2.3.41) the derivative of f can be put into a summation of σbi ,

which reduces our computation burden. In Figure 2.5, we show the shrinkers computed

from different loss functions for Example 2.3.21 and 2.3.22.

Oracle estimation under Frobenius norm. If we have no prior information what-

soever on the true eigenbasis of the covariance matrix, the natural choice for us is to

use the sample eigenvectors. Therefore, we need to find a diagonal matrix D, such that

L(Σg,UDU∗) is minimized, where L(·, ·) is some loss function. Then our estimator will

be Σg = UDU∗. Consider the Frobenius norm as our loss function, we have

||Σg − Σg||2F = (||Σg||2F − || diag(U∗ΣgU)||2F ) + ||D− diag(U∗ΣgU)||2F . (2.3.35)

Our estimator is called oracle estimator as the first part of the right-hand side of (2.3.35)

cannot be optimized, hence we should take D = diag(U∗ΣgU). The oracle estimator is a

better choice than simply using the sample covariance matrix Qg. Unlike the rank-aware

approach in the previous example, we consider the shrinkage for all the eigenvalues, where

di ≡ β(µi) := u∗iΣgui, i = 1, 2, · · · ,M. (2.3.36)

We assume that I = I+ and rewrite (2.3.36) as

di =∑j∈I+

σgj < vj,ui >2 +

∑j /∈I+

σgj < vj,ui >2 . (2.3.37)

For i ∈ I+, by (2.3.28), the first part of the right-hand side of (2.3.37) satisfies

∑j∈I+

σgj < vj,ui >2→ f ′(−1/σgi )

f(−1/σgi ). (2.3.38)


For the second part of (2.3.37), we have

∑j /∈I+

σgj < vj,ui >2→ 1

N

(σgi )2

f(−1/σgi )

M∑j=1

(σbj)2

(σgi − σbj)2. (2.3.39)

Meanwhile, inserting f(− 1σgi

) back into (2.3.7), we get

f(− 1

σgi) = σgi +

1

N

M∑j=1

1

−(σgi )−1 + (σbj)

−1. (2.3.40)

Differentiating with respect to σgi on both sides of (2.3.40), we get

f ′(− 1

σgi)(σgi )

−2 = 1− 1

N

M∑j=1

(σgj )2

(σgi − σbj)2. (2.3.41)

Therefore, by (2.3.37), (2.3.38), (2.3.39) and (2.3.41), when i ∈ I+, we have that

di →(σgi )

2

f(−1/σgi ). (2.3.42)

When i /∈ I+, by Theorem 2.3.11, we can use the estimator derived by [77], where di

satisfies

di →1

µi| limη→0mD(µi + iη)|2. (2.3.43)

Under mild assumption on the convergence rate of πb → πb∞, we can replace mD(µi + iη)

with m(µi + iη), then we have that∣∣ 1N

TrG(µi + iη)−m(µi + iη)∣∣ ≤ 1

Nη. Therefore,

the oracle estimator can be written as Σg = UDU∗, where D = diagd1, · · · , dM and

di, i = 1, 2, · · · ,M are defined as

di :=1

λi|m(µi + iN−1/2)|2, m(µi + iN−1/2) =

1

N

N∑k=1

1

µk − µi − iN−1/2. (2.3.44)


Note that the estimator (2.3.44) can be regarded as a nonlinear shrinkage of the sam-

ple eigenvalues, this can be understood as follows: Denote m1(z) = 1M

TrG1(z), where

G1(z) is the Green function of Qb. It is easy to check that m(z) =c−1N −1+c−1

N zm1(z)

z. As a

consequence, we can rewrite

di =µi

|1− c−1N − c

−1N µim1(µi + iN−1/2)|2

.

Factor model based estimation. Factor models have been heavily used in the study

of empirical finance. In these applications, financial stocks share the same market risks

and hence their returns can be highly correlated. The cross sectional units are modeled

using a few common factors [57]

Yit = b∗i ft + uit, (2.3.45)

where Yit is the return of the i-th stock at time t, bi is a vector of factor loadings, ft is a

K × 1 vector of latent common factors and uit is the idiosyncratic component, which is

uncorrelated with ft. The matrix form of (2.3.45) can be written as Yt = Bft + ut. For

the purpose of identifiability, we impose the following constraints: [58] Cov(ft) = IK and

the columns of B are orthogonal. As a consequence, the population covariance matrix

can be written as

Σ = BB∗ + Σu. (2.3.46)

(2.3.46) can be written into our general covariance matrices model (2.3.13) by letting

Σb = Σu, BB∗ = VDV∗Σb. The estimation can be computed using the least square

optimization

arg minB,F||Y −BF∗||2F , (2.3.47)

N−1F∗F = IK , B∗B is diagonal.


The least square estimator for B is Λ = N−1YF, where the columns of F satisfy that

N−1/2FK is the eigenvector corresponds to the K-th largest eigenvalue of Y∗Y. Under

some mild conditions, Fan, Liao and Mincheva [58] showed that BB∗ corresponded to

the spiked parts whereas Σu the bulk part. In most of the applications, Σu is assumed

to have some sparse structure. Hence, the estimator can be written as

Σ = ΛKΛ∗K + Σu,

where ΛK is the collection of the columns correspond to the factors and Σu is the esti-

mated using some thresholding method by analyzing the residual.

0 100 200 300 400 500 600 700 8000

20

40

60

80

100

120

140

160

180

200

Dimension M

Frobeniusloss

Figure 2.3: Estimation loss using factor model. We simulate the estimation error underFrobenius norm for Example 2.3.21, where the blue line stands for the sample covari-ance matrices estimation, red dots for our Multi-POET estimation and magenta dots forthe POET estimation. We find that using information from the population covariancematrices can help us improve the inference results.

Preliminaries. This section is devoted to introducing the tools for our proofs. We

start by giving some notations and definitions. For fixed small constants τ, τ ′ > 0, we


define the domains

S ≡ S(τ,N) := z ∈ C+ : |z| ≥ τ, |E| ≤ τ−1, N−1+τ ≤ η ≤ τ−1, (2.3.48)

Sei ≡ Sei (τ′, τ, N) := z ∈ S : E ∈ [ai − τ ′, ai + τ ′], i = 1, 2, · · · , 2p, (2.3.49)

Sbi ≡ Sbi(τ′, τ, N) := z ∈ S : E ∈ [a2i + τ ′, a2i−1 − τ ′], i = 1, 2, · · · , p, (2.3.50)

So ≡ So(τ, τ ′, N) := z ∈ S : dist(E, Supp(ρ)) ≥ τ ′. (2.3.51)

It is notable that S =⋃2pi=1 Sei∪

⋃pi=1 Sbi∪So. Recall that the empirical spectral distribution

(ESD) of an N ×N symmetric matrix H is defined as

F(N)H (λ) :=

1

N

N∑i=1

1λi(H)≤λ.

Definition 2.3.24 (Stieltjes transform). The Stieltjes transform of the ESD of Q1b is

given by

m(z) ≡ m(N)(z) :=

∫1

x− zdF

(N)

Q1b

(x) =1

NTrG(z), (2.3.52)

where we recall that G(z) is the Green function of Q1b .

We further denote

κi(x) := |x− f(x2i−1)|, κdi,j := N−2/3(j ∧ (Ni + 1− j))2/3. (2.3.53)

We will see from Lemma 2.3.34 that, κdi,j is a deterministic version of κi(µi,j). The fol-

lowing lemma determines the locations of the outlier eigenvalues of Qg. Denote Gb(z) as

the Green function of Qb and σ(Qg) the spectrum of Qg.

Lemma 2.3.25. µ ∈ σ(Qg)/σ(Qb) if and only if

det((1 + µGb(µ))VDV∗ + I) = 0. (2.3.54)


Recall (2.3.14), as D is not invertible, we instead use the following corollary.

Corollary 2.3.26. µ ∈ σ(Qg)/σ(Qb) if and only if

det(D−1o + I + µV∗oGb(µ)Vo) = 0. (2.3.55)

Next we introduce the local deformed MP law for Qb. Denote

Ψ(z) :=

√Imm(z)

Nη+

1

Nη.

Lemma 2.3.27. Fix τ > 0, for the sample covariance matrix Qb defined in (2.3.2)

satisfying (2.3.4), suppose that Assumption 2.3.3 holds. Then for any unit vectors u,v ∈

RM , with 1−N−D1 probability, we have

∣∣∣∣< u,Σ−1/2b

(Gb(z) +

1

z(1 +m(z)Σb)

)Σ−1/2b v >

∣∣∣∣ ≤ N ε1Ψ(z), z ∈ S uniformly .

Furthermore, outside the spectrum when z ∈ Seo, where Seo is defined as

Seo := z ∈ S dist(E, Supp(ρ)) ≥ N−2/3+τ , |z| ≤ τ−1, (2.3.56)

then we have

∣∣∣∣< u,Σ−1/2b

(Gb(z) +

1

z(1 +m(z)Σb)

)Σ−1/2b v >

∣∣∣∣ ≤ N ε1

√Imm(z)

Nη, z ∈ Seo uniformly .

The following lemma summarizes the results when z is restricted to the real axis.

Lemma 2.3.28. For z ∈ Seo ∩ R and any unit vectors u,v ∈ RM , with 1 − N−D1


∣∣∣∣< u,Σ−1/2b

(Gb(z) +

1

z(1 +m(z)Σb)

)Σ−1/2b v >

∣∣∣∣ ≤ N−1/2+ε1κ−1/4,


where κ := min1≤i≤p |E − f(x2i−1)|.

Next we will extend the Weyl’s interlacing theorem to fit our setting. We first discuss

the case when r = 1, the general case is just a corollary. Denote Gg(z) as the Green

function of Qg.

Lemma 2.3.29. Let r = 1 in (2.3.11), we assume that the outlier is associated with

the k-th bulk component, k = 1, 2, · · · , p, recall (2.3.24), define sk =∑k−1

i=1 Ni, under

Assumption 2.3.6, we have

λsk ≥ µsk+1 ≥ λsk+1 ≥ · · · ≥ λsk+Nk . (2.3.57)

It is easy to deduce the following corollary for the rank r case.

Corollary 2.3.30. For the rank r model defined in (2.3.11), we have

µsk+i ∈ [λk,i+rk , λk,i−rk ], 1 ≤ i ≤ Nk,

where we use the convention that λk,i−rk := +∞ when i− rk < 1.

Under Assumption 2.3.3 and 2.3.6, the convention can be made as λk,i−rk = f(x2k−2)

when i − rk < 1. The following lemma establishes the connection between the Green

functions of Qb and Qg. It provides an expression for analyzing the eigenvectors.

Lemma 2.3.31.

V∗oGg(z)Vo =1

z

[D−1o −

(1 + Do)1/2

Do

(D−1o + 1 + zV∗oGb(z)Vo)

−1 (1 + Do)1/2

Do

].

(2.3.58)

Lemma 2.3.32. Denote

B := m ∈ R : m 6= 0, − 1

m/∈ Supp(πb∞),


then we have

x /∈ Supp(ρD)⇐⇒ mD ∈ B and f ′D(mD) > 0, (2.3.59)

fD(mD(z)) = z, mD(fD(z)) = z. (2.3.60)

Similar results hold for f,m and ρ.

Next we collect the properties of m(z) as the following lemma, its proof can be found

in [74, Lemma A.4 and A.5].

Lemma 2.3.33. For z ∈ S, we have

Imm(z) ∼

√κ+ η, E ∈ Supp(ρ) ;

η√κ+η

, E /∈ Supp(ρ).

, (2.3.61)

and

mini|m(z) + (σbi )

−1| ≥ τ, (2.3.62)

where τ > 0 is some constant. Furthermore, if z ∈ Sei , we have

|m(z)− xi| ∼√κ+ η. (2.3.63)

We conclude this section by listing two important consequences of Lemma 2.3.27,

which are the eigenvalue rigidity and edge universality.

Lemma 2.3.34. Recall (2.3.24), fix τ > 0, under Assumption 2.3.3 and (2.3.4), for all

i = 1, · · · , p and j = 1, · · · , Ni satisfying γi,j ≥ τ , with 1−N−D1 probability, we have

|λi,j − γi,j| ≤ N−2/3+ε1(j ∧ (Ni + 1− j))−1/3. (2.3.64)

Furthermore, let i := b(i+1)/2c to be the bulk component to which the edge i belongs. For

0 < τ ′ < τ and j = 1, · · · , Ni satisfying γi,j ∈ [ai− τ ′, ai + τ ′], with 1−N−D1 probability,


we have

|λi,j − γi,j| ≤ N−2/3+ε1(j ∧ (Ni + 1− j))−1/3.

And for all j = 1, 2, · · · , Ni satisfying γi,j ∈ [a2i + τ ′, a2i−1 − τ ′], we have

|λi,j − γi,j| ≤N ε1

N.

For i = 1, · · · , p, define $i := (|f ′′(xi)|/2)1/3 and for any fixed l ∈ N and bulk component

i = 1, 2, · · · , p, we define

q2i−1,l :=N2/3

$2i−1

(λi,1 − a2i−1, · · · , λi,l − a2i−1),

q2i,l := −N2/3

$2i

(λi,Ni − a2i, · · · , λi,Ni−l+1 − a2i),

then for any fixed continuous bounded function h ∈ Cb(Rl), there exists b(h, π) ≡ bN(h, π),

depending only on π, such that limN→∞(Eh(qi,l)− b(h, π)) = 0.

Eigenvalues. In this section, we study the local asymptotics of the eigenvalues of Qg

and prove Theorem 2.3.10 and 2.3.11. We firstly prove the outlier and the extremal

non-outlier eigenvalues, then the bulk eigenvalues. We always use C,C1, C2 to denote

some generic large constants, whose values may change from one line to the next.

Outlier eigenvalues. The outlier eigenvalues of Qg are completely characterized by

(2.3.55). It relies on two main steps: (i) Recall (2.3.13), fix a configuration independent of

N , establishing two permissible regions, Γ(D) of r+ components and I0, where the outliers

of Qg are allowed to lie in Γ(D) and each component contains exactly one eigenvalue and

the rb non-outliers lie in I0; (ii) Using a continuity argument where the result of (i) can

be extended to arbitrary N−dependent D. We will prove the results by contradiction.


We first find that for oi ∈ I (recall (2.3.10))

− 1

1 + σboim(f(−(σgoi)−1))= −d−1

oi− 1. (2.3.65)

We also observe that for some small constant ν > 0, when x ∈ [x2i−1 − ν, x2i−1 + ν], i =

1, 2, · · · , p, we have

f ′(x) = O(|x− x2i−1|), f(x)− f(x2i−1) = O(|x− x2i−1|2). (2.3.66)

Proof of Theorem 2.3.10. We first deal with anyN -independent configuration D ≡ D(0).

For (i, j) ∈ I+, denote Ii,j ≡ Ii,j(D) = [I−i,j, I+i,j], where I±i,j is defined as

I±i,j := f(− 1

σgi,j)± (− 1

σgi,j− x2i−1)1/2N−1/2+C1ε1 , (2.3.67)

and for k = 1, · · · , p, define Ik ≡ Ik(D) by

Ik := [f(x2k)−N−2/3+C2ε1 , f(x2k−1) +N−2/3+C2ε1 ],

where C1, C2 > 0 will be specified later. We will show that with 1 − N−D1 probability,

the complement of

I :=

(⋃i,j

Ii,j

)⋃(⋃k

Ik

)= S1 ∪ S2, (2.3.68)

contains no eigenvalues of Qg. This is summarized as the following lemma.

Lemma 2.3.35. Denote σ+ as the set of the outlier eigenvalues of Qg associated with

I+, then we have

σ+ ⊂ S1. (2.3.69)

Moreover, each interval Ii,j contains precisely one eigenvalue of Qg. Furthermore, we


have

σb ⊂ S2, (2.3.70)

where σb is the set of the extremal non-outliers associated with I/I+.

Proof. We first assume that

2 < C2 < C1, C21ε1 < ε0. (2.3.71)

Therefore, by (2.3.59), (2.3.66) and Lemma 2.3.34, with 1 − N−D1 probability, we have

that S1 ∩ S2 = ∅. For k = 1, 2, · · · , p, denote

L+k := f(− 1

σgk,r+k

)−N−1/2+C1ε1(− 1

σgk,r+k

− x2k−1)1/2.

We only prove for the first bulk component and all the others can be done by induction.

For any x > L+1 , it is easy to check that x /∈ σ(Qb) using (2.3.71) and Lemma 2.3.34.

Under the assumption of (2.3.4), with 1 − N−D1 probability, we have that µ1(Qg) ≤ C

for some large constant C. Recall (2.3.10), by Lemma 2.3.28, with 1−N−D1 probability,

we have

det(D−1o + 1 + xV∗oGb(x)Vo) =

r∏i=1

(d−1oi

+ 1− 1

1 +m(x)σboi) +O(N−1/2+ε1κ−1/4

x ),

where we use the fact that r is finite. First of all, suppose that x ∈ (L+1 , C)/(

⋃j I1,j),

by (2.3.65), (2.3.66), Lemma 2.3.33 and inverse function theorem, it is easy to conclude

thatr∏i=1

(d−1oi

+ 1− 1

1 +m(x)σboi) ≥ N−1/2+(C1−1)ε1κ−1/4

x . (2.3.72)

Hence, D−1o + 1 + xV∗oGb(x)Vo is regular provided C1 > 2. This implies that x is not an

eigenvalue of Qg using Corollary 2.3.26. Secondly, we consider the extremal non-outlier


eigenvalues. Using Corollary 2.3.30, with 1−N−D1 probability, we find that

µ1,j ≥ f(x1)−N−2/3+C3ε1 , j = r+1 , · · · , r1,

where we use Lemma 2.3.34 and C3 > 0 is some constant. We assume that L+1 ≥

f(x1) + N−2/3+C3ε1 , otherwise the proof is already done. When x /∈ I1, i.e x ∈ (f(x1) +

N−2/3+C3ε1 , L+1 ), as a control compared to (2.3.72) does not exist, we need to use different

strategy to finish our proof. Denote

z := x+ iN−2/3−ε2 , 0 < ε2 < ε1.

As r is finite, it can be easily checked by the spectral decomposition (see the proof of [35,

Lemma 5.3]) that, there exists some constant C > 0, such that

||xV∗oGb(x)Vo − zV∗oGb(z)Vo|| ≤ C maxi

Im < vi, Gb(z)vi > . (2.3.73)

By Lemma 2.3.27, with 1−N−D1 probability, we have

xV∗oGb(x)Vo = − 1

1 +m(z)Dob

+O(N−1/2+ε1κ−1/4x ).

As a consequence, we have

det(D−1o + 1 + xV∗oGb(x)Vo) = O(N−1/2+ε1κ−1/4

x + max1≤j≤r+1

|m(z) + (σg1,j)−1|).

The proof follows from the fact

max1≤j≤r+1

|m(z) + (σg1,j)−1| = max

1≤j≤r+1|m(z)− x1 + x1 + (σg1,j)

−1| ≥ N−1/3+C3ε1 ,

where we use Lemma 2.3.33. Hence, D−1o + 1 + xV∗oGb(x)Vo is regular provided C3 > 1.


Similarly, for all k = 1, 2, · · · , p, we can show that when x ∈ (L+k , C)/S1 and x /∈ S2,

D−1o + 1 + xV∗oGb(x)Vo is regular by induction. Lemma 2.3.35 will be proved if we

can show that each interval Ii,j contains precisely one eigenvalue of Qg. Let (i, j) ∈

I+ and pick a small N -independent counterclosewise (positive-oriented) contour C ⊂

C/⋃pk=1[a2k, a2k−1] that encloses f(− 1

σgi,j) but no other point. For large enough N, define

F (z) := det(D−1o + 1 + zV∗oGb(z)Vo), G(z) := det(D−1

o + 1− 1

1 +m(z)Dob

).

F (z), G(z) are holomorphic on and inside C. G(z) has precisely r+ zeros f(− 1σgi,j

) inside

C. And on C, it is easy to check that for some constant δ > 0,

minz∈C|G(z)| ≥ δ > 0, |G(z)− F (z)| ≤ N−1/2+ε1κ−1/4,

where we use Lemma 2.3.27. Hence, it follows from Roche’s theorem.

In a second step, we extend the proofs to any configuration D(1) depending on N

by using a continuity argument. This is done by a bootstrap argument by choosing a

continuous path. We firstly deal with (2.3.22). As r is finite, we can choose a path

(D(t) : 0 ≤ t ≤ 1) connecting D(0) and D(1) having the following properties:

(i). For all t ∈ [0, 1], recall (2.3.14) and (2.3.16), for (i, j) ∈ I+, σi,j(t) ∈ O+i , i =

1, 2, · · · , p, j = 1, 2, · · · , r+i , where σgi,j(t) = (1 + di,j(t))σ

bi,j.

(ii). For the i-th bulk component, i = 1, 2, · · · , p, if Ii,j1(D(1)) ∩ Ii,j2(D(1)) = ∅ for a

pair 1 ≤ j1 < j2 ≤ r+i , then I+

i,j1(D(t)) ∩ I+

i,j2(D(t)) = ∅ for all t ∈ [0, 1].

Recall (2.3.15), denote Qg(t) := Σ1/2g (t)XX∗Σ

1/2g (t), where Σg(t) = (1+VD(t)V∗)Σb.

As the mapping t→ Qg(t) is continuous, we find that µi,j(t) is continuous in t ∈ [0, 1] for


all (i, j) where µi,j(t) are the eigenvalues of Qg(t). Moreover, by Lemma 2.3.35, we have

σ+(Qg(t)) ⊂ S1(t),∀ t ∈ [0, 1]. (2.3.74)

We focus our discussion on the i-th bulk component. In the case when the r+i intervals

are disjoint, we have

µi,j(t) ∈ Ii,j(D(t)), t ∈ [0, 1],

where we use property (ii) of the continuous path, (2.3.74) and the continuity of µi,j(t).

In particular, it holds true for D(1). Now we consider the case when they are not disjoint.

Recall (2.3.17), denote B as a partition of I+i and the equivalent relation as

j1 ≡ j2 if Ii,j1(D(1)) ∩ Ii,j2(D(1)) 6= ∅.

Therefore, we can decompose B = ∪jBj. It is notable that each Bj contains a sequence of

consecutive integers. Choose any s ∈ Bj, without loss of generality, we assume s is not

the smallest element in Bj. Since they are not disjoint, for some constant C > 0, we have

− 1

σgi,s−1

+1

σgi,s≤ 2N−1/2+Cε1(− 1

σgi,s− x2i−1)−1/2,

where we use the fact that f ′′(x) ≥ 0 when x is close to the right edge of each bulk

component, (2.3.66) and (2.3.67). This yields that

(− 1

σgi,s−1

−x2i−1)1/2 ≤ (− 1

σgi,s−x2i−1)1/2

(1 +− 1σgi,s−1

+ 1σgi,s

− 1σgi,s− x2i−1

)≤ (− 1

σgi,s−x2i−1)1/2(1+o(1)).

Therefore, by repeating the process for the remaining s ∈ Bj, we find

diam(∪s∈BjI+i,j(D(1))) ≤ CN−1/2+Cε0 min

s∈Bj(− 1

σgi,j(1)− x1)1/2(1 + o(1)),


where we use the fact that r = O(1). This immediately yields that

|µi,j(1)− f(− 1

σi,j(1))| ≤ N−1/2+Cε0(− 1

σgi,j(1)− x2i−1)1/2,

for some constant C > 0. This completes the proof of (2.3.22). Finally, we deal with the

extremal non-outlier eigenvalues (2.3.23). By the continuity of µi,j(t) and Lemma 2.3.35,

we have

σ0(Qg(t)) ⊂ S2(t), t ∈ [0, 1]. (2.3.75)

In particular it holds true for D(1). The proofs follow from Corollary 2.3.26 and Lemma

2.3.34.

Eigenvalue sticking. In this subsection, we prove the eigenvalue sticking property

of Qg. Similar to the proof of (2.3.23), it contains three main steps, (i) Establishing a

forbidden region which contains with high probability no eigenvalues of Qg; (ii) Using a

counting argument for the eigenvalues where the forbidden region does not depend on

N ; and (iii) Using a continuity argument where the result of (ii) can be extended to

arbitrary N -dependent region.

Proof of Theorem 2.3.11. We start with step (i), for definiteness, we focus our discussion

on the i-th bulk component. Define

η := N−1+3ε1(αi+)−1.

It is notable that η ≤ N−2/3+ε1 . We firstly show that for any x satisfying

x ∈ [f(x2i−1)− C1, f(x2i−1) +N−2/3+2ε1 ], dist(x,σ(Qb)) ≥ η, (2.3.76)

will not be an eigenvalue of Qg, where 0 < C1 < f(x2i−1)− f(x2i) is some constant. The

discussion is similar to that of (2.3.23). We observe that 2|λi − x| ≥√

(λi − x)2 + η2 by


(2.3.76). Denote z := x+ iη, by Lemma 2.3.27, we conclude that

maxi∈I

Im < vi, Gb(z)vi >≤ max√κ+ η,

η√κ+ η

, η,

where we use (2.3.61), (2.3.62) and (2.3.76). Hence, by Lemma 2.3.27 and (2.3.73), with

1−N−D1 probability

D−1o +1+xV∗oGb(x)Vo = D−1

o +1− 1

1 +m(z)Dob

+O

(N ε1Ψ(z) + max

√κ+ η,

η√κ+ η

, η).

Furthermore, by (2.3.62) and (2.3.63), we have

det

(D−1o + 1− 1

1 +m(z)Dob

)≥ C|αi+ −

√κ+ η|.

Therefore, when κ ≤ N−2ε1(αi+)2, with 1−N−D1 probability, we have

det(D−1o + 1 + xV∗oGb(x)Vo) ≥ O(αi+ −N−ε1αi+). (2.3.77)

This implies that x is not an eigenvalue of Qg by Corollary 2.3.26. Similarly, by Theorem

2.3.10 and Lemma 2.3.34, we therefore conclude that for j ≤ N1−3ε1i (αi+)3, the set

x ∈ [λi,j−ri−1, f(x2i−1) +N−2/3+2ε1 ] : dist(x,σ(Qg)) > N−1+3ε1(αi+)−1

, (2.3.78)

contains no eigenvalue of Qg. Next we will use a standard counting argument to locate the

eigenvalues of Qg in terms of Qb for any fixed configuration D ≡ D(0). We summarize

it as the following lemma.

Lemma 2.3.36. For j ≤ N1−3ε1i (αi+)3, with 1−N−D1 probability, we have

|µi,j+r+i − λi,j| ≤N2ε1

Nαi+. (2.3.79)


For the case when j > N1−3ε1i (αi+)3, using Corollary 2.3.30 and (2.3.64), we find that

|µi,j+r+i − λi,j| ≤ N−2/3+ε1j−1/3 ≤ N2ε1

Nαi+. (2.3.80)

In a third step, we extend the proofs to any configuration D(1) depending on N by

using the continuity argument, Corollary 2.3.30, Lemma 2.3.34, (2.3.78) and (2.3.79).

We summarize it as the following lemma, where the proofs are similar to those in the

second step of Theorem 2.3.10.

Lemma 2.3.37. Lemma 2.3.36 and (2.3.80) hold true for any N-dependent configuration

D(1).

Outlier eigenvectors. For (i, j) ∈ I+, we define the contour γi,j := ∂Υi,j as the

boundary of the disk Bρi,j(− 1σgi,j

), where ρoi,j is defined as (recall (2.3.21))

ρi,j :=νi,j ∧ (− 1

σgi,j− x2i−1)

2, i = 1, 2, · · · , p, j = 1, 2, · · · , r+

i .

Under Assumption 2.3.8, it is easy to check that

ρi,j ≥1

2(− 1

σgi,j− x2i−1)1/2N−1/2+ε0 . (2.3.81)

We further define

Γi,j := f(γi,j). (2.3.82)

We summarize the basic properties of the contour as the following lemma.

Lemma 2.3.38. Recall (2.3.56), for (i, j) ∈ I+, we have Γi,j ⊂ Seo. Furthermore, each

outlier µi,j lies in⋃I+ Γi,j and all the other eigenvalues lie in the complement of

⋃I+ Γi,j.


Proof of Theorem 2.3.13. We firstly prove the following proposition, where we assume

that all the outliers are on the right of the first bulk component. The general case is an

easy corollary.

Proposition 2.3.39. Theorem 2.3.13 holds true when all the outliers are on the right of

the first bulk component.

Proof. When all the outliers are on the right of the first bulk component, we have I+ =

1, 2, · · · , r+. By Lemma 2.3.27 and 2.3.33, we can choose an event Ξ, with 1 − N−D1

probability, when N is large enough, such that for all z ∈ Seo, we have

1(Ξ)

∣∣∣∣∣∣∣∣− 1

1 +m(z)Dob

− zV∗oGb(z)Vo

∣∣∣∣∣∣∣∣ ≤ (κ+ η)−1/4N−1/2+ε1 . (2.3.83)

For i, j ∈ I+, by spectral decomposition, Cauchy’s integral formula and Theorem 2.3.10,

we have that

< ui,vj >2= − 1

2πi

∫Γi

< vj, Gg(z)vj > dz = − 1

2πi

∫γi

< vj, Gg(f(ζ))vj > f ′(ζ)dζ.

(2.3.84)

Furthermore, by (2.3.58) and Cauchy’s integration theorem, we can write

< ui,vj >2= − 1

2πi

∫Γi

[V∗oGg(z)Vo]jj dz =1 + djd2j

1

2πi

∫Γi

(D−1o + 1 + zV∗oGb(z)Vo)

−1jj

dz

z.

(2.3.85)

Now we introduce the following decomposition

D−1o +1+zV∗oGb(z)Vo = D−1

o +1− 1

1 +m(z)Dob

−∆(z), ∆(z) := (− 1

1 +m(z)Dob

−zV∗oGb(z)Vo).

(2.3.86)

It is notable that ∆(z) can be well-controlled by (2.3.83). Using the resolvent expansion


to the order of two on (2.3.86) and by (2.3.85), we have the following decomposition

< ui,vj >2=

1 + djd2j

(s1 + s2 + s3),

where si, i = 1, 2, 3 are defined as

s1 :=1

2πi

∫Γi

1

d−1j + 1− 1

(1+m(z)σbj )

dz

z,

s2 :=1

2πi

∫Γi

1

d−1j + 1− 1

(1+m(z)σbj )

2

(∆(z))jjdz

z,

s3 :=1

2πi

∫Γi

(1

D−1o + 1− 1

(1+m(z)Dob )

∆(z)1

D−1o + 1− 1

(1+m(z)Dob )

∆(z)1

D−1o + 1 + zV∗oGb(z)Vo

)jj

dz

z.

First of all, the convergent limit is characterized by s1. By the residual theorem and

(2.3.60), we have

1 + djd2j

s1 =1

σgj

1

2πi

∫γi

f ′(ζ)

f(ζ)

1 + ζσbjζ + (σgj )

−1dζ = δij

1

σgi

f ′(−1/σgi )

f(−1/σgi ).

Next we will control s2 and s3. For s2, we rewrite it as

s2 =d2j

2(σgj )2πi

∫γi

hjj(ζ)

(ζ + 1σgj

)2dζ, hjj(ζ) := (1 + ζσbj)

2(∆(f(ζ)))jjf ′(ζ)

f(ζ).

As hjj(ζ) is holomorphic inside the contour γi, by (2.3.66) and (2.3.83), we conclude that

with 1−N−D1 probability,

|hjj(ζ)| ≤ |ζ − x1|1/2N−1/2+ε1 . (2.3.87)


By Cauchy’s differentiation formula, we have

h′jj(ζ) =1

2πi

∫C

hjj(ξ)

(ξ − ζ)2dξ, (2.3.88)

where C is the circle of radius |ζ−x1|2

centered at ζ. Hence, by (2.3.87), (2.3.88) and the

residual theorem, with 1−N−D1 probability, we have

|h′jj(ζ)| ≤ |ζ − x1|−1/2N−1/2+ε1 . (2.3.89)

When i = j, by the residual theorem and (2.3.89), we have

|s2| =∣∣∣∣ d2

i

(σgi )2h′jj(−

1

σgi)

∣∣∣∣ ≤ d2i

(σgi )2(− 1

σgi− x1)−1/2N−1/2+ε1 .

When i 6= j, by Assumption 2.3.8 and the residual theorem, we have |s2| = 0. Finally,

we estimate s3. Here the residual calculation is not available, we need to choose precise

contour for our discussion. We summarize it as the following lemma, whose proof can be

found in [17, Section 5.1].

Lemma 2.3.40. When N is large enough, there exists some constant C > 0, with 1 −

N−D1 probability, we have

|s3| ≤ Cd2j

(σgj )2N−1+2ε1(

1

ν2j

+1(i = j)

(− 1σgj− x1)2

).

This concludes the proof.

The proof of Theorem 2.3.13 is the same as Proposition 2.3.39 except that we need

to change the indices for any bulk component.

Non-outlier eigenvectors. For the non-outlier eigenvector, the residual calculation

is not available.


Proof of Theorem 2.3.15. We firstly suppose that all the outliers are on the right of the

first bulk component and focus on the l-th bulk component. The proof of the general

case is similar. For simplicity, we use µj for µl,j and uj for ul,j. For this subsection, we

use the spectral parameter z = µj + iη, where η is the unique smallest solution of

Imm(z) = N−1+6ε1η−1. (2.3.90)

As a consequence, with 1−N−D1 probability, Lemma 2.3.27 reads as

∣∣∣∣∣∣∣∣− 1

1 +m(z)Dob

− zV∗oGb(z)Vo

∣∣∣∣∣∣∣∣ ≤ N4ε1

Nη. (2.3.91)

Using the spectral decomposition, we have

< vi,uj >2≤ ηv∗i ImGg(z)vi. (2.3.92)

Recall (2.3.86), by (2.3.58) and a simple resolvent expansion, we get

< vi, Gg(z)vi >

=1

z

1

di− (1 + di)

d2i

1

d−1i + 1− 1

(1+m(z)σbi )

+

(1

d−1i + 1− 1

(1+m2c(z)σbi )

)2

(∆(z))ii

+

(1

D−1o + 1− 1

(1+m(z)Dob )

∆(z)1

D−1o + 1− 1

(1+m(z)Dob )

∆(z)1


)ii

)].

(2.3.93)

We first observe that

mini|m(z) + (σgi )

−1| ≥ Imm(z) ∆(z),


where we use (2.3.91). This yields that

∣∣∣∣∣∣∣∣ 1


∣∣∣∣∣∣∣∣ ≤ C

Imm(z).

Therefore, we get from (2.3.93) that

z < vi, Gg(z)vi >=1

diσbi

1 +m(z)σbim(z) + (σgi )

−1+O

(1

σgi σbi

|1 +m(z)σbi |2

|m(z) + (σgi )−1|2

N4ε1

Nη

).

Hence, combine with (2.3.92), we have

< vi,uj >2 ≤ η Im

1 +m(z)σbiz(m(z) + (σgi )

−1)+O

(|1 +m(z)σbi |2

|z||m(z) + (σgi )−1|2

N4ε1

N

)≤[η2

|z|2Re


−1+µjη

|z|2Im


−1

]+O

(|1 +m(z)σbi |2

|z||m(z) + (σgi )−1|2

N4ε1

N

).

(2.3.94)

Under Assumption 2.3.3, by Corollary 2.3.30 and Lemma 2.3.34, we have |µj| ≥ τ, where

τ > 0 is some constant. On one hand, we have

η2

|z|2Re


−1≤ CNCε1

N |m(z) + (σgi )−1|2

,

where C > 0 is some large constant. Similarly, we have

µjη

|z|2Im


−1≤ Cη

|m(z) + (σgi )−1|2

Imm(z).

Therefore, we conclude from (2.3.62), (2.3.94) and the definition of (2.3.90) that

< vi,uj >2≤ N6ε1

N |m(z) + (σgi )−1|2

.


It is easy to check that

m(z) + (σgi )−1 = m(z)− x1 + x1 + (

1

σgi) ∼√κ+ η + (− 1

σgi)− x1,

where we use (2.3.63). This yields that

∣∣Re(m(z) + (σgi )−1)∣∣ = O

(√κ+ η + (− 1

σgi)− x1

). (2.3.95)

We conclude that

|m(z) + (σgi )−1|2 ≥ C

(κdl,j + ((σgi )

−1 + x1)2).

This concludes our proof.

Proof of Lemma 2.3.25. Using the identity det(1+XY ) = det(1+Y X), µ is an eigenvalue

of Qg if and only if

0 = det(Σ1/2g XX∗Σ1/2

g − µ) = det(X∗ΣgX − µ) = det(XX∗Σb(1 + VDV∗)− µ)

= det(Σ1/2b (1 + VDV∗)XX∗Σ

1/2b − µ) = det(Qb − µ) det(Gb(µ)VDV∗Qb + 1)

= det(Qb − µ) det(QbGb(µ)VDV∗ + 1).

Using the Woodbury matrix identity

(A+ SBT )−1 = A−1 − A−1S(B−1 + TA−1S)−1TA−1, (2.3.96)

and

QbGb(µ) = Qb(Qb − µ)−1 = (1−Q−1b µ)−1,


we find that

0 = det((1−Q−1b µ)−1VDV∗ + 1) = det((1 + µGb(µ))VDV∗ + 1).

Proof of Corollary 2.3.26. By (2.3.14), (2.3.54) and the identity det(1 +XY ) = det(1 +

Y X), we have

det((1 + µGb(µ))VDV∗ + 1) = 0⇔ det(V∗o(1 + µGb(µ))VoDo + 1) = 0,

the proof follows from the fact that Do is invertible.

Proof of Lemma 2.3.29. We first write

(1 + VDV∗)1/2Gg(z)(1 + VDV∗)1/2 =(

Σ1/2b XX∗Σ

1/2b − z(1 + VDV∗)−1

)−1

= Gb(z)−Gb(z)Voz


V∗oGb(z),

where in the second equality we use (2.3.96). For simplicity, we now omit the indices for

v, d and in this case Vo = v. Denote Gvvg,b(z) :=< v, Gg,b(z)v >, we have

Gvvg (z) =

1

d+ 1Gvvb (z)− 1

d+ 1(Gvv

b (z))2 z

d−1 + 1 + zGvvb (z)

,

which implies that

1

Gvvb (z)

+z

d−1 + 1=

1

d+ 1

1

Gvvg (z)

.

Writing this in spectral decomposition yields that

(∑ < v,ubi >2

λi − z

)−1

=1

d+ 1

(∑ < v,ui >2

µi − z

)−1

− z

d−1 + 1. (2.3.97)


It is notable that the left-hand side of (2.3.97) defines a function of z ∈ (0,∞) with M−1

singularities and M zeros, which is smooth and decreasing away from the singularities.

Moreover, its zeros are the eigenvalues of Qb. Similar results hold for Qg. Hence, if z is

an eigenvalue of Qg, we should have

(∑ < v,ubi >

λi − z)−1 = − z

d−1 + 1.

We can then conclude our proof using the monotone decreasing property of the left-hand

side of (2.3.97).

Proof of Lemma 2.3.31. We first observe that

Σ−1/2b Σ1/2

g Gg(z)Σ1/2g Σ

−1/2b = Σ

−1/2b (XX∗ − zΣ−1

g )−1Σ−1/2b

= (Qb − z + z − zΣ1/2b Σ−1

g Σ1/2b )−1

= (G−1b (z) + zVoDo(1 + Do)

−1V∗o)−1,

where in the last step we use the fact that

Σ1/2b Σ−1

g Σ1/2b − 1 = −VD(1 + D)−1V∗ = −VoDo(1 + Do)

−1V∗o.

We now again use the Woodbury matrix identity (2.3.96) to get

Σ−1/2b Σ1/2

g Gg(z)Σ1/2g Σ

−1/2b = Gb(z)− zGb(z)Vo(D

−1o + 1 + zV∗oGb(z)Vo)

−1V∗oGb(z).

Multiplying Vo on both sides of the equation and using the identity

A− A(A+B)−1A = B −B(A+B)−1B,

we can conclude our proof.


Proof of Corollary 2.3.14. Without loss of generality, we mainly focus on the case when

all the outliers are on the right of the first bulk component. Recall (2.3.27) and (2.3.29),

similar to the proof of Proposition 2.3.39, with 1−N−D1 probability, for i, j ∈ I, we have

< vi,PAvj >= δij1(i ∈ A)ui +N ε1R(i, j, A,N), (2.3.98)

where R(i, j, A,N) is defined as

R(i, j, A,N) :=N−1/2

[1(i, j ∈ A)(− 1

σgi− x1)−1/4(− 1

σgj− x1)−1/4 + 1(i ∈ A, j /∈ A)

(− 1σgi− x1)1/2

| − 1σgi

+ 1σgj|

+1(i /∈ A, j ∈ A)(− 1

σgj− x1)1/2

| − 1σgi

+ 1σgj|

+N−1

[(

1

νi+

1(i ∈ A)

− 1σgi− x1

)(1

νj+

1(j ∈ A)

− 1σgj− x1

)

].

For the general case when i, j = 1, 2, 3, · · · ,M, we denote I := I ∪ i, j and consider

Σg := (1 + VoDoV∗o)Σb, Vo := [vk]k∈I , D := diag(dk)k∈I , (2.3.99)

where dk := dk for k ∈ I and dk = ε, ε > 0 small enough for k ∈ I/I. Since |I| ≤ r + 2

is finite, (2.3.98) can be applied to (2.3.99). By continuity, taking the limit ε → 0, we

conclude that (2.3.98) holds true for all i, j = 1, 2, 3, · · · ,M. For the proof of (2.3.30), as

w =∑M

j=1wjvj, we have

< w,PAw >=M∑i=1

M∑j=1

wiwj < vi,PAvj > .

Then the proof follows from (2.3.98).

Proof of Corollary 2.3.16. For the proof of (2.3.32), as w =∑M

j=1wjvj, we have

< w,uj >=M∑k=1

wk < vk,uj > .


The rest of the proof is similar to that of Corollary 2.3.14 by using the elementary

inequality.

Simulation studies. We provide the simulation detail and results. We use Figure 2.4

to show the simulation results of the examples. We can see that there exist some outlier

eigenvalues in our general model (in red), and the bulk eigenvalues stick to that of the

underlying bulk model (in blue). For the statistical application of optimal shrinkage

of eigenvalues, we focus our discussion on the sample covariance matrices from Example

2.3.21 and 2.3.22. Consider any two matrices, true matrix A and estimation B, we denote

L(A,B) as the loss between A and B. For completeness, we list 10 loss functions and

their optimal shrinkers in Table 2.2. In each of the examples, we use Figure 2.5 to show

the simulation results of optimal shrinkers under the 10 different loss functions. For the

oracle estimation, we use Figure 2.6 to compare our estimation method and the QuEST

method. And finally, Figure 2.7 is used to show how our results can be employed in the

factor model to improve the estimation.

0 50 100 150 200 250 300 350 4000

1

2

3

4

5

6

7

0 50 100 150 200 250 300 350 4000

5

10

15

20

25

30

35

40

45

0 50 100 150 200 250 300 350 4000

1

2

3

4

5

6

7

8

9

10

0 50 100 150 200 250 300 350 4000

2

4

6

8

10

12

Figure 2.4: Spectrum of the examples. We simulate the spectrum of four 400 × 800

sample covariance matrices with different population covariance matrices.


Frobenius matrix norm Shrinker Statistical measure Shrinker

||A−B||F lc2 + s2 Stein loss lc2+ls2

||A−1 −B−1||F lc2+ls2

Entropy loss lc2 + s2

||A−1B − I||F lc2+l2s2

c2+l2s2Divergence loss

√l2c2+ls2

c2+ls2

||B−1A− I||F l2c2+s2

lc2+s2Matusita Affinity (1+c2)l+s2

1+c2+ls2

||A−1/2BA−1/2 − I||F 1 + (l−1)c2

(c2+ls2)2Frechet discrepancy (

√lc2 + s2)2

Table 2.2: 10 different loss functions and their optimal shrinkers.

40 45 50 55 60 65 70 75 800

10

20

30

40

50

60

70

80

Empirical eigenvalue λ

Shrunk

eneigenvalue

β(λ)

Model two: Frobenius norm discrepancies (c = 2)

F1F2F3F4F5C1C2

40 45 50 55 60 65 70 75 800

10

20

30

40

50

60

70

80


Shrunk

eneigenvalue

β(λ)

Model two: statistical discrepancies (c = 2)

SteinEntropy

Divergence

FrechetAffineC1C2

8 9 10 11 12 13 14 150

5

10

15


Shrunk

eneigenvalue

β(λ)

Model three: Frobenius norm discrepancies (c = 2)

F1F2F3F4F5C1C2

8 9 10 11 12 13 14 150

5

10

15


Shrunk

eneigenvalue

β(λ)

Model three: statistical discrepancies (c = 2)

SteinEntropy

Divergence

FrechetAffineC1C2

Figure 2.5: Optimal shrinkers under different loss functions. We simulate the optimalshrinkers using Example 2.3.21 and 2.3.22. Model two is for Example 2.3.21 and three isfor Example 2.3.22. F1 to F5 correspond to the Frobenius matrix norms in Table 2.2,C1 stands for the empirical eigenvalue and C2 for the true eigenvalue.


0 1 2 3 4 5 6 70.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

Eigenvalue λ

Oracleestimator

0 5 10 15 20 25 30 35 40 450

5

10

15

20

25

30

Eigenvalue λOracleestimator

0 2 4 6 8 101

2

3

4

5

6

7

Eigenvalue λ

Oracleestimator

0 2 4 6 8 10 120

1

2

3

4

5

6

7

8

9

10

Eigenvalue λ

Oracleestimator

Figure 2.6: Estimation of oracle estimator. We simulate the estimation of oracle estima-tors under Frobenius norm, where the blue line stands for the true estimator, red dots forour estimation and magenta dots for the estimation using QuEST [78]. The first panel(top left), the second panel (top right), the third panel (bottom left) and the fourth panel(bottom right) correspond to the Example 2.3.20, 2.3.21, 2.3.22 and 2.3.23 respectively,where the entries of X are standard normal random variables. In Example 2.3.20, thespike locates at 6.


0 100 200 300 400 500 600 700 8000

5

10

15

20

25

Dimension M

Frobeniusloss

0 100 200 300 400 500 600 700 8000

20

40

60

80

100

120

140

160

180

200

Dimension M

Frobeniusloss

0 100 200 300 400 500 600 700 8000

5

10

15

20

25

30

35

40

45

Dimension M

Frobeniusloss

0 100 200 300 400 500 600 700 8004

6

8

10

12

14

16

18

20

22

Dimension M

Frobeniusloss

Figure 2.7: Estimation error using POET with information from random matrices theory.We simulate the estimation error under Frobenius norm, where the blue line stands forthe sample covariance matrices estimation, red dots for our Multi-POET estimation. Wefind that using information from the sample covariance matrices can help us improvethe inference results. The first panel (top left), the second panel (top right), the thirdpanel (bottom left) and the fourth panel (bottom right) correspond to the Example2.3.20, 2.3.21, 2.3.22 and 2.3.23 respectively, where the entries of X are standard normalrandom variables. In Example 2.3.20, the spike locates at 6.

Chapter 3

Random matrices in non-stationary

time series analysis

In this chapter, we provide detailed proof and computation on the study of non-stationary

time series, which is the second part of our contribution in Section 1.4. For a complete

discussion, we refer to our papers [42, 43].

3.1 Locally stationary time series and physical de-

pendence measure

Definition 3.1.1. Let η′i be an i.i.d. copy of ηi. Assuming that for some q >

0, ||xi||q <∞, for j ≥ 0, we define the physical dependence measure by

δ(j, q) := supt∈[0,1]

maxi||G(t,Fi)−G(t,Fi,j)||q , (3.1.1)

where Fi,j := (Fi−j−1, η′i−j, ηi−j+1, · · · , ηi).

The measure δ(j, q) quantifies the changes in the filter’s output when the input of

the system j steps ahead is changed to an i.i.d. copy. If the change is small, then we

197

Chapter 3. Random matrices in non-stationary time series analysis 198

have short-range dependence. It is notable that δ(j, q) is related to the data generating

mechanism and can be easily computed.

In the present paper, we impose the following assumptions and the physical depen-

dence measure to control the temporal dependence of the non-stationary time series.

Assumption 3.1.2. There exists a large constant τ > 0 and q > 4, such that

δ(j, q) ≤ j−τ , j ≥ 1. (3.1.2)

Furthermore, G satisfies the property of stochastic Lipschitz continuity,

||G(t1,Fi)−G(t2,Fi)||q ≤ C|t1 − t2|, (3.1.3)

for any t1, t2 ∈ [0, 1] and we also assume that

supt

maxi||G(t,Fi)||q <∞. (3.1.4)

(3.1.2) indicates that the time series has short-range dependence. (3.1.3) implies that

G(·, ·) changes smoothly over time and ensures local stationarity. Furthermore, for each

fixed t ∈ [0, 1], denote

γ(t, j) = E(G(t,F0), G(t,Fj)), (3.1.5)

(3.1.3) and (3.1.4) imply that γ(t, j) is Lipschiz continuous in t. Furthermore, we need

the following mild assumption on the smoothness of γ(t, j).

Assumption 3.1.3. For any j ≥ 0, we assume that γ(t, j) ∈ Cp([0, 1]), p > 0 is some

integer, where Cp([0, 1]) is the function space on [0, 1] of continuous functions that have

continuous first p derivatives.

Many important consequences can be derived due to Assumption 3.1.2 and 3.1.3. We

list the most useful ones. The first one is the following control on γ(t, j).


Lemma 3.1.4. Under Assumption 3.1.2 and 3.1.3, there exists some constant C > 0,

such that

supt|γ(t, j)| < Cj−τ , j ≥ 1.

Another important conclusion is that the coefficients are of polynomial decay. Hence,

when i > b is large, where b = n2/τ , we only need to focus on autoregressive fit of order

b instead of i− 1.

Lemma 3.1.5. Under Assumption 3.1.2 and letting b = n2/τ , there exists some constant

C > 0, such that

supi>b|φij| ≤

maxn−4+5/τ , Cj−τ, i ≥ b2;

maxn−2+3/τ , Cj−τ, b < i < b2.

(3.1.6)

Furthermore, when i > b, denote φbi = (φi1, · · · , φib), and φbi = (Γbi)−1γbi with entries

(φi1, · · · , φib), where Γbi = Cov(xbi−1,xbi−1), γbi = E(xbi−1xi), xbi−1 = (xi−1, · · · , xi−b), we

have

supi

∣∣∣∣∣∣φbi − φbi ∣∣∣∣∣∣ ≤ Cn−2+1/τ .

Finally, denote φb( in) := (φ1( i

n), · · · , φb( in)) by

φb(i

n) = (Γbi)

−1γbi , (3.1.7)

where Γbi and γbi are defined as

Γbi = Cov(xbi−1, xbi−1), γi = Cov(xbi−1, xi),

with xbi−1,k = G( in,Fi−k), k = 1, 2, · · · , b. The following lemma shows that φbi can be

well approximated by φb( in) when i > b.


Lemma 3.1.6. Under Assumption 3.1.2, there exists some constant C > 0, such that

supi>b

∣∣∣∣φij − φj( in)

∣∣∣∣ ≤ Cn−1+2/τ , for all j ≤ b.

Till the end of this paper, we will always use b = n2/τ . Before concluding this section,

we summarize the basic properties of εi.

Lemma 3.1.7. First of all, we have supi σ2i < ∞. Furthermore, denote the physical

dependence measure of εi as δε(j, q), then there exists some constant C > 0, such that

δε(j, q) ≤ Cj−τ , j ≥ 1.

3.2 Estimation of covariance and precision matrices

By Lemma 3.1.5 and 3.1.6, it suffices to estimate φij, i ≤ b, φj(in), i > b ≥ j and the

variances of the residuals. When i > b, by (3.1.4) and Lemma 3.1.5, it is easy to check

that with 1− o(1) probability,

supi

∣∣∣∣∣i−1∑

j=b+1

φijxi−j

∣∣∣∣∣ = o(n−1).

Therefore, we now write

xi =b∑

j=1

φijxi−j + εi, i = b+ 1, · · · , n. (3.2.1)

Time-varying coefficients for i > b. We first estimate the time-varying coefficients

φj(in) using the method of Sieves [10, 31, 32] when i > b. We firstly observe the following

fact.

Lemma 3.2.1. Under Assumption 3.1.2 and 3.1.3, for any j, we have that φj(t) ∈

Cp([0, 1]).


Based on the above lemma, we use

θj(i

n) :=

c∑k=1

ajkαk(i

n), j ≤ b, (3.2.2)

to approximate φj(in), where αk( in) is a set of pre-chosen orthogonal bases on [0, 1] and

c ≡ c(n) stands for the number of basis which will be specified later. We impose the

following regularity condition on the regressors and the basis functions.

Assumption 3.2.2. For any k = 1, 2, · · · , b, denote Σk(t) ∈ Rk×k via Σkij(t) = γ(t, |i−

j|), we assume that the eigenvalues of

∫ 1

0

Σk(t)⊗ (b(t)b∗(t)) ,

are bounded above and also away from zero by a constant κ > 0, where b(t) = (α1(t), · · · , αc(t))∗ ∈

Rc.

Lemma 3.2.3. Denote the L∞ norm with respect to Lebesgue measure as

ρ := supt∈[0,1]

|φj(t)− θj(t)| .

We then have that ρ = O(c−p) for the Orthogonal polynomials, Trigonometric polynomi-

als, Spline series when r ≥ p+ 1, and Orthogonal wavelets when m > p.

Using the Cholesky decomposition (3.2.1), we can now write

xi =b∑

j=1

c∑k=1

ajkzkj(i

n) + εi, i = b+ 1, · · · , n, (3.2.3)

with zkj(in) defined as

zkj(i

n) := αk(

i

n)xi−j.

We can use the ordinary least square (OLS) method to estimate the coefficients ajk.


Denote the vector β ∈ Rbc with βs = ajs,ks , where js = b scc + 1, ks = s − b s

cc × c.

Similarly, we define yi ∈ Rbc by letting yis = zks,js(in). Furthermore, we denote Y ∗ as the

bc × (n − b) rectangular matrix whose columns are yi, i = b + 1, · · · , n. We also define

by x ∈ Rn−b containing xb+1, · · · , xn. Hence, the OLS estimator for β can be written as

β = (Y ∗Y )−1Y ∗x.

Moreover, denote xi = (xi−1, · · · , xi−b)∗ ∈ Rb and X = (xb+1, · · · ,xn) ∈ Rb×(n−b).

Denote the matrices Ei ∈ R(n−b)×(n−b) satisfies that (Ei)st = 1, when s = t = i− b and 0

otherwise. As a consequence, we can write

Y ∗ =n∑

i=b+1

(X ⊗ b(

i

n)

)Ei, (3.2.4)

where ⊗ stands for the Kronecker product. It is well-known that the OLS estimator

satisfies

β = β +

(Y ∗Y

n

)−1Y ∗ε

n, (3.2.5)

where ε ∈ Rn−b contains εb+1, · · · , εn. We decompose β into b blocks by denoting β =

(β∗1, · · · ,β∗b )∗, where each βi ∈ Rc. Similarly, we can decompose β. Therefore, our sieve

estimator can be written as φj(in) = β∗jb( i

n) and it satisfies that

φj(i

n)− φj(

i

n) = (βj − βj)∗b(

i

n). (3.2.6)

By carefully choosing c = nα1 , we show that φj(in) are consistent estimators for φj(

in)

uniformly in i for all j ≤ b. Denote ζc := supi ||b( in)||, we have

Theorem 3.2.4. Under Assumption 3.1.2, for some sufficiently small positive constants

α2, α3 > 0 satisfying

2/τ + α1 − α3 < 0, 4/τ + 2(α1 + α3 − 1) < 0, (3.2.7)


− 1/2 + α2 = o(ζc), (3.2.8)

with P1 := 1 −maxO(n1−q(1/2+α2)), O(n4/τ+2α1+2α3−2)

probability, for some constant

C > 0, we have

supi>b,j≤b

∣∣∣∣φj( in)− φj(i

n)

∣∣∣∣ ≤ E1,

where E1 is defined as

E1 := maxCζcn−1/2+α2 , Cn−pα1.

Here we recall that p is the order of smoothness of γ(t, j) defined in Assumption 3.1.3.

Remark 3.2.5. (i). The first constraint of (3.2.7) ensures that 1nY ∗Y is regular such that

the smallest eigenvalue can be bounded away from zero.

(ii). ζc = O(√c) for Trigonometric polynomials, Spline series and Orthogonal wavelets.

And ζc = O(c) for Orthogonal polynomials.

(iii). We can easily find a consistent estimator by suitably choosing the above parameters.

For instance, when τ = 10, we can choose α1 = 1/8, α2 = 14, α3 = 1

2for Fourier basis and

Orthogonal Wavelet basis.

Time-varying coefficients for i < b It is notable that by Lemma 3.1.6, when i, j are

not very large, we cannot use the estimators. In this situation, for each φij, j < i ≤ b,

we estimate them using the coefficients of the best linear predictor after smoothing by

Sieve method. For instance, in order to estimate φ21, we reply on the following regression

equations

xk = φk1xk−1 + ξ2k, k = 2, 3, · · · , n,

where ξ22 = ε2. Due to the local stationarity assumption, informally we can say there

exists a continuous function f21, such that φ21 ≈ f21( 2n), where f21 can be efficiently

estimated using the Sieve method. Now we make rigorous of this idea. For each fixed


i ≤ b, to estimate φi, we make use of the following equations,

xk =i−1∑j=1

λkjxk−j + ξik, k = i, i+ 1, · · · , n, (3.2.9)

where λki = (λk1, · · · , λk,i−1) are the coefficients of the best linear prediction using the

i− 1 predecessors and ξii = εi. Note that λii = φi. Using Yule-Walker’s equation, we find

λki = (Γki )−1γki ,

where Γki = Cov(xki ,xki ),γ

ki = Cov(xki , xk) and xki = (xk−1, · · · , xk−i+1). Due to Assump-

tion 3.1.3, we can denote fki = (f i1( kn), · · · , f ii−1( k

n)) by

fki = (Γki )−1γki , (3.2.10)

with Γki , γki defined by

Γki = Cov(xki , xki ), γ

ki = Cov(xki , xk),

where xki,j = G( kn,Fi−j). In order to estimate φij via f ij(

in), we will make use of the

following lemma.

Lemma 3.2.6. Under Assumptions 3.1.2, for each fixed i ≤ b and for any j ≤ i − 1,

f ij(t) are Cp functions on [0, 1]. Furthermore, we have

∣∣∣∣φij − f ij( in)

∣∣∣∣ ≤ maxn−1+2/τ , n−pα1, j < i ≤ b. (3.2.11)

Therefore, by Lemma 3.2.3, the rest of the work leaves to estimate the functions


f ij(in), j < i ≤ b using Sieve approximation by denoting

f ij(i

n) =

c∑k=1

djkαk(i

n).

We denote the OLS estimation as f ij(in) =

∑ck=1 djkαk(

in).

Theorem 3.2.7. Under Assumption 3.1.2, for some sufficiently small positive con-

stants α2, α3 > 0 satisfying (3.2.7) and (3.2.8), for some constant C > 0, with 1 −

maxO(n1−q(1/2+α2)), O(n4/τ+2α1+2α3−2)


supi≤b,j<i

∣∣∣∣f ij( in)− f ij(i

n)

∣∣∣∣ ≤ maxCζcn−1/2+α2 , n−1+2τ , Cn−pα1.

Sieve estimation for noise variances. We discuss the case for i > b and i ≤ b

separately. For i > b, denote εbi = xi −∑b

j=1 φijxi−j and (σbi )2 = E(εbi)

2. σi can be well

approximated using σbi by the following lemma.

Lemma 3.2.8. For i > b and some constant C > 0, we have

supi>b

∣∣σ2i − (σbi )

2∣∣ ≤ Cn−2+2/τ .

Furthermore, denote g( in) = E

(xi −

∑bj=1 φijG( i

n,Fi−j)

)2

, we then have

supi>b

∣∣∣∣(σbi )2 − g(i

n)

∣∣∣∣ ≤ Cn−1+4/τ .

Finally, g( in) ∈ Cp([0, 1]).

Denote rbi = (εbi)2, it is notable that rbi can not be observed directly. Denote rbi = εi

2,

where

εi = xi −b∑

j=1

c∑k=1

ajkzkj(i

n), i = b+ 1, · · · , n. (3.2.12)


By Theorem 3.2.4, we conclude that with P1 probability, for some constant C > 0, we

have

supi>b|rbi − rbi | ≤ Cn4/τ (n−1+2/τ + E1). (3.2.13)

Denote the centered random variables ωbi = rbi −(σbi )2, by Lemma 3.2.8 and (3.2.13), with

P1 probability, we can write

rbi = g(i

n) + ωbi +O(n−1+6/τ + n4/τE1). (3.2.14)

Invoking Lemma 3.2.3 and for convenience, we can therefore write our regression equation

as

rbi =c∑

k=1

dkαk(i

n) + ωbi , i = b+ 1, · · · , n. (3.2.15)

Similar to Lemma 3.1.7, we can show that the physical dependence measure of ωbi is also

of polynomial decay. Therefore, the OLS estimator for α = (d1, · · · , dc)∗ can be written

as

α = (W ∗W )−1W ∗r,

where W ∗ is an c× (n− b) matrix whose i-th column is (α1(i + b), · · · , αc(i + b))∗, i =

1, 2, · · · , n− b, and r is an Rn−b containing rbb+1, · · · , rbn. Furthermore, by the property of

OLS, we have

α = α+

(W ∗W

n

)−1W ∗ω

n, (3.2.16)

where ω = (ωb+1, · · · , ωn)∗. Correspondingly, we have the following consistency result.

Theorem 3.2.9. Under Assumption 3.1.2 and 3.1.3, for some sufficiently small constant

α2 > 0, with P2 := 1− n1−q(1/2+α2) probability, for some constant C > 0, we have

supi0>b

∣∣∣∣g(i0n

)− g(i0n

)

∣∣∣∣ ≤ maxCζcn−1/2+α2 , n−pα1. (3.2.17)

Finally, we study the estimation of σ2i , i = 1, 2, · · · , b, which enjoys the same dis-


cussion as in Section 3.2. Recall ξik defined in (3.2.9), denote (σik(ξ))2 = E(ξik)

2, us-

ing a similar discussion to Lemma 3.2.8, we can find a smooth function gi, such that

supi≤b |(σii(ξ))2 − gi( in)| ≤ O(n−1+4τ ), especially we can use gi( i

n) to estimate σ2

i . When

i = 1, we need to estimate the variance function of xi.

The rest of the work leaves to estimate gi(t) using Sieve method similar to (3.2.15)

for i ≤ b, where we replace the residual with rik, k = i, · · · , n. Here rik is defined as

rik :=

(xi −

i−1∑j=1

f ij(k

n)xi−j

)2

, k = i, i+ 1, · · · , n. (3.2.18)

Then for i0 ≤ b, we can estimate gi0( i0n

) similarly, except that the dimension of W ∗ is

c× (n+ 1− i). The results can be summarized as the following theorem.

Theorem 3.2.10. Under Assumption 3.1.2 and 3.1.3, with P2 probability, for some

constant C > 0, we have

supi0≤b

∣∣∣∣gi0(i0n )− gi0(i0n

)

∣∣∣∣ ≤ maxCζcn−1/2+α2 , n−pα1.

Estimation for covariance and precision matrices. It is natural to choose

Γ := Φ−1D(Φ−1)∗,

as our estimator. Here Φ is a lower triangular matrix, where the diagonal entries are all

ones. We now control the estimation error between Γ and Γ.

We first observe that, as det(ΦΦ∗) = det(ΦΦ∗) = 1, combining with Assumption

3.1.2, there exist some constants C1, C2 > 0, such that

C1 ≤ λmin(ΦΦ∗) ≤ λmax(ΦΦ∗) ≤ C2.

Similar results hold for ΦΦ∗.


Proposition 3.2.11. Under Assumptions 3.1.2, 3.1.3 and 3.2.2, for α1, α2 and α3 de-

fined in (3.2.7) and (3.2.8), with 1 − P (τ, α1, α2, α3, n) probability, for some constant

C > 0, we have

∣∣∣∣∣∣Γ− Γ∣∣∣∣∣∣ ≤ Cn4/τ max

n−1+4/τ , n−pα1 , ζ2

cn−1+2α2 , ζcn

−1/2+α2, (3.2.19)

where P (τ, α1, α2, α3, n) is defined as

maxO(n1−q(1/2+α2)), O(n4/τ+2α1+2α3−2)

. (3.2.20)

Proof. Using the fact that any two compatible matrices A,B, AB and BA have the same

non-zero eigenvalues, we conclude that

∣∣∣∣∣∣Γ− Γ∣∣∣∣∣∣ ≤ C

−1/21 ||E||,

where E has the following form of decomposition

E = E1 + E2 + E3

=[D− D

]+[D(

Φ−1 − Φ−1)∗

Φ∗ + Φ(

Φ−1 − Φ−1)

D]

+[Φ(

Φ−1 − Φ−1)

D(

Φ−1 − Φ−1)∗

Φ∗].

Denote B := D(Φ−1)∗(Φ − Φ)∗(Φ−1)∗Φ∗, we therefore have ||E2|| ≤ 2||B||. We fur-

ther denote RΦ := Φ − Φ, we first observe that RΦ = 0, i ≤ j. Then by Lemma

3.1.5, Lemma 3.1.6, Theorem 3.2.4 and 3.2.7, for i ≤ b or j ≤ b ≤ i, with 1 −

maxO(n1−q(1/2+α2)), O(n4/τ+2α1+2α3−2)

probability, we have (RΦ)ij = O(maxζcn−1/2+α2 , n−pα1).

And for i > b, j > b, |(RΦ)ij| ≤ j−τ . This implies that with the above probability, we

have

λmax

((Φ− Φ)(Φ− Φ)∗

)≤ maxCζ2

cn−1+2α2+4/τ , Cn−2pα1+ 4

τ ,


where we use the Gershgorin’s circle theorem. As a consequence, by submultiplicaticity,

for some constant C > 0, we have that

||E2|| ≤ maxCζcn−1/2+α2 , Cn−pα1.

Similarly, we can show that

||E3|| ≤ maxCζ2cn−1+2α2+4/τ , Cn−2pα1+ 4

τ .

By (3.2.14), Theorem 3.2.9 and 3.2.10, with P1 probability,

||E1|| ≤ C max(ζcn−1/2+α2+4/τ , n−pα1+4/τ , n−1+8/τ

).

Hence, we have finished our proof.

An advantage of the Cholesky decomposition is that we can easily estimate the pre-

cision matrix using the following estimator

Γ−1 := Φ∗D−1Φ.

Similar to Proposition 3.2.11, we have the following result for precision matrix.

Corollary 3.2.12. Under Assumptions 3.1.2, 3.1.3 and 3.2.2, for α1, α2 and α3 defined

in (3.2.7) and (3.2.8), with 1 − P (τ, α1, α2, α3, n) probability, for some constant C > 0,

we have

∣∣∣∣∣∣Γ−1 − Γ−1∣∣∣∣∣∣ ≤ Cn4/τ max

n−1+4/τ , n−pα1 , ζ2

cn−1+2α2 , ζcn

−1/2+α2,

with P (τ, α1, α2, α3, n) defined in (3.2.20).

In this subsection, we show by simulations the finite sample performance of our es-


timation for covariance and precision matrices. We investigate the following four non-

stationary models:

• Non-stationary MA(1) process

xi = 0.6 cos(2πi

n)εi−1 + εi,

where εi are i.i.d N (0, 1) random variables.


xi = 0.6 cos(2πi

n)εi−1 + 0.3 sin(

2πi

n)εi−2 + εi.

• Time-varying AR(1) process

xi = 0.6 cos(2πi

n)xi−1 + εi.


xi = 0.6 cos(2πi

n)xi−1 + 0.3 sin(

2iπ

n)xi−2 + εi.

It is easy to compute the true covariance matrices of the above models. In the

following simulations, we report the average estimation errors in term of operator norm

and their standard deviations based on 1000 repetitions. We compare the results for two

different types of bases, the Fourier basis and Daubechies Orthogonal Wavelet basis [33].

We also record the estimation errors by using the simple covariance matrices.

We observe from Table 3.1 that our covariance matrix estimators are better than

simply using the sample covariance matrices in general. Due to the consistency of our

estimators, they are more accurate when n becomes large. Furthermore, as we can see


from the estimation of AR(1) and AR(2) processes, our estimators can still be quite

accurate even when the underlying covariance matrix is not very sparse. But the simple

covariance matrix will become worse due to the curse of dimensionality. Similarly, as

we can see from Table 3.2, our estimators for the precision covariance matrices are also

reasonably accurate. We remark that most of the simple precision matrices are singular

due to correlation. In this sense, our methodology provides a natural way to estimate

the precision matrix of non-stationary time series.

n=200 n=500 n=800

MA(1)Fourier Basis 1.71 (0.77) 1.52 (0.6) 1.58 (0.56)Wavelet Basis 1.91(0.87) 1.85 (0.86) 1.65 (0.68)Sample estimation 2.53 (0.005) 2.55 (0.002) 2.55 (0.002)

MA(2)Fourier Basis 2.12 (0.88) 2.04 (0.72) 1.79 (0.69)Wavelet Basis 2.21 (1) 2.12(0.98) 1.73(0.83)Sample estimation 2.78(0.005) 2.78 (0.002) 2.8 (0.002)

AR(1)Fourier Basis 1.78 (0.98) 1.67 (0.77) 1.53 (0.76)Wavelet Basis 1.89 (1.1) 1.78 (0.99) 1.65 (0.92)Sample estimation 5.79 (0.03) 6.06 (0.01) 6.81 (0.01)

AR(2)Fourier Basis 3.1 (1.2) 2.95 (1.03) 2.67 (0.87)Wavelet Basis 3.5 (1.05) 3.3 (1) 3.02 (0.95)Sample estimation 8.3 (0.04) 8.8 (0.018) 8.97 (0.01)

Table 3.1: Operator norm error for estimation of covariance matrices.

Our simulation are very limited ones, because we need the true covariance and pre-

cision matrices to be known for comparison. In these cases, we hence choose either the

Fourier basis or Orthogonal Wavelet basis for our Sieve estimation. However, as we can

see from Table 3.1 and 3.2, when n is quite large, the differences from using different

bases can be ignored. Therefore, we suggest the employment of Orthogonal Wavelet basis

in practice. As a final remark, in the finite sample case, we shall choose b and c using

data-driven model selection methods, for instance, the AIC and BIC.


n=200 n=500 n=800

MA(1)Fourier Basis 3.3 (0.26) 2.98 (0.27) 2.57 (0.28)Wavelet Basis 3.35 (0.7) 3.16 (0.53) 3.08 (0.44)

MA(2)Fourier Basis 3.8 (0.22) 3.66 (0.22) 3.43 (0.21)Wavelet Basis 3.85 (0.45) 3.72 (0.45) 3.52 (0.43)

AR(1)Fourier Basis 0.85 (0.39) 0.6 (0.22) 0.37 (0.19)Wavelet Basis 1.1 (0.68) 0.97 (0.48) 0.77 (0.46)

AR(2)Fourier Basis 1.32 (0.55) 1.03 (0.32) 0.79 (0.29)Wavelet Basis 1.57 (0.78) 1.38 (0.67) 1 (0.52)

Table 3.2: Operator norm error for estimation of precision matrices.

3.3 Inference of covariance and precision matrices

Another advantage of our methodology is that we can test the structure of the covariance

and precision matrices by using some simple statistics in terms of the entries of Φ. On

one hand, when i > b, denote the vector Bj( in) ∈ Rbc with b blocks, where the j-th block

is the basis b( in) and zero otherwise. Therefore, for any fixed i > b, j ≤ b, we have

(φj(

i

n)− φj(

i

n)

)2

= B∗j(i

n)(β − β)(β − β)∗Bj(

i

n) +O(n−2pα1).

It can be seen from the the above equation, the order of smoothness and number of

base functions are important to our asymptotics. In this section, we assume that p can

be quite large such that pα1 > 1.

As a consequence, recall (3.2.5), it is easy to see that for Σ, we have

(φj(

i

n)− φj(

i

n)

)2

→ B∗j(i

n)Σ−1Y

∗ε

n

ε∗Y

nΣ−1Bj(

i

n) in probability. (3.3.1)

Similarly, for any fixed i ≤ b, j ≤ i − 1, denote the vector Bij( in) ∈ R(i−1)c with i − 1


blocks, where the j-th block is b( in) and zero otherwise, we then have

(f ij(

i

n)− f ij(

i

n)

)2

= B∗ij(i

n)(di − di)(di − di)

∗Bij(i

n), (3.3.2)

where di and di. Note that the above equation is different from (3.3.1) in the sense that

β is of the same dimension bc for all i > b but di is of dimension (i− 1)c for i ≤ b.

Hypothsis testing and test statistics In this subsection, we focus on discussing two

fundamental tests in time series analysis. One of the targets in time series analysis is

to test whether the observed samples are from a White noise process, where the null

hypothesis is

H10 : xi is a White noise process.

Under H10, recall (3.1.7) and (3.2.10), we shall have that φj(

in), f ij(

in) are all zeros.

Therefore, our estimation φj(in), f ij(

in) should be small enough for all i, j. We therefore

use the following statistic to test H10

T1 =b∑

j=1

∫ 1

0

φ2j(t)dt+

b∑i=2

i−1∑j=1

∫ 1

0

(f ij(t)

)2

dt.

In the analysis of time series, it is also important to test the bandedness of the

precision matrices. In our setup, the Cholesky decomposition provides a convenient way

to test the bandedness. For any given k0 ≡ k0(n) b, we are interested in testing the

following hypothesis

H20 : The precision matrix of xi is k0-banded.

Denote Γ−1 = Φ∗D−1Φ. As Γ−1 is strictly positive, the Cholesky decomposition is unique.

Therefore, we conclude that Φ is also k0-banded. Furthermore, under H20, recall (3.1.7)

and (3.2.10), we have that φj(in) = 0, for j > k0. Therefore, it is natural for us to use


the following statistics

T2 =b∑

j=k0+1

∫ 1

0

φ2j(t)dt+

b∑i=k0+2

i−1∑j=k0+1

∫ 1

0

(f ij(t)

)2

dt.

It is notable that both of the test statistics T1, T2 can be written into summations of

quadratic forms under the null hypothesis. For T1 under H10 and T2 under H2

0, we have

that

φ2j(t) =

(φj(t)− φj(t)

)2

, (f ij(t))2 =

(f ij(t)− f ij(t)

)2

.

Therefore, all the above quantities can be computed using (3.3.1) and (3.3.2), which are

quadratic forms of a high dimensional locally stationary time series.

High dimensional Gaussian approximation. As we have seen from the previous

subsection, both of the test statistics are involved with high dimensional quadratic forms.

The distribution of the quadratic form from Gaussian vectors can be easily computed

using Lindeberg’s central limit theorem. We need to discuss the Gaussian approximation

of the quadratic form from general distributions. We mainly focus on the discussion when

i > b and point out the differences for i ≤ b in the end of this subsection.

Using (3.2.4) and the basic properties of the Kronecker product, we find that

Y ∗ε =n∑

i=b+1

(Xεi)⊗ b(i

n),

where εi ∈ Rn−b satisfies that εis = εi when s = i − b and zero otherwise. Denote

q∗ij = B∗j( in)Σ−1 ∈ Rbc. We now rewrite q∗ij as a summation of Kronecker product by

q∗ij =b∑

k=1

e∗k ⊗ q∗ijk,

where ek is the standard basis in Rb and qijk is the k-th block of qij of size c. As a


consequence, we can write

q∗ijY ∗ε

n=

1

n

n∑k=b+1

h∗kqkij, (3.3.3)

where we denote hk = εkxk, qkij ∈ Rb is denoted by (qkij)s = q∗ijsb( kn). Hence, by (3.3.1)

and Slutsky’s theorem, it suffices to find the distribution of

1

n2

n∑k1=b+1

n∑k2=b+1

h∗k1qk1ij (qk2ij )∗hk2 . (3.3.4)

In order to derive the central limit theorems, we now write (3.3.4) into a quadratic form.

To do this, we define H ∈ Rb by letting

Hs =1√n

n∑k=b+1

hk(s)qkij(s), (3.3.5)

where hk(s), qkij(s) stand for the s-th entry of the vector respectively. Hence, we can

rewrite (3.3.4) as 1nH∗EH, where E is a b × b matrix with all the entries being ones.

Therefore, it suffices to derive the distribution of the above quadratic form.

Under the assumption that hk, k = b + 1, · · · , n are i.i.d, Xu, Zhang and Wu [111]

derived the L2 asymptotics of H∗H, where they showed that it was normally distributed

after properly scaling. In our setting, hk’s are correlated, so we need to extend their

results to allow dependence structure. For the later case, Zhang and Cheng [117] derived

the asymptotics for the maximal entry of H for locally stationary time series. For the

rest of this subsection, we employ the above ideas to derive the L2 asymptotics under

physical dependence measure for high dimensional locally stationary time series.

To derive the distribution of the quadratic form, we firstly look at the Gaussian

vectors. We now write 1nH∗EH as

1

n2h∗Qijh, (3.3.6)


where h ∈ R(n−b)b is a vector of n − b blocks, where the k-th block is hk and Qij is

an (n − b)b × (n − b)b block matrix with block size b × b, where the (k1, k2)−th block

is qk1ij (qk2ij )∗. It is notable that Qij is a rank-one symmetric matrix and E(hk) = 0. We

denote g = (gb+1, · · · ,gn)∗ ∼ N (0,Ω) as a Gaussian vector, which is dependent of h and

preserves its covariance structure. Due to non-stationarity, it is reasonable to assume that

Ω is a full-rank matrix. This implies that QijΩ is also of rank one. As a consequence,

we conclude that

g∗Qijg ≡ λijwij, (3.3.7)

where ≡ means that they have the same distribution, λij is the eigenvalue of QijΩ and

wij is a standard Chi-squared random variable with one degree of freedom.

Next, we will show that the above conclusion holds for general locally stationary time

series, with which we have two main issues to deal. The first issue is the dependence

structure. To handle this, by choosing a smooth function, equation (7.1), we use the

technique of M -dependent sequence and the leave one-block-out argument. The second

issue is to prove the universality for distributions beyond Gaussian. We employ Stein’s

method to continuously compare h and g. Similar ideas have been used in proving the

universality of random matrices theory. It is notable that we use the continuous version

of Stein’s method because of dependence. For the independent case, the authors used

the discrete version of Stein’s method.

For i > b, we have the following result on the high dimensional Gaussian approxima-

tion. We first recall (3.3.5), it is notable that

sups,k||qkij(s)|| ≤ c1/2ζc,

where we use Cauchy-Schwarz inequality. Denote

Z :=1

c1/2ζcH =

1√n

n∑k=b+1

zk, (3.3.8)


where zk := 1c1/2ζc

wk ∈ Rb with wk(s) = hk(s)qkij(s). We also denote U = 1√

n

∑nk=b+1 uk,

where (ub+1, · · · ,un) as a centered Gaussian random vector preserving the covariance

structure of (zb+1, · · · , zn). Our task is to control the following Kolmogorov distance

ρij := supx∈R

∣∣P (Rzij ≤ x)− P (Ru

ij ≤ x)∣∣ , (3.3.9)

with the definitions

Rzij = Z∗EZ, Ru

ij = U∗EU.

Theorem 3.3.1. Under Assumption 3.1.2, 3.1.3 and 3.2.2, we have

limn→∞

supi>b,j≤b

ρij = O(l(n)),

where l(n) is defined as

maxM−1

x , n−ε, ψ−1/2, ψ1/2b5/4n−1/4M−τ/2+1/2,M2ψ2b3Mxn−1, ψb2M−1

x ,

M5b4M3xψ

3n−2, ψ2b4M3Mxn−1, ψn−1/2+ε

(M−5/6

x +M1/2M−3x

)√Mx log b

,

where Mx, ψ,M →∞ and ε ∈ (0, 1) is some constant.

It can be easily checked that l(n) → 0 by suitably choosing some parameters, for

instance

ψ = M = Mx = b1/8.

Similar discussion holds for i ≤ b except the dimension of Y∗i is (i− 1)c varying with

i. Denote Σi ∈ R(i−1)c×(i−1)c as the convergent limit of ofY∗iYi

n. We further denote

Zi :=1

c1/2ζcHi =

1√n

n∑k=i

zik,

where zik := 1c1/2ζc

wik ∈ Ri−1 with wk(s) = hik(s)p

kij(s). Here hik = ξikx

ik ∈ Ri−1 and


pkij(s) = p∗ijsb( kn) with p∗ij = B∗ij( in)Σ−1

i . Similarly, we can prove the Gaussian approxi-

mation results and we omit the detail here.

Theorem 3.3.2. Under Assumption 3.1.2, 3.1.3 and 3.2.2, we have

limn→∞

supj<i≤b

ρij = O(l(n)),

where l(n) is defined as

maxM−1

x , n−ε, ψ−1/2, ψ1/2b5/4n−1/4M−τ/2+1/2,M2ψ2b3Mxn−1, ψb2M−1

x ,

M5b4M3xψ

3n−2, ψ2b4M3Mxn−1, ψn−1/2+ε

(M−5/6

x +M1/2M−3x

)√Mx log b

,

where Mx, ψ,M →∞ and ε ∈ (0, 1) is some constant.

Asymptotics of test statistics With the above preparation, we now derive the dis-

tributions for the test statistics T1 and T2 defined in Section 3.3. For T1, under H10, we

have

T1 =1

n2

∫ 1

0

[h∗

b∑j=1

Qj(t)h +b∑i=2

((hi)∗

i−1∑j=1

Qij(t)hi

)]dt, (3.3.10)

where Qj(t) ∈ R(n−b)b×(n−b)b is an extension of Qij by letting Qj(in) = Qij, hi ∈

R(n−i+1)(i−1) and Qij(t) ∈ R(n−i+1)(i−1)×(n−i+1)(i−1) is defined similarly to Qj(t) by us-

ing the vector Bij( in).

Recall that each Qj(t) is a rank-one matrix, where we can write as Qj(t) = tj(t)t∗j(t).

Here tj ∈ R(n−b)b is a vector of (n − b) blocks with the k-th block being qkj (t). As Σ

is positive definite, for each fixed t, tj, j = 1, · · · , b are linearly independent. Hence,∑bj=1 Qj(t) is a rank b symmetric matrix. For the Gaussian case, using Lindeberg’s

central limit theorem, h∗∑b

j=1 Qj(t)h is normally distributed.


Remark 3.3.3. For each fixed t, denote

Qb(t) =b∑

j=1

Qj(t),

and it spectral decomposition as Qb(t) =∑b

k=1 λbk(t)u

bk(t)(u

bk(t))

∗. Recall equation (3.3.7),

Theorem 3.3.1 has established the Gaussian approximation for quadratic form with rank-

one matrix. Qb(t) is a rank b matrix with b n, we can therefore extend our results to

the form h∗Qb(t)h by modifying equation (3.3.4). In detail, we can write

b∑j=1

λbj(t)h∗ubk(t)(u

bk(t))

∗h =b∑

j=1

λbj(t)n∑

k1=b+1

n∑k2=b+1

h∗k1uk1j (t)uk2j (t)∗hk2 . (3.3.11)

As we have seen from previous discussion, the key part is to construct a similar form of

equation (3.3.5). Next we can rewrite (3.3.11) as

1

nH∗bE

bHb,

where Eb ∈ Rb2×b2 is a b × b diagonal block matrix with block size b × b, with the j-th

block being λbj(t)E. Here E ∈ Rb×b is a matrix with all entries being ones. Hb is a block

vector with the j-th block of the item∑n

k1=b+1

∑nk2=b+1 h∗k1u

k1j (t)uk2j (t)∗hk2 in the form

of (3.3.5). As b2 n, it is easy to check that the proof of Theorem 3.3.1 still holds true

with some minor changes.

For the second item on the right-hand side of (3.3.10), it can be written into the

following form

h∗Q(t)h,

where h is a vector of length∑b

i=2(n− i+ 1)(i−1) contains all the vectors hi and Q(t) is

a (∑b

i=2(n− i+ 1)(i− 1))× (∑b

i=2(n− i+ 1)(i− 1)) block diagonal matrix with the i-th

diagonal block to be Qij(t). As the covariance matrix of h is regular, we conclude that it


is still normally distributed. We can see from the above discussion that T1 is normally

distributed with some complicated covariane structure. However, due to Assumption

3.1.2, it is easy to check that when the first part of (3.3.10) is small, the second item is

also small. Therefore, we instead use the following statistic

T ∗1 =1

n2

∫ 1

0

(Σhzh)∗

(b∑

j=1

Qj(t)

)(Σhzh) dt, (3.3.12)

where Σh is the covariance matrix of h.

We denote the rank-one matrix Ωbk(t) = ubk(t)(u

bk(t))

∗Ω with its unique non-trivial

eigenvalue being µbk(t). For any 1 ≤ k1, k2 ≤ b, we also denote Ωbk1,k2

(t) =ubk1

(ubk2)∗+ubk2

(ubk1)∗

2Ω

with its unique non-trivial eigenvalue being µbk1,k2(t). By Assumption 3.1.3 and the

smoothness of the base functions, we conclude that λbk(t), µbk(t) are smooth functions

on [0, 1]. This is because that the characteristic polynomials are smooth functions of the

coefficients. For any t ∈ [0, 1], we define

f1(t) := limb→∞

1

b

b∑k=1

λbk(t)µbk(t),

i.e the pointwise convergent limit. As λbk, µbk are smooth functions, f1(t) is also a smooth

function on [0, 1]. We can therefore conclude the following result.

Lemma 3.3.4. There exist continuous functions f1, f2, such that for t ∈ [0, 1] uniformly,

we have

1

b

b∑k=1

λbk(t)µbk(t)→ f1(t), (3.3.13)

1

b2

[b∑

k1=1

b∑k2=1

λbk1(t)λbk2

(t)(2(µk1,k2(t))

2 + µbk1.k2(t))

(3.3.14)

−

(b∑

k=1

λbk(t)µbk(t)

)2→ f2(t).


We can analyze T2 in the same way and use the following statistic to replace it

T ∗2 =1

n2

∫ 1

0

(Σhzh)∗

(b∑

j=k0+1

Qj(t)

)(Σhzh) dt, (3.3.15)

The asymptotics of T ∗1 and T ∗2 can be summarized as the following proposition.

Proposition 3.3.5. Under Assumption 3.1.2, 3.1.3 and 3.2.2, we have

(1). Under H10, we have

T ∗1 ⇒ N (µ1, σ21),

where µ1, σ1 are defined as

µ1 =b

n2

∫ 1

0

f1(t)dt,

σ21 =

b2

n4

∫ 1

0

f2(s)(1− s)ds,

where f1(t) and f2(t) are defined in (3.3.13) and (3.3.14).

(2). Under H20, we have

T2 ⇒ N (µ2, σ22),

where µ2, σ2 are defined as

µ2 =b

n2

∫ 1

0

f3(t)dt,

σ22 =

b2

n4

∫ 1

0

f4(s)(1− s)ds,

where f3(t) and f4(t) are defined similarly as (3.3.13) and (3.3.14) by replacing b

with b− k0.


Estimation of long-run covariance matrices. From (3.3.12) and (3.3.15), we find

that the key part for the accurate testing is to estimate the long-run covariance matrix

Σh for h. We follow the construction of the Nadaraya-Watson (NW) type estimator [120],

where h is assumed to be of fixed dimension. Under Assumption 3.1.2, we show that the

above estimator is still consistent in our setup for h ∈ R(n−b)b with worse convergent rate.

Note that the covariance matrix of h is a (n− b)× (n− b) block matrix. We first consider

the diagonal part, where each block Λk is the covariance matrix of hk, k = b + 1, · · · , n.

Denote

Λ(t) = Cov(hb(t),hb(t)),

where hb(t) = (G1(t,Fb), · · · , G1(t,F1)). Here G1 is defined such that εixi−1 = G1( in,Fi)

for i > b. The following lemma shows that Λk can be well estimated by Λ( kn) for k > b.

Lemma 3.3.6. Under Assumption 3.1.2 and 3.1.3, we have

supk>b

∣∣∣∣∣∣∣∣Λ(k

n)− Λk

∣∣∣∣∣∣∣∣ = O(n−1+4/τ ).

Next we consider the upper-off-diagonal blocks. For any b < k ≤ n − b + 1, we find

that for j > b+ k, for some constant C > 0, we have

||Λkj|| ≤ C(j − b)−τ+1, (3.3.16)

where we use a similar discussion to Lemma 3.1.4 and Gershgorin’s circle theorem. As a

consequence, we only need to estimate the blocks Λkj for k < j ≤ k + b. For each fixed

j, we denote by

Λj(t) = Cov (hb(t),hb+j(t)) ,

where hb(t) = (G1(t,Fb+j), · · · , G1(t,Fj+1)). Similar to Lemma 3.3.6, we have

∣∣∣∣∣∣∣∣Λj(k

n)− Λkj

∣∣∣∣∣∣∣∣ = O(n−1+4/τ ).


Next, we will estimate Λ(t), Λj(t), 1 ≤ j ≤ b using the Nadaraya-Watson type estimators.

Denote

Ψk =m∑j=0

hk+j, ∆k = ΨkΨ∗k/(m+ 1), b+ 1 ≤ k ≤ n−m,

where m → ∞ and mn→ 0. Let hn be the bandwidth and γn = hn + (m + 1)/n. For

t ∈ I = [γn, 1− γn] ⊂ (0, 1), we define

Λ(t) =n−m∑k=b+1

W (t, k)∆k, where W (t, k) =Khn( k

n− t)∑n−m

j=b+1Khn( jn− t)

,

where Khn(·, ·) is a smooth symmetric density function defined on R supported on [−1, 1].

Similarly, for each fixed j ≤ b, we now define

∆kj = ΨkΨ∗k+j/(m+ 1), b+ 1 ≤ k ≤ n− j −m,

and denote

Λj(t) =

n−j−m∑k=b+1

Wj(t, k)∆kj, where Wj(t, k) =Khn( k

n− t)∑n−m−j

j=b+1 Khn( jn− t)

.

Finally we define Σh as the long-run covariance matrix estimator by setting its blocks by

(Σh)kk = Λ(b+ k

n), (Σh)kj = Λj(

k + b

n), (3.3.17)

and zero otherwise, where k = 1, 2, · · · , n − b, k < j ≤ k + b. To avoid abusing the

notations, we denote Λ0(t) ≡ Λ(t).

Theorem 3.3.7. Under Assumption 3.1.2 and 3.1.3, let m→∞, mn→ 0, hn → 0 and

nhn →∞, for j = 0, 1, 2, · · · , b, we have

supt∈I

∣∣∣∣∣∣Λj(t)− Λj(t)∣∣∣∣∣∣ = O

(b

(1

m+ h2

n +b

n+

1√nhn

)). (3.3.18)



∣∣∣∣∣∣Σh − Σh

∣∣∣∣∣∣ = O

(b2

(1

m+ h2

n +b

n+

1√nhn

)). (3.3.19)

In practice, the true εi is unknown and we have to use εi defined in (3.2.12). We then

define

Λ(t) =n−m∑k=b+1

W (t, k)∆k, Λj(t) =

n−j−m∑k=b+1

Wj(t, k)∆kj,

where ∆k(∆kj) are defined as ∆k(∆kj) with hk therein replaced by hk := xbk εk. Similarly,

we can define the estimation Σh. The analog of Theorem 3.3.7 is the following result.

Theorem 3.3.8. Under the assumptions of Theorem 3.3.8, we have

supt∈I

∣∣∣∣∣∣Λj(t)− Λj(t)∣∣∣∣∣∣ = O

(b

(1

m+ h2

n +b

n+

1√nhn

+ θn

)),

where θn is defined as

θn = b(n−1 + bc−p + bζc(bc/n)1/2)m−1.


∣∣∣∣∣∣Σh − Σh

∣∣∣∣∣∣ = O

(b2

(1

m+ h2

n +b

n+

1√nhn

+ θn

)).

By Proposition 3.3.5, Theorem 3.3.7 and 3.3.8, we now propose the following practical

procedure to test H10 ( the test of H2

0 is similar):

1. For j = 1, 2, · · · , b, i = b+ 1, 2, · · · , n, estimate Σ−1 using n(Y ∗Y )−1 and calculate

Qij by the definition of (3.3.6).

2. Estimate the long-run covariance using (3.3.17) from the samples hknk=b+1.


3. Generate B (say 2000) i.i.d copies of Gaussian random vectors zi, i = 1, 2, · · · , B.

Here zi ∼ N (0,1), where 1 is the identity matrix of dimension (n − b)b. For each k =

1, 2, · · · , B, calculate the following Riemann summation

T 1k =

1

n2

b∑j=1

n∑i=b+1

(Σhzk)∗Qij(Σhzk).

4. Let T 1(1) ≤ T 1

(2) ≤ · · · ≤ T 1(B) be the order statistics of T 1

k , k = 1, 2, · · · , B. Reject

H10 at the level α if T ∗1 > M(bB(1−α)c), where bxc stands for the largest integer smaller or

equal to x. Let B∗ = maxk : T 1k ≤ T ∗1 , the p-value can be denoted as 1−B∗/B.

By Proposition 3.3.5, we find that T1(T2) converges at the rate n1−b under H10(H2

0).

Therefore, using Theorem 3.3.8, we find that both of our test statistics T1 and T2 have

asymptotic power 1 under the following two alternatives respectively

H1a : inf

i|Cov(xi, xi+1)| ≥ b2

(1

m+ h2

n +b

n+

1√nhn

+ θn

),

H2a : inf

i|Cov(xi, xi+k0+1)| ≥ b2

(1

m+ h2

n +b

n+

1√nhn

+ θn

).

In this subsection, we design simulations to study the finite sample performance for

the testings of White noise and bandedness of precision matrices using the procedure

described before. At the nominal levels 0.01, 0.05 and 0.1, the simulated Type I error

rates are listed below for the null hypothesis of H10 and H2

0 based on 1000 simulations,

where for H20 we use the AR(2) model (i.e k0 = 2). From Table 3.3 and 3.4, we see that

the performance of our proposed tests are reasonably accurate for both the Fourier basis

and Daubechies Wavelet basis.

Next we consider the statistical power of our tests under some given alternatives. For

the testing of White noise, we choose the four examples considered above as our alter-


n=200 n=500 n=800

α = 0.01Fourier Basis 0.01 0.01 0.009Wavelet Basis 0.01 0.01 0.01



Table 3.3: Simulated type I error rates under H10.

n=200 n=500 n=800




Table 3.4: Simulated type I error rates under H20 for k0 = 2.

natives. For the testing of bandedness of the precision matrices, for the null hypothesis,

we choose k0 = 2 and consider the follow two types of alternatives


xi = 0.6 cos(2πi

n)xi−1 + 0.3 sin(

2iπ

n)xi−2 + δ sin(

2iπ

n)xi−3 + εi,

where εi are i.i.d standard normal random variables and δ ∈ (0, 0.3). It can be

easily checked that this is a locally stationary process.


xi = 0.6 cos(2πi

n)εi−1 + 0.3 sin(

2iπ

n)εi−2 +

i

nεi−3 + εi.


In all of our simulations, we choose the Orthogonal Wavelet basis as our Sieve base

functions. Figure 3.1 and 3.2 show that our testing procedure is quite robust and has

strong statistical power for both tests.

50 100 150 200 250 300

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Sample size

Stat

istica

Pow

er

AR(1)AR(2)MA(1)MA(2)

Figure 3.1: Statistical power of White noise testing under nominal level 0.05.

50 100 150 200 250 300

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Sample size

Stat

istica

Pow

er

AR(3)MA(3)

Figure 3.2: Statistical power of bandedness testing under nominal level 0.05. For theAR(3) process we choose δ = 0.2.

Finally, we simulate the statistical power for various choices of δ in the AR(3) process

for the sample size n = 200, 300 respectively in Figure 3.3, we find that our method is

quite robust.


0 0.05 0.1 0.15 0.2 0.25 0.30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sta

tistica

l p

ow

er

Values of thresholding

n=200

n=300

Figure 3.3: Statistical power of bandedness testing under nominal level 0.05 for differentchoices of δ.

Bibliography

[1] Arka Adhikari and Ziliang Che. The edge universality of correlated matrices. arXiv:

1712.04889, 2018.

[2] Oskari Ajanki, Laszlo Erdos, and Torben Kruger. Local spectral statistics of Gaus-

sian matrices with correlated entries. Journal of Statistical Physics, 163:280–302,

2016.

[3] Oskari Ajanki, Laszlo Erdos, and Torben Kruger. Stability of the matrix dyson

equation and random matrices with correlations. Probability Theory and Related

Fields (to appear), 2016.

[4] Zhidong Bai and Jack Silverstein. Spectral analysis of large dimensional random

matrices. Springer Series in Statistics, Springer, 2rd edition, 2010.

[5] Zhidong Bai and Jianfeng Yao. Central limit theorems for eigenvalues in a spiked

population model. Annales de l’Institut Henri Poincare - Probabilites et Statis-

tiques, 44:447–474, 2008.

[6] Jinho Baik, Gerard Ben Arous, and Sandrine Peche. Phase transition of the largest

eigenvalue for nonnull complex sample covariance matrices. The Annals of Proba-

bility, 33:1643–1697, 2005.

[7] Zhigang Bao and Xiucai Ding. Tracy-Widom limits for sample covariance matrices

with spikes of moderately large rank. In Progress, 2018.

229

Bibliography 230

[8] Zhigang Bao, Xiucai Ding, and Ke Wang. Singular subspace inference. In progress,

2018.

[9] Zhigang Bao, Guangming Pan, and Wang Zhou. Universality for the largest eigen-

value of sample covariance matrices with general population. The Annals of Statis-

tics, 43:382–421, 2015.

[10] Alexandre Belloni, Victor Chernozhukov, Denis Chetverikov, and Kengo Kato.

Some new asymptotic theory for least squares series: Pointwise and uniform re-

sults. Journal of Econometrics, 186:345–366, 2015.

[11] Florent Benaych-Georges and Antti Knowles. Lectures on the local semicircle law

for Wigner matrices. arXiv: 1601.04055, 2016.

[12] Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigen-

vectors of finite, low rank perturbations of large random matrices. Advances in

Mathematics, 227:494–521, 2011.

[13] Florent Benaych-Georges and Raj Rao Nadakuditi. The singular values and vec-

tors of low rank perturbations of large rectangular random matrices. Journal of

Multivariate Analysis, 111:120–135, 2012.

[14] Pavel Bleher and Arno Kuijlaars. Random matrices with external source and mul-

tiple orthogonal polynomials. International Mathematics Research Notices, 3:109–

129, 2004.

[15] Pavel Bleher and Arno Kuijlaars. Integral representations for multiple Hermite and

multiple Laguerre polynomials. Annales de l’Institut Fourier, 55:2001–2014, 2005.

[16] Alex Bloemendal, Laszlo Erdos, Antti Knowles, Horng-Tzer Yau, and Jun Yin.

Isotropic local laws for sample covariance and generalized Wigner matrices. Elec-

tronic Journal of Probability, 19:1–53, 2014.

Bibliography 231

[17] Alex Bloemendal, Antti Knowles, Horng-Tzer Yau, and Jun Yin. On the principal

components of sample covariance matrices. Probability Theory and Related Fields,

164:459–552, 2016.

[18] Alex Bloemendal and Balint Virag. Limits of spiked random matrices i. Probability

Theory and Related Fields, 156:795–825, 2013.

[19] Alex Bloemendal and Balint Virag. Limits of spiked random matrices ii. The

Annals of Probability, 44:2726–2769, 2016.

[20] Alexei Borodin. Biorthogonal ensembles. Nuclear Physics B, 3:704–732, 1998.

[21] Alexei Borodin. Determinantal point processes. arXiv:0911.1153, 2009.

[22] Alexei Borodin, Patrik Ferrari, Michael Prahofer, and Tomohiro Sasamoto. Fluc-

tuation properties of the TASEP with periodic initial configuration. Journal of

Physics A: Mathematical and General, 129:1055–1080, 2007.

[23] Gaetan Borot. An introduction to random matrix theory. arXiv:1710.10792, 2017.

[24] Joel Bun, Romain Allez, Jean-Philippe Bouchaud, and Marc Potters. Rotational

invariant estimator for general noisy matrices. IEEE Transactions on Information

Theory, 62:7475–7490, 2016.

[25] Joel Bun, Jean-Philippe Bouchaud, and Marc Potters. Cleaning large correlation

matrices: Tools from random matrix theory. Physics Reports, 666:7475–7490, 2017.

[26] Mireille Capitaine and Catherine Donati-Martin. Spectrum of deformed random

matrices and free probability. arXiv: 1607.05560, 2016.

[27] Mireille Capitaine, Catherine Donati-Martin, and Delphine Feral. The largest

eigenvalues of finite rank deformation of large Wigner matrices: Convergence and

nonuniversality of the fluctuations. The Annals of Probability, 37:1–47, 2009.

Bibliography 232

[28] Sourav Chatterjee. A generalization of the lindeberg principle. The Annals of

Probability, 34:2061–2076, 2006.

[29] Sourav Chatterjee. Superconcentration and Related Topics. Springer, 2014.

[30] Ziliang Che. Universality of random matrices with correlated entries. Electronic

Journal of Probability, 22:1–38, 2017.

[31] Xiaohong Chen. Large Sample Sieve Estimation of Semi-nonparametric Models.

Chapter 76 in Handbook of Econometrics, Vol. 6B, James J. Heckman and Edward

E. Leamer, 2007.

[32] Xiaohong Chen and Timothy Christensen. Optimal uniform convergence rates

and asymptotic normality for series estimators under weak dependence and weak

conditions. Journal of Econometrics, 188:447–465, 2015.

[33] Ingrid Daubechies. Ten Lectures on Wavelets. Society for Industrial and Applied

Mathematics, 1992.

[34] Percy Deift. Orthogonal Polynomials and Random Matrices: A Riemann-Hilbert

Approach. Courant Lecture Notes, American Mathematical Society and the Coun-

rant Institute of Mathematical Sciences at the New York University, 1999.

[35] Xiucai Ding. Singular vector distribution of sample covariance matrices. arXiv:

1611.01837, 2016.

[36] Xiucai Ding. Asymptotics of empirical eigen-structure for high dimensional sample

covariance matrices of general form. arXiv: 1708.06296, 2017.

[37] Xiucai Ding. High dimensional deformed rectangular matrices with applications in

matrix denoising. arXiv: 1702.06975, 2017.

[38] Xiucai Ding, Weihao Kong, and Gregory Valiant. Norm consistent oracle estimators

for high dimensional covariance matrices of general form. Preprint, 2017.

Bibliography 233

[39] Xiucai Ding and Jeremy Quastel. Multi-matrix model and generalization of Airy

process. In progress, 2018.

[40] Xiucai Ding and Fan Yang. A necessary and sufficient condition for edge univer-

sality at the largest singular values of covariance matrices. The Annals of Applied

Probability (in press), 2016.

[41] Xiucai Ding and Fan Yang. Necessary and sufficient condition for edge universality

for a general class of sample covariance matrices. In progress, 2018.

[42] Xiucai Ding and Zhou Zhou. Estimation and inference for covariance and precision

matrices of non-stationary time series. Preprint, 2018.

[43] Xiucai Ding and Zhou Zhou. On the stationarity testing for the correlation of time

series. Preprint, 2018.

[44] David Donoho. De-noising by soft-thresholding. IEEE Transactions on Information

Theory, 41:613–627, 1995.

[45] David Donoho, Matan Gavish, and Iain Johnstone. Optimal shrinkage of eigenval-

ues in the spiked covariance model. The Annals of Statistics (to appear), 2013.

[46] R. Brent Dozier and Jack W. Silverstein. Analysis of the limiting spectral dis-

tribution of large dimensional information-plus-noise type matrices. Journal of

Multivariate Analysis, 98:1099–1122, 2007.

[47] Laszlo Erdos. Universality of wigner random matrices: a survey of recent results.

Russian Mathematical Surveys, 66:507, 2011.

[48] Laszlo Erdos. Lecture Notes on the Matrix Dyson Equation and its Applications

for Random Matrices. IAS/Park City Mathematics Program, 2017.

Bibliography 234

[49] Laszlo Erdos, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Spectral statistics of

Erdos-Renyi graphs II: Eigenvalue spacing and the extreme eigenvalues. Commu-

nications in Mathematical Physics, 314:587–640, 2012.

[50] Laszlo Erdos, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Spectral statistics of

Erdos-Renyi graphs I: Local semicircle law. The Annals of Probability, 41:2279–

2375, 2013.

[51] Laszlo Erdos and Horng-Tzer Yau. A Dynamical Approach to Random Matrix

Theory. Courant Lecture Notes, American Mathematical Society and the Counrant

Institute of Mathematical Sciences at the New York University, 2017.

[52] Laszlo Erdos, Horng-Tzer Yau, and Jun Yin. Rigidity of eigenvalues of generalized

Wigner matrices. Advances in Mathematics, 229:1435–1515, 2012.

[53] Laszlo Erdos, Horng-Tzer Yau, and Jun Yin. Rigidity of eigenvalues of generalized

Wigner matrices. Advances in mathematics, 229:1435–1515, 2012.

[54] Bertrand Eynard. Eigenvalue distribution of large random matrices, from one

matrix to several coupled matrices. Nuclear Physics B, 506:633–664, 1997.

[55] Bertrand Eynard, Taro Kimura, and Sylvain Ribault. Random matrices. arXiv:

1510.04430, 2015.

[56] Bertrand Eynard and Madan Mehta. Matrices coupled in a chain: I. eigenvalue

correlations. Journal of Physics A: Mathematical and General, 31:44–49, 1998.

[57] Jianqing Fan, Yuan Liao, and Martina Mincheva. High-dimensional covariance

matrix estimation in approximate factor models. The Annals of Statistics, 39:3320–

3356, 2011.

Bibliography 235

[58] Jianqing Fan, Yuan Liao, and Martina Mincheva. Large covariance estimation by

thresholding principal orthogonal complements. Journal of the Royal Statistical

Society: Series B, 75:603–680, 2013.

[59] Patrik Ferrari, Michael Praehofer, and Herbert Spohn. Stochastic growth in one

dimension and gaussian multi-matrix models. arXiv:03010053, 2003.

[60] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An

Introduction to Statistical Learning. Sringer Texts in Statistics, 2013.

[61] Matan Gavish and David Donoho. The optimal hard threshold for singular values

is 4√3. IEEE Transactions on Information Theory, 60:5040–5053, 2014.

[62] Matan Gavish and David Donoho. Optimal shrinkage of singular values. IEEE

Transactions on Information Theory, 63:2137–2152, 2017.

[63] G.H Golub and C. Van Loan. Matrix Computation. Johns Hopkins University

Press, 4th edition, 2013.

[64] Kurt Johansson. Shape flucations and random matrices. Communications in Math-

ematical Physics, 209:437–476, 2000.

[65] Kurt Johansson. Discrete polynuclear growth and determinantal processes. Com-

munications in Mathematical Physics, 242:277–329, 2003.

[66] Kurt Johansson. Random matrices and determinantal processes. arXiv:0510038,

2005.

[67] Iain Johnstone. On the distribution of the largest eigenvalue in principal compo-

nents analysis. The Annals of Statistics, 29:295–327, 2001.

[68] Iain Johnstone. Multivariate analysis and Jacobi ensembles: Largest eigenvalue,

TracyWidom limits and rates of convergence. The Annals of Statistics, 36:2638–

2716, 2008.

Bibliography 236

[69] Noureddine El Karoui. Tracy-Widom limit for the largest eigenvalue of a large

class of complex sample covariance matrices. The Annals of Probability, 35:663–

714, 2007.

[70] Noureddine El Karoui. Spectrum estimation for large dimensional covariance ma-

trices using random matrix theory. The Annals of Statistics, 36:2757–2790, 2008.

[71] Antti Knowles and Jun Yin. Eigenvector distribution of Wigner matrices. Proba-

bility Theory and Related Fields, 155:543–582, 2013.

[72] Antti Knowles and Jun Yin. The isotropic semicircle law and deformation of Wigner

matrices. Communications on Pure and Applied Mathematics, 66:1663–1749, 2013.

[73] Antti Knowles and Jun Yin. The outliers of a deformed Wigner matrix. The Annals

of Probability, 42:1980–2031, 2014.

[74] Antti Knowles and Jun Yin. Anisotropic local laws for random matrices. Probability

Theory and Related Fields, 169:257–352, 2017.

[75] Weihao Kong and Gregory Valiant. Spectrum estimation from samples. The Annals

of Statistics, 45:2352–2367, 2017.

[76] Arno Kuijlaars. Random matrices with external source and multiple orthogonal

polynomials. Proceedings of the International Congress of Mathematicians, pages

1417–1432, 2010.

[77] Olivier Ledoit and Sandrine Peche. Eigenvectors of some large sample covariance

matrix ensembles. Probability Theory and Related Fields, 151:233–264, 2011.

[78] Olivier Ledoit and Michael Wolf. Numerical implementation of the QuEST func-

tion. Computational Statistics & Data Analysis, 115:199–223, 2017.

[79] Ji Oon Lee and Kevin Schnelli. Edge universality for deformed wigner matrices.

Reviews in Mathematical Physics, 27:1550018, 2015.

Bibliography 237

[80] Ji Oon Lee and Kevin Schnelli. Tracy-Widom distribution for the largest eigenvalue

of real sample covariance matrices with general population. The Annals of Applied

Probability, 26:3786–3839, 2016.

[81] Ji Oon Lee and Jun Yin. A necessary and sufficient condition for edge universality

of wigner matrices. Duke Mathematical Journal, 163:117–173, 2014.

[82] Haoyang Liu, Alexander Aue, and Debashis Paul. On the marcenko-pastur law for

linear time series. The Annals of Statistics, 43:675–712, 2015.

[83] V.A. Marcenko and L.A. Pastur. Distribution for some sets of random matrices.

Sbornik: Mathematics, 1:457–483, 1967.

[84] Konstantin Matetski, Jeremy Quastel, and Daniel Remenik. The KPZ fixed point.

arXiv:1701.00018, 2017.

[85] Madan Mehta. Random matrices. Elsevier Academic Press, 3rd edition, 2004.

[86] Alexandru Nica and Roland Speicher. Lectures on the Combinatorics of Free Prob-

ability. London Mathematical Society Lecture Note Series, Cambridge University

Press, 2006.

[87] Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked

covariance model. Statistica Sinica, 17:1617–1642, 2007.

[88] Debashis Paul and Jack Silverstein. No eigenvalues outside the support of the

limiting empirical spectral distribution of a separable covariance matrix. Journal

of Multivariate Analysis, 100:37–57, 2009.

[89] Natesh Pillai and Jun Yin. Universality of covariance matrices. The Annals of

Applied Probability, 24:935–1001, 2014.

[90] Beatriz Pontes, Raul Giraldez, and Jesus S.Aguilar-Ruiz. Biclustering on expression

data: A review. Journal of Biomedical Informatics, 57:163–180, 2015.

Bibliography 238

[91] Mohsen Pourahmadi. Joint mean-covariance models with applications to longitu-

dinal data: unconstrained parameterisation. Biometrika, 3:677–690, 1999.

[92] Michael Prahofer and Herbert Spohn. Scale invariance of the PNG droplet and the

Airy process. Journal of Physics A: Mathematical and General, 108:1071–1106,

2002.

[93] Jeremy Quastel and Daniel Remenik. Airy processes and variational problems.

Topics in Percolative and Disordered Systems, pages 121–171, 2014.

[94] Jeremy Quastel and Herbert Spohn. The one-dimensional KPZ equation and its

universality class. Journal of Physics A: Mathematical and General, 160:965–984,

2015.

[95] Adrian Rollin. Stein’s method in high dimensionas with applications. Annales de

l’Institut Henri Poincare, Probabilites et Statistiques, 49:529–549, 2011.

[96] Nathan Ross. Fundamentals of steins method. Probability Surveys, 8:210–293,

2011.

[97] Tomohiro Sasamoto. Spatial correlations of the 1D KPZ surface on a flat substrate.

Journal of Physics A: Mathematical and General, 38:549–556, 2005.

[98] Jack W. Silverstein and Sang-Il Choi. Analysis of the limiting spectral distribution

of large dimensional random matrices. Journal of Multivariate Analysis, 54:295–

309, 1995.

[99] Defeng Sun and Jie Sun. Strong semismoothness of eigenvalues of symmetric matri-

ces and its application to inverse eigenvalue problems. SIAM Journal on Numerical

Analysis, 40:2352–2367, 2003.

[100] Gabor Szego. Orthogonal Polynomials. Colloquium Publications. XXIII. American

Mathematical Society, 1939.

Bibliography 239

[101] Kazumasa Takeuchi and Masaki Sano. Universal fluctuations of growing interfaces:

Evidence in turbulent liquid crystals. Physical review letters, 104:230601, 2010.

[102] Terence Tao and Van Vu. Random matrices: Universality of local eigenvalue statis-

tics. Acta Mathematica, 206:127, 2011.

[103] Terence Tao, Van Vu, and Manjunath Krishnapur. Random matrices: Universality

of esds and the circular law. The Annals of Probability, 38:2023–2065, 2010.

[104] Terrence Tao. Topics in Random matrix theory. American Mathematical Society,

2012.

[105] Craig Tracy and Harold Widom. Differential equations for Dyson processes. Com-

munications in Mathematical Physics, 252:7–41, 2004.

[106] Eugene Wigner. Characteristic vectors bordered matrices with infinite dimensions.

Annals of Mathematics, 62:548–564, 1955.

[107] Eugene Wigner. On the distributions of the roots of certain symmetric matrices.

Annals of Mathematics, 67:325–327, 1958.

[108] John Wishart. The generalised product moment distribution in samples from a

normal multivariate population. Biometria, 20:32–52, 1928.

[109] Weibiao Wu. Nonlinear system theory: Another look at dependence. Proceedings of

the National Academy of Sciences of the United States of America, 40:14150–14151,

2005.

[110] Haokai Xi, Fan Yang, and Jun Yin. Local circular law for the product of a deter-

ministic matrix with a random matrix. Electronic Journal of Probability, 22:1–77,

2017.

[111] Mengyu Xu, Danna Zhang, and Wei Biao Wu. l2 asymptotics for high-dimensional

data. arXiv: 1405.7244, 2015.

Bibliography 240

[112] Dan Yang, Zongming Ma, and Andreas Buja. Rate optimal denoising of simulta-

neously sparse and low rank matrices. The Journal of Machine Learning Research,

17:1–27, 2016.

[113] Jianfeng Yao, Shurong Zheng, and Zhidong Bai. Large Sample Covariance Matrices

and High-Dimensional Data. Cambridge University Press, 2015.

[114] Horng-Tzer Yau and Paul Bourgade. The eigenvector moment flow and local quan-

tum unique ergodicity. Communications in Mathematical Physics, 350:231–278,

2017.

[115] Bo Zhang, Guangming Pan, and Jiti Gao. CLT for largest eigenvalues and unit

root tests for high-dimensional nonstationary time series. The Annals of Statistics

(to appear), 2016.

[116] Lixin Zhang. Spectral analysis of large dimensional random matrices. Ph. D. Thesis.

National University of Singapore, 2006.

[117] Xianyang Zhang and Gugang Cheng. Guassian approximation for high dimensional

vector under physical dependence. Bernoulli(to appear), 2017.

[118] Zhou Zhou. Heteroscedasticity and autocorrelation robust structural change detec-

tion. Journal of the American Statistical Association, 108:726–740, 2013.

[119] Zhou Zhou and Weibiao Wu. Local linear quantile estimation for non-stationary

time series. The Annals of Statistics, 37:2696–2729, 2009.

[120] Zhou Zhou and Weibiao Wu. Simultaneous inference of linear models with time

varying coefficents. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 72:513–531, 2010.

Documents

Random matrix theory: From mathematical physics to high … · 2018-11-17 · Non-stationary time series is important in understanding the temporal correlation of data. Assuming that