Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Random matrix theory: From mathematical physics to highdimensional statistics and time series analysis
by
Xiucai Ding
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Department of Statistical SciencesUniversity of Toronto
c© Copyright 2018 by Xiucai Ding
Abstract
Random matrix theory: From mathematical physics to high dimensional statistics and
time series analysis
Xiucai Ding
Doctor of Philosophy
Graduate Department of Statistical Sciences
University of Toronto
2018
Random matrix serves as one of the key tools in understanding the eigen-structure of
large dimensional matrices. The application ranges from the estimation and inference
of the high dimensional covariance matrices, the noise reduction of rectangular matrices
to the understanding of separable matrices and even matrices having correlation both in
rows and columns. Assuming that we can observe a p by n data matrix, where log p is
comparable to log n, we derive the convergent limits and distributions for the eigenvalues
and eigenvectors for a few random matrix models related to the above problems in the
study of high dimensional statistics. This part is based on a few papers jointly with
Zhigang Bao (HKUST), Fan Yang (UCLA) and Ke Wang (HKUST), where we employ
the dynamic approach developed by Laszlo Erdos and Horng-Tzer Yau [51].
Non-stationary time series is important in understanding the temporal correlation
of data. Assuming that only one time series is observed, we develop a methodology to
estimate the underlying high dimensional covariance and precision matrices. Based on
our methodology, we can infer the covariance and precision matrices using the strategy
of bootstrapping. This part is based on two papers jointly with Professor Zhou Zhou
(UofT). It is notable that, we apply Stein’s method to prove the Gaussian approximation,
which is essentially the same as the Green function comparison strategy for proving the
universality for random matrix models.
ii
Acknowledgements
This dissertation is dedicated to my son Kyrie Ding, my wife Xin Zhang, my father
Shenyue Ding and my mother Yunzhen Ma. My motivation is from their love and support.
This dissertation is also dedicated to all of my teachers in my PhD study, especially
to my advisor Professor Jeremy Quastel. I not only learn mathematics from him, but
also learn how to understand the questions in the correct way.
I would like to thank all of my collaborators, they are smart and I enjoy learning
mathematics and statistics from them. They are (in alphabet order of last name): Zhi-
gang Bao (HKUST), Dehan Kong (UofT), Weihao Kong (Stanford), Jeremy Quastel
(UofT), Qiang Sun (UofT), Gregory Valiant (Stanford), Ke Wang (HKUST), Hautieng
Wu (Duke), Fan Yang (UCLA), and Zhou Zhou (UofT).
Finally, I want to thank all my friends in Toronto, who have been enjoying research
and life with me, they are (in alphabet order of their last name): Philippe Casgrain,
Jinlong Fu, Luhui Gan, Boris Garbuzov, Tianyi Jia, Zhenhua Lin, Peng Liu, Qixuan Ma,
Chongda Wang, Shuai Yang and Xingshuo Zhai.
iii
Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Introduction 1
1.1 Introduction to random matrix theory . . . . . . . . . . . . . . . . . . . 2
1.2 Two approaches for analyzing random matrices . . . . . . . . . . . . . . 13
1.3 Applications in statistics and mathematical physics . . . . . . . . . . . . 17
1.4 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Random matrices in high dimensional statistics 26
2.1 Universality of sample covariance matrices . . . . . . . . . . . . . . . . . 26
2.1.1 Edge universality of sample covariance matrices . . . . . . . . . . 26
2.1.2 Universality of singular vector distribution . . . . . . . . . . . . . 68
2.2 Eigen-structure of the model of matrix denoising . . . . . . . . . . . . . . 114
2.3 Eigen-structure of sample covariance matrix of general form . . . . . . . 150
3 Random matrices in non-stationary time series analysis 197
3.1 Locally stationary time series and physical dependence measure . . . . . 197
3.2 Estimation of covariance and precision matrices . . . . . . . . . . . . . . 200
3.3 Inference of covariance and precision matrices . . . . . . . . . . . . . . . 212
iv
Bibliography 229
v
List of Tables
1.1 Orthogonal polynomials (OP) and random matrix model (RMM) . . . . 13
2.1 Comparison of different algorithms . . . . . . . . . . . . . . . . . . . . . 123
2.2 Loss functions and their optimal shrinkers . . . . . . . . . . . . . . . . . 194
3.1 Operator norm error for estimation of covariance matrices . . . . . . . . 211
3.2 Operator norm error for estimation of precision matrices. . . . . . . . . . 212
3.3 Simulated type I error rates under H10. . . . . . . . . . . . . . . . . . . . 226
3.4 Simulated type I error rates under H20 for k0 = 2. . . . . . . . . . . . . . 226
vi
List of Figures
1.1 An example of general sample covariance matrices . . . . . . . . . . . . . 24
2.1 Rotation invariant estimator . . . . . . . . . . . . . . . . . . . . . . . . . 124
2.2 Rotation invariant estimator Vs TSVD . . . . . . . . . . . . . . . . . . . 125
2.3 Estimation loss using factor model . . . . . . . . . . . . . . . . . . . . . 170
2.4 Spectrum of the examples . . . . . . . . . . . . . . . . . . . . . . . . . . 193
2.5 Optimal shrinkers under different loss functions . . . . . . . . . . . . . . 194
2.6 Estimation of oracle estimator . . . . . . . . . . . . . . . . . . . . . . . . 195
2.7 Estimation error using POET with extra information . . . . . . . . . . . 196
3.1 White noise test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
3.2 Bandedness test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
3.3 Bandedness test for different levels . . . . . . . . . . . . . . . . . . . . . 228
vii
Chapter 1
Introduction
Random matrices first appeared in mutivariate statistics in 1928 [108], where Wishart
formulated the Wishart distribution on matrix-valued random variables (i.e. Wishart
matrix) to study the problem of the estimation of covariance matrices. However, it did
not attract much attention at that time. Then in 1950s, Wigner used a simple random
matrix model (see Definition 1.1.1) to study the statistical behaviour of slow neutron
resonances in nuclear physics [106, 107]. Since then, many random matrix models have
been employed to study the quantum mechanics, for a detailed review, we refer to the
book [85] by Mehta.
In physics, random matrices are used to describe the limiting behavior of the eigen-
values of a Hamiltonian operator. However, in multivariate statistics, the typical objects
are sample covariance matrices. In this direction, Marcenko and Pastur [83] studied the
random matrix model (see Definition 1.1.2) of the form of sample covariance matrices.
Since then, statisticians have used random matrix models to study the estimation and
inference problems in multivariate and high dimensional statistics. For a comprehensive
review, we refer to the monograph [4] by Bai and Silverstein and the book [113] by Yao,
Zheng and Bai.
In this thesis, we will focus the discussion on the theory and applications of random
1
Chapter 1. Introduction 2
matrix models in statistics and mathematical physics. However, we remark that random
matrix theory has also been successfully used to many other disciplines of mathematics,
for instance the combinatorics, knot theory and number theory (Riemann zeta function).
These are out of the scope of this thesis, we refer the readers to the lecture note [55] by
Eynard, Kimura and Ribault.
1.1 Introduction to random matrix theory
In this section, we introduce the well-known random matrix models and the associated
statistical properties of their eigenvalues and eigenvectors.
Definition 1.1.1 (Wigner matrices, Definition 2.2 of [11]). A Wigner matrix is a n× n
Hermitian matrix whose entries Hij satisfy the following conditions: (i). The upper-
triangular entries (Hij, 1 ≤ i ≤ j ≤ n) are independent; (ii). For all i, j, we have
EHij = 0 and E|Hij|2 = n−1(1 +O(δij)); (iii). The random variables√nHij are bounded
in any Lp space, uniformly in n, i, j.
Definition 1.1.2 (Sample covariance matrices, Section 1.3 of [17]). For a p× p positive
definite matrix Σ and a p × n rectangular matrix X, where the entries Xij are centered
i.i.d random variables such that EXij = 0, E|Xij|2 = 1n
and√nXij are bounded in any
Lp space, uniformly in n, i, j, then we call Σ1/2XX∗Σ1/2 a sample covariance matrix.
Definition 1.1.3 (Addition of random matrix and deterministic matrix, Section 1 of
[37]). Consider a p×n random matrix X satisfying the conditions of those in Definition
1.1.2 and a p × n deterministic matrix S, we call S = S + X as an addition of random
matrix and deterministic matrix.
Definition 1.1.4 (Separable sample covariance matrices, Section 1 of [88]). For a p× p
positive definite matrix Σa and a n×n positive definite matrix Σb, consider a random p×n
matrix X satisfying the conditions of those in Definition 1.1.2, we call Σ1/2a XΣbX
∗Σ1/2a
a seperable sample covariance matrix.
Chapter 1. Introduction 3
Remark 1.1.5. It is notable that recently, researchers have been studying the random
matrices with correlated entries for square matrices using the matrix self-consistent equa-
tions, for instance [1, 2, 3, 30]. It is expected that all the above new techniques can be
applied to study the rectangular matrices. We will pursue this direction in the future.
In Definition 1.1.1 and 1.1.2, when the entries of the matrices are Gaussian random
variables, we can understand them from the perspective of ensembles. We first introduce
the one matrix model, which covers the Gaussian ensemble of Wigner matrices and sample
covariance matrices and then extend it to the multi-matrix model.
Definition 1.1.6 (One-matrix model, Section 1 of [54]). Consider the following proba-
bility law on a space M whose points M are matrices,
P(M) =1
Ze−TrV (M),
where V (x) is the potential function and Z is a normalization constant. We usually
consider the following two types of ensembles: (1). Hermite ensemble: M is the class of
n× n Hermitian matrix and V (x) is some polynomial function (e.g. V (x) = 12x2 in the
Gaussian case); (2). Laguerre ensemble: M is the class of the positive definite matrix
of the form Σ1/2XX∗Σ1/2 and V (x) is some function (e.g. V (x) = x − p log x in the
Gaussian case).
Definition 1.1.7 (Multi-matrix model, Section 1 of [56]). For multi-matrix model, we
consider a chain of m Hermitian n× n matrices with probability density proportional to
exp
[−Tr
1
2V1(H1) + V1(H2) + · · ·+ 1
2Vm(Hm)
+ Tr (c1H1H2 + · · ·+ cm−1Hm−1Hm)
],
where Vj(x) are real polynomials of even order and cj are real constants. Here Hi, i =
1, 2, · · · ,m are Hermitian matrices.
Remark 1.1.8. (i). For the one-matrix model, we can define different ensembles by choos-
Chapter 1. Introduction 4
ing different matrix space and probability law. A third important class is the Jacobi
ensemble [68]. (ii). For the multi-matrix model, we only discuss the Hermite ensemble.
A special type of Laguerre ensemble was derived by Tracy and Widom in [105].
To clarify the presentation of the results of random matrices, we first introduce some
useful notations. For any n × n matrix H, we denote its empericial spectral distri-
bution (ESD) as
FH(x) ≡ FHn (x) :=
1
n
n∑j=1
δλj(x),
where λj := λj(H) are the eigenvalues of H in the decreasing order. Then the Stieltjes’s
transform of FH(x) is defined as
mn(z) :=
∫1
x− zdFH(x) =
1
nTr(H − z)−1, z ∈ C+.
For a given domain S ∈ C+, we define the Green function of H on S as
G(z) ≡ GH(z) := (H − z)−1, z ∈ S.
Therefore, we have mn(z) = 1n
TrG(z). It is notable that the choice of S is also crucial
to our local analysis. We are mainly interested in the following two questions: (i). Does
FH(x) converge to some nonrandom limit? This is called the limiting spectral distribution
(LSD). (ii). If it does, what is the convergent rate? The answers for (i) are called global
laws and for (ii) are called local laws. The answers of (ii) have many consequences, for
instance the eigenvalue gap, the rigidity of eigenvalues, the bulk and edge universality
and the delocalization of eigenvectors.
(i). Global laws. It is well-known that establishing the convergence of the ESD of a
sequence of matrices is equivalent to show the convergence of their Stieltjes’s transforms,
and then the LSD can be found using the inversion formula (see Appendix B.2 of [4]).
For the above random matrix models, the Stieltjes’s transforms of the global laws satisfy
Chapter 1. Introduction 5
their associated self-consistent equations.
For Wigner matrices, we denote msc as the limit of mn, then msc is the unique solution
of following equation (see equation (1.29) of [47])
msc +1
msc + z= 0,
and the associated global law is called semicircular law, which can date back to the
work of Wigner [107].
For sample covariance matrices, assuming that np→ c ∈ (0,∞) and the ESD of Σ
converges to some nonrandom limit π, denote mmp as the limit and then it satisfies the
following equation (see equation (1.3) of [98])
z = − 1
mmp
+ c−1
∫λdπ(λ)
1 + λmmp
,
and the associated global law is called deformed Marcenko-Pastur law [80]. When
Σ = I, this coincides with the Marcenko-Pastur law [83].
For the addition of random matrix and deterministic matrix, assuming that 1pSS∗
converges to some nonrandom distribution π, denote msn as the limit of the matrices
1pSS∗, then it satisfies [46]
msn =
∫dπ(t)
t1+p−1msn
− (1 + p−1msn)z + n−1(1− c).
However, in this thesis, for the purpose of application, we will assume that S is of low rank
structure [37]. We will simply consider the singular vector decomposition of S without
normalizing by p−1. In this situation, by Cauchy’s interlacing property, the global law of
SS∗ is the Marcenko-Pastur law.
For separable sample covariance matrices, denote mse as the limit of mn, then mse is
the unique solution of the following system of equations (see equation (4.1.3) of [116] or
Chapter 1. Introduction 6
equations (1) and (2) of [88])
mse =
∫1
a∫
b1+cbe(z)
dπB(b)−zdπA(a)
e(z) =∫
aa∫
b1+cbe(z)
dπB(b)−zdπA(a)
,
where πA and πB are the LSDs of Σa and Σb respectively. Note that if Σb = I, this
reduces to the deformed Marcenko-Pastur law. For the recent results on the random
matrices with correlated entries, we refer to the lecture note [48].
Finally, it is remarkable that the global laws of the general deformed random matrices
can also be understood using free probability theory, where they can be written in terms
of subordination functions. For a comprehensive review, we refer to Section 3 of [26].
(ii). Local laws. The local laws measure how close the ESD and LSD are when the
spectral domain is restricted to some region containing only a few of the eigenvalues.
We will summarize the isotropic local law for Wigner matrices, the anisotropic local law
for sample covariance matrices satisfying some regularity conditions on Σ. For sample
covariance matrices with another condition on Σ, the local law is derived in [7]. The local
law for random matrices with fast decay correlation can be found in [48].
Finally we remark that the local laws for separable sample covariance matrices and
correlated sample covariance matrix are still missing at this point.
Theorem 1.1.9 (Isotropic local semicircle law, Theorem 2.12 and 2.15 of [16] and The-
orem 2.2 and 2.3 of [72]). (1). For a small ω ∈ (0, 1) and define
Sb ≡ Sb(ω, n) := z = E + iη ∈ C+ : |E| ≤ ω−1, n−1+ω ≤ η ≤ ω−1. (1.1.1)
Then for the Wigner matrices defined in Definition 1.1.1, for some small ε > 0 and large
Chapter 1. Introduction 7
D > 0, with 1− n−D probability, we have
|< v, G(z)w > − < v,w > msc(z)| ≤ nε
(√Immsc(z)
nη+
1
nη
), z ∈ Sb,
where v,w are unit vectors in Cn. When v,w are the standard basis in Rn, this reduces
to the local semicircle law.
(2). Denote the spectral domain outside the bulk as
So ≡ So(ω, n) := z = E + iη ∈ C : 2 + n−2/3+ω ≤ |E| ≤ ω−1, 0 ≤ η ≤ ω−1, (1.1.2)
then with 1− n−D probability, we have
|< v, G(z)w > − < v,w > msc(z)| ≤ nε
(√Immsc(z)
nη
), z ∈ So.
We next introduce the anisotropic local law for the sample covariance matrices. We
will need the following assumptions on Σ. This type of condition ensures square root
behavior of the Stieltjes’s transformation near the right edges and has been used in a
series of papers in [9, 40, 69, 80]. We first denote the non-asymptotic version of the global
law as z = f(m), Imm(z) > 0, where f(x) is defined as
f(x) = −1
x+
1
n
p∑i=1
1
x+ σ−1i
, (1.1.3)
where σipi=1 are the eigenvalues of Σ in the decreasing order. The elementary properties
of f are collected as the following lemma.
Lemma 1.1.10 (Properties of f). Denote R = R ∪ ∞, then f defined in (1.1.3)
is smooth on the p + 1 open intervals of R defined through I1 := (−σ−11 , 0), Ii :=
(−σ−1i ,−σ−1
i−1), i = 2, · · · , p, I0 := R/ ∪pi=1 Ii. We also introduce a multiset C ⊂ R
containing the critical points of f , using the conventions that a nondegenerate critical
Chapter 1. Introduction 8
point is counted once and a degenerate critical point will be counted twice. In the case
n/p = 1, ∞ is a nondegenerate critical point. With the above notations, we have
• |C ∩ I0| = |C ∩ I1| = 1 and |C ∩ Ii| ∈ 0, 2 for i = 2, · · · , t. Therefore, |C| = 2t,
where for convenience, we denote by x1 ≥ x2 ≥ · · · ≥ x2t−1 be the 2t − 1 critical
points in I1 ∪ · · · ∪ It and x2t be the unique critical point in I0.
• Denote ak := f(xk), we have a1 ≥ · · · ≥ a2t. Moreover, we have xk = m(ak) by
assuming m(0) := ∞ for n/p = 1. Furthermore, for k = 1, · · · , 2t, there exists a
constant C such that 0 ≤ ak ≤ C.
• supp ρ ∩ (0,∞) = (∪tk=1[a2k, a2k−1]) ∩ (0,∞).
Using the dual relation f(m(z)) = z, we can easily derive the asymptotic properties
of m(z), where we will discuss in detail in Chapter 2. Now we list the key assumption.
Assumption 1.1.11 (Regularity assumption on Σ, Definition 2.7 of [74]). Fix τ > 0,
we assume that (i). The edges ak, k = 1, · · · , 2t are regular in sense that
ak ≥ τ, minl 6=k|ak − al| ≥ τ, min
i|xk + σ−1
i | ≥ τ. (1.1.4)
(ii). The bulk components k = 1, · · · ,m are regular if for any fixed τ ′ > 0 there exists a
constant c ≡ cτ,τ ′ such that the density of ρ in [a2k + τ ′, a2k−1− τ ′] is bounded from below
by c.
Theorem 1.1.12 (Anisotropic local laws, Theorem 3.6 and 3.7 of [74]). Denote the
(p+ n)× (p+ n) deterministic matrices
Π(z) :=
−Σ(1 +m(z)Σ)−1 0
0 m(z)
, Σ =
Σ 0
0 1
,
Chapter 1. Introduction 9
and the random matrix as
G(z) :=
−Σ−1 X
X∗ −z
−1
.
(i). Under Assumption 1.1.11, for the spectral domain defined in (1.1.1), for some small
ε > 0 and large D > 0, with 1− n−D probability, we have
∣∣< v,Σ−1(G(z)− Π(z))Σ−1w >∣∣ ≤ nε
(√Imm(z)
nη+
1
nη
), z ∈ Sb,
where v and w are deterministic unit vectors in Rp+n.
(ii). Under Assumption 1.1.11, for z ∈ S, 0 < η ≤ ω−1 and dist(E, Suppρ) ≥ n−2/3+ω,
with 1− n−D probability, we have
∣∣< v,Σ−1(G(z)− Π(z))Σ−1w >∣∣ ≤ nε
(√Imm(z)
nη
).
It is remarkable that the isotropic local law for sample covariance matrix can be
recovered by the anisotropic local law as
G(z) :=
zΣ1/2G1(z)Σ1/2 ΣXG2(z)
G2(z)X∗Σ G2(z)
−1
,
where G1(z) = (Σ1/2XX∗Σ1/2 − z)−1, G2(z) = (X∗ΣX − z)−1.
Finally, we summarize the results on the matrix ensembles defined in Definition 1.1.6
and 1.1.7. As we know the exact probability measure, we can compute the joint probabil-
ity density function using the integral over the unitary (orthogonal) group. We first recall
the definition of determinantal point process [21, 66]. Consider a point process ξ on
a complete separable metric space Λ, with reference measure λ, all of whose correlation
Chapter 1. Introduction 10
functions ρn exist. If there exists a function K : Λ× Λ→ C such that
ρn(x1, · · · , xn) = det(K(xi, xj))ni,j=1,
for all xi, xj ∈ Λ, then we call ξ is a determinantal point process with correlation kernel K.
We can view the correlation kernel K as an integral kernel of a Hilbert Schmidt operator,
then K can be written into a matrix form [93]. The joint probability density function of
the eigenvalues ρ(n)n (x1, · · · , xn) can be easily computed using change of variables (e.g see
Section 8 of [23]), and we are usually interested in computing two important quantities.
One of them is the k-point correlation function ρ(k)n (x1, · · · , xk), which is defined as
ρ(k)n (x1, · · · , xk) :=
n!
(n− k)!
∫Rn−k
ρ(n)n (x1, · · · , xn)
n∏i=k+1
dxi
When k = 1, it is the averaged spectral density. The other is the level-spacing function
A(k)n (θ;x1, · · · , xk), which is defined as
A(k)n (θ;x1, · · · , xk) =
n!
k!(n− k)!
∫Rn−k/Dn−k
ρ(n)n (x1, · · · , xn)
n∏i=k+1
dxi,
where D := [−θ, θ]. We are mainly interested in the following two questions: (i). Are the
functions ρ(k)n , A
(k)n (θ) are determinantal? (ii). If they are, how can we use orthogonal
and biorothogonal polynomials to characterize this point process? The answers of (ii)
have many important consequences, for instance, the limiting distribution of the global
law (i.e level density) and the largest eigenvalues after doing the steepest decent analysis.
In the following discussion, we focus on the Hermitian matrices.
(i). Determinantal point process. For the one-matrix model, we rescale it and
consider the ensemble
1
Ze−nTrV (M),
where we assume that V (M) is a polynomial with positive leading coefficient. Denote
Chapter 1. Introduction 11
xi, i = 1, 2, · · · , n as the eigenvalues of M, it is well-known the joint probability density
function can be written as
ρ(n)n (x1, · · · , xn) = Z−1
∏1≤i<j≤n
|xi − xj|2n∏i=1
exp(−nV (xi)).
To show it is determinantal, we need to find its correlation kernel. This is usually done
in terms of orthogonal polynomials [100], where we record it as the following lemma and
theorem.
Lemma 1.1.13. For any partition function of the form
Z =
∫Rn
∏1≤i<j≤n
|∆(x1, · · · , xn)|2n∏i=1
e−W (xi),
there always exists a unique sequence of polynomials (Pn)n≥0 with the following properties:
(1). Pn is a monic polynomial of degree n;
(2). For any n,m ≥ 0 and some constant hn > 0,
∫RPn(x)Pm(x)e−W (x)dx = δnmhn.
Then Z = n!∏n−1
m=0 hm.
Theorem 1.1.14 (Theorem 9.2 of [23]). Denote the Christoffel-Darboux kernel as
K(x, y) =n−1∑k=0
Pk(x)Pk(y)
hk,
then we have that
ρ(k)n (x1, · · · , xk) = det[K(xi, xj)]
ni,j=1,
where K(x, y) = K(x, y)e−V (x)/2−V (y)/2.
For the multi-matrix model, we need to use the biorthogonal polynomials.
Chapter 1. Introduction 12
Theorem 1.1.15 (Main theorem of [56]). Consider a chain of m Hermitian n × n
matrices with probability density proportional to
exp
[−Tr
1
2V1(H1) + V1(H2) + · · ·+ 1
2Vm(Hm)
+ Tr (c1H1H2 + · · ·+ cm−1Hm−1Hm)
],
where Vj(x) are real polynomials of even order and cj are real constants. Denote
Eij(x, y) =
0, i > j,
ωi(x, y), i = j + 1,
ωi ∗ · · · ∗ ωj−1(x, y), i < j + 1.
where ωi(x, y) = exp(−12Vi(x) − 1
2Vi+1(y) + cixy). Then the correlation function of the
eigenvalues are determinantal and the kernel can be written as
Kij(x, y) = Hij(x, y)− Eij(x, y), 1 ≤ i, j ≤ m,
where Hij(x, y) =∑n−1
l=01hl
Ψil(x)Φjl(y), where
∫Ψij(x)Φikdx = hjδjk, 1 ≤ i ≤ m, j, k ≥ 0.
Here Φik(x), Ψjl(x) can be constructed in the following ways: choose polynomials Pj(x), Qk(y)
of degrees j, k, where they satisfy
∫ ∫Pj(x)(ω1 ∗ · · · ∗ ωm−1)(x, y)Qk(y)dxdy = hjδjk.
Let Ψmj(x) = Qj(x), Φ1j(x) = Pj(x), we then have
Ψij(x) =
∫ωi(x, y)Ψi+1,j(y)dy, Φij(x) =
∫Φi−1,j(y)ωi−1(y, x)dy.
Chapter 1. Introduction 13
(ii). Kernel representation using orthogonal polynomials. Once we have proved
it is determinantal, the next step is to rewrite the kernel function into a summation of
polynomials, where the asymptotics can be done by using the steepest decent analysis.
For the one matrix model, we list the common orthogonal polynomials and their asso-
ciated random matrix models. For the classic polynomials, there exists some recursive
relation (i.e Christoffel-Darboux formula) by analyzing the generating functions of the
orthogonal polynomials.
Table 1.1: Orthogonal polynomials (OP) and random matrix model (RMM)
V (x) OP RMM
12x2 Hermite Wigner
x− a log x Laguerre Wishart−a log(1− x)− b log x Jacobi Double Wishart
For the biorthogonal polynomials, the existence is guaranteed by the work of Borodin
[20], but due to the lack of a simple explicit Christoffel-Darboux formula, we need to find
the biorthogonal system case by case. For instance, in the Gaussian case [105], they turn
to be the extended Hermite polynomials.
1.2 Two approaches for analyzing random matrices
There are four important methods employed in the study of random matrices: the mo-
ment method, the Stieltjes’s transform, the orthogonal and biorthogonal polynomial de-
composition and the approach of free probability. We will not discuss the method of
moment and free probability as they are beyond the main discussion of this thesis, for
the purpose of reference, we refer to [4] and [86].
For statistical applications, we will rely on the dynamic approach by analyzing the
Green functions. This approach can be regarded as an extension of the method of Stielt-
jes’s transform. An important advantage is that this method can provide us the local
laws with optimal bounds. For the applications in mathematical physics, we focus on
Chapter 1. Introduction 14
the discussion of orthogonal and biorthogonal polynomial decomposition, which can pro-
vide us an exact determinantal formula for the eigenvalue correlation function and level
spacing function.
Dynamic approach developed by Erdos and Yau. Two good references for this
approach are the book [51] and the lecture note [11]. To employ this idea, we firstly
need to prove the local laws (or their variants) for the associated random matrix models.
This will reply on the detailed analysis of Green functions following the steps below (an
example of detailed representation will be given in Chapter 2):
(a). Using Schur’s complement formula and large derivation bounds to prove that the
diagonal entry of the Green function is close to its expectation.
(b). Splitting the terms in the Schur’s complement formula into a leading term and a
random term (usually by multiplying the diagonal term of Green functions on both sides
of the Schur’s complement formula). Averaging all the above terms, we can control the
error terms for some suitably chosen η, for instance, for the Wigner matrix and sample
covariance matrix satisfying Assumption 1.1.11, we firstly take η ≥ 1. This step will give
us the self-consistent equation for the global law. Note that the spectral domain is also
crucial. For example, if we only want to study the local law near the edge, we can choose
the real part of the parameter within the typical distance from the edge.
(c). Repeat (a) and (b) for the off-diagonal terms of the Green functions.
(d). Steps (a), (b) and (c) provide us a priori bound for some η, we can repeatedly
improve our estimation bounds. This will provide us a weak local law.
(e). To obtain the final form, we need the fluctuation averaging for the summations,
where we need to apply the decoupling technique.
Once we obtain the local laws for our application, the next step is to write the quantity
of interest in terms of smooth functions of the entries of Green functions. For instance, in
Chapter 2 (see also [9, 40, 80, 81, 89]), we will show that the distribution function of the
largest eigenvalue of sample covariance matrix can be well approximated by a function
Chapter 1. Introduction 15
only depending on Green functions on some well-chosen interval. We will also show that
(see also [12, 17, 37, 72]) for the spiked sample covariance matrices, the outlier eigenvalues
are completely determined by a deterministic equation involving Green functions, and
the overlap of eigenvectors can be written into the derivative of Green functions. Once we
have the representation of the quantity of interest, we can compute and prove the desired
results. In statistical applications, the following two types of questions are usually taken
into consideration: universality of the eigen-structure and asymptotics of the first few
largest eigenvalues and eigenvectors of sample covariance matrices.
The strategy for proving universality is to either use the Lindeberg’s replacement
trick [28] by replacing entry by entry [40, 52, 81, 102, 103] or column by column [9, 89]
or use the Green function flow (continuous interpolations) by controlling the derivatives
of Green functions [74, 79, 80].
The outlier eigenvalues and eigenvetors play important roles in the statistical estima-
tion and inference. The convergent limits can usually be computed from the representa-
tion of the quantities and the local laws. For the asymptotics, in the supercritical case,
we usually have Gaussian fluctuation [12, 37, 72, 73]. To prove this, we need to derive a
recursion formula for the moments using Stein’s lemma [8], where we need to control the
error terms using local laws. For reference, we list the key formulas for Gaussian random
variables. Their proofs are just a simple application of the trick of integration by parts.
Lemma 1.2.1 (Recursive formula for moments of Gaussian random variables). For a
real Gaussian random variable X ∼ N (µ, σ2), denote its n-th moment by an, we then
have an+2 = µan+1 + σ2(n+ 1)an
a1 = µ
a2 = µ2 + σ2
.
Lemma 1.2.2 (Stein’s lemma, Appendix A of [29]). Suppose that X = (x1, · · · , xn) ∈ Rn
Chapter 1. Introduction 16
is a n-dimensional centered Gaussian vector. Let f : Rn → R be an absolutely continuous
function such that |∇f(X)| has finite expectation, then for any i,
E(xif(X)) =n∑j=1
E(xixj)E(∂if(X)).
We point out that the tricks we need to use are the self-consistent equations and the
fact that zG = HG−I. There are many other important applications in high dimensional
statistics, for instance the linear spectral statistics and the correlated sample covariance
matrices. We will not discuss them in this thesis but will focus on this direction in the
future.
Orthogonal and biorthogonal polynomial decomposition. Two good references
are the books [34] and [85].To apply this approach, we firstly need to prove that the
correlation function of the random matrices model is determinantal following the steps
below:
(a). Deriving the jointly probability density function for the eigenvalues of the random
matrix models using the confluent form of the Harish-Chandra-Itzykson-Zuber integral
[14, 85].
(b). For the one-matrix model, choosing suitable orthogonal polynomials to rewrite the
Vandermonde determinant of the product of the eigenvalues according to the weights of
the ensemble. For the multi-matrix model, following Theorem 1.1.15 to construct the
biorthogonal system by choosing some suitably initial polynomials according to the po-
tential functions.
(c). Rewrite the correlation function into the determinantal form and analyze the asymp-
totic behavior of the polynomials using the Riemann-Hilbert steepest decent analysis [34].
Once we find the correlation function, the next step is to write the quantities of inter-
est in terms of the correlation function, for instance, the gap probabilities can be write
into an infinite summation of the integration of correlation function. It is notable that
Chapter 1. Introduction 17
the most important step is to biorthogonalize the correlation kernel, for some special
cases, we can use the technique of TASEP [84].
Remark 1.2.3. (1). The asymptotic analysis is usually easy for studying the complex
case but hard for the real case. For example, the BBP transition was only tackled for
the real case by Bloemendal and Virag [18, 19] by relating the distribution of perturbed
GOE to the probability of explosion of the solution of second order stochastic differential
equations in 2011.
(2). It is notable that we can add some external resources to the one-matrix model,
where we need to biorthogonalize it as well. We will not pursue this direction, we refer
to the work of Kuijlaars [15, 76] for future discussion.
1.3 Applications in statistics and mathematical physics
Covariance matrices play important roles in high dimensional data analysis, which find
applications in many scientific endeavors, ranging from functional magnetic resonance
imaging and analysis of gene expression arrays to risk management and portfolio alloca-
tion. Furthermore, a large collection of statistical methods, including principal compo-
nent analysis, discriminant analysis, clustering analysis, and regression analysis, require
the knowledge of the covariance structure. Estimating a high dimensional covariance
matrix becomes the fundamental problem in high dimensional statistics. The starting
point of covariance matrix estimation is the sample covariance matrix. For the purpose
of statistical applications, our work focus on the models in Definition 1.1.2 and 1.1.3.
After deriving the local laws, we can study the statistical properties of the eigenvalues
and eigenvectors of such matrices. For the sample covariance matrix, an important sub-
class of is the spiked sample covariance matrix [6, 17, 36, 67, 87], where a finite number
of eigenvalues can detach from the bulk and become the outliers. In the language of
statistics, we can regard the outlier eigenvalues as the signals and the bulk eigenvalues
Chapter 1. Introduction 18
as the noise. The signal part contains information only depending on itself and the
noise part will stick to the sample covariance matrix XX∗. In the supercritical case,
the outlier eigenvalues have Gaussian fluctuation [5] and the distribution of the angle
between the eigenvectors of population and sample covariance matrix is also Gaussian.
However, in the general situation, it may not be universal [27, 73]. For the extremal
non-outlier eigenvalues, they will be governed by the Tracy-Widom asymptotics. Simi-
lar results hold true for the addition of random matrix and low-lank deterministic matrix.
Random matrix theory can also help us study the covariance structure of non-stationary
time series and high dimensional time series. On one hand, for the non-stationary time
series, the underlying covariance and precision matrices are large dimensional matri-
ces. We adapt the construction of Wu and Zhou [109, 119, 120] to characterize the
non-stationary time series. Assuming that we can only observe one non-stationary time
series xini=1, xi ∈ R, and xi = G( in,Fi), where Fi = (· · · , ηi−1, ηi) and ηi, i ∈ Z are
i.i.d centered random variables, and G : [0, 1] × R∞ → R is a measurable function such
that ξi(t) := G(t,Fi) is a properly defined random variable for all t ∈ [0, 1]. It is very
important to test whether xini=1 is a White noise process (with possibly time-varying
variances) and whether its precision matrix is banded. In many cases, the statistics is
of quadratic form with a vector of diverging dimension [42, 118]. The distribution of
the Gaussian case is easy to compute using the classic central limit theorem, and for the
general distribution, we need to prove the Gaussian approximation. This is usually done
by using Stein’s method [95, 96], which is essentially the same as the Green function
comparison strategy in random matrix theory. On the other hand, in high dimensional
statistics, even though the entries between each vector are correlated through Σ, the vec-
tors are assumed to be independent, it is important to study the case when the vectors
are correlated with each other. In [82], the Marcenko-Pastur law is derived for the lagged
autocovariance matrices for stationary linear time series. And in [115], the Gaussian
asymptotics of the largest eigenvalue for a special class of unstable time series is derived.
Chapter 1. Introduction 19
However, the connection between non-stationary time series and random matrix is still
missing at this point and we will pursue this direction in the future using the framework
of Wu and Zhou.
Random matrix models are also useful to help us understand the stochastic growth
phenomena in physics [59, 101]. From the work of Johansson [64, 65], Prahofer and Spohn
[92], the Airy process is employed to describe the spatial fluctuations in a wide range of
growth models. These processes are at the center of the KPZ universality class [94]. One
way to characterize one of such processes is to scale the top eigenvalue curves of Dyson
Brownian motion at different time points. A Dyson Brownian motion is matrix-valued
SDEs whose entries independently undergo Ornstein-Uhlenbeck diffusion [59], if we con-
sider the GUE initial condition and study the transition density at finite time points, it
has the form of the multi-matrix model of Definition 1.1.7. Using Theorem 1.1.15, we
can get the extended Hermite kernel and scale this kernel at the edge we can get the Airy
process [105]. The key part for the above computation is to biorthogonalization of the
correlation kernel, where in the GUE case they are the standard Hermite polynomials
[100]. Similar technical problems appear in the discussion of totally asymmetric simple
exclusion process (TASEP) [22, 97]. Very recently, Matetski, Quastel and Remenik [84]
proposed a new way to understand this problem in the environment of random walk, it
is our hope that we can extend the Airy process using this technique [39].
1.4 Our contributions
This section is devoted to listing our contributions of this thesis and the detail can be
found in Chapter 2 and 3. We divide them into two parts accordingly:
Random matrix theory and high dimensional statistics. We have successfully
applied the dynamic approach developed by Erdos and Yau to study some problems
related to high dimensional statistics.
Chapter 1. Introduction 20
(1). We prove a necessary and sufficient condition for the edge universality at the largest
eigenvalue of a general class of sample covariance matrix satisfying Assumption
1.1.11 and Σ is diagonal in [40]. To make the Tracy-Widom asymptotics hold true,
the following moment assumption is the necessary and sufficient condition
lims→∞
s4P(|√nXij| ≥ s) = 0.
This implies that Tracy-Widom distribution still holds true for data with slightly
heavy tails, for example the probability density function of the form
f(x) =e4(4 log x+ 1)
x5(log x)21(x > e).
This condition was originally proposed for Wigner matrix by Lee and Yin in [81].
In an on-going project [41], we will prove that this condition still holds true for a
more general class of Σ.
(2). We prove the universality of the singular vectors for a general class of sample
covariance matrices provided that Assumption 1.1.11 holds true in [35]. We consider
a class of sample covariance matrices of the form Σ1/2XX∗Σ1/2. Assuming p is
comparable to n, we prove that the distribution of the components of the singular
vectors close to each edge singular value agrees with that of Gaussian ensembles
provided the first two moments coincide with the Gaussian random variables. For
the singular vectors associated with each bulk singular value, the same conclusion
holds if the first four moments match with those of Gaussian random variables.
Similar results have been proved for Wigner matrices by Knowles and Yin in [71].
We only prove for the diagonal case in this paper, however, using the Green function
flow method [80], we can extend the results for any Σ satisfying Assumption 1.1.11.
(3). We systematically study the eigen-structure of the model in Definition 1.1.3 assum-
Chapter 1. Introduction 21
ing that S is of low rank structure in [37]. Denote the singular vector decompo-
sition of S as S = UDV ∗, where D = diagd1, · · · , dr, U = (u1, · · · , ur), V =
(v1, · · · , vr), and where ui ∈ Rp, vi ∈ Rn are orthonormal vectors and r is a fixed
constant. We also assume d1 > d2 > · · · > dr > 0. We are interested in the regime
cn := np, limn→∞ cn = c ∈ (0,∞). We now give a heuristic description of our
results for rank one case and the detail can be found in Chapter 2. We denote
µ1 ≥ · · · ≥ µK as the eigenvalues of SS∗, K = minn, p and ui, vi as the singular
vectors of S. We prove that when d1 > c−1/4, µ1 → p(d1), where p(d1) is defined as
p(d) =(d2 + 1)(d2 + c−1)
d2.
When d1 > c−1/4, the largest eigenvalue µ1 will detach from the bulk and become
an outlier around its classical location p(d1). We would expect this happens under
a scale of n−1/3. This can be understood in the following ways: increasing d beyond
the critical value c−1/4, we expect µ1 to become an outlier, where its location p(d)
is located at a distance greater than O(n−2/3) from λ+. By using mean value theo-
rem, the phase transition will take place on the scale when |d1− c−1/4| ≥ O(n−1/3).
Furthermore, we also prove that µ1 = p(d1) +O(n−1/2(d1 − c−1/4)1/2
). Below this
scale, we would expect the spectrum of SS∗ to stick to that of XX∗. Especially, the
largest eigenvalue µ1 still has the Tracy-Widom distribution with the scale n−2/3,
which reads as µ1 = λ+ +O(n−2/3), λ+ = (1 + c−1/2)2..
For the singular vectors, when d1 > c−1/4, we have < u1, u1 >2→ a1(d1), <
v1, v1 >2→ a2(d1), where a1(d1), a2(d2) are deterministic functions of d. We further
prove that if d1 > c−1/4 + n−1/3, we have that
< u1, u1 >2= a1(d1) +O(n−1/2), < v1, v1 >
2= a2(d1) +O(n−1/2).
Chapter 1. Introduction 22
Below this scale, we prove that
< u1, u1 >2= O(n−1), < v1, v1 >
2= O(n−1).
Finally, we point out that in the working paper [8], we prove that in the super-
critical case when d1 > c−1/4, < u1, u1 >2 is asymptotically normally distributed if
the singular vector has no component of order O(1).
We also consider two statistical applications. Under the assumption ui, vi are
sparse, we provide a algorithm to consistently estimate S from S. And in the gen-
eral situation, we provide the rotation-invariant estimator, which performs better
than simply using singular value decomposition.
(4). We extend the famous spiked sample covariance matrix model to a more general
model containing more bulk components and outliers in [36]. To extend the bulk
model, we now add r (finite) number of spikes to the spectrum of Σb satisfying
Assumption 1.1.11. Denote the spectral decomposition of Σb as
Σb =
p∑i=1
σbiviv∗i , Db = diagσb1, · · · , σbp.
Denote the index set I ⊂ 1, 2, · · · , p as the collection of the indices of the r
outliers, where I := o1, · · · , or ⊂ 1, 2, · · · ,M. Now we define
Σg =M∑i=1
σgi viv∗i , where σgi =
σbi (1 + di), i ∈ I;
σbi , otherwise.
, di > 0.
We also assume that di are in the decreasing fashion. Therefore, we can write
Σg = Σb(1 + VDV∗) = (1 + VDV∗)Σb,
Chapter 1. Introduction 23
where V = (v1, · · · ,vM) and D = (di) is a p×p diagonal matrix, where di = di, i ∈
I and zero otherwise. Then our new model can be written as Qg = Σ1/2g XX∗Σ
1/2g .
As there exist m bulk components, for convenience, we relabel the indices of the
eigenvalues of Qg using µi,j, which stands for the j-th eigenvalue of the i-th bulk
component. Similarly, we can relabel for di,j, σgi,j, σ
bi,j. Recall the definitions related
to f in Lemma 1.1.10, we assume that the r outliers are associated with t bulk
components and each with ri, i = 1, 2, · · · , t outliers satisfying∑t
i=1 ri = r. Using
the convention that x0 = ∞, we denote the subset O+ ⊂ O by O+ =⋃ti=1O
+i ,
where O+i is defined as
O+i = σgi,j : x2i−1 +N−1/3+ε0 ≤ − 1
σgi,j< x2(i−1) − c0,
where ε0 > 0 is some small constant and 0 < c0 < minix2(i−1)−x2i−1
2. We further
denote r+i := |O+
i | and the index sets associated with O+i ,O+ by I+
i , I+, where
I+i := (i, j) : σgi,j ∈ O+
i , I+ :=
p⋃i=1
I+i .
We can relabel I in the similar fashion. We prove that for i = 1, 2, · · · , t, j =
1, 2, · · · , r+i , there exists some constant C > 1, when N is large enough, with
1−N−D1 probability, we have
|µi,j − f(− 1
σgi,j)| ≤ n−1/2+Cε0(− 1
σgi,j− x2i−1)1/2.
Moreover, for i = 1, 2, · · · , t, j = r+i + 1, · · · , ri, we have
|µi,j − f(x2i−1)| ≤ n−2/3+Cε0 .
Similar results hold for the angle between the eigenvectors of sample covariance ma-
Chapter 1. Introduction 24
trices and population covariance matrices, where the limit is 1σgi,j
f ′(−1/σgi,j)
f(−1/σgi,j). Examples
and statistical applications are considered to verify our results.
Figure 1.1: An example of the general model. The spectrum of the population covariancematrix contains three bulk components and there are three, two and one spike associatedwith the first, second and third bulk component respectively.
Random matrix theory and time series analysis. We developed a methodology,
which can help us estimate the underlying high dimensional covariance and precision
matrices of the locally stationary time series [42] assuming that only one observation is
available. Consider the one dimensional non-stationary time series xini=1, the starting
point of our methodology is the idea of Cholesky decomposition [91]. Let xi be the best
linear predictor of xi based on its predecessors xi−1, · · · , x1, where
xi =i−1∑j=1
φijxi−j, i = 2, · · · , n.
Denote φi = (φi1, · · · , φi,i−1)∗, where we use ∗ to stand for the transpose. Then we have
φi = Γ−1i γi, where Γi and γi are defined as Γi = Cov(xi−1,xi−1), γi = Cov(xi−1, xi),
with xi−1 = (xi−1, · · · , x1). Let εi be the prediction error with variance σ2i , εi = xi − xi.
Chapter 1. Introduction 25
Therefore, we can write
xi =i−1∑j=1
φijxi−j + εi, i = 2, · · · , n.
As xi is centered, we have x1 = ε1, as a consequence, we can write ΦΓΦ∗ = D, where the
diagonal matrix D = diagσ21, · · · , σ2
n and Φ is a lower triangular matrix having ones
on its diagonal and −φij at its (i, i − j)−th element for j < i. We need to estimate the
coefficients φij and the variances of εi. Under mild smoothing conditions, φij can be well
approximated by φj(in), where φj(t) is a smooth function defined on [0, 1]. Hence, it is
natural to employ the idea of Sieve estimation [31], where φj(t) can be estimated using
some given basis functions. For the variances σ2i , due to the smoothness assumption,
they can also be estimated using the method of Sieve. An advantage of the Cholesky
decomposition is that the precision matrix can also be easily (actually numerically easier)
estimated.
As byproducts, we can use the estimators of φij to infer the structure of the covariance
and precision matrices of the non-stationary time series. In the first paper [42], we
consider two concrete hypothesis testing problems, one is to test whether xini=1 is a
White noise process and the other is to test whether its precision matrix is banded. In
the second paper [43], we test the stationarity of the correlation of time series.
Chapter 2
Random matrices in high
dimensional statistics
In this chapter, we provide detailed proof and computation on the eigen-structure of some
random matrix models and discuss their statistical applications, which are sketched as the
first part of our contributions in Section 1.4. We will list the detailed results and the key
proofs here. For a complete discussion, we refer to our papers [7, 8, 35, 36, 37, 38, 40, 41].
2.1 Universality of sample covariance matrices
2.1.1 Edge universality of sample covariance matrices
Sample covariance matrices with general populations. We consider the M1×M1
sample covariance matrix Q1 := TX(TX)∗, where T is a deterministic M1 ×M2 matrix
and X is a random M2 × N matrix. We assume X = (xij) has entries xij = N−1/2qij,
1 ≤ i ≤M2 and 1 ≤ j ≤ N , where qij are i.i.d. random variables satisfying
Eq11 = 0, E|q11|2 = 1. (2.1.1)
26
Chapter 2. Random matrices in high dimensional statistics 27
In this subsection, we regardN as the fundamental (large) parameter andM1,2 ≡M1,2(N)
as depending on N . We define M := minM1,M2 and the aspect ratio dN := N/M .
Moreover, we assume that
dN → d ∈ (0,∞), as N →∞. (2.1.2)
For simplicity of notations, we will almost always abbreviate dN as d in this paper. We
denote the eigenvalues of Q1 in decreasing order by λ1(Q1) ≥ . . . ≥ λM1(Q1). We will
also need the N × N matrix Q2 := (TX)∗TX and denote its eigenvalues by λ1(Q2) ≥
. . . ≥ λN(Q2). Since Q1 and Q2 share the same nonzero eigenvalues, we will for simplicity
write λj, 1 ≤ j ≤ minN,M1, to denote the j-th eigenvalue of both Q1 and Q2 without
causing any confusion.
We assume that T ∗T is diagonal. In other words, T has a singular decomposition
T = UD, where U is an M1 × M1 unitary matrix and D is an M1 × M2 rectangular
diagonal matrix. Then it is equivalent to study the eigenvalues of DX(DX)∗. When
M1 ≤ M2 (i.e. M = M1), we can write D = (D, 0) where D is an M ×M diagonal
matrix such that D11 ≥ . . . ≥ DMM . Hence we have DX = DX, where X is the upper
M × N block of X with i.i.d. entries xij, 1 ≤ i ≤ M and 1 ≤ j ≤ N . On the other
hand, when M1 ≥ M2 (i.e. M = M2), we can write D =
D0
where D is an M ×M
diagonal matrix as above. Then DX =
DX0
, which shares the same nonzero singular
values with DX. The above discussions show that we can make the following stronger
assumption on T :
M1 = M2 = M, and T ≡ D = diag(σ
1/21 , σ
1/22 , . . . , σ
1/2M
), (2.1.3)
Chapter 2. Random matrices in high dimensional statistics 28
where
σ1 ≥ σ2 ≥ . . . ≥ σM ≥ 0.
Under the above assumption, the population covariance matrix of Q1 is defined as
Σ := EQ1 = D2 = diag (σ1, σ2, . . . , σM) . (2.1.4)
We denote the empirical spectral density of Σ by
πN :=1
M
M∑i=1
δσi . (2.1.5)
We assume that there exists a small constant τ > 0 such that
σ1 ≤ τ−1 and πN([0, τ ]) ≤ 1− τ for all N. (2.1.6)
Note the first condition means that the operator norm of Σ is bounded by τ−1, and the
second condition means that the spectrum of Σ cannot concentrate at zero.
For definiteness, in this subsection we will focus on the real case, i.e. the random
variable q11 is real. However, we remark that our proof can be applied to the complex case
after minor modifications if we assume in addition that Re q11 and Im q11 are independent
centered random variables with variance 1/2.
We summarize our basic assumptions here for future reference.
Assumption 2.1.1. We assume that X is an M × N random matrix with real i.i.d.
entries satisfying (2.1.1) and (2.1.2). We assume that T is an M × M deterministic
diagonal matrix satisfying (2.1.3) and (2.1.6).
Deformed Marchenko-Pastur law. In this part, we will study the eigenvalue statis-
tics of Q1,2 through their Green functions or resolvents.
Definition 2.1.2 (Green functions). For z = E + iη ∈ C+, where C+ is the upper half
Chapter 2. Random matrices in high dimensional statistics 29
complex plane, we define the Green functions for Q1,2 as
G1(z) := (DXX∗D∗ − z)−1 , G2(z) := (X∗D∗DX − z)−1 . (2.1.7)
We denote the empirical spectral densities (ESD) of Q1,2 as
ρ(N)1 :=
1
M
M∑i=1
δλi(Q1), ρ(N)2 :=
1
N
N∑i=1
δλi(Q2).
Then the Stieltjes transforms of ρ1,2 are given by
m(N)1 (z) :=
∫1
x− zρ
(N)1 (dx) =
1
MTrG1(z),
m(N)2 (z) :=
∫1
x− zρ
(N)2 (dx) =
1
NTrG2(z).
Throughout the rest of this subsection, we omit the super-index N from our notations.
Remark 2.1.3. Since the nonzero eigenvalues of Q1 and Q2 are identical, and Q1 has
M −N more (or N −M less) zero eigenvalues, we have
ρ1 = ρ2d+ (1− d)δ0, (2.1.8)
and
m1(z) = −1− dz
+ dm2(z). (2.1.9)
In the case D = IM×M , it is well known that the ESD of X∗X, ρ2, converges weakly
to the Marchenko-Pastur (MP) law [83]:
ρMP (x)dx :=1
2π
√[(λ+ − x)(x− λ−)]+
xdx, (2.1.10)
where λ± = (1± d−1/2)2. Moreover, m2(z) converges to the Stieltjes transform mMP (z)
Chapter 2. Random matrices in high dimensional statistics 30
of ρMP (z), which can be computed explicitly as
mMP (z) =d−1 − 1− z + i
√(λ+ − z)(z − λ−)
2z, z ∈ C+. (2.1.11)
Moreover, one can verify that mMP (z) satisfies the self-consistent equation [9, 98]
1
mMP (z)= −z + d−1 1
1 +mMP (z), ImmMP (z) ≥ 0 for z ∈ C+. (2.1.12)
Using (2.1.8) and (2.1.9), it is easy to get the expressions for ρ1c, the asymptotic eigen-
value density of Q1, and m1c, the Stieltjes transform of ρ1c.
If D is non-identity but the ESD πN in (2.1.5) converges weakly to some π, then it
was shown that the empirical eigenvalue distribution of Q2 still converges in probability
to some deterministic distributions ρ2c, referred to as the deformed Marchenko-Pastur
law below. It can be described through the Stieltjes transform
m2c(z) :=
∫R
ρ2c(dx)
x− z, z = E + iη ∈ C+.
For any given probability measure π compactly supported on R+, we define m2c as the
unique solution to the self-consistent equation [98]
1
m2c(z)= −z + d−1
∫x
1 + m2c(z)xπ(dx), (2.1.13)
where the branch-cut is chosen such that Im m2c(z) ≥ 0 for z ∈ C+. It is well known that
the functional equation (2.1.13) has a unique solution that is uniformly bounded on C+
under the assumptions (2.1.2) and (2.1.6). Letting η 0, we can recover the asymptotic
eigenvalue density ρ2c with the inverse formula
ρ2c(E) = limη0
1
πIm m2c(E + iη). (2.1.14)
Chapter 2. Random matrices in high dimensional statistics 31
The measure ρ2c is sometimes called the multiplicative free convolution of π and the MP
law. Again with (2.1.8) and (2.1.9), we can easily obtain m1c and ρ1c(z).
Similar to (2.1.13), for any finite N we define m(N)2c as the unique solution to the
self-consistent equation
1
m(N)2c (z)
= −z + d−1N
∫x
1 +m(N)2c (z)x
πN(dx), (2.1.15)
and define ρ(N)2c through the inverse formula as in (2.1.14). Then we define m
(N)1c and
ρ(N)1c (z) using (2.1.8) and (2.1.9). In the rest of this paper, we will always omit the
super-index N from our notations. The properties of m1,2c and ρ1,2c have been stud-
ied extensively. Here we collect some basic results that will be used in our proof. In
particular, we shall define the rightmost edge (i.e. the soft edge) of ρ1,2c.
Corresponding to the equation in (2.1.15), we define the function
f(m) := − 1
m+ d−1
N
∫x
1 +mxπN(dx). (2.1.16)
Then m2c(z) can be characterized as the unique solution to the equation z = f(m) with
Imm ≥ 0.
Lemma 2.1.4 (Support of the deformed MP law). The densities ρ1c and ρ2c have the
same support on R+, which is a union of connected components:
supp ρ1,2c ∩ (0,∞) =
p⋃k=1
[a2k, a2k−1] ∩ (0,∞), (2.1.17)
where p ∈ N depends only on πN . Here ak are characterized as following: there exists a
real sequence bk2pk=1 such that (x,m) = (ak, bk) are the real solutions to the equations
x = f(m), and f ′(m) = 0. (2.1.18)
Chapter 2. Random matrices in high dimensional statistics 32
Moreover, we have b1 ∈ (−σ−11 , 0). Finally, under assumptions (2.1.2) and (2.1.6), we
have a1 ≤ C for some positive constant C.
It is easy to observe that m2c(ak) = bk according to the definition of f . We shall call
ak the edges of the deformed MP law ρ2c. In particular we will focus on the rightmost
edge λr := a1. To establish our result, we need the following extra assumption.
Assumption 2.1.5. For σ1 defined in (2.1.3), we assume that there exists a small con-
stant τ > 0 such that
|1 +m2c(λr)σ1| ≥ τ, for all N. (2.1.19)
Remark 2.1.6. The above assumption guarantees a regular square-root behavior of the
spectral density ρ2c near λr (see Lemma 2.1.15 below), which is used in proving the local
deformed MP law at the soft edge. Note that f(m) has singularities at m = −σ−1i for
nonzero σi, so the condition (2.1.19) simply rules out the singularity of f at m2c(λr).
Main result. The main result of this paper is the following theorem. It establishes the
necessary and sufficient condition for the edge universality of the deformed covariance
matrix Q2 at the soft edge λr. We define the following tail condition for the entries of
X,
lims→∞
s4P(|q11| ≥ s) = 0. (2.1.20)
Theorem 2.1.7. Let Q2 = X∗T ∗TX be an N × N sample covariance matrix with X
and T satisfying Assumptions 2.1.1 and 2.1.5. Let λ1 be the largest eigenvalues of Q2.
• Sufficient condition: If the tail condition (2.1.135) holds, then we have
limN→∞
P(N2/3(λ1 − λr) ≤ s) = limN→∞
PG(N2/3(λ1 − λr) ≤ s), (2.1.21)
for all s ∈ R, where PG denotes the law for X with i.i.d. Gaussian entries.
Chapter 2. Random matrices in high dimensional statistics 33
• Necessary condition: If the condition (2.1.135) does not hold for X, then for
any fixed s > λr, we have
lim supN→∞
P(λ1 ≥ s) > 0. (2.1.22)
Remark 2.1.8. In [79], it was proved that there exists γ0 ≡ γ0(N) depending only on πN
and the aspect ratio dN such that
limN→∞
PG(γ0N
2/3(λ1 − λr) ≤ s)
= F1(s)
for all s ∈ R, where F1 is the type-1 Tracy-Widom distribution. The scaling factor γ0 is
given by [69]
1
γ30
=1
d
∫ (x
1 +m2c(λr)x
)3
πN(dx)− 1
m2c(λr)3,
and Assumption 2.1.5 assures that γ0 ∼ 1 for all N . Hence (2.1.21) and (2.1.22) together
show that the distribution of the rescaled largest eigenvalue of Q2 converges to the Tracy-
Widom distribution if and only if the condition (2.1.135) holds.
Remark 2.1.9. The universality result (2.1.21) can be extended to the joint distribution
of the k largest eigenvalues for any fixed k:
limN→∞
P((N2/3(λi − λr) ≤ si
)1≤i≤k
)= lim
N→∞PG((N2/3(λi − λr) ≤ si
)1≤i≤k
),
(2.1.23)
for all s1, s2, . . . , sk ∈ R. Let HGOE be an N × N random matrix belonging to the
Gaussian orthogonal ensemble. The joint distribution of the k largest eigenvalues of
HGOE, µGOE1 ≥ . . . ≥ µGOEk , can be written in terms of the Airy kernel for any fixed k
Chapter 2. Random matrices in high dimensional statistics 34
and
limN→∞
PG((γ0N
2/3(λi − λr) ≤ si)
1≤i≤k
)= lim
N→∞P((N2/3(µGOEi − 2) ≤ si
)1≤i≤k
),
for all s1, s2, . . . , sk ∈ R. Hence (2.1.23) gives a complete description of the finite-
dimensional correlation functions of the largest eigenvalues of Q2.
Notations. Following the notations in [49, 50], we will use the following definition to
characterize events of high probability.
Definition 2.1.10 (High probability event). Define
ϕ := (logN)log logN . (2.1.24)
We say that an N-dependent event Ω holds with ξ-high probability if there exist constant
c, C > 0 independent of N , such that
P(Ω) ≥ 1−NC exp(−cϕξ), (2.1.25)
for all sufficiently large N . For simplicity, for the case ξ = 1, we just say high probability.
Note that if (2.1.25) holds, then P(Ω) ≥ 1− exp(−c′ϕξ) for any constant 0 ≤ c′ < c.
Definition 2.1.11 (Bounded support condition). A family of M×N matrices X = (xij)
are said to satisfy the bounded support condition with q ≡ q(N) if
P(
max1≤i≤M,1≤j≤N
|xij| ≤ q
)≥ 1− e−Nc
, (2.1.26)
Chapter 2. Random matrices in high dimensional statistics 35
for some c > 0. Here q ≡ q(N) depends on N and usually satisfies
N−1/2 logN ≤ q ≤ N−φ,
for some small constant φ > 0. Whenever (2.1.26) holds, we say that X has support q.
Remark 2.1.12. Note that the Gaussian distribution satisfies the condition (2.1.26) with
q < N−φ for any φ < 1/2. We also remark that if (2.1.26) holds, then the event
|xij| ≤ q,∀1 ≤ i ≤M, 1 ≤ j ≤ N holds with ξ-high probability for any fixed ξ > 0
according to Definition 2.1.10. For this reason, the bad event |xij| > q for some i, j is
negligible, and we will not consider the case it happens throughout the proof.
Next we introduce a convenient self-adjoint linearization trick, which has been proved
to be useful in studying the local laws of random matrices of the A∗A type. We define
the following (N +M)× (N +M) block matrix, which is a linear function of X.
Definition 2.1.13 (Linearizing block matrix). For z ∈ C+, we define the (N + M) ×
(N +M) block matrix
H ≡ H(X) :=
0 DX
(DX)∗ 0
, (2.1.27)
and its Green function
G ≡ G(X, z) :=
−IM×M DX
(DX)∗ −zIN×N
−1
. (2.1.28)
Definition 2.1.14 (Index sets). We define the index sets
I1 := 1, ...,M, I2 := M + 1, ...,M +N, I := I1 ∪ I2.
Chapter 2. Random matrices in high dimensional statistics 36
Then we label the indices of the matrices according to
X = (Xiµ : i ∈ I1, µ ∈ I2) and D = diag(Dii : i ∈ I1).
In the rest of this paper, whenever referring to the entries of H and G, we will consistently
use the latin letters i, j ∈ I1, greek letters µ, ν ∈ I2, and a, b ∈ I. For 1 ≤ i ≤ minN,M
and M + 1 ≤ µ ≤ M + minN,M, we introduce the notations i := i + M ∈ I2 and
µ := µ−M ∈ I1. For any I × I matrix A, we define the following 2× 2 submatrices
A[ij] =
Aij Aij
Aij Aij
, 1 ≤ i, j ≤ minN,M. (2.1.29)
We shall call A[ij] a diagonal group if i = j, and an off-diagonal group otherwise .
It is easy to verify that the eigenvalues λ1(H) ≥ . . . ≥ λM+N(H) of H are related to
the ones of Q2 through
λi(H) = −λN+M−i+1(H) =√λi (Q2), 1 ≤ i ≤ N ∧M, (2.1.30)
and
λi(H) = 0, N ∧M + 1 ≤ i ≤ N ∨M,
where we used the notations N ∧M := minN,M and N ∨M := maxN,M. Fur-
thermore, by Schur complement formula, we can verify that
G =
z(DXX∗D∗ − z)−1 (DXX∗D∗ − z)−1DX
X∗D∗(DXX∗D∗ − z)−1 (X∗D∗DX − z)−1
=
zG1 G1DX
X∗D∗G1 G2
=
zG1 DXG2
G2X∗D∗ G2
. (2.1.31)
Chapter 2. Random matrices in high dimensional statistics 37
Thus a control of G yields directly a control of the resolvents G1,2 defined in (2.2.6). By
(2.2.48), we immediately get that
m1 =1
Mz
∑i∈I1
Gii, m2 =1
N
∑µ∈I2
Gµµ.
Next we introduce the spectral decomposition of G. Let
DX =N∧M∑k=1
√λkξkζ
∗k ,
be a singular value decomposition of DX, where
λ1 ≥ λ2 ≥ . . . ≥ λN∧M ≥ 0 = λN∧M+1 = . . . = λN∨M ,
and ξkMk=1 and ζkNk=1 are orthonormal bases of RI1 and RI2 , respectively. Then using
(2.2.48), we can get that for i, j ∈ I1 and µ, ν ∈ I2,
Gij =M∑k=1
zξk(i)ξ∗k(j)
λk − z, Gµν =
N∑k=1
ζk(µ)ζ∗k(ν)
λk − z, (2.1.32)
Giµ =N∧M∑k=1
√λkξk(i)ζ
∗k(µ)
λk − z, Gµi =
N∧M∑k=1
√λkζk(µ)ξ∗k(i)
λk − z. (2.1.33)
Main tools. For small constant c0 > 0 and large constants C0, C1 > 0, we define a
domain of the spectral parameter z = E + iη as
S(c0, C0, C1) :=
z = E + iη : λr − c0 ≤ E ≤ C0λr,
ϕC1
N≤ η ≤ 1
. (2.1.34)
We define the distance to the rightmost edge as
κ ≡ κE := |E − λr|, for z = E + iη. (2.1.35)
Chapter 2. Random matrices in high dimensional statistics 38
Then we have the following lemma, which summarizes some basic properties of m2c and
ρ2c.
Lemma 2.1.15. There exists sufficiently small constant c > 0 such that
ρ2c(x) ∼√λr − x, for all x ∈ [λr − 2c, λr] . (2.1.36)
The Stieltjes transform m2c satisfies that
|m2c(z)| ∼ 1, (2.1.37)
and
Imm2c(z) ∼
η/√κ+ η, E ≥ λr
√κ+ η, E ≤ λr
, (2.1.38)
for z = E + iη ∈ S(c, C0,−∞).
Remark 2.1.16. Recall that ak are the edges of the spectral density ρ2c; see (2.1.17).
Hence ρ2c(ak) = 0, and we must have ak < λr − 2c for 2 ≤ k ≤ 2p. In particular,
S(c0, C0, C1) is away from all the other edges if we choose c0 ≤ c.
Definition 2.1.17 (Classical locations of eigenvalues). The classical location γj of the
j-th eigenvalue of Q2 is defined as
γj := supx
∫ +∞
x
ρ2c(x)dx >j − 1
N
. (2.1.39)
In particular, we have γ1 = λr.
Remark 2.1.18. If γj lies in the bulk of ρ2c, then by the positivity of ρ2c we can define γj
through the equation ∫ +∞
γj
ρ2c(x)dx =j − 1
N.
Chapter 2. Random matrices in high dimensional statistics 39
We can also define the classical location of the j-th eigenvalue of Q1 by changing ρ2c to
ρ1c and (j− 1)/N to (j− 1)/M in (2.1.39). By (2.1.8), this gives the same location as γj
for j ≤ N ∧M .
Definition 2.1.19 (Deterministic limit of G). We define the deterministic limit Π of the
Green function G in (2.2.48) as
Π(z) :=
− (1 +m2c(z)Σ)−1 0
0 m2c(z)IN×N
, (2.1.40)
where Σ is defined in (2.1.4).
In the rest of this section, we present some results that will be used in the proof of
Theorem 2.1.7. Their proofs will be given in subsequent sections.
Lemma 2.1.20 (Local deformed MP law). Suppose the Assumptions 2.1.1 and 2.1.5
hold. Suppose X satisfies the bounded support condition (2.1.26) with q ≤ N−φ for some
constant φ > 0. Fix C0 > 0 and let c1 > 0 be a sufficiently small constant. Then
there exist constants C1 > 0 and ξ1 ≥ 3 such that the following events hold with ξ1-high
probability:
⋂z∈S(2c1,C0,C1)
|m2(z)−m2c(z)| ≤ ϕC1
(min
q,
q2
√κ+ η
+
1
Nη
), (2.1.41)
⋂z∈S(2c1,C0,C1)
maxa,b∈I|Gab(z)− Πab(z)| ≤ ϕC1
(q +
√Imm2c(z)
Nη+
1
Nη
), (2.1.42)
‖H‖2 ≤ λr + ϕC1(q2 +N−2/3)
. (2.1.43)
The estimates in (2.1.41) and (2.1.42) are usually referred to as the averaged local
law and entrywise local law, respectively. For completeness, we will give a concise proof
in the end that fits into our setting. The local laws (2.1.41) and (2.1.42) can be used
to derive some important properties of the eigenvectors and eigenvalues of the random
Chapter 2. Random matrices in high dimensional statistics 40
matrices. For instance, they lead to the following results about the delocalization of
eigenvectors and the rigidity of eigenvalues. Note that (2.1.44) gives an almost optimal
estimate on the flatness of the singular vectors of DX, while (2.1.45) gives some quite
precise information on the locations of the singular values of DX.
Lemma 2.1.21. Suppose the events (2.1.41) and (2.1.42) hold with ξ1-high probabil-
ity. Then there exists constant C ′1 > 0 such that the following events hold with ξ1-high
probability:
(1) Delocalization:
⋂k:λr−c1≤γk≤λr
maxi|ξk(i)|2 + max
µ|ζk(µ)|2 ≤ ϕC
′1
N
; (2.1.44)
(2) Rigidity of eigenvalues: if q ≤ N−φ for some constant φ > 1/3,
⋂j:λr−c1≤γj≤λr
|λj − γj| ≤ ϕC
′1(j−1/3N−2/3 + q2
), (2.1.45)
where λj is the j-th eigenvalue of (DX)∗DX and γj is defined in (2.1.39).
With Lemma 2.1.20, Lemma 2.1.21 and a standard Green function comparison method,
one can prove the following edge universality result when the support q is small.
Lemma 2.1.22. Let XW and XV be two sample covariance matrices satisfying the as-
sumptions in Lemma 2.1.20. Moreover, suppose q ≤ ϕCN−1/2 for some constant C > 0.
Then there exist constants ε, δ > 0 such that, for any s ∈ R, we have
PV(N2/3(λ1 − λr) ≤ s−N−ε
)−N−δ ≤ PW
(N2/3(λ1 − λr) ≤ s
)(2.1.46)
≤ PV(N2/3(λ1 − λr) ≤ s+N−ε
)+N−δ,
where PV and PW denote the laws of XV and XW , respectively.
Chapter 2. Random matrices in high dimensional statistics 41
For any matrix X satisfying Assumption 2.1.1 and the tail condition (2.1.135), we
can construct a matrix X1 that approximates X with probability 1− o(1), and satisfies
Assumption 2.1.1, the bounded support condition (2.1.26) with q ≤ N−φ for some small
φ > 0, and
E|xij|3 ≤ BN−3/2, E|xij|4 ≤ B(logN)N−2. (2.1.47)
for some constant B > 0. We will need the following local law, eigenvalues rigidity,
and edge universality results for covariance matrices with large support and satisfying
condition (2.1.47).
Theorem 2.1.23 (Rigidity of eigenvalues: large support case). Suppose the Assumptions
2.1.1 and 2.1.5 hold. Suppose X satisfies the bounded support condition (2.1.26) with
q ≤ N−φ for some constant φ > 0 and the condition (2.1.47). Fix the constants c1, C0,
C1, and ξ1 as given in Lemma 2.1.20. Then there exists constant C2 > 0, depending only
on c1, C1, B and φ, such that with high probability we have
maxz∈S(c1,C0,C2)
|m2(z)−m2c(z)| ≤ ϕC2
Nη, (2.1.48)
for sufficiently large N . Moreover, (2.1.48) implies that for some constant C > 0, the
following events hold with high probability:
⋂j:λr−c1≤γj≤λr
|λj − γj| ≤ ϕCj−1/3N−2/3
(2.1.49)
and sup
E≥λr−c1|n(E)− nc(E)| ≤ ϕC
N
, (2.1.50)
where
n(E) :=1
N#λj ≥ E, nc(E) :=
∫ +∞
E
ρ2c(x)dx. (2.1.51)
Theorem 2.1.24. Let XW and XV be two i.i.d. sample covariance matrices satisfying
Chapter 2. Random matrices in high dimensional statistics 42
the assumptions in Theorem 2.1.23. Then there exist constants ε, δ > 0 such that, for
any s ∈ R, we have
PV(N2/3(λ1 − λr) ≤ s−N−ε
)−N−δ ≤ PW
(N2/3(λ1 − λr) ≤ s
)(2.1.52)
≤ PV(N2/3(λ1 − λr) ≤ s+N−ε
)+N−δ,
where PV and PW denote the laws of XV and XW , respectively.
Lemma 2.1.25 (Bounds on Gij: large support case). Let X be a sample covariance
matrix satisfying the assumptions in Theorem 2.1.23. Then for any 0 < c < 1 and
z ∈ S(c1, C0, C2) ∩ z = E + iη : η ≥ N−1+c, we have the following weak bound
E|Gab(z)|2 ≤ ϕC3
(Imm2c(z)
Nη+
1
(Nη)2
), a 6= b, (2.1.53)
for some constant C3 > 0.
In proving Theorem 2.1.23, Theorem 2.1.24 and Lemma 2.1.25, we will make use of the
results in Lemmas 2.1.20-2.1.22 for covariance matrices with small support. In fact, given
any matrix X satisfying the assumptions in Theorem 2.1.23, we can construct a matrix X
having the same first four moments as X but with smaller support q = O(N−1/2 logN).
Lemma 2.1.26. Suppose X satisfies the assumptions in Theorem 2.1.23. Then there
exists another matrix X = (xij), such that X satisfies the bounded support condition
(2.1.26) with q = O(N−1/2 logN), and the first four moments of the entries of X and X
match, i.e.
Exkij = Exkij, k = 1, 2, 3, 4. (2.1.54)
From Lemmas 2.1.20-2.1.22, we see that Theorems 2.1.23, 2.1.24 and Lemma 2.1.25
hold for X. Then due to (2.1.54), we expect that X has “similar properties” as X, so
that these results also hold for X. This will be proved with a Green function comparison
Chapter 2. Random matrices in high dimensional statistics 43
method, that is, we expand the Green functions with X in terms of Green functions with
X using resolvent expansions, and then estimate the relevant error terms.
Proof of of the main result. In this part, we prove Theorem 2.1.7 with the results.
We begin by proving the necessity part.
Proof of the necessity. Assume that lims→∞
s4P(|q11| ≥ s) 6= 0. Then we can find a constant
0 < c0 < 1/2 and a sequence rn such that rn →∞ as n→∞ and
P(|qij| ≥ rn) ≥ c0r−4n . (2.1.55)
Fix any s > λr. We denote L := bτMc, I :=√τ−1s and define the event
ΓN = There exist i and j, 1 ≤ i ≤ L, 1 ≤ j ≤ N, such that |xij| ≥ I .
We first show that λ1(Q2) ≥ s when ΓN holds. Suppose |xij| ≥ I for some 1 ≤ i ≤ L
and 1 ≤ j ≤ N . Let u ∈ RN such that u(k) = δkj. By assumption (2.1.6), we have
σi ≥ τ for i ≤ L. Hence
λ1(Q2) ≥ 〈u, (DX)∗(DX)u〉 =M∑k=1
σkx2kj ≥ σix
2ij ≥ τ
(√τ−1s
)2
= s.
Now we choose N ∈ b(rn/I)2c : n ∈ N. With the choice N = b(rn/I)2c, we have
1− P(ΓN) = (1− P(|x11| ≥ I))NL ≤ (1− P(|q11| ≥ rn))NL
≤(1− c0r
−4n
)NL ≤ (1− c1N−2)c2N
2
, (2.1.56)
for some constant c1 > 0 depending on c0 and I, and some constant c2 > 0 depending on
τ and d. Since (1− c1N−2)c2N
2 ≤ c3 for some constant 0 < c3 < 1 independent of N , the
above inequality shows that P(ΓN) ≥ 1−c3 > 0. This shows that lim supN→∞ P(ΓN) > 0
Chapter 2. Random matrices in high dimensional statistics 44
and concludes the proof.
Proof of the sufficiency. Given the matrix X satisfying Assumption 2.1.1 and the tail
condition (2.1.135), we introduce a cutoff on its matrix entries at the level N−ε. For any
fixed ε > 0, define
αN := P(|q11| > N1/2−ε) , βN := E
[1(|q11| > N1/2−ε)q11
].
By (2.1.135) and integration by parts, we get that for any δ > 0 and large enough N ,
αN ≤ δN−2+4ε, |βN | ≤ δN−3/2+3ε. (2.1.57)
Let ρ(x) be the distribution density of q11. Then we define independent random
variables qsij, qlij, cij, 1 ≤ i ≤M and 1 ≤ j ≤ N , in the following ways:
• qsij has distribution density ρs(x), where
ρs(x) = 1
(∣∣∣∣x− βN1− αN
∣∣∣∣ ≤ N1/2−ε) ρ
(x− βN
1−αN
)1− αN
; (2.1.58)
• qlij has distribution density ρl(x), where
ρl(x) = 1
(∣∣∣∣x− βN1− αN
∣∣∣∣ > N1/2−ε) ρ
(x− βN
1−αN
)αN
; (2.1.59)
• cij is a Bernoulli 0-1 random variable with P(cij = 1) = αN and P(cij = 0) = 1−αN .
Let Xs, X l and Xc be random matrices such that Xsij = N−1/2qsij, X
lij = N−1/2qlij and
Xcij = cij. By (2.1.58), (2.1.59) and the fact that Xc
ij is Bernoulli, it is easy to check that
for independent Xs, X l and Xc,
Xijd= Xs
ij
(1−Xc
ij
)+X l
ijXcij −
1√N
βN1− αN
, (2.1.60)
Chapter 2. Random matrices in high dimensional statistics 45
where by (2.1.57), we have
∣∣∣∣ 1√N
βN1− αN
∣∣∣∣ ≤ 2δN−2+3ε.
Therefore, if we define the M ×N matrix Y = (Yij) by
Yij =1√N
βN1− αN
for all i and j,
we have ‖Y ‖ ≤ cN−1+3ε for some constant c > 0. In the proof below, one will see that
‖D(X + Y )‖ = λ1/21 ((X + Y )∗D∗D(X + Y )) = O(1) with probability 1 − o(1), where
λ1(·) denotes the largest eigenvalue of the random matrix. Then it is easy to verify that
with probability 1− o(1),
|λ1 ((X + Y )∗D∗D(X + Y ))− λ1 (X∗D∗DX)| = O(N−1+3ε
). (2.1.61)
Thus the deterministic part in (2.1.60) is negligible under the scaling N2/3.
By (2.1.135) and integration by parts, it is easy to check that
Eqs11 = 0, E|qs11|2 = 1−O(N−1+2ε),
E|qs11|3 = O(1), E|qs11|4 = O(logN).
(2.1.62)
We note that X1 := (E|qsij|2)−1/2Xs is a matrix that satisfies the assumptions for X in
Theorem 2.1.24. Together with the estimate for E|qsij|2 in (2.1.62), we conclude that there
exist constants ε, δ > 0 such that for any s ∈ R,
PG(N2/3(λ1 − λr) ≤ s−N−ε
)−N−δ ≤ Ps
(N2/3(λ1 − λr) ≤ s
)(2.1.63)
≤ PG(N2/3(λ1 − λr) ≤ s+N−ε
)+N−δ,
where Ps denotes the law for Xs and PG denotes the law for a Gaussian covariance matrix.
Chapter 2. Random matrices in high dimensional statistics 46
Now we write the first two terms on the right-hand side of (2.1.60) as
Xsij(1−Xc
ij) +X lijX
cij = Xs
ij +RijXcij,
where Rij := X lij − Xs
ij. It remains to show that the effect of the RijXcij terms on λ1
is negligible. We call the corresponding matrix as Rc := (RijXcij). Note that Xc
ij is
independent of Xsij and Rij.
We first introduce a cutoff on matrix Xc as Xc := 1AXc, where
A :=
#(i, j) : Xcij = 1 ≤ N5ε
∩Xcij = Xc
kl = 1⇒ i, j = k, l or i, j ∩ k, l = ∅.
If we regard the matrix Xc as a sequence Xc of NM i.i.d. Bernoulli random variables,
it is easy to obtain from the large deviation formula that
P
(MN∑i=1
Xci ≤ N5ε
)≥ 1− exp(−N ε), (2.1.64)
for sufficiently large N . Suppose the number m of the nonzero elements in Xc is given
with m ≤ N5ε. Then it is easy to check that
P
(∃ i = k, j 6= l or i 6= k, j = l such that Xc
ij = Xckl = 1
∣∣∣∣∣MN∑i=1
Xci = m
)(2.1.65)
= O(m2N−1).
Combining the estimates (2.1.64) and (2.1.65), we get that
P(A) ≥ 1−O(N−1+10ε). (2.1.66)
Chapter 2. Random matrices in high dimensional statistics 47
On the other hand, by condition (2.1.135), we have
P (|Rij| ≥ ω) ≤ P(|qij| ≥
ω
2N1/2
)= o(N−2), (2.1.67)
for any fixed constant ω > 0. Hence if we introduce the matrix
E = 1
(A ∩
maxi,j|Rij| ≤ ω
)Rc,
then
P(E = Rc) = 1− o(1) (2.1.68)
by (2.1.66) and (2.1.67). Thus we only need to study the largest eigenvalue of (Xs +
E)∗D∗D(Xs + E), where maxi,j |Eij| ≤ ω and the rank of E is less than N5ε. In fact, it
suffices to prove that
P(∣∣λs1 − λE1 ∣∣ ≤ N−3/4
)= 1− o(1), (2.1.69)
where λs1 := λ1 ((Xs)∗D∗DXs) and λE1 := λ1 ((Xs + E)∗D∗D(Xs + E)). The estimate
(2.1.69), combined with (2.1.61), (2.1.63) and (2.1.68), concludes (2.1.21).
Now we prove (2.1.69). Note that Xc is independent of Xs, so the positions of the
nonzero elements of E are independent of Xs. Without loss of generality, we assume the
m nonzero entries of DE are
e11, e22, · · · , emm, m ≤ N5ε. (2.1.70)
For the other choices of the positions of nonzero entries, the proof is exactly the same.
But we make this assumption to simplify the notations. By (2.1.6) and the definition of
E, we have |eii| ≤ τ−1ω for 1 ≤ i ≤ m.
Chapter 2. Random matrices in high dimensional statistics 48
We define the matrices
Hs :=
0 DXs
(DXs)∗ 0
and HE := Hs + P, P :=
0 DE
(DE)∗ 0
.
Then we have the eigendecomposition P = V PDV∗, where PD is a 2m × 2m diagonal
matrix
PD = diag (e11, . . . , emm,−e11, . . . ,−emm) ,
and V is an (M +N)× 2m matrix such that
Vab =
δa,i/√
2 + δa,(M+i)/√
2, b = i, i ≤ m,
δa,i/√
2− δa,(M+i)/√
2, b = i+m, i ≤ m,
0, b ≥ 2m+ 1.
With the identity
det
−IM×M DX
(DX)∗ −zIN×N
= det(−IM×M) det(X∗D∗DX − zIN×N),
we find that if µ /∈ σ((DXs)∗DXs), then µ is an eigenvalue ofQγ := (Xs+γE)∗D∗D(Xs+
γE) if and only if
det(V ∗Gs(µ)V + (γPD)−1) = 0, (2.1.71)
where
Gs(µ) :=
Hs −
IM×M 0
0 µIN×N
−1
.
Define Rγ := V ∗GsV + (γPD)−1 for 0 < γ < 1. It has the following 2 × 2 blocks (recall
Chapter 2. Random matrices in high dimensional statistics 49
the definition (2.1.29)): for 1 ≤ i ≤ m,
Rγi,j Rγ
i,j+m
Rγi+m,j Rγ
i+m,j+m
=1
2
1 1
1 −1
G[ij]
1 1
1 −1
+ δij
(γeii)−1 0
0 −(γeii)−1
. (2.1.72)
Now let µ := λs1 ±N−3/4. We claim that
P (detRγ(µ) 6= 0 for all 0 < γ ≤ 1) = 1− o(1). (2.1.73)
If (2.1.73) holds, then µ is not an eigenvalue of Qγ with probability 1 − o(1). Denoting
the largest eigenvalue of Qγ by λγ1 , 0 < γ ≤ 1, and defining λ01 := limγ→0 λ
γ1 , we have
λ01 = λs1 and λ1
1 = λE1 by definition. With the continuity of λγ1 with respect to γ and the
fact that λ01 ∈ (λs1 −N−3/4, λs1 +N−3/4), we find that
λE1 = λ11 ∈ (λs1 −N−3/4, λs1 +N−3/4),
with probability 1− o(1), i.e. we have proved (2.1.69).
Finally, we prove the claim (2.1.73). Choose z = λr + iN−2/3 and note that Hs has
support N−ε. Then by (2.1.42) and (2.1.38), we have with high probability,
maxa|Gs
aa(z)− Πaa(λr)| ≤ N−ε/2, (2.1.74)
where we also used the Assumption 2.1.5 and
|m2c(z)−m2c(λr)| ∼ |z − λr|1/2,
which follows from (2.1.36). For the off-diagonal terms, we use (2.1.53), (2.1.38) and the
Chapter 2. Random matrices in high dimensional statistics 50
Markov inequality to conclude that
maxa6=b∈1,...,m∪M+1,...,M+m
|Gsab(z)| ≤ N−1/6, (2.1.75)
holds with probability 1−o(N−1/6). We can extend (2.1.63) to finite correlation functions
of the largest eigenvalues. Since the largest eigenvalues in the Gaussian case are separated
in the scale ∼ N−2/3, we conclude that
P(
mini|λi((Xs)∗Xs)− µ| ≥ N−3/4
)≥ 1− o(1). (2.1.76)
On the other hand, the rigidity result (2.1.49) gives that with high probability,
|µ− λr| ≤ ϕCN−2/3 +N−3/4. (2.1.77)
Using (2.1.44), (2.1.76), (2.1.77) and the rigidity estimate (2.1.49), we can get that with
probability 1− o(1),
maxa,b|Gs
ab(z)−Gsab(µ)| < N−1/4+ε. (2.1.78)
For instance, for α, β ∈ I2, small c > 0 and large enough C > 0, we have with probability
1− o(1) that
|Gαβ(z)−Gαβ(µ)| ≤∑k
|ζk(α)ζ∗k(β)|∣∣∣∣ 1
λk − z− 1
λk − µ
∣∣∣∣≤ C
N2/3
∑γk≤λr−c
|ζk(α)ζ∗k(β)|+ ϕC
N5/3
∑γk>λr−c
1
|λk − z||λk − µ|
≤ C
N2/3+
ϕC
N5/3
∑1≤k≤ϕC
1
|λk − z||λk − µ|+
ϕC
N5/3
∑k>ϕC ,γk>λr−c
1
|λk − z||λk − µ|
≤ C
N2/3+
ϕ2C
N1/4+
ϕC
N2/3
1
N
∑k>ϕC ,γk>λr−c
1
|λk − z||λk − µ|
≤ N−1/4+ε,
where in the first step we used (2.1.32), in the second step (2.1.44) and |λk−z||λk−µ| & 1
Chapter 2. Random matrices in high dimensional statistics 51
for γk ≤ λr − c due to (2.1.49), in the third step the Cauchy-Schwarz inequality, in the
fourth step (2.1.76), and in the last step the rigidity estimate (2.1.49). For all the other
choices of a and b, we can prove the estimate (2.1.78) in a similar way. Now by (2.1.78),
we see that (2.1.74) and (2.1.75) still hold if we replace z by µ = λs1 ±N−3/4 and double
the right hand sides. Then using maxi |eii| ≤ τ−1ω and (2.1.72), we get that for any
0 < γ ≤ 1,
min1≤i≤m,γ
|Rγii|, |R
γi+m,i+m| ≥ τω−1 − 1
2|Πii(λr) +m2c(λr)| −O(N−ε/2),
max1≤i≤m,γ
|Rγi,i+m|, |R
γi+m,i| ≤
1
2|Πii(λr)−m2c(λr)|+O(N−ε/2),
and
max1≤i 6=j≤m,γ
(|Rγ
i,j|+ |Rγi+m,j|+ |R
γi,j+m|+ |R
γi+m,j+m|
)= O(N−1/6),
hold with probability 1− o(1). Thus Rγ is diagonally dominant with probability 1− o(1)
(provided that ω is chosen to be sufficiently small). This proves the claim (2.1.73), which
further gives (2.1.69) and completes the proof.
Proofs of local laws. We firstly collect some tools that will be used in the proof. For
simplicity, we denote Y := DX.
Definition 2.1.27 (Minors). For T ⊆ I, we define the minor H(T) := (Hab : a, b ∈ I \T)
obtained by removing all rows and columns of H indexed by a ∈ T. Note that we keep the
names of indices of H when defining H(T), i.e. (H(T))ab = 1a,b/∈THab. Correspondingly,
we define the Green function
G(T) :=(H(T)
)−1=
zG(T)1 G(T)
1 Y (T)(Y (T)
)∗ G(T)1 G(T)
2
=
zG(T)1 Y (T)G(T)
2
G(T)2
(Y (T)
)∗ G(T)2
,
Chapter 2. Random matrices in high dimensional statistics 52
and the partial traces
m(T)1 :=
1
MTrG(T)
1 =1
Mz
∑i/∈T
G(T)ii , m
(T)2 :=
1
NTrG(T)
2 =1
N
∑µ/∈T
G(T)µµ ,
where we adopt the convention that G(T )ab = 0 if a ∈ T or b ∈ T. We will abbreviate
(a) ≡ (a), (a, b) ≡ (ab), and
∑a/∈T
≡(T)∑a
,∑a,b/∈T
≡(T)∑a,b
.
Lemma 2.1.28. (Resolvent identities).
(i) For i ∈ I1 and µ ∈ I2, we have
1
Gii
= −1−(Y G(i)Y ∗
)ii,
1
Gµµ
= −z −(Y ∗G(µ)Y
)µµ. (2.1.79)
(ii) For i 6= j ∈ I1 and µ 6= ν ∈ I2, we have
Gij = GiiG(i)jj
(Y G(ij)Y ∗
)ij, Gµν = GµµG
(µ)νν
(Y ∗G(µν)Y
)µν. (2.1.80)
For i ∈ I1 and µ ∈ I2, we have
Giµ = GiiG(i)µµ
(−Yiµ +
(Y G(iµ)Y
)iµ
),
Gµi = GµµG(µ)ii
(−Y ∗µi +
(Y ∗G(µi)Y ∗
)µi
).
(2.1.81)
(iii) For a ∈ I and b, c ∈ I \ a,
G(a)bc = Gbc −
GbaGac
Gaa
,1
Gbb
=1
G(a)bb
− GbaGab
GbbG(a)bb Gaa
. (2.1.82)
(iv) All of the above identities hold for G(T) instead of G for T ⊂ I.
Chapter 2. Random matrices in high dimensional statistics 53
Lemma 2.1.29. Fix constants c0, C0, C1 > 0. The following estimates hold uniformly
for all z ∈ S(c0, C0, C1):
‖G‖ ≤ Cη−1, ‖∂zG‖ ≤ Cη−2. (2.1.83)
Furthermore, we have the following identities:
∑µ∈I2
|Gνµ|2 =∑µ∈I2
|Gµν |2 =ImGνν
η, (2.1.84)
∑i∈I1
|Gji|2 =∑i∈I1
|Gij|2 =|z|2
ηIm
(Gjj
z
), (2.1.85)
∑i∈I1
|Gµi|2 =∑i∈I1
|Giµ|2 = Gµµ +z
ηImGµµ, (2.1.86)
∑µ∈I2
|Giµ|2 =∑µ∈I2
|Gµi|2 =Gii
z+z
ηIm
(Gii
z
). (2.1.87)
All of the above estimates remain true for G(T) instead of G for any T ⊆ I.
Lemma 2.1.30. Fix constants c0, C0, C1 > 0. For any T ⊆ I, the following bounds hold
uniformly in z ∈ S(c0, C0, C1):
∣∣∣m2 −m(T)2
∣∣∣ ≤ 2 |T|Nη
, (2.1.88)
and ∣∣∣∣∣ 1
N
M∑i=1
σi
(G
(T)ii −Gii
)∣∣∣∣∣ ≤ C |T|Nη
, (2.1.89)
where C > 0 is a constant depending only on τ .
Proof. For µ ∈ I2, we have
∣∣∣m2 −m(µ)2
∣∣∣ =1
N
∣∣∣∣∣∑ν∈I2
GνµGµν
Gµµ
∣∣∣∣∣ ≤ 1
N |Gµµ|∑ν∈I2
|Gνµ|2 =ImGµµ
Nη|Gµµ|≤ 1
Nη,
where in the first step we used (2.1.82), and in the second and third steps we used the
Chapter 2. Random matrices in high dimensional statistics 54
equality (2.1.84). Similarly, using (2.1.82) and (2.1.86) we get
∣∣∣m2 −m(i)2
∣∣∣ =1
N
∣∣∣∣∣∑ν∈I2
GνiGiν
Gii
∣∣∣∣∣ ≤ 1
N |Gii|
(Gii
z+z
ηIm
(Gii
z
))≤ 2
Nη.
Then we can prove (2.1.88) by induction on the indices in T. The proof for (2.1.89) is
similar except that one needs to use the assumption (2.1.6).
Lemma 2.1.31. Let (xi), (yi) be independent families of centered and independent ran-
dom variables, and (Ai), (Bij) be families of deterministic complex numbers. Suppose the
entries xi and yj have variance at most N−1 and satisfies the bounded support condition
(2.1.26) with q ≤ N−ε for some constant ε > 0. Then for any fixed ξ > 0, the following
bounds hold with ξ-high probability:
∣∣∣∣∣∑i
Aixi
∣∣∣∣∣ ≤ ϕξ
qmaxi|Ai|+
1√N
(∑i
|Ai|2)1/2
, (2.1.90)
∣∣∣∣∣∑i,j
xiBijyj
∣∣∣∣∣ ≤ ϕ2ξ
q2Bd + qBo +1
N
(∑i 6=j
|Bij|2)1/2
, (2.1.91)
∣∣∣∣∣∑i
xiBiixi −∑i
(E|xi|2)Bii
∣∣∣∣∣ ≤ ϕξqBd, (2.1.92)
∣∣∣∣∣∑i 6=j
xiBijxj
∣∣∣∣∣ ≤ ϕ2ξ
qBo +1
N
(∑i 6=j
|Bij|2)1/2
, (2.1.93)
where
Bd := maxi|Bii|, Bo := max
i 6=j|Bij|.
Finally, we have the following lemma, which is a consequence of the Assumption 2.1.5.
Lemma 2.1.32. There exists constants c0, τ′ > 0 such that
|1 +m2c(z)σk| ≥ τ ′, (2.1.94)
Chapter 2. Random matrices in high dimensional statistics 55
for all z ∈ S(c0, C0, C1) and 1 ≤ k ≤M .
Proof. By Assumption 2.1.5 and the fact m2c(λr) ∈ (−σ−11 , 0), we have
|1 +m2c(λr)σk| ≥ τ, 1 ≤ k ≤M.
Applying (2.1.36) to the Stieltjes transform
m2c(z) :=
∫R
ρ2c(dx)
x− z, (2.1.95)
one can verify that m2c(z) ∼√z − λr for z close to λr. Hence if κ + η ≤ 2c0 for some
sufficiently small constant c0 > 0, we have
|1 +m2c(z)σk| ≥ τ/2.
Then we consider the case with E − λr ≥ c0 and η ≤ c1 for some constant c1 > 0. In
fact, for η = 0 and E ≥ λr + c0, m2c(E) is real and it is easy to verify that m′2c(E) ≥ 0
using the formula (2.1.95). Hence we have
|1 + σkm2c(E)| ≥ |1 + σkm2c(λr + c0)| ≥ τ/2, for E ≥ λr + c0.
Using (2.1.95) again, we can get that
∣∣∣∣dm2c(z)
dz
∣∣∣∣ ≤ c−20 , for E ≥ λr + c0.
So if c1 is sufficiently small, we have
|1 + σkm2c(E + iη)| ≥ 1
2|1 + σkm2c(E)| ≥ τ/4
for E ≥ λr + c0 and η ≤ c1. Finally, it remains to consider the case with η ≥ c1. If
Chapter 2. Random matrices in high dimensional statistics 56
σk ≤ |2m2c(z)|−1, then we have |1 + σkm2c(z)| ≥ 1/2. Otherwise, we have Imm2c(z) ∼ 1
by (2.1.38). Together with (2.1.37), we get that
|1 + σkm2c(z)| ≥ σk Imm2c(z) ≥ Imm2c(z)
2m2c(z)≥ τ ′
for some constant τ ′ > 0.
Our goal is to prove that G is close to Π in the sense of entrywise and averaged local
laws. Hence it is convenient to introduce the following random control parameters.
Definition 2.1.33 (Control parameters). We define the entrywise and averaged errors
Λ := maxa,b∈I|(G− Π)ab| , Λo := max
a6=b∈I|Gab| , θ := |m2 −m2c|. (2.1.96)
Moreover, we define the random control parameter
Ψθ :=
√Imm2c + θ
Nη+
1
Nη, (2.1.97)
and the deterministic control parameter
Ψ :=
√Imm2c
Nη+
1
Nη. (2.1.98)
We introduce the Z variables
Z(T)a := (1− Ea)
(G(T)aa
)−1, a /∈ T,
where Ea[·] := E[· | H(a)], i.e. it is the partial expectation over the randomness of the
a-th row and column of H. By (2.1.271), we have
Zi = (Ei − 1)(Y G(i)Y ∗
)ii
= σi∑µ,ν∈I2
G(i)µν
(1
Nδµν −XiµXiν
), (2.1.99)
Chapter 2. Random matrices in high dimensional statistics 57
and
Zµ = (Eµ − 1)(Y ∗G(µ)Y
)µµ
=∑i,j∈I1
√σiσjG
(µ)ij
(1
Nδij −XiµXjµ
).
(2.1.100)
The following lemma plays a key role in the proof of local laws.
Lemma 2.1.34. Let c0 > 0 be a sufficiently small constant and fix C0, C1, ξ > 0. Define
the z-dependent event Ξ(z) := Λ(z) ≤ (logN)−1. Then there exists constant C > 0
such that the following estimates hold for all a ∈ I and z ∈ S(c0, C0, C1) with ξ-high
probability:
1(Ξ) (Λo + |Za|) ≤ Cϕ2ξ (q + Ψθ) , (2.1.101)
1 (η ≥ 1) (Λo + |Za|) ≤ Cϕ2ξ (q + Ψθ) . (2.1.102)
Proof. Applying the large deviation Lemma 2.1.31 to Zi in (2.1.99), we get that on Ξ,
|Zi| ≤ Cϕ2ξ
q +1
N
(∑µ,ν
∣∣G(i)µν
∣∣2)1/2
= Cϕ2ξ
q +1
N
(∑µ
ImG(i)µµ
η
)1/2 = Cϕ2ξ
q +
√Imm
(i)2
Nη
(2.1.103)
holds with ξ-high probability, where we used (2.1.6), (2.1.84) and the fact that maxa,b |Gab| =
O(1) on event Ξ. Now by (2.1.96), (2.1.97) and the bound (2.1.88), we have that
√Imm
(i)2
Nη=
√Imm2c + Im(m
(i)2 −m2) + Im(m2 −m2c)
Nη≤ CΨθ. (2.1.104)
Chapter 2. Random matrices in high dimensional statistics 58
Together with (2.1.103), we conclude that
1(Ξ)|Zi| ≤ Cϕ2ξ (q + Ψθ)
with ξ-high probability. Similarly, we can prove the same estimate for 1(Ξ)|Zµ|. In the
proof, we also need to use (2.1.9) and
Im
(−d− 1
z
)= O(η) = O(Imm2c(z)).
If η ≥ 1, we always have maxa,b |Gab| = O(1) by (2.1.83). Then repeating the above
proof, we obtain that
1(η ≥ 1)|Za| ≤ Cϕ2ξ (q + Ψθ)
with ξ-high probability. Similarly, using (2.1.80) and Lemmas 2.1.29-2.1.276, we can
prove that with ξ-high probability,
1(Ξ) (|Gij|+ |Gµν |) ≤ Cϕ2ξ (q + Ψθ) (2.1.105)
holds uniformly for i 6= j and µ 6= ν. It remains to prove the bound for Giµ and Gµi. Using
(2.1.81), the bounded support condition (2.1.26) for Xiµ, the bound maxa,b |Gab| = O(1)
Chapter 2. Random matrices in high dimensional statistics 59
on Ξ, Lemma 2.1.29 and Lemma 2.1.276, we get that with ξ-high probability,
|Giµ| ≤ C
q +
∣∣∣∣∣∣(iµ)∑j,ν
XiνG(iµ)νj Xjµ
∣∣∣∣∣∣
≤ Cϕ2ξ
q +1
N
(iµ)∑j,ν
∣∣∣G(iµ)νj
∣∣∣21/2
≤ Cϕ2ξ
q +1
N
(µ)∑ν
(G(iµ)νν +
z
ηImG(iµ)
νν
)1/2
≤ Cϕ2ξ
q +
√|m(iµ)
2 |N
+
√Imm
(iµ)2
Nη
.
(2.1.106)
As in (2.1.104), we can show that
√Imm
(iµ)2
Nη= O(Ψθ). (2.1.107)
For the other term, we have
√|m(iµ)
2 |N
≤
√|m2c|+ |m(iµ)
2 −m2|+ |m2 −m2c|N
≤ C
(1
N√η
+
√θ
N+
√|m2c|N
)≤ CΨθ,
(2.1.108)
where we used (2.1.88), and that
|m2c|N
= O
(Imm2c
Nη
),
since |m2c| = O(1) and Imm2c & η by Lemma 2.1.15. From (2.1.106), (2.1.107) and
(2.1.108), we obtain that
1(Ξ)|Giµ| ≤ Cϕ2ξ (q + Ψθ)
Chapter 2. Random matrices in high dimensional statistics 60
with ξ-high probability. Together with (2.1.105), we get the estimate in (2.1.101) for
Λo. Finally, the estimate (2.1.102) for Λo can be proved in a similar way with the bound
1(η ≥ 1) maxa,b |Gab| = O(1).
Our proof of the local law starts with an analysis of the self-consistent equation.
Recall that m2c(z) is the solution to the equation z = f(m) for f defined in (2.1.16).
Lemma 2.1.35. Let c0 > 0 be sufficiently small. Fix C0 > 0, ξ ≥ 3 and C1 ≥ 8ξ. Then
there exists C > 0 such that the following estimates hold uniformly in z ∈ S(c0, C0, C1)
with ξ-high probability:
1(η ≥ 1) |z − f(m2)| ≤ Cϕ2ξ(q +N−1/2), (2.1.109)
1(Ξ) |z − f(m2)| ≤ Cϕ2ξ(q + Ψθ), (2.1.110)
where Ξ is as given in Lemma 2.1.34. Moreover, we have the finer estimates
1(Ξ) (z − f(m2)) = 1(Ξ) ([Z]1 + [Z]2) +O(ϕ4ξ(q2 + Ψ2
θ
)), (2.1.111)
with ξ-high probability, where
[Z]1 :=1
N
∑i∈I1
σi(1 +m2σi)2
Zi, [Z]2 :=1
N
∑µ∈I2
Zµ. (2.1.112)
Proof. We first prove (2.1.111), from which (2.1.110) follows due to (2.1.101) and (2.1.94).
By (2.1.271), (2.1.99) and (2.1.100), we have
1
Gii
= −1− σiN
∑µ∈I2
G(i)µµ + Zi = −1− σim2 + εi, (2.1.113)
and
1
Gµµ
= −z − 1
N
∑i∈I1
σiG(µ)ii + Zµ = −z − 1
N
∑i∈I1
σiGii + εµ, (2.1.114)
Chapter 2. Random matrices in high dimensional statistics 61
where
εi := Zi + σi
(m2 −m(i)
2
)and εµ := Zµ +
1
N
∑i∈I1
σi
(Gii −G(µ)
ii
).
By (2.1.88), (2.1.89) and (2.1.101), we have for all i and µ,
1(Ξ) (|εi|+ |εµ|) ≤ Cϕ2ξ(q + Ψθ), (2.1.115)
with ξ-high probability. Then using (2.1.114), we get that for any µ and ν,
1(Ξ)(Gµµ −Gνν) = 1(Ξ)GµµGνν(εν − εµ) = O(ϕ2ξ(q + Ψθ)
), (2.1.116)
with ξ-high probability. This implies that
1(Ξ)|Gµµ −m2| ≤ Cϕ2ξ(q + Ψθ), µ ∈ I2, (2.1.117)
with ξ-high probability.
Now we plug (2.1.113) into (2.1.114) and take the average N−1∑
µ. Note that we can
write
1
Gµµ
=1
m2
− 1
m22
(Gµµ −m2) +1
m22
(Gµµ −m2)2 1
Gµµ
.
After taking the average, the second term on the right-hand side vanishes and the third
term provides a O(ϕ4ξ(q + Ψθ)2) factor by (2.1.117). On the other hand, using (2.1.82)
and (2.1.101) we get that
1(Ξ)
∣∣∣∣∣ 1
N
∑i∈I1
σi
(G
(µ)ii −Gii
)∣∣∣∣∣ ≤ 1(Ξ)1
N
∑i∈I1
σi
∣∣∣∣GiµGµi
Gµµ
∣∣∣∣ ≤ Cϕ4ξ(q + Ψθ)2,
and
1(Ξ)|m2 −m(i)2 | ≤ 1(Ξ)
1
N
∑µ∈I2
∣∣∣∣GµiGiµ
Gii
∣∣∣∣ ≤ Cϕ4ξ(q + Ψθ)2,
Chapter 2. Random matrices in high dimensional statistics 62
with ξ-high probability. Hence the average of (2.1.114) gives
1(Ξ)1
m2
= 1(Ξ)
−z +
1
N
∑i∈I1
σi1 + σim2 − Zi +O (ϕ4ξ(q + Ψθ)2)
+ [Z]2
+O(ϕ4ξ(q + Ψθ)
2),
with ξ-high probability. Finally, using (2.1.94) and the definition of Ξ we can expand the
fractions in the sum to get that
1(Ξ)
z +
1
m2
− 1
N
∑i∈I1
σi1 + σim2
= 1(Ξ) ([Z]1 + [Z]2) +O
(ϕ4ξ(q + Ψθ)
2).
This concludes (2.1.111).
Then we prove (2.1.109). Using the bound 1(η ≥ 1) maxa,b |Gab| = O(1), it is easy to
get that |m2| = O(1) and θ = O(1). Thus we have 1(η ≥ 1)Ψθ = O(N−1/2) and (2.1.115)
gives
1(η ≥ 1)(|εi|+ |εµ|) ≤ Cϕ2ξ(q +N−1/2) (2.1.118)
with ξ-high probability. First, we claim that for η ≥ 1,
|m2| ≥ Imm2 ≥ c with ξ-high probability, (2.1.119)
for some constant c > 0. By the spectral decomposition (2.1.32), we have
ImGii = ImM∑k=1
z|ξk(i)|2
λk − z=
M∑k=1
|ξk(i)|2Im
(−1 +
λkλk − z
)≥ 0.
Then by (2.1.114), G−1µµ is of orderO(1) and has an imaginary part≤ −η+O
(ϕ2ξ(q +N−1/2)
)with ξ-high probability. This implies that ImGµµ & η with ξ-high probability, which con-
cludes (2.1.119). Next, we claim that
|1 + σim2| ≥ c′ with ξ-high probability, (2.1.120)
Chapter 2. Random matrices in high dimensional statistics 63
for some constant c′ > 0. In fact, if σi ≤ |2m2|−1, we trivially have |1 + σim2| ≥ 1/2.
Otherwise, we have σi & 1 (since |m2| = O(1)), which gives that
|1 + σim2| ≥ σiImm2 ≥ c′.
Finally, with (2.1.118), (2.1.119) and (2.1.120), we can repeat the previous arguments to
get (2.1.109).
The following lemma gives the stability of the equation z = f(m). Roughly speaking,
it states that if z − f(m2(z)) is small and m2(z)−m2c(z) is small for Im z ≥ Im z, then
m2(z)−m2c(z) is small. For an arbitrary z ∈ S(c0, C0, C1), we define the discrete set
L(w) := z ∪ z′ ∈ S(c0, C0, C1) : Re z′ = Re z, Im z′ ∈ [Im z, 1] ∩ (N−10N).
Thus, if Im z ≥ 1, then L(z) = z; if Im z < 1, then L(z) is a 1-dimensional lattice with
spacing N−10 plus the point z. Obviously, we have |L(z)| ≤ N10.
Lemma 2.1.36. The self-consistent equation z − f(m) = 0 is stable on S(c0, C0, C1) in
the following sense. Suppose the z-dependent function δ satisfies N−2 ≤ δ(z) ≤ (logN)−1
for z ∈ S(c0, C0, C1) and that δ is Lipschitz continuous with Lipschitz constant ≤ N2.
Suppose moreover that for each fixed E, the function η 7→ δ(E + iη) is non-increasing
for η > 0. Suppose that u2 : S(c0, C0, C1) → C is the Stieltjes transform of a probability
measure. Let z ∈ S(c0, C0, C1) and suppose that for all z′ ∈ L(z) we have
|z − f(u2)| ≤ δ(z). (2.1.121)
Then we have
|u2(z)−m2c(z)| ≤ Cδ√κ+ η + δ
, (2.1.122)
for some constant C > 0 independent of z and N , where κ is defined in (2.1.35).
Chapter 2. Random matrices in high dimensional statistics 64
Note that by Lemma 2.1.36 and (2.1.109), we immediately get that
1(η ≥ 1)θ(z) ≤ Cϕ2ξ(q +N−1/2), (2.1.123)
with ξ-high probability. From (2.1.102), we obtain the off-diagonal estimate
1(η ≥ 1)Λo(z) ≤ Cϕ2ξ(q +N−1/2) (2.1.124)
with ξ-high probability. Using (2.1.117), (2.1.113) and (2.1.123), we get that
1(η ≥ 1) (|Gii − Πii|+ |Gµµ −m2c|) ≤ Cϕ2ξ(q +N−1/2), (2.1.125)
with ξ-high probability, which gives the diagonal estimate. These bounds can be easily
generalized to the case η ≥ c for some fixed c > 0. Comparing with (2.1.42), one can see
that the bounds (2.1.124) and (2.1.125) are optimal for the η ≥ c case. Now it remains
to deal with the small η case (in particular, the local case with η 1). We first prove
the following weak bound.
Lemma 2.1.37. Let c0 > 0 be sufficiently small. Fix C0 > 0, ξ ≥ 3 and C1 ≥ 8ξ. Then
there exists C > 0 such that with ξ-high probability,
Λ(z) ≤ Cϕ2ξ(√
q + (Nη)−1/3), (2.1.126)
holds uniformly in z ∈ S(c0, C0, C1).
To get stronger local laws in Lemma 2.1.20, we need stronger bounds on [Z]1 and [Z]2
in (2.1.111). They follow from the following fluctuation averaging lemma.
Lemma 2.1.38. Fix a constant ξ > 0. Suppose q ≤ ϕ−5ξ and that there exists S ⊆
Chapter 2. Random matrices in high dimensional statistics 65
S(c0, C0, L) with L ≥ 18ξ such that with ξ-high probability,
Λ(z) ≤ γ(z) for z ∈ S, (2.1.127)
where γ is a deterministic function satisfying γ(z) ≤ ϕ−ξ. Then we have that with
(ξ − τN)-high probability,
|[Z]1(z)|+ |[Z]2(z)| ≤ ϕ18ξ
(q2 +
1
(Nη)2+
Imm2c(z) + γ(z)
Nη
), (2.1.128)
for z ∈ S, where τN := 2/ log logN .
Proof. We suppose that the event Ξ holds. The bound for [Z]2 is proved in Lemma 4.1
of [50]. The bound for [Z]1 can be proved in a similar way, except that the coefficients
σi/(1 +m2σi)2 are random and depend on i. This can be dealt with by writing, for any
i ∈ I1,
m2 = m(i)2 +
1
N
∑µ∈I2
GµiGiµ
Gii
= m(i)2 +O(Λ2
o),
where by Lemma 2.1.34, we have
Λ2o ≤ Cϕ4ξ
(q2 + Ψ2
θ
)≤ Cϕ4ξ
(q2 +
1
(Nη)2+
Imm2c(z) + γ(z)
Nη
).
with ξ-high probability. Then we write
[Z]1 =1
N
∑i∈I1
σi(1 +m
(i)2 σi
)2Zi +O(Λ2o)
=1
N
∑i∈I1
(1− Ei)
σi(1 +m
(i)2 σi
)2G−1ii
+O(Λ2o)
=1
N
∑i∈I1
(1− Ei)[
σi
(1 +m2σi)2G−1ii
]+O(Λ2
o). (2.1.129)
The method to bound the first term in the line (2.1.129) is a slight modification of the
Chapter 2. Random matrices in high dimensional statistics 66
one in [50]. Finally, one can use that the event Ξ holds with ξ-high probability by Lemma
2.1.37 to conclude the proof.
Proof of (2.1.41) and (2.1.42). Fix c0, C0 > 0, ξ > 3 and set
L := 120ξ, ξ := 2/ log 2 + ξ.
Hence we have ξ ≤ 2ξ and L ≥ 60ξ. Then to prove (2.1.42), it suffices to prove
⋂z∈S(c0,C0,L)
Λ(z) ≤ Cϕ20ξ
(q +
√Imm2c(z)
Nη+
1
Nη
), (2.1.130)
with ξ-high probability.
By Lemma 2.1.37, the event Ξ holds with ξ-high probability. Then together with
Lemma 2.1.38 and (2.1.111), we get that with (ξ − τN)-high probability,
|z − f(m2)| ≤ Cϕ18ξ
[q2 +
1
(Nη)2+
Imm2c + Cϕ2ξ(√q + (Nη)−1/3)
Nη
]
≤ C
[ϕ20ξ
(q2 +
1
(Nη)4/3
)+ ϕ18ξ Imm2c
Nη
],
where we used Young’s inequality for the√q/(Nη) term. Now applying Lemma 2.1.36,
we get that with (ξ − τN)-high probability,
θ ≤ Cϕ10ξ
(q +
1
(Nη)2/3
)+ Cϕ18ξ Immc
Nη√κ+ η
≤ Cϕ18ξ
(q +
1
(Nη)2/3
),
where we used (2.1.38) in the second step. Then using Lemma 2.1.34, (2.1.113) and
Chapter 2. Random matrices in high dimensional statistics 67
(2.1.117), it is easy to obtain that
Λ ≤ Cϕ2ξ(q + Ψθ) + θ ≤Cϕ18ξ
(q +
1
(Nη)2/3
)+ Cϕ2ξ
√Imm2c
Nη
≤ϕ20ξ
(q +
1
(Nη)2/3
)+ ϕ3ξ
√Imm2c
Nη
uniformly in z ∈ S(c0, C0, L) with (ξ−τN)-high probability, which is a better bound than
the one in (2.1.126). We can repeat this process M times, where each iteration yields a
stronger bound on Λ which holds with a smaller probability. More specifically, suppose
that after k iterations we get the bound
Λ ≤ ϕ20ξ
(q +
1
(Nη)1−τ
)+ ϕ3ξ
√Imm2c
Nη(2.1.131)
uniformly in z ∈ S(c0, C0, L) with ξ′-high probability. Then by Lemma 2.1.38 and
(2.1.111), we have with (ξ′ − τN)-high probability,
|z − f(m2)| ≤ Cϕ18ξ
[q2 +
1
(Nη)2+
Imm2c
Nη+ϕ20ξ
Nη
(q +
1
(Nη)1−τ
)+ϕ3ξ
Nη
√Imm2c
Nη
]
≤ C
[ϕ38ξ
(q2 +
1
(Nη)2−τ
)+ ϕ18ξ Imm2c
Nη
].
Then using Lemma 2.1.36, we get that with (ξ′ − τN)-high probability,
θ ≤ Cϕ19ξ
(q +
1
(Nη)1−τ/2
)+ Cϕ18ξ Immc
Nη√κ+ η
≤ Cϕ19ξ
(q +
1
(Nη)1−τ/2
).
Chapter 2. Random matrices in high dimensional statistics 68
Again with Lemma 2.1.34, (2.1.113) and (2.1.117), we obtain that
Λ ≤ Cϕ2ξ(q + Ψθ) + θ ≤ Cϕ19ξ
(q +
1
(Nη)1−τ/2
)+ Cϕ2ξ
√Immc
Nη
≤ ϕ20ξ
(q +
1
(Nη)1−τ/2
)+ ϕ3ξ
√Immc
Nη,
(2.1.132)
uniformly in z ∈ S(c0, C0, L) with (ξ′ − τN)-high probability. Comparing with (2.1.131),
we see that the power of (Nη)−1 is increased from 1− τ to 1− τ/2, and moreover, there
is no extra constant C appearing on the right-hand side of (2.1.132). Thus after M
iterations, we get
Λ ≤ ϕ20ξ
(q +
1
(Nη)1−(1/2)M−1/3
)+ ϕ3ξ
√Immc
Nη, (2.1.133)
uniformly in z ∈ S(c0, C0, L) with (ξ−MτN)-high probability. TakingM = blog logN/ log 2c
such that
ξ −MτN ≥ ξ,1
(Nη)−(1/2)M−1/3≤ (Nη)4/(3 logN) ≤ C,
we can then conclude (2.1.130) and hence (2.1.42). Finally to prove (2.1.41), we only
need to plug (2.1.130) into Lemma 2.1.38 and then apply Lemma 2.1.36.
Proof of (2.1.43). The bound in (2.1.43) follows from a standard application of the local
laws (2.1.41) and (2.1.42). The proof is exactly the same as the one for Lemma 4.4 in
[50]. We omit the details here.
2.1.2 Universality of singular vector distribution
Sample covariance matrices with a general class of populations We first in-
troduce some notations. Throughout the paper, we will use
r = limN→∞
rN = limN→∞
N
M. (2.1.134)
Chapter 2. Random matrices in high dimensional statistics 69
Let X = (xij) be an M ×N data matrix with centered entries xij = N−1/2qij, 1 ≤ i ≤M
and 1 ≤ j ≤ N, where qij are i.i.d random variables with unit variance and for all p ∈ N,
there exists a constant Cp, such that q11 satisfies the following condition
E|q11|p ≤ Cp. (2.1.135)
We consider the sample covariance matrix Q = TXX∗T ∗, where T is a deterministic
matrix satisfying T ∗T is a positive diagonal matrix. Using the QR factorization [63,
Theorem 5.2.1], we find that T = UΣ1/2, where U is an orthogonal matrix and Σ is a
positive diagonal matrix. Denote Y = Σ1/2X, and the singular value decomposition of Y
as Y =N∧M∑k=1
√λkξkζ
∗k , where λk, k = 1, 2, · · · , N ∧M are the nontrivial eigenvalues of Q
and ξkMk=1 and ζkNk=1 are orthonormal bases of RM and RN respectively. First of all,
we observe that
X∗T ∗TX = Y ∗Y = ZΛNZ∗,
where the columns of Z are ζ1, · · · , ζN and ΛN is a diagonal matrix with entries λ1, · · · , λN .
As a consequence, U will not influence the right singular vectors of Y . Next, we have
TXX∗T ∗ = UY Y ∗U∗ = UΞΛMΞ∗U∗,
where the columns of Ξ are ξk, k = 1, 2, · · · ,M and ΛM is a diagonal matrix containing
λ1, · · · , λM . Use the fact that the product of orthogonal matrices is again orthogonal, we
conclude that the left singular vectors of TX are ξk := Uξk. Hence, the components of
ξk is a linear combination of ξk. For instance, we have
ξk(i)ξk(j) =M∑p1=1
M∑p2=1
Uip1Ujp2ξk(p1)ξk(p2).
By the delocalization result (see Lemma 2.1.62) and dominated convergence theorem, we
only need to consider the universality of the entries of ξk. The above discussion shows
Chapter 2. Random matrices in high dimensional statistics 70
that, we can make the following assumptions on T :
T ≡ Σ1/2 = diagσ1/21 , · · · , σ1/2
M , with σ1 ≥ σ2 ≥ · · · ≥ σM > 0. (2.1.136)
We denote the empirical spectral distribution of Σ by
π :=1
M
M∑i=1
δσi . (2.1.137)
Suppose that there exists some small positive constant τ such that,
τ < σM ≤ σ1 ≤ τ−1, τ ≤ r ≤ τ−1, π([0, τ ]) ≤ 1− τ. (2.1.138)
For definiteness, in this paper we focus on the real case, i.e. all the entries xij are real.
However, it is clear that our results and proofs can be applied to the complex case after
minor modifications if we assume in addition that Re xij and Im xij are independent
centered random variables with the same variance. To avoid repetition, we summarize
the basic assumptions for future reference.
Assumption 2.1.39. We assume X is an M × N matrix with centered i.i.d entries
satisfying (2.3.1) and (2.1.135). We also assume that T is a deterministic M × M
matrix satisfying (2.1.136) and (2.3.4).
From now on, we will always use Y = Σ1/2X and its singular value decomposition
Y =N∧M∑k=1
√λkξkζ
∗k , where λ1 ≥ λ2 ≥ · · · ≥ λM∧N .
Deformed Marcenko-Pastur law. We use this subsection to discuss the empirical
spectral distribution of X∗T ∗TX, where we basically follow the discussion of [74, Section
2.2]. It is well-known that if π is a compactly supported probability measure on R, and
Chapter 2. Random matrices in high dimensional statistics 71
let rN > 0, then for any z ∈ C+, there is a unique m ≡ mN(z) ∈ C+ satisfying
1
m= −z +
1
rN
∫x
1 +mxπ(dx). (2.1.139)
In this paper, we define the deterministic function m ≡ m(z) as the unique solution of
(2.3.5) with π defined in (2.3.3). We define by ρ the probability measure associated with
m (i.e. m is the Stieltjes transform of ρ) and call it the asymptotic density of X∗T ∗TX.
Our assumption (2.3.4) implies that the spectrum of Σ cannot be concentrated at zero,
thus it ensures π is a compactly supported probability measure. Therefore, m and ρ are
well-defined.
Let z ∈ C+, then m ≡ m(z) can be characterized as the unique solution of the
equation
z = f(m), Imm ≥ 0, where f(x) := −1
x+
1
rN
M∑i=1
π(σi)x+ σ−1
i
. (2.1.140)
The behaviour of ρ can be entirely understood by the analysis of f . We summarize the
elementary properties of ρ as the following lemma.
Lemma 2.1.40. Denote R = R∪∞, then f defined in (2.3.7) is smooth on the M + 1
open intervals of R defined through
I1 := (−σ−11 , 0), Ii := (−σ−1
i ,−σ−1i−1), i = 2, · · · ,M, I0 := R/ ∪Mi=1 Ii.
We also introduce a multiset C ⊂ R containing the critical points of f , using the conven-
tions that a nondegenerate critical point is counted once and a degenerate critical point
will be counted twice. In the case rN = 1, ∞ is a nondegenerate critical point. With the
above notations, we have
• |C ∩ I0| = |C ∩ I1| = 1 and |C ∩ Ii| ∈ 0, 2 for i = 2, · · · ,M. Therefore, |C| = 2p,
where for convenience, we denote by x1 ≥ x2 ≥ · · · ≥ x2p−1 be the 2p − 1 critical
Chapter 2. Random matrices in high dimensional statistics 72
points in I1 ∪ · · · ∪ IM and x2p be the unique critical point in I0.
• Denote ak := f(xk), we have a1 ≥ · · · ≥ a2p. Moreover, we have xk = m(ak) by
assuming m(0) := ∞ for rN = 1. Furthermore, for k = 1, · · · , 2p, there exists a
constant C such that 0 ≤ ak ≤ C.
• supp ρ ∩ (0,∞) = (∪pk=1[a2k, a2k−1]) ∩ (0,∞).
With the above definitions and properties, we now introduce the key regularity as-
sumption on Σ.
Assumption 2.1.41. Fix τ > 0, we say that
(i) The edges ak, k = 1, · · · , 2p are regular if
ak ≥ τ, minl 6=k|ak − al| ≥ τ, min
i|xk + σ−1
i | ≥ τ. (2.1.141)
(ii) The bulk components k = 1, · · · , p are regular if for any fixed τ ′ > 0 there exists a
constant c ≡ cτ,τ ′ such that the density of ρ in [a2k + τ ′, a2k−1− τ ′] is bounded from below
by c.
Remark 2.1.42. The second condition in (2.3.9) states that the gap in the spectrum
of ρ adjacent to ak can be well separated when N is sufficiently large. And the third
condition ensures a square root behaviour of ρ in a small neighbourhood of ak. To be
specific, consider the right edge of the k-th bulk component, by (A.12) of [74], there
exists some small constant c > 0, such that ρ has the following square root behavior
ρ(x) ∼√a2k−1 − x, x ∈ [a2k−1 − c, a2k−1]. (2.1.142)
As a consequence, it will rule out the outliers. The bulk regularity imposes a lower bound
on the density of eigenvalues away from the edges.
Chapter 2. Random matrices in high dimensional statistics 73
Main results. This subsection is devoted to providing the main results of this paper.
We first introduce some notations. Recall that the nontrivial classical eigenvalue locations
γ1 ≥ γ2 ≥ · · · ≥ γM∧N of Q are defined as∫∞γidρ =
i− 12
N. By Lemma 2.3.2, there are p
bulk components in the spectrum of ρ. For k = 1, · · · , p, we define the classical number
of eigenvalues of the k-th bulk component through Nk := N∫ a2k−1
a2kdρ. When p ≥ 1, we
relabel λi and γi separately for each bulk component k = 1, · · · , p by introducing
λk,i := λi+∑l<k Nl
, γk,i := γi+∑l<k Nl
∈ (a2k, a2k−1). (2.1.143)
Equivalently, we can characterize γk,i through
∫ a2k−1
γk,i
dρ =i− 1
2
N. (2.1.144)
In the present paper, we will use the following assumption for the technical purpose of
the application of the anisotropic local law.
Assumption 2.1.43. For k = 1, 2, · · · , p, i = 1, 2, · · · , Nk, γk,i ≥ τ, for some constant
τ > 0.
We define the index sets I1 := 1, ...,M, I2 := M+1, ...,M+N, I := I1∪I2. We
will consistently use the latin letters i, j ∈ I1, greek letters µ, ν ∈ I2, and s, t ∈ I. Then
we label the indices of the matrix according to X = (Xiµ : i ∈ I1, µ ∈ I2). Similarly, we
can label the entries of ξk ∈ RI1 , ζk ∈ RI2 . In the k-th bulk component, k = 1, 2, · · · , p,
we rewrite the index of λα′ as
α′ := l +∑t<k
Nt, when α′ −∑t<k
Nt <∑t≤k
Nt − α′, (2.1.145)
α′ := −l + 1 +∑t≤k
Nt, when α′ −∑t<k
Nt >∑t≤k
Nt − α′. (2.1.146)
In this paper, we will always say that l is associated with α′. Note that α′ is the index
Chapter 2. Random matrices in high dimensional statistics 74
of λk,l before the relabeling of (2.1.143) and the two cases correspond to the right and
left edges respectively. Our main result on the distribution of the components of the
singular vectors near the edge is the following theorem. For any positive integers m, k,
some function θ : Rm → R and x = (x1, · · · , xm) ∈ Rm, we denote
∂(k)θ(x) =∂kθ(x)
∂xk11 ∂xk22 · · · ∂xkmm
,
m∑i=1
ki = k, k1, k2, · · · , km ≥ 0, (2.1.147)
and ||x||2 to be its l2 norm. Denote QG := Σ1/2XGX∗GΣ1/2, where XG is GOE and Σ
satisfies (2.1.136) and (2.3.4).
Theorem 2.1.44. For QV = Σ1/2XVX∗V Σ1/2 satisfying Assumption 2.1.39, let EG,EV
denote the expectations with respect to XG, XV . Consider the k-th bulk component, k =
1, 2, · · · , p, and l defined in (2.1.145) or (2.1.146), under Assumption 2.1.41 and 2.1.43,
for any choices of indices i, j ∈ I1, µ, ν ∈ I2, there exists a δ ∈ (0, 1), when l ≤ N δk , we
have
limN→∞
[EV − EG]θ(Nξα′(i)ξα′(j), Nζα′(µ)ζα′(ν)) = 0,
where θ is a smooth function in R2 that satisfies
|∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, with some constant C > 0. (2.1.148)
Theorem 2.1.45. For QV = Σ1/2XVX∗V Σ1/2 satisfying Assumption 2.1.39. Consider
the k1-th, · · · , kn-th bulk components, k1, · · · , kn ∈ 1, 2, · · · , p, n ≤ p, for lki defined in
(2.1.145) or (2.1.146) associated with the ki-th bulk component, i = 1, 2, · · · , n, under
Assumption 2.1.41 and 2.1.43, for any choices of indices i, j ∈ I1, µ, ν ∈ I2, there exists
a δ ∈ (0, 1), when lki ≤ N δki, where lki is associated with α′ki , i = 1, 2, · · · , n, we have
limN→∞
[EV−EG]θ(Nξα′k1 (i)ξα′k1 (j), Nζα′k1 (µ)ζα′k1 (ν), · · · , Nξα′kn (i)ξα′kn (j), Nζα′kn (µ)ζα′kn (ν)) = 0,
Chapter 2. Random matrices in high dimensional statistics 75
where θ is a smooth function in R2n that satisfies
|∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, with some constant C > 0. (2.1.149)
Remark 2.1.46. The results in Theorem 2.1.44 and 2.1.45 can be easily extended to a
general form containing more entries of the singular vectors using a general form of Green
function comparison argument. For example, to extend Theorem 2.1.44, we consider the
k-th bulk component and choose any positive integer β, under Assumption 2.1.41 and
2.1.43, for any choices of indices i1, j1, · · · , iβ, jβ ∈ I1 and µ1, ν1, · · · , µβ, νβ ∈ I2, for
the corresponding li defined in (2.1.145) or (2.1.146), i = 1, 2, · · · , β, there exists some
0 < δ < 1 with 0 < max1≤i≤βli ≤ N δk , we have
limN→∞
[EV−EG]θ(Nξα′1(i1)ξα′1(j1), Nζα′1(µ1)ζα′1(ν1), · · · , Nξα′β (iβ)ξα′β (jβ), Nζα′β (µβ)ζα′β (νβ)) = 0,
(2.1.150)
where θ ∈ R2β is a smooth function function satisfying |∂(k)θ(x)| ≤ C(1 + ||x||2)C , k =
1, 2, 3, with some constant C > 0. Similarly, we can extend Theorem 2.1.45 to contain
more entries of singular vectors.
Recall (2.1.143), denote $k := (|f ′′(xk)|/2)1/3, k = 1, 2, · · · , 2p, for any positive
integer h, we define
q2k−1,h :=N
23
$2k−1
(λk,h − a2k−1), q2k,h := −N23
$2k
(λk,Nk−h+1 − a2k).
Consider a smooth function θ ∈ R whose third derivative θ(3) satisfying |θ(3)(x)| ≤
C(1 + |x|)C , for some constant C > 0. Then we have
limN→∞
[EV − EG]θ(qk,h) = 0. (2.1.151)
Together with Theorem 2.1.44, we have the following corollary. Denote t = 2k − 1 if α′
is of (2.1.145) and 2k if α′ is of (2.1.146).
Chapter 2. Random matrices in high dimensional statistics 76
Corollary 2.1.47. Under the assumptions of Theorem 2.1.44, for some positive integer
h, we have
limN→∞
[EV − EG]θ(qt,h, Nξα′(i)ξα′(j), Nζα′(µ)ζα′(ν)) = 0, (2.1.152)
where θ ∈ R3 satisfying
|∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, with some constant C > 0. (2.1.153)
Corollary 2.1.47 can be extended to a general form for several bulk components.
Denote ti = 2ki − 1 if α′ki is of (2.1.145) and 2ki if α′ki is of (2.1.146).
Corollary 2.1.48. Under the assumptions of Theorem 2.1.45, for some positive integer
h, we have
limN→∞
[EV−EG]θ(qt1,h, Nξα′k1 (i)ξα′k1 (j), Nζα′k1 (µ)ζα′k1 (ν), · · · ,qtn,h, Nξα′kn (i)ξα′kn (j), Nζα′kn (µ)ζα′kn (ν)) = 0,
where θ ∈ R3n is a smooth function function satisfying
|∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, with some arbitrary C > 0. (2.1.154)
Remark 2.1.49. (i). Similar to (2.1.150), the results in Corollary 2.1.47 and 2.1.48 can
be easily extended to a general form containing more entries of the singular vectors. For
example, to extend Corollary 2.1.47, we can choose any positive integers β and h1, · · · , hβ,
under Assumption 2.1.41 and 2.1.43, for any choices of indices i1, j1, · · · , iβ, jβ ∈ I1
and µ1, ν1, · · · , µβ, νβ ∈ I2, for the corresponding li defined in (2.1.145) or (2.1.146),
i = 1, 2, · · · , β, there exists some 0 < δ < 1 with max1≤i≤βli ≤ N δk , we have
limN→∞
[EV−EG]θ(qt1,h1 , Nξα′1(i1)ξα′1(j1), ζα′1(µ1)ζα′1(ν1), · · · ,qtβ ,hβ , Nξα′β (iβ)ξα′β (jβ), Nζα′β (µβ)ζα′β (νβ)) = 0.
where the smooth function θ ∈ R3β satisfies |∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, for
Chapter 2. Random matrices in high dimensional statistics 77
some constant C.
(ii). Theorem 2.1.44 and 2.1.45, Corollary 2.1.47 and 2.1.48 still hold true for the complex
case, where the moment matching condition is replaced by
EGxlijxuij = EV xlijxuij, 0 ≤ l + u ≤ 2. (2.1.155)
In the bulks, similar results hold under the stronger assumption that the first four
moments of the matrix entries match with those of Gaussian ensembles.
Theorem 2.1.50. For QV = Σ1/2XVX∗V Σ1/2 satisfying Assumption 2.1.39. Assuming
that the third and fourth moments of XV agree with those of XG and considering the
k-th bulk component, k = 1, 2, · · · , p and l defined in (2.1.145) or (2.1.146) , under
Assumption 2.1.41 and 2.1.43, for any choices of indices i, j ∈ I1, µ, ν ∈ I2, there exists
a small δ ∈ (0, 1), when δNk ≤ l ≤ (1− δ)Nk, we have
limN→∞
[EV − EG]θ(Nξα′(i)ξα′(j), Nζα′(µ)ζα′(ν)) = 0,
where θ is a smooth function in R2 that satisfies
|∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, 4, 5, with some constant C > 0. (2.1.156)
Theorem 2.1.51. For QV = Σ1/2XVX∗V Σ1/2 satisfying Assumption 2.1.39. Assuming
that the third and fourth moments of XV agree with those of XG, consider the k1-th, · · · ,
kn-th bulks, k1, · · · , kn ∈ 1, 2, · · · , p, n ≤ p, for lki defined in (2.1.145) or (2.1.146)
associated with the ki-th bulk component, i = 1, 2, · · · , n, under Assumption 2.1.41 and
2.1.43, for any choices of indices i, j ∈ I1, µ, ν ∈ I2, there exists a δ ∈ (0, 1), when
δNki ≤ lki ≤ (1− δ)Nki , i = 1, 2, · · · , n, we have
limN→∞
[EV−EG]θ(Nξα′k1 (i)ξα′k1 (j), Nζα′k1 (µ)ζα′k1 (ν), · · · , Nξα′kn (i)ξα′kn (j), Nζα′kn (µ)ζα′kn (ν)) = 0,
Chapter 2. Random matrices in high dimensional statistics 78
where θ is a smooth function in R2n that satisfies
|∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, 4, 5, with some constant C > 0. (2.1.157)
Remark 2.1.52. (i). Similar to Corollary 2.1.47, 2.1.48 and (i) of Remark 2.1.49, we
can extend the results to the joint distribution containing singular values. We take the
extension of Theorem 2.1.50 as an example. By (ii) of Assumption 2.1.41, in the bulk, we
have∫ γα′λα′
dρ = 1N
+o(N−1). Using a similar Dyson Brownian motion argument, combining
with Theorem 2.1.50, we have
limN→∞
[EV − EG]θ(pα′ , Nξα′(i)ξα′(j), Nζα′(µ)ζα′(ν)) = 0. (2.1.158)
where pα′ is defined as
pα′ := ρ(γα′)N(λα′ − γα′),
and θ ∈ R3 satisfying
|∂(k)θ(x)| ≤ C(1 + ||x||2)C , k = 1, 2, 3, 4, 5, with some constant C > 0.
(ii). Theorem 2.1.50 and 2.1.51 still hold true for the complex case, where the moment
matching condition is replaced by
EGxlijxuij = EV xlijxuij, 0 ≤ l + u ≤ 4. (2.1.159)
Applications to statistics. In this subsection, we give a few remarks on the possible
applications to statistics. It is notable that, in general, the distribution of the singular
vectors of sample covariance matrix Q = TXX∗T ∗ is unknown, even for the GOE case.
However, when T is a scalar matrix (i.e T = cI, c > 0), Bourgade and Yau [114, Ap-
pendix C] have shown that the entries of the singular vectors are asymptotically normally
Chapter 2. Random matrices in high dimensional statistics 79
distributed. Hence, our universality results imply that under Assumption 2.1.39, 2.1.41
and 2.1.43, when T is conformal (i.e T ∗T = cI, c > 0), the entries of the right singular
vectors are asymptotically normally distributed. Therefore, this can be used to test the
null hypothesis
H0 : T is a conformal matrix. (2.1.160)
The statistical testing problem (2.1.160) contains a rich class of hypothesis tests. For
instance, when T = I, it reduces to the sphericity test and when c = 1, it reduces to test
whether the covariance matrix of X is orthogonal [113].
To illustrate how our results can be used to test (2.1.160), we take the example by
assuming c = 1 in the following discussion. Under H0, denote the QR factorization of
T to be T = UI, the right singular vector of TX is the same of X, ζk, k = 1, 2, · · · , N.
Using [114, Corollary 1.3], we find that for i, k = 1, 2, · · · , N,
√Nζk(i)→ N , (2.1.161)
where N is a standard Gaussian random variable. In detail, we can take the following
steps to test whether H0 holds true:
1). Randomly choose two index sets R1, R2 ⊂ 1, 2, · · · , N with |Ri| = O(1), i = 1, 2.
2). Use Bootstrapping method to sample the columns of Q and get a sequence of
M ×N matrices Qj, j = 1, 2, · · · , K.
3). Select ζjk(i), k ∈ R1, i ∈ R2 from Qj, j = 1, 2, · · · , K. Use the classic normality
test, for instance the Shapiro-Wilk test to check whether (2.1.161) hold true for all the
above samples. Record the number of samples cannot be rejected by the normality test
by A.
4). Given some pre-chosen significant level α, reject H0 if A|R1||R2| < 1− α.
Chapter 2. Random matrices in high dimensional statistics 80
The other important information from our result is that the singular vectors are
completely delocalized. In the low rank matrix denoising problem,
S = TX + S,
where S is a deterministic low rank matrix. Consider the rank one case and assume
the left singular vector u of S is sparse, using the completely delocalization result, it
can be shown that the first left singular vector of S has the same sparse structure as
that of u. Thus, to estimate the singular vectors of S, we only need to do singular value
decomposition on a block matrix of S.
Notations and tools. In this part, we introduce some notations and tools which will
be used in this paper. Throughout the paper, we will always use ε1 for a small constant
and D1 a large constant. Recall that the ESD of an n×n symmetric matrix H is defined
as
F(n)H (λ) :=
1
n
n∑i=1
1λi(H)≤λ.
For some small constant τ > 0, we define the typical domain for z = E + iη as
D(τ) = z ∈ C+ : |E| ≤ τ−1, N−1+τ ≤ η ≤ τ−1. (2.1.162)
Definition 2.1.53 (Stieltjes transform). Recall that the Green functions for Y Y ∗ and
Y ∗Y are defined
G1(z) := (Y Y ∗ − z)−1, G2(z) := (Y ∗Y − z)−1, z = E + iη ∈ C+. (2.1.163)
The Stieltjes transform of the ESD of Y ∗Y is given by
m2(z) ≡ m(N)2 (z) :=
∫1
x− zdF
(N)Y ∗Y (x) =
1
N
N∑i=1
(G2)ii(z) =1
NTrG2(z). (2.1.164)
Chapter 2. Random matrices in high dimensional statistics 81
Similarly, we can also define m1(z) ≡ m(M)1 (z) := M−1TrG1(z).
Definition 2.1.54. For z ∈ C+, we define the (N +M)× (N +M) self-adjoint matrix
H ≡ H(X,Σ) :=
−zI z1/2Y
z1/2Y ∗ −zI
, (2.1.165)
and
G ≡ G(X, z) := H−1. (2.1.166)
By Schur’s complement, it is easy to check that
G =
G1(z) z−1/2G1(z)Y
z−1/2Y ∗G1(z) z−1Y ∗G1(z)Y − z−1I
=
z−1Y G2(z)Y ∗ − z−1I z−1/2Y G2(z)
z−1/2G2(z)Y ∗ G2(z)
,
(2.1.167)
for G1,2 defined in (2.2.6). Thus a control of G yields directly a control of (Y Y ∗ − z)−1
and (Y ∗Y − z)−1. Moreover, we have
m1(z) =1
M
∑i∈I1
Gii, m2(z) =1
N
∑µ∈I2
Gµµ.
Recall that Y =∑M∧N
i=1
√λkξkζ
∗k , ξk ∈ RI1 , ζk ∈ RI2 , by (2.2.48), we have
G(z) =M∧N∑k=1
1
λk − z
ξkξ∗k z−1/2
√λkξkζ
∗k
z−1/2√λkζkξ
∗k ζkζ
∗k
. (2.1.168)
Denote
Ψ(z) :=
√Imm(z)
Nη+
1
Nη, Σo :=
Σ 0
0 I
, Σ :=
z−1/2Σ1/2 0
0 I
. (2.1.169)
Chapter 2. Random matrices in high dimensional statistics 82
Definition 2.1.55. For z ∈ C+, we define the I × I matrix
Π(z) :=
−z−1(1 +m(z)Σ)−1 0
0 m(z)
. (2.1.170)
We will see later from Lemma 2.3.27 that G(z) converges to Π(z) in probability.
Remark 2.1.56. In [74, Definition 3.2], the linearizing block matrix is defined as
Ho :=
−Σ−1 X
X∗ −zI
. (2.1.171)
It is easy to check the following relation between (2.1.165) and (2.1.171)
H =
z1/2Σ1/2 0
0 I
Ho
z1/2Σ1/2 0
0 I
. (2.1.172)
In [74, Definition 3.3], the deterministic convergent limit of H−1o is
Πo(z) =
−Σ(1 +m(z)Σ)−1 0
0 m(z)
. (2.1.173)
Therefore, by (2.1.172), we can get a similar relation between (2.2.64) and (2.1.173)
Π(z) =
z−1/2Σ−1/2 0
0 I
Πo(z)
z−1/2Σ−1/2 0
0 I
. (2.1.174)
Definition 2.1.57. We introduce the notation X(T) to represent the M×(N−|T|) minor
of X by deleting the i-th, i ∈ T columns of X. For convenience, (i) will be abbreviated
to (i). We will keep the name of indices of X for X(T), that is X(T)ij = 1(j /∈ T)Xij. We
Chapter 2. Random matrices in high dimensional statistics 83
will denote
Y (T) = Σ1/2X(T), G(T)1 = (Y (T)Y (T)∗ − zI)−1, G(T)
2 = (Y (T)∗Y (T) − zI)−1. (2.1.175)
Consequently, m(T)1 (z) = M−1 TrG(T)
1 (z), m(T)2 (z) = N−1 TrG(T)
2 (z).
Our key ingredient is the anisotropic local law derived by Knowles and Yin in [74].
Lemma 2.1.58. Fix τ > 0, assume (2.3.1), (2.1.135) and (2.3.4) hold. Moreover,
suppose that every edge k = 1, · · · , 2p satisfies ak ≥ τ and every bulk component k =
1, · · · , p is regular in the sense of Assumption 2.1.41. Then for all z ∈ D(τ) and any
unit vectors u,v ∈ RM+N , there exists some small constant ε1 > 0 and large constant
D1 > 0, when N is large enough, with 1−N−D1 probability, we have
∣∣< u,Σ−1(G(z)− Π(z))Σ−1v >∣∣ ≤ N ε1Ψ(z), (2.1.176)
and
|m2(z)−m(z)| ≤ N ε1Ψ(z). (2.1.177)
Proof. (2.1.177) is already proved in (3.11) of [74]. We only need to prove (2.2.63). By
(2.1.172), we have
Go(z) =
z1/2Σ1/2 0
0 I
G(z)
z1/2Σ1/2 0
0 I
. (2.1.178)
By [74, Theorem 3.6], with 1−N−D1 probability, we have
| < u,Σ−1o (Go(z)− Πo(z))Σ−1
o v > | ≤ N ε1Ψ(z). (2.1.179)
Therefore, by (2.1.174), (2.1.178) and (2.1.179), we conclude our proof.
It is easy to derive the following corollary from Lemma 2.3.27.
Chapter 2. Random matrices in high dimensional statistics 84
Corollary 2.1.59. Under the assumptions of Lemma 2.3.27, with 1−N−D1 probability,
we have
| < v, (G2(z)−m(z))v > | ≤ N ε1Ψ(z), | < u, (G1(z)+z−1(1+m(z)Σ)−1)u > | ≤ N ε1Ψ(z),
(2.1.180)
where v, u are unit vectors in RN ,RM respectively.
We use the following lemma to characterize the rigidity of eigenvalues within each of
the bulk component, which can be found in [74, Theorem 3.12].
Lemma 2.1.60. Fix τ > 0, assume (2.3.1), (2.1.135) and (2.3.4) hold. Moreover, sup-
pose that every edge k = 1, · · · , 2p satisfies ak ≥ τ and every bulk component k = 1, · · · , p
is regular in the sense of Assumption 2.1.41. Recall Nk is the number of eigenvalues
within each bulk, then we have that for i = 1, · · · , Nk satisfying γk,i ≥ τ and k = 1, · · · , p,
with 1−N−D1 probability, we have
|λk,i − γk,i| ≤ (i ∧ (Nk + 1− i))−13N−
23
+ε1 . (2.1.181)
Within the bulk, we have stronger result. For small τ ′ > 0, denote
Dbk := z ∈ D(τ) : E ∈ [a2k + τ ′, a2k−1 − τ ′], k = 1, 2, · · · , p, (2.1.182)
as the bulk spectral domain, then [74, Theorem 3.15] gives the following result.
Lemma 2.1.61. Fix τ, τ ′ > 0, assume (2.3.1), (2.1.135) and (2.3.4) hold and the bulk
component k = 1, · · · , 2p is regular in the sense of (ii) of Assumption 2.1.41. Then for
all i = 1, · · · , Nk satisfying γk,i ∈ [a2k + τ ′, a2k−1 − τ ′], we have (2.2.63) and (2.1.177)
hold uniformly for all z ∈ Dbk and with 1−N−D1 probability,
|λk,i − γk,i| ≤ N−1+ε1 . (2.1.183)
Chapter 2. Random matrices in high dimensional statistics 85
As discussed in [74, Remark 3.13], Lemma 2.3.27 and 2.2.22 imply the complete
delocalization of the singular vectors.
Lemma 2.1.62. Fix τ > 0, under the assumptions of Lemma 2.3.27, for any i, µ such
that γi, γµ ≥ τ, with 1−N−D1 probability, we have
maxi,s1|ξi(s1)|2 + max
µ,s2|ζµ(s2)|2 ≤ N−1+ε1 . (2.1.184)
Proof. By (2.1.180), with 1 − N−D1 probability, we have maxImGii(z), ImGµµ(z) =
O(1). Choose z0 = E+ iη0 with η0 = N−1+ε1 and use the spectral decomposition (2.2.49),
we haveN∧M∑k=1
η0
(E − λk)2 + η20
|ξk(i)|2 = ImGii(z0) = O(1), (2.1.185)
N∧M∑k=1
η0
(E − λk)2 + η20
|ζk(µ)|2 = ImGµµ(z0) = O(1), (2.1.186)
hold with 1 − N−D1 probability. Choosing E = λk in (2.1.185) and (2.1.186), we finish
the proof.
Singular vectors near the edges. In this section, we prove the universality for the
distributions of the edge singular vectors Theorem 2.1.44 and 2.1.45, as well as the joint
distribution between singular values and singular vectors Corollary 2.1.47 and 2.1.48.
The main identities on which we will rely are
Gij =M∧N∑β=1
η
(E − λβ)2 + η2ξβ(i)ξβ(j), Gµν =
M∧N∑β=1
η
(E − λβ)2 + η2ζβ(µ)ζβ(ν), (2.1.187)
where Gij, Gµν are defined as
Gij :=1
2i(Gij(z)−Gij(z)), Gµν :=
1
2i(Gµν(z)−Gµν(z)). (2.1.188)
Chapter 2. Random matrices in high dimensional statistics 86
Due to similarity, we focus our proof on the right singular vectors. The proofs reply on
three main steps: (i). Writing Nζβ(µ)ζβ(ν) as an integral of Gµν over a random interval
with size O(N εη), where ε > 0 is a small constant and η = N−2/3−ε0 , ε0 > 0 will be
chosen later; (ii). Replacing the sharp characteristic function getting from step (i) with a
smooth cutoff function q in terms of the Green function; (iii). Using the Green function
comparison argument to compare the distribution of the singular vectors between the
ensembles XG and XV .
We will follow the proof strategy of [71, Section 3] and slightly modify the detail.
Specially, the choices of random interval in step (i) and the smooth function q in step
(ii) are different due to the fact that we have more than one bulk components. And the
Green function comparison argument is also slightly different as we use the linearization
matrix (2.2.49).
We mainly focus on a single bulk component, firstly prove the singular vector dis-
tribution and then extend the results to singular values. The results containing several
bulk components will follow after minor modification. We first prove the following result
for the right singular vector.
Lemma 2.1.63. For QV = Σ1/2XVX∗V Σ1/2 satisfying Assumption 2.1.39, let EG,EV
denote the expectations with respect to XG, XV . Consider the k-th bulk component, k =
1, 2, · · · , p, and l defined in (2.1.145) or (2.1.146), under Assumption 2.1.41 and 2.1.43,
for any choices of indices µ, ν ∈ I2, there exists a δ ∈ (0, 1), when l ≤ N δk , we have
limN→∞
[EV − EG]θ(Nζα′(µ)ζα′(ν)) = 0,
where θ is a smooth function in R that satisfies
|θ(3)(x)| ≤ C1(1 + |x|)C1 , x ∈ R, with some constant C1 > 0. (2.1.189)
Chapter 2. Random matrices in high dimensional statistics 87
Near the edges, by (2.1.181) and (2.1.184), with 1−N−D1 probability, we have
|λα′ − γα′ | ≤ N−2/3+ε1 , maxµ,s2|ζµ(s2)|2 ≤ N−1+ε1 . (2.1.190)
Hence, throughout the proofs of this section, we always use the scale parameter
η = N−2/3−ε0 , ε0 > ε1 is a small constant. (2.1.191)
Proof of Lemma 2.1.63. In a first step, we express the singular vector entries as an inte-
gral of Green functions over a random interval, which is recorded as the following lemma.
Lemma 2.1.64. Under the assumptions of Lemma 2.1.63, there exists a small constant
0 < δ < 1, such that
limN→∞
maxl≤Nδ
k
maxµ,ν
∣∣∣∣EV θ(Nζα′(µ)ζα′(ν))− EV θ[N
π
∫I
Gµν(z)X (E)dE]
∣∣∣∣ = 0, (2.1.192)
where I is defined as
I := [a2k−1 −N−23
+ε, a2k−1 +N−23
+ε], (2.1.193)
when (2.1.145) holds, and when (2.1.146) holds, it is denoted as
I := [a2k −N−23
+ε, a2k +N−23
+ε], (2.1.194)
with ε satisfies that, for C1 defined in (2.1.189)
2(C1 + 1)(δ + ε1) < ε < cε0, c > 0 is a constant much smaller than 1. (2.1.195)
And X (E) is defined as
X (E) := 1(λα′+1 < E− ≤ λα′), (2.1.196)
Chapter 2. Random matrices in high dimensional statistics 88
where E± := E ±N εη. The conclusion holds true if we replace XV with XG.
Proof. We first observe that
ζα′(µ)ζα′(ν) =η
π
∫R
ζα′(µ)ζα′(ν)
(E − λα′)2 + η2dE.
Choose a, b such that
a := minλα′ −N εη, λα′+1 +N εη, b := λα′ +N εη. (2.1.197)
We also observe the elementary inequality (see the equation above (6.10) of [53]), for
some constant C > 0, ∫ ∞x
η
π(y2 + η2)dy ≤ Cη
x+ η, x > 0. (2.1.198)
By (2.1.190), (2.1.197) and (2.1.198), with 1−N−D1 probability, we have
ζα′(µ)ζα′(ν) =η
π
∫ b
a
ζα′(µ)ζα′(ν)
(E − λ′α)2 + η2dE +O(N−1−ε+ε1). (2.1.199)
By (2.1.189), (2.1.190), (2.1.195), (2.1.199) and mean value theorem , we have
EV θ(Nζα′(µ)ζα′(ν)) = EV θ(Nη
π
∫ b
a
ζα′(µ)ζα′(ν)
(E − λα′)2 + η2dE) + o(1). (2.1.200)
Denote λ±t := λt ±N εη, t = α′, α′ + 1, and by (2.1.197), we have
∫ b
a
dE =
∫ λ+α′
λ+α′+1
dE + 1(λ+α′+1 > λ−α′)
∫ λ+α′+1
λ−α′
dE.
By (2.1.189), (2.1.190), (2.1.200) and mean value theorem , we have
EV θ(Nζα′(µ)ζα′(ν)) = EV θ(Nη
π
∫ λ+α′
λ+α′+1
ζα′(µ)ζα′(ν)
(E − λα′)2 + η2dE) + o(1), (2.1.201)
Chapter 2. Random matrices in high dimensional statistics 89
where we use (2.1.181) and (2.1.195). Next we can without loss of generality, consider
the case when (2.1.145) holds true. By (2.1.190) and (2.1.195), we observe that with
1 − N−D1 probability, we have λ+α′ ≤ a2k−1 + N−2/3+ε and λ+
α′+1 ≥ a2k−1 − N−2/3+ε. By
(2.1.181) and the choice of I in (2.1.193), we have
EV θ(Nζα′(µ)ζα′(ν)) = EV θ(Nη
π
∫I
ζα′(µ)ζα′(ν)
(E − λα′)2 + η2X (E)dE) + o(1). (2.1.202)
Recall (2.1.187), we can split the summation as
1
ηGµν(z) =
∑β 6=α′
ζβ(µ)ζβ(ν)
(E − λβ)2 + η2+
ζα′(µ)ζα′(ν)
(E − λα′)2 + η2. (2.1.203)
Denote A := β 6= α′ : λβ is not in the k-th bulk component. By (2.1.190), with 1 −
N−D1 probability, we have
∣∣∣∣∣∑β 6=α′
Nη
π
∫I
ζβ(µ)ζβ(ν)
(E − λβ)2 + η2dE
∣∣∣∣∣ ≤ N ε1
π
(∑β∈A
∫I
η
η2 + (E − λβ)2dE +
∑β∈Ac
∫I
η
η2 + (E − λβ)2dE
).
(2.1.204)
By Assumption 2.1.41, with 1−ND1 probability, we have
N ε1
π
∑β∈A
∫I
η
η2 + (E − λβ)2dE ≤ N ε1
∑β∈A
N−4/3−ε0+ε. (2.1.205)
Denote
l(β) := β −∑t<k
Nt. (2.1.206)
By (2.1.190), with 1−N−D1 probability, for some small constant 0 < δ < 1, we have
N ε1
π
∑β∈Ac
∫I
η
(E − λβ)2 + η2dE ≤ N ε1+δ+
1
π
∑β∈Ac; l(β)≥Nδ
k
∫I
N ε1η
η2 + (E − λβ)2dE. (2.1.207)
Chapter 2. Random matrices in high dimensional statistics 90
By Assumption 2.1.41, (2.1.142) and (2.1.181), it is easy to check that (see (3.12) of [71])
(E − λβ)2 ≥ c(l(β)
N)4/3, c > 0 is some constant. (2.1.208)
By (2.1.208), with 1−N−D1 probability, we have
1
π
∑β∈Ac; l(β)≥Nδ
k
∫I
N ε1η
η2 + (E − λβ)2dE ≤ N ε1−ε0+ε
∫ N
Nδ−1
1
x4/3dx ≤ N−δ/3+ε1−ε0+ε.
Recall (2.1.195), we can restrict ε1 − ε0 + ε < 0, with 1−N−D1 probability, this yields
∑β∈Ac; l(β)≥Nδ
k
∫I
N ε1η
η2 + (E − λβ)2dE ≤ N−δ/3. (2.1.209)
By (2.1.204), (2.1.205), (2.1.207) and (2.1.209), with 1−N−D1 probability, we have
∣∣∣∣∣∑β 6=α′
Nη
π
∫I
ζβ(µ)ζβ(ν)
(E − λβ)2 + η2dE
∣∣∣∣∣ ≤ N δ+2ε1 . (2.1.210)
By (2.1.189), (2.1.190), (2.1.203), (2.1.210) and mean value theorem, we have
∣∣∣∣EV θ(Nηπ∫I
ζα′(µ)ζα′(ν)
(E − λα′)2 + η2X (E)dE)− EV θ(
N
π
∫I
Gµν(E + iη)X (E)dE)
∣∣∣∣≤ NC1(δ+2ε1)EV
∑β 6=α′
Nη
π
∫I
|ζβ(µ)ζβ(ν)|(E − λβ)2 + η2
X (E)dE, (2.1.211)
where C1 is defined in (2.1.189). To finish the proof, it suffices to estimate the right-hand
side of (2.1.211). Similar to (2.1.205), we have
∑β∈A
∫I
η
η2 + (E − λβ)2dE ≤ N−1/3−ε0+ε. (2.1.212)
Chapter 2. Random matrices in high dimensional statistics 91
Choose a small constant 0 < δ1 < 1, repeat the estimation of (2.1.209), we have
∑β∈Ac; l(β)≥Nδ1
k
∫I
η
η2 + (E − λβ)2dE ≤ N−δ1/3+ε−ε0 . (2.1.213)
Recall (2.1.145) and restrict ε > 2((C1 + 1)ε1 + δ1 +C1δ), by (2.1.190) and (2.1.198), we
have
∑β∈Ac; l≤l(β)≤Nδ1
k
Nη
πEV∫I
|ζβ(µ)ζβ(ν)|(E − λβ)2 + η2
X (E)dE ≤ EV∫ ∞λα′+1+Nεη
N δ1+ε1η
(E − λα′+1)2 + η2dE
≤ N−ε+ε1+δ1 , (2.1.214)
where we use the fact that β ∈ Ac and l < l(β) ≤ N δ1k implies λβ ≤ λα′+1. It remains to
estimate the summation of the terms when β ∈ Ac and l(β) < l. For a given constant ε′
satisfies
1
2(ε0 + 3ε+ 2(C1 + 1)ε1 + (C1 + 1)δ) < ε′ < ε0, (2.1.215)
we partition I = I1 ∪ I2 with I1 ∩ I2 = ∅ by denoting
I1 := E ∈ I : ∃β, β ∈ Ac, l(β) < l, |E − λβ| ≤ N ε′η. (2.1.216)
By (2.1.190) and (2.1.216), we have
∑β∈Ac; l(β)<l
Nη
πEV∫I2
|ζβ(µ)ζβ(ν)|(E − λβ)2 + η2
X (E)dE ≤ N−2ε′+ε0+ε+ε1+δ. (2.1.217)
It is easy to check that on I1 when λα′+1 ≤ λα′ < λβ, we have
1
(E − λβ)2 + η21(E− ≤ λα′) ≤
N2ε
(λα′+1 − λα′)2 + η2. (2.1.218)
Chapter 2. Random matrices in high dimensional statistics 92
By (2.1.190) and (2.1.218), we have
∑β∈Ac;l(β)≤l
Nη
πEV∫I1
|ζβ(µ)ζβ(ν)|(E − λβ)2 + η2
X (E)dE ≤ EV∫I1
N δ+ε1+2ε−2/3η
(λα′+1 − λα′)2 + η2dE
≤ N δ+ε1+3ε−D1+2/3+ε0 +N−2ε′+ε0+ε1+δ+3ε. (2.1.219)
By (2.1.212), (2.1.213), (2.1.214), (2.1.215) and (2.1.219), we conclude the proof of
(2.1.211). It is clear that our proof still applies when we replace XV with XG.
In a second step, we will write the sharp indicator function of (2.1.196) as some
smooth function q of Gµν . To be consistent with the proof of Lemma 2.1.64, we consider
the bulk edge a2k−1. Denote
ϑη(x) :=η
π(x2 + η2)=
1
πIm
1
x− iη. (2.1.220)
We define a smooth cutoff function q ≡ qα′ : R→ R+ satisfying
q(x) = 1, if |x− l| ≤ 1
3; q(x) = 0, if |x− l| ≥ 2
3, (2.1.221)
where l is defined in (2.1.145). We also denote Q1 = Y ∗Y.
Lemma 2.1.65. For ε defined in (2.1.195) , denote
XE(x) := 1(E− ≤ x ≤ EU), (2.1.222)
where EU := a2k−1 + 2N−2/3+ε. Denote η := N−2/3−9ε0 , where ε0 is defined in (2.1.191),
we have
limN→∞
maxl≤Nδ
k
maxµ,ν
∣∣∣∣EV θ(Nζα′(µ)ζα′(ν))− EV θ(N
π
∫I
Gµν(z)q[Tr(XE ∗ ϑη)(Q1)]dE)
∣∣∣∣ = 0,
(2.1.223)
where I is defined in (2.1.193) and ∗ is the convolution operator.
Chapter 2. Random matrices in high dimensional statistics 93
Proof. For any E1 < E2, denote the number of eigenvalues of Q1 in [E1, E2] by
N (E1, E2) := #j : E1 ≤ λj ≤ E2. (2.1.224)
Recall (2.1.193) and (2.1.196), it is easy to check that with 1−N−D1 probability, we have
N
∫I
Gµν(z)X (E)dE = N
∫I
Gµν(z)1(N (E−, EU) = l)dE = N
∫I
Gµν(z)q[TrXE(Q1)]dE,
(2.1.225)
where for the second equality, we use (2.1.181) and Assumption 2.1.41. We use the
following lemma to estimate (2.1.224) by its delta approximation smoothed on the scale
η.
Lemma 2.1.66. For t = N−2/3−3ε0 , there exists some constant C, with 1−N−D1 prob-
ability, for any E satisfying
|E− − a2k−1| ≤3
2N−2/3+ε, (2.1.226)
we have
|TrXE(Q1)− Tr(XE ∗ ϑη)(Q1)| ≤ C(N−2ε0 +N (E− − t, E− + t)). (2.1.227)
By (A.7) of [74], for any z ∈ D(τ) defined in (2.2.42), we have
Imm(z) ∼
η/√κ+ η, E /∈ supp(ρ),
√κ+ η, E ∈ supp(ρ).
, (2.1.228)
where κ := |E − a2k−1|. When µ = ν, with 1−N−D1 probability, we have
supE∈I|Gµµ(E+iη)| = sup
E∈I| ImGµµ(z)| ≤ sup
E∈I(Im |Gµµ(z)−m(z)|+| Imm(z)|) ≤ N−1/3+ε0+2ε,
Chapter 2. Random matrices in high dimensional statistics 94
where we use (2.1.180) and (2.1.228). When µ 6= ν, we use the following identity
Gµν = ηM+N∑k=M+1
GµkGνk.
By (2.1.180) and (2.1.228), with 1−N−D1 probability, we have supE∈I |Gµν(z)| ≤ N−1/3+ε0+2ε.
Therefore, for E ∈ I, with 1−N−D1 probability, we have
supE∈I|Gµν(E + iη)| ≤ N−1/3+3ε0/2. (2.1.229)
Recall (2.2.17), by (2.1.225), (2.1.227), (2.1.229) and the smoothness of q, with 1−N−D1
probability, we have
∣∣∣∣N ∫I
Gµν(z)X (E)dE −N∫I
Gµν(z)q[Tr(XE ∗ ϑη(Q1))]dE
∣∣∣∣≤ CN
∑l(β)≤Nδ
k
∫I
|Gµν(z)|1(|E− − λβ| ≤ t)dE +N−ε0/4
≤ CN1+δ|t| supz∈I|Gµν(z)|+N−ε0/4. (2.1.230)
By (2.1.229) and (2.1.230), we have
∣∣∣∣N ∫I
Gµν(z)X (E)dE −N∫I
Gµν(z)q[Tr(XE ∗ ϑη(Q1))]dE
∣∣∣∣ ≤ CN−ε0/2+δ +N−ε0/4.
Using a similar discussion to (2.1.204), by (2.1.189) and (2.1.195), we finish the proof.
In the final step, we use the Green function comparison argument to prove the fol-
lowing lemma.
Lemma 2.1.67. Under the assumptions of Lemma 2.1.65, we have
limN→∞
maxµ,ν
(EV − EG)θ
(N
π
∫I
Gµν(z)q[Tr(XE ∗ ϑη)(Q1)]dE
)= 0.
Chapter 2. Random matrices in high dimensional statistics 95
Once Lemma 2.1.67 is proved, the proof of Lemma 2.1.63 follows from Lemma 2.1.65.
Green function comparsion argument. In this part, we will prove Lemma 2.1.67
using the Green function comparison argument. In the end of this section, we will
discuss how we can extend Lemma 2.1.63 to Theorem 2.1.44 and Theorem 2.1.45. By
the orthonormal properties of ξ, ζ and (2.2.49), we have
Gij = η
M∑k=1
GikGjk, Gµν = η
M+N∑k=M+1
GµkGνk. (2.1.231)
By (2.1.180), with 1−N−D1 probability, we have
|Gµµ| = O(1), |Gµν | ≤ N−1/3+2ε0 , µ 6= ν. (2.1.232)
We firstly drop the all diagonal terms in (2.2.82).
Lemma 2.1.68. Recall EU = a2k−1 + 2N−2/3+ε and η = N−2/3−9ε0, we have
EV θ[N
π
∫I
Gµν(z)q[Tr(XE ∗ ϑη)(Q1)]dE
]− EV θ
[∫I
x(E)q(y(E))dE
]= o(1), (2.1.233)
where we denote Xµν,k := GµkGνk and
x(E) :=Nη
π
M+N∑k=M+1, and 6=µ,ν
Xµν,k(E + iη), y(E) :=η
π
∫ EU
E−
∑k
∑β 6=k
Xββ,k(E + iη)dE.
(2.1.234)
The conclusion holds true if we replace XV with XG.
Proof. We first observe that by (2.1.232), with 1−N−D1 probability, we have
|x(E)| ≤ N2/3+3ε0 , (2.1.235)
Chapter 2. Random matrices in high dimensional statistics 96
which implies that ∫I
|x(E)|dE ≤ N4ε0 . (2.1.236)
By (2.2.82) and (2.1.232), with 1−N−D1 probability, we have
∣∣∣∣Nπ Gµν(E + iη)− x(E)
∣∣∣∣ =Nη
π|GµµGνµ+GµνGνν | ≤ Nη(1(µ = ν)+N−1/3+2ε01(µ 6= ν)).
(2.1.237)
By the equations (5.11) and (6.42) of [40], we have
Tr(XE ∗ ϑη(Q1)) =N
π
∫ EU
E−Imm2(w + iη)dw,
∑µν
|Gµν(w + iη)|2 =N Imm2(w + iη)
η.
(2.1.238)
Therefore, we have
Tr(XE ∗ ϑη(Q1))− y(E) =η
π
∫ EU
E−
M+N∑β=M+1
|Gββ|2dw. (2.1.239)
By (2.1.239), mean value theorem and the fact q is smooth enough, we have
|q[Tr(XE ∗ ϑη)(Q1)]− q[y(E)]| ≤ N−1/3−7ε0 . (2.1.240)
Therefore, by mean value theorem, (2.1.189), (2.1.195), (2.1.235), (2.1.236), (2.1.237)
and (2.1.240), we can conclude our proof.
To prove Lemma 2.1.67, by (2.1.233), it suffices to prove
[EV − EG]θ[
∫I
x(E)q(y(E))dE] = o(1). (2.1.241)
For the rest, we will use the Green function comparison argument to prove (2.1.241),
where we follow the basic approach of [40, Section 6] and [72, Section 3.1]. Define a
Chapter 2. Random matrices in high dimensional statistics 97
bijective ordering map Φ on the index set, where
Φ : (i, µ1) : 1 ≤ i ≤M, M + 1 ≤ µ1 ≤M +N → 1, . . . , γmax = MN.
Recall that we relabel XV = ((XV )iµ1 , i ∈ I1, µ1 ∈ I2), similarly for XG. For any
1 ≤ γ ≤ γmax, we define the matrix Xγ =(xγiµ1
)such that xγiµ1 = XG
iµ1if Φ(i, µ1) > γ,
and xγiµ1 = XViµ1
otherwise. Note that X0 = XG and Xγmax = XV . With the above
definitions, we have
[EG − EV ]θ[
∫I
x(E)q(y(E))dE] =
γmax∑γ=1
[Eγ−1 − Eγ]θ[∫I
x(E)q(y(E))dE].
For simplicity, we rewrite the above equation as
E[θ(
∫I
xGq(yG)dE)− θ(∫I
xV q(yV )dE)] =
γmax∑γ=1
E[θ(
∫I
xγ−1q(yγ−1)dE)− θ(∫I
xγq(yγ)dE)].
(2.1.242)
The key step of the Green function comparison argument is to use Lindeberg replacement
strategy. We focus on the indices s, t ∈ I, the special case µ, ν ∈ I2 follow. Denote
Yγ := Σ1/2Xγ and
Hγ :=
0 z1/2Yγ
z1/2Y ∗γ 0
, Gγ :=
−zI z1/2Yγ
z1/2Y ∗γ −zI
−1
. (2.1.243)
As Σ is diagonal, for each fixed γ, Hγ and Hγ−1 differ only at (i, µ1) and (µ1, i) elements,
where Φ(i, µ1) = γ. Then we define the (N +M)× (N +M) matrices V and W by
Vab = z1/2(1(a,b)=(i,µ1) + 1(a,b)=(µ1,i)
)√σiX
Giµ1,
Wab = z1/2(1(a,b)=(i,µ1) + 1(a,b)=(µ1,i)
)√σiX
Viµ1,
Chapter 2. Random matrices in high dimensional statistics 98
so that Hγ and Hγ−1 can be written as
Hγ−1 = O + V, Hγ = O +W,
for some (N +M)× (N +M) matrix O satisfying Oiµ1 = Oµ1i = 0 and O is independent
of V and W . Denote
S := (Hγ−1 − z)−1, R := (O − z)−1, T := (Hγ − z)−1. (2.1.244)
With the above definitions, we can write
E[θ(
∫I
xGq(yG)dE)− θ(∫I
xV q(yV )dE)] =
γmax∑γ=1
E[θ(
∫I
xSq(yS)dE)− θ(∫I
xT q(yT )dE)].
(2.1.245)
The comparison argument is based on the following resolvent expansion
S = R−RV R + (RV )2R− (RV )3R + (RV )4S. (2.1.246)
For any integer m > 0, by (6.11) of [40], we have
([RV ]mR)ab =∑
(ai,bi)∈(i,µ1),(µ1,i):1≤i≤m
(z)m/2(σi)m/2(XG
iµ1)mRaa1Rb1a2 · · ·Rbmb, (2.1.247)
([RV ]mS)ab =∑
(ai,bi)∈(i,µ1),(µ1,i):1≤i≤m
(z)m/2(σi)m/2(XG
iµ1)mRaa1Rb1a2 · · ·Sbmb. (2.1.248)
Denote
∆Xµν,k := SµkSνk −RµkRνk. (2.1.249)
In [72], the discussion relies on a crucial parameter (see (3.32) of [72]), which counts the
maximum number of diagonal resolvent elements in ∆Xµν,k. We will follow this strategy
but using a different counting parameter and furthermore use (2.1.247) and (2.1.248) as
Chapter 2. Random matrices in high dimensional statistics 99
our key ingredients. Our discussion is slightly easier due to the loss of a free index (i.e.
i 6= µ1).
Inserting (2.1.246) into (2.1.249), by (2.1.247) and (2.1.248), we find that there exists
a random variable A1, which depends on the randomness only through O and the first
two moments of XGiµ1
. Taking the partial expectation with respect to the (i, µ1)-th entry
of XG(recall they are i.i.d), by (2.1.135), we have the following result.
Lemma 2.1.69. Recall (2.2.62) and denote Eγ as the partial expectation with respect to
XGiµ1
, there exists some constant C > 0, with 1−N−D1 probability, we have
|Eγ∆Xµν,k − A1| ≤ N−3/2+Cε0Ψ(z)3−s, M + 1 ≤ k 6= µ, ν ≤M +N, (2.1.250)
where s counts the maximum number of resolvent elements in ∆Xµν,k involving the index
µ1 and defined as
s := 1((µ, ν ∩ µ1 6= ∅) ∪ (k = µ1)). (2.1.251)
Proof. Inserting (2.1.246) into (2.1.249), the terms in the expansion containingXGiµ1, (XG
iµ1)2
will be included in A1, we only consider the terms containing (XGiµ1
)m,m ≥ 3. We consider
m = 3 and discuss the following terms,
Rµk[(RV )3R]νk, [RV R]µk[(RV )2R]νk.
By (2.1.247), we have
Rµk[(RV )3R]νk = Rµk(∑
(σi)3/2(XG
iµ1)3(z)3/2Rνa1Rb1a2Rb2a3Rb3k. (2.1.252)
In the worst scenario, Rb1a2 and Rb2a3 are assumed to be the diagonal entries of R.
Chapter 2. Random matrices in high dimensional statistics 100
Similarly, we have
[RV R]µk[(RV )2R]νk = (∑
z1/2σ1/2i XG
iµ1Rµa1Rb1k)(
∑σi(X
Giµ1
)2zRνa1Rb1a2Rb2k),
(2.1.253)
and the worst scenario is the case when Rb1a2 is a diagonal term. As µ, ν 6= i is always
true and there are only finite terms of summation, by (2.1.135) and (2.1.232), for some
constant C, we have
Eγ|Rµk[(RV )3R]νk| ≤ N−3/2+Cε0Ψ(z)3−s.
Similarly, we have
Eγ|[RV R]µk[(RV )2R]νk| ≤ N−3/2+Cε0Ψ(z)3−s.
The other cases 4 ≤ m ≤ 8 can be handled similarly. Hence, we conclude our proof.
Lemma 2.1.67 follows from the following result. Recall (2.1.234), denote
∆x(E) := xS(E)− xR(E), ∆y(E) := yS(E)− yR(E).
Lemma 2.1.70. For any fixed µ, ν, γ, there exists a random variable A, which depends
on the randomness only through O and the first two moments of XG, such that
Eθ[∫I
xSq(yS)dE]− Eθ[∫I
xRq(yR)dE] = A+ o(N−2+t), (2.1.254)
where t := |µ, ν ∩ µ1| and t = 0, 1 counts if there is µ, ν equals to µ1.
Before proving Lemma 2.1.70, we firstly show how Lemma 2.1.70 implies Lemma
2.1.67.
Proof of Lemma 2.1.67. It is easy to check that Lemma 2.1.70 still holds true when we
Chapter 2. Random matrices in high dimensional statistics 101
replace S with T . Note in (2.1.245), there are O(N) terms when t = 1 and O(N2) terms
when t = 0. By (2.1.254), we have
E[θ(
∫I
xGq(yG)dE)− θ(∫I
xV q(yV )dE)] = o(1),
where we use the assumption that the first two moments of XV are the same with XG.
Combine with (2.1.233), we conclude the proof.
Finally we will follow the approach of [72, Lemma 3.6] to finish the proof of Lemma
2.1.70. A key observation is that when s = 0, we will have a smaller bound but the total
number of such terms are O(N) for x(E) and O(N2) for y(E). And when s = 1, we have
a larger bound but the number of such terms are O(1). We need to analyze the items
with s = 0, 1 separately.
Proof of Lemma 2.1.70. Condition on the variable s = 0, 1, we introduce the following
decomposition
xs(E) :=Nη
π
M+N∑k=M+1, and 6=µ,ν
Xµν,k(E + iη)1(s = 1 ((µ, ν ∩ µ1 6= ∅) ∪ (k = µ1))),
ys(E) :=η
π
∫ EU
E−
∑k
∑β 6=k
Xββ,k(E + iη)dE1(s = 1((β = µ1) ∪ (k = µ1))).
∆xs,∆ys can be defined in the same fashion. Similar to the discussion of (2.1.250), for
any E-dependent variable f ≡ f(E) independent of the (i, µ1)-th entry of XG, there exist
two random variables A2, A3, which depend on the randomness only through O, f and
the first two moments of XGiµ1
, for any event Ω, with 1−N−D1 probability, we have
∣∣∣∣∫I
Eγ∆xs(E)f(E)dE − A2
∣∣∣∣1(Ω) ≤ ||f1(Ω)||∞N−11/6+Cε0N−2s/3+t, (2.1.255)
|Eγ∆ys(E)− A3| ≤ N−11/6+Cε0N−2s/3. (2.1.256)
Chapter 2. Random matrices in high dimensional statistics 102
In our application, f is usually a function of the entries of R (recall R is independent of
V ). Next, we use
θ[
∫I
xSq(yS)dE] = θ[
∫I
(xR + ∆x0 + ∆x1)q(yR + ∆y0 + ∆y1)dE]. (2.1.257)
By (2.1.246), (2.1.247) and (2.1.248), it is easy to check that, with 1−N−D1 probability,
we have
∫I
|∆xs(E)|dE ≤ N−5/6+Cε0N−2s/3+t, |∆ys(E)| ≤ N−5/6+Cε0N−2s/3, (2.1.258)
∫I
|x(E)|dE ≤ NCε0 , |y(E)| ≤ NCε0 . (2.1.259)
By (2.1.257) and (2.1.258), with 1−N−D1 probability, we have
θ[
∫I
xSq(yS)dE] = θ[
∫I
xS(q(yR) + q′(yR)(∆y0 + ∆y1) + q′′(yR)(∆y0)2)dE] + o(N−2).
Similarly, we have (see (3.44) of [71])
θ[
∫I
xSq(yS)dE]− θ[∫I
xRq(yR)dE] = θ′[
∫I
xRq(yR)dE]
× [
∫I
((∆x0 + ∆x1)q(yR) + xRq′(yR)(∆y0 + ∆y1) + ∆x0q
′(yR)∆y0 + xRq′′(yR)(∆y0)2)dE]
+1
2θ′′[
∫I
xRq(yR)dE][
∫I
(∆x0q(yR) + xRq′(yR)∆y0)dE]2 + o(N−2+t). (2.1.260)
Now we start dealing with the individual terms on the right-hand side of (2.1.260).
Firstly, we consider the terms containing ∆x1, ∆y1. Similar to (2.1.250), we can find
a random variable A4, which depends on randomness only through O and the first two
moments of XGiµ1, such that with 1−N−D1 probability,
∣∣∣∣Eγ ∫I
(∆x1q(yR) + xRq′(yR)∆y1)dE − A4
∣∣∣∣ = o(N−2+t).
Chapter 2. Random matrices in high dimensional statistics 103
Hence, we only need to focus on ∆x0, ∆y0. We first observe that
∆x0(E) = 1(t = 0)Nη
π
∑k 6=µ,ν,µ1
∆Xµν,k(z),
∆y0(E) =η
π
∫ EU
E−
∑k 6=µ1
∑β 6=k,µ1
∆Xββ,k(E + iη)dE.
Denote ∆x(k)0 (E) by the summations of the terms in ∆x0(E) containing k items of XG
iµ1.
By (2.1.232), (2.1.246) and (2.1.247), it is easy to check that with 1−N−D1 probability,
|∆x(3)0 | ≤ N−7/6+Cε0 , |∆y(3)
0 | ≤ N−11/6+Cε0 . (2.1.261)
We now decompose ∆Xµν,k into three parts indexed by the number of XGiµ1
they contain.
By (2.1.232), (2.1.247), (2.1.248) and (2.1.261), with 1−N−D1 probability, we have
∆Xµν,k = ∆X(1)µν,k + ∆X
(2)µν,k + ∆X
(3)µν,k +O(N−3+Cε0), (2.1.262)
∆x0 = ∆x(1)0 + ∆x
(2)0 + ∆x
(3)0 +O(N−5/3+Cε0), (2.1.263)
∆y0 = ∆y(1)0 + ∆y
(2)0 + ∆y
(3)0 +O(N−7/3+Cε0). (2.1.264)
Inserting (2.1.263) and (2.1.264) into (2.1.260), similar to the discussion of (2.1.250), we
can find a random variable A5 depending on the randomness only through O and the
first two moments of XGiµ1, such that with 1−N−D1 probability,
Eγθ[∫I
xSq(yS)dE]− Eγθ[∫I
xRq(yR)dE]
= Eγθ′[∫I
xRq(yR)dE][
∫I
∆x(3)0 q(yR) + xRq′(yR)∆y
(3)0 dE] + A4 + A5 + o(N−2+t).
(2.1.265)
Chapter 2. Random matrices in high dimensional statistics 104
Lemma 2.1.70 will be proved if we can show
Eθ′[∫I
xRq(yR)dE][
∫I
∆x(3)0 q(yR) + xRq′(yR)∆y
(3)0 dE] = o(N−2). (2.1.266)
Due to the similarity, we shall prove
Eθ′[∫I
xRq(yR)dE][
∫I
∆x(3)0 q(yR)dE] = o(N−2),
the other term follows. By (2.1.189) and (2.1.259), with 1 − N−D1 probability, we have
|BR| :=∣∣θ′[∫
IxRq(yR)dE]
∣∣ ≤ NCε0 . Similar to (2.1.252), ∆x(3)0 is a finite sum of terms of
the form
1(t = 0)Nη∑
k 6=µ,ν,µ1
Rµk(σi)3/2(XG
iµ1)3z3/2Rνa1Rb1a2Rb2a3Rb3k. (2.1.267)
Inserting (2.1.267) into∫I
∆x(3)0 q(yR)dE, for some constant C > 0, we have
∣∣∣∣Eθ′[∫I
xRq(yR)dE][
∫I
∆x(3)0 q(yR)dE]
∣∣∣∣ ≤ N−5/6+Cε0 maxk 6=µ,ν,µ1
supE∈I
∣∣EBRRµkRνµ1Rikq(yR)∣∣+ o(N−2).
(2.1.268)
Again by (2.1.246), (2.1.247) and (2.1.248), it is easy to check that with 1 − N−D1
probability, for some constant C > 0, we have
|RµkRνµ1RikBRq(yR)− SµkSνµ1SikBSq(yS)| ≤ N−4/3+Cε0 .
Therefore, if we can show
|ESµkSνµ1SikBSq(yS)| ≤ N−4/3+Cε0 , (2.1.269)
then by (2.1.268), we finish proving (2.1.266). The rest leaves to prove (2.1.269). Recall
Chapter 2. Random matrices in high dimensional statistics 105
Definition 2.1.57 and (2.1.243), by [40, Lemma A.2], we have the following resolvent
identities,
S(µ1)µν = Sµν −
Sµµ1Sµ1νSµ1µ1
, µ, ν 6= µ1, (2.1.270)
Sµν = zSµµS(µ)νν (Y ∗γ−1S
(µν)Yγ−1)µν , µ 6= ν. (2.1.271)
By (2.1.247), (2.1.248) and (2.1.270), it is easy to check that (see (3.72) of [72]),
|SµkSνµ1SikBSq(yS)− S(µ1)µk Sνµ1S
(µ1)ik (BS)(µ1)q((yS)(µ1))| ≤ N−4/3+Cε0 . (2.1.272)
Moreover, by (3.73) of [72], we have
S(µ1)µk Sνµ1S
(µ1)ik (BS)(µ1)q((yS)(µ1)) = (SµkSikB
Sq(yS))(µ1)Sνµ1 . (2.1.273)
As t = 0, by (2.1.271), we have
Sνµ1 = zm(z)S(ν)µ1µ1
∑p,q
S(νµ1)pq (Y ∗γ−1)νp(Yγ−1)qµ1+z(Sνν−m(z))S(ν)
µ1µ1
∑p,q
S(νµ1)pq (Y ∗γ−1)νp(Yγ−1)qµ1 .
(2.1.274)
The conditional expectation Eγ applied to the first term of (2.1.274) vanishes; hence its
contribution to the expectation of (2.1.273) will vanish. By (2.1.180), with 1 − N−D1
probability, we have
|Sνν −m(z)| ≤ N−1/3+Cε0 . (2.1.275)
By the large deviation bound, with 1−N−D1 probability, we have
∣∣∣∣∣∑p,q
S(νµ1)pq (Y ∗γ−1)νp(Yγ−1)qµ1
∣∣∣∣∣ ≤ N ε1(∑
p,q |S(νµ1)pq |2)1/2
N. (2.1.276)
Chapter 2. Random matrices in high dimensional statistics 106
By (2.1.180) and (2.1.276), with 1−N−D1 probability, we have
∣∣∣∣∣∑p,q
S(νµ1)pq (Y ∗γ−1)νp(Yγ−1)qµ1
∣∣∣∣∣ ≤ N−1/3+Cε0 . (2.1.277)
Therefore, inserting (2.1.275) and (2.1.277) into (2.1.273), by (2.1.180), we have
|ES(µ1)µk Sνµ1S
(µ1)ik (BS)(µ1)q((yS)(µ1))| ≤ N−4/3+Cε0 .
Combine with (2.1.272), we conclude our proof.
It is clear that our proof can be extended to the left singular vectors. For the proof
of Theorem 2.1.44, the only difference is to use mean value theorem in R2 whenever it is
needed. Moreover, for the proof of Theorem 2.1.45, we need to use n intervals defined by
Ii := [a2ki−1 −N−2/3+ε, a2ki−1 +N−2/3+ε], i = 1, 2, · · · , n.
Singular vectors in the bulks In this section, we will prove the bulk universality
Theorem 2.1.50 and 2.1.51. Our key ingredients Lemma 2.3.27, 2.1.62 and Corollary
2.1.59 are proved for N−1+τ ≤ η ≤ τ−1 (recall (2.2.42)). In the bulks, recall Lemma
2.1.61, the eigenvalue spacing is of order N−1. The following lemma extends the above
controls for a small spectral scale all the way down to the real axis. The proof relies on
Corollary 2.1.59 and the detail can be found in [72, Lemma 5.1].
Lemma 2.1.71. Recall (2.1.182), for z ∈ Dbk with 0 < η ≤ τ−1, when N is large enough,
with 1−N−D1 probability, we have
maxµ,ν|Gµν − δµνm(z)| ≤ N ε1Ψ(z). (2.1.278)
Once Lemma 2.1.71 is established, Lemma 2.1.61 and 2.1.62 will follow. Next we
follow the basic proof strategy for Theorem 2.1.44 but use different spectral window size.
Chapter 2. Random matrices in high dimensional statistics 107
Again, we will only provide the proof for the following Lemma 2.1.72, which establishes
the universality for the distribution of ζα′(µ)ζα′(ν) in detail. To the end of this section,
we always use the scale parameter
η = N−1−ε0 , ε0 > ε1 is a small constant. (2.1.279)
Therefore, the following bounds hold with 1−N−D1 probability
maxµ|Gµµ(z)| ≤ N2ε0 , max
µ 6=ν|Gµν(z)| ≤ N2ε0 , max
µ,s|ζµ(s)|2 ≤ N−1+ε0 . (2.1.280)
The following lemma states the bulk universality for ζα′(µ)ζα′(ν).
Lemma 2.1.72. For QV = Σ1/2XVX∗V Σ1/2 satisfying Assumption 2.1.39, assuming that
the third and fourth moments of XV agree with those of XG and considering the k-th bulk
component, k = 1, 2, · · · , p and l defined in (2.1.145) or (2.1.146) , under Assumption
2.1.41 and 2.1.43, for any choices of indices µ, ν ∈ I2, there exists a small δ ∈ (0, 1),
when δNk ≤ l ≤ (1− δ)Nk, we have
limN→∞
[EV − EG]θ(Nζα′(µ)ζα′(ν)) = 0,
where θ is a smooth function in R that satisfies
|θ(5)(x)| ≤ C1(1 + |x|)C1 , with some constant C1 > 0. (2.1.281)
Proof. The proof strategy is very similar to that of Lemma 2.1.63. Our first step is an
analogue of Lemma 2.1.64. The proof is quite similar (actually easier as the window size
is much smaller). We omit further detail.
Lemma 2.1.73. Under the assumptions of Lemma 2.1.72, there exists a 0 < δ < 1, we
Chapter 2. Random matrices in high dimensional statistics 108
have
limN→∞
maxδNk≤l≤(1−δ)Nk
maxµ,ν
∣∣∣∣EV θ(Nζα′(µ)ζα′(ν))− EV θ[N
π
∫I
Gµν(z)X (E)dE]
∣∣∣∣ = 0,
(2.1.282)
where X (E) is defined in (2.1.196) and for ε satisfying (2.1.195), I is denoted as
I := [γα′ −N−1+ε, γα′ +N−1+ε]. (2.1.283)
Next we will express the indicator function in (2.1.282) using Green functions. Recall
(2.1.222), a key observation is that the size of [E−, EU ] is of order N−2/3 due to (2.1.191).
As we now use (2.1.279) and (2.1.283) in the bulks, the size here is of order 1. So we
cannot use the delta approximation function to estimate X (E). Instead, we will use
Helffer-Sjostrand functional calculus. This has been used many times when the window
size η takes the form of (2.1.279).
For any 0 < E1, E2 ≤ τ−1, denote f(λ) ≡ fE1,E2,ηd(λ) be the characteristic function
of [E1, E2] smoothed on the scale
ηd := N−1−dε0 , d > 2, (2.1.284)
where f = 1, when λ ∈ [E1, E2] and f = 0 when λ ∈ R\[E1 − ηd, E2 + ηd], and
|f ′| ≤ Cη−1d , |f ′′| ≤ Cη−2
d , (2.1.285)
for some constant C > 0. Denote fE ≡ fE−,EU ,ηd , we have
fE(λ) =1
2π
∫R2
iσf ′′E(e)χ(σ) + ifE(e)χ′(σ)− σf ′E(e)χ′(σ)
λ− e− iσdedσ, (2.1.286)
where χ(y) is a smooth cutoff function with support [−1, 1] and χ(y) = 1 for |y| ≤ 12
with
bounded derivatives. Using a similar argument to Lemma 2.1.65, we have the following
Chapter 2. Random matrices in high dimensional statistics 109
result.
Lemma 2.1.74. Recall the smooth cutoff function q defined in (2.2.17), under the as-
sumptions of Lemma 2.1.73, there exists a 0 < δ < 1, such that
limN→∞
maxδNk≤l≤(1−δ)Nk
maxµ,ν
∣∣∣∣EV θ[Nπ∫I
Gµν(z)X (E)]dE − EV θ[N
π
∫I
Gµν(z)q(Tr fE(Q1))]dE
∣∣∣∣ = 0.
(2.1.287)
Proof. It is easy to check that with 1 − N−D1 probability, (2.1.225) still holds true.
Therefore, it remains to prove the following result
EV θ[N
π
∫I
Gµν(E + iη)q(TrXE(Q1))]− EV θ[N
π
∫I
Gµν(E + iη)q(Tr fE(Q1))dE] = o(1).
(2.1.288)
We first observe that for any x ∈ R, we have
|XE(x)− fE(x)| =
0, x ∈ [E−, EU ] ∪ (−∞, E− − ηd) ∪ (EU + ηd,+∞);
|fE(x)|, x ∈ [E− − ηd, E−) ∪ (EU , EU + ηd].
Therefore, we have
|TrXE(Q1)− Tr fE(Q1)| ≤ maxx|fE(x)|
(N (E− − ηd, E−) +N (EU , EU + ηd)
).
By Lemma 2.1.61, the definition of ηd and a similar argument to (2.1.230), we can finish
the proof of (2.1.288).
Finally, we apply the Green function comparison argument, where we will follow the
basic approach of [72, Section 5]. The key difference is that we will use (2.1.279) and
(2.1.280).
Lemma 2.1.75. Under the assumptions of Lemma 2.1.74, there exists a 0 < δ < 1, we
Chapter 2. Random matrices in high dimensional statistics 110
have
limN→∞
maxδNk≤l≤(1−δ)Nk
maxµ,ν
[EV − EG]θ[N
π
∫I
Gµν(E + iη)q(Tr fE(Q1))dE] = 0. (2.1.289)
Proof. Recall (2.1.286), by (2.3.52), we have
Tr fE(Q1) =N
2π
∫R2
(iσf ′′E(e)χ(σ)+ifE(e)χ′(σ)−σf ′E(e)χ′(σ))m2(e+iσ)dedσ. (2.1.290)
Denote ηd := N−1−(d+1)ε0 , we can decompose the right-hand side of (2.1.290) by
Tr fE(Q1) =N
2π
∫ ∫R2
(ifE(e)χ′(σ)− σf ′E(e)χ′(σ))m2(e+ iσ)dedσ
+iN
2π
∫|σ|>ηd
σχ(σ)
∫f ′′E(e)m2(e+ iσ)dσde+
iN
2π
∫ ηd
−ηdσχ(σ)
∫f ′′E(e)m2(e+ iσ)dσde.
By (2.1.280) and (2.1.285), for some constant C > 0, with 1−N−D1 probability, we have
∣∣∣∣iN2π∫ ηd
−ηdσχ(σ)
∫f ′′E(e)m2(e+ iσ)dσde
∣∣∣∣ ≤ N−Cε0 . (2.1.291)
Recall (2.2.82) and (2.1.234), similar to Lemma 2.1.68, we firstly drop the diagonal terms.
By (2.1.278), with 1−N−D1 probability, we have (recall (2.1.237))
∫I
∣∣∣∣Nπ Gµν(E + iη)− x(E)
∣∣∣∣ dE ≤ N−1+Cε0 ,
for some constant C > 0. Hence, by mean value theorem, we only need to prove
limN→∞
maxδNk≤l≤(1−δ)Nk
maxµ,ν
[EV − EG]θ[
∫I
x(E)q(Tr fE(Q1))dE] = o(1). (2.1.292)
Furthermore, by Taylor expansion, (2.1.291) and the definition of χ, it suffices to prove
limN→∞
maxδNk≤l≤(1−δ)Nk
maxµ,ν
[EV − EG]θ[
∫I
x(E)q(y(E) + y(E))dE] = o(1), (2.1.293)
Chapter 2. Random matrices in high dimensional statistics 111
where
y(E) :=N
2π
∫R2
iσf ′′E(e)χ(σ)m2(e+ iσ)1(|σ| ≥ ηd)dedσ, (2.1.294)
y(E) :=N
2π
∫R2
(ifE(e)χ′(σ)− σf ′E(e)χ′(σ))m2(e+ iσ)dedσ. (2.1.295)
Next we will use the Green function comparison argument to prove (2.1.293). In the
proof of Lemma 2.1.67, we use the resolvent expansion till the order of 4. However, due
to the larger bounds in (2.1.280), we will use the following expansion,
S = R−RV R + (RV )2R− (RV )3R + (RV )4R− (RV )5S. (2.1.296)
Recall (2.1.244) and (2.1.245), we have
[EV−EG]θ[
∫I
x(E)q(y(E)+y(E))dE] =
γmax∑γ=1
E(θ[(
∫I
xSq(yS + yS))]− θ[(∫I
xT q(yT + yT ))]
).
(2.1.297)
We still use the same notation ∆x(E) := xS(E) − xR(E). We firstly deal with x(E).
Denote ∆x(k)(E) by the summations of the terms in ∆x(E) containing k numbers of XGiµ1
.
Similar to the discussion of Lemma 2.1.69, recall (2.1.249), by (2.1.135) and (2.1.280),
with 1−N−D1 probability, we have
|∆x(5)(E)| ≤ N−3/2+Cε0 , M + 1 ≤ k 6= µ, ν ≤M +N.
This yields that
∆x(E) =4∑p=1
∆x(p)(E) +O(N−3/2+Cε0). (2.1.298)
Denote
∆y(E) = yS(E)− yR(E), ∆m2 := mS2 −mR
2 =1
N
M+N∑µ=M+1
(Sµµ −Rµµ).
Chapter 2. Random matrices in high dimensional statistics 112
We first deal with (2.1.295). By the definition of χ, we need to restrict 12≤ |σ| ≤ 1;
hence, by (2.1.180), with 1−N−D1 probability, we have
maxµ|Gµµ| ≤ N ε1 , max
µ6=ν|Gµν | ≤ N−1/2+ε1 . (2.1.299)
By (2.1.247), (2.1.248), (2.1.296) and (2.1.299), with 1 − N−D1 probability, we have
|∆m(5)2 | ≤ N−7/2+9ε1 . This yields the following decomposition
∆y(E) =4∑p=1
∆y(p)(E) +O(N−5/2+Cε0). (2.1.300)
Next we will control (2.1.294). Denote ∆y(E) := yS(E)− yR(E). By (2.1.247), (2.1.248)
and (2.1.278), with 1−N−D1 probability, we have
|∆m(5)2 | ≤ N−5/2+Cε0 . (2.1.301)
In order to estimate ∆y(E), we integrate (2.1.294) by parts, first in e then in σ, by (5.24)
of [71], with 1−N−D1 probability, we have
∣∣∣∣N2π∫R2
iσf ′′E(e)χ(σ)∆(5)m2(e+ iσ)1(|σ| ≥ ηd)dedσ
∣∣∣∣≤ CN
∣∣∣∣∫ f ′E(e)ηd∆m(5)2 (e+ iηd)de
∣∣∣∣+ CN
∣∣∣∣∫ f ′E(e)de
∫ ∞ηd
χ′(σ)σ∆m(5)(e+iσ)2 dσ
∣∣∣∣+ CN
∣∣∣∣∫ f ′E(e)de
∫ ∞ηd
χ(σ)∆m(5)2 (e+ iσ)dσ
∣∣∣∣ . (2.1.302)
By (2.1.301) and (2.1.302), with 1 − N−D1 probability, we have the following decompo-
sition
∆y(E) =4∑p=1
∆y(p)(E) +O(N−5/2+Cε0). (2.1.303)
Similar to the discussion of (2.1.298), (2.1.300) and (2.1.303), it is easy to check that
Chapter 2. Random matrices in high dimensional statistics 113
with 1−N−D1 probability, we have
∫I
|∆x(p)(E)|dE ≤ N−p/2+Cε0 , |∆y(p)(E)| ≤ N−p/2+Cε0 , |∆y(p)(E)| ≤ N−p/2+Cε0 ,
(2.1.304)
where p = 1, 2, 3, 4 and C > 0 is some constant. Furthermore, by (2.1.278), with 1−ND1
probability, we have ∫I
|x(E)|dE ≤ NCε0 . (2.1.305)
Due to the similarity of (2.1.300) and (2.1.303), we denote y = y + y and then we have
∆y =4∑p=1
∆y(p)(E) +O(N−5/2+Cε0). (2.1.306)
By (2.1.304), (2.1.306) and Taylor expansion, we have
q(yS) = q(yR) + q′(yR)
(4∑p=1
∆y(p)(E)
)+
1
2q′′(yR)
(3∑p=1
∆y(p)(E)
)2
+1
6q(3)(yR)
(2∑p=1
∆y(p)(E)
)3
+1
24q(4)(yR)
(∆y(1)(E)
)4+ o(N−2).
(2.1.307)
By (2.1.281), we have
θ[
∫I
xSq(yS)dE]− θ[∫I
xRq(yR)dE] =4∑s=1
1
s!θ(s)(
∫I
xRq(yR)dE)
[∫I
xSq(yS)dE −∫I
xRq(yR)dE
]s+ o(N−2). (2.1.308)
Inserting xS = xR+∑4
p=1 ∆x(p) and (2.1.307) into (2.1.308), using the partial expectation
argument, by (2.1.281), (2.1.304) and (2.1.305), we find that that exists a random variable
B that depends on the randomness only through O and the first four moments of XGiµ1
,
Chapter 2. Random matrices in high dimensional statistics 114
such that
Eθ[∫I
xSq(y + y)SdE]− Eθ[∫I
xRq(y + y)RdE] = B + o(N−2). (2.1.309)
Hence, combine with (2.1.297), we prove (2.1.293), which implies (2.1.289). This finishes
our proof.
2.2 Eigen-structure of the model of matrix denoising
Consider that we can observe a noisy M ×N data matrix S, where
S = X + S. (2.2.1)
In model (2.2.1), the deterministic matrix S is known as the signal matrix and X the
noise matrix. In the classic framework, under the assumption that M is much smaller
than N, the truncated singular value decomposition (TSVD) is the default technique, see
for example [63]. This method recovers S with an estimator S =∑m
i=1 µiuiv∗i using the
truncated singular value decomposition, where m < minM,N denotes the truncation
level, µi, ui, vi, i = 1, 2, · · · ,m are the singular values, left singular vectors and right
singular vectors of S respectively. We usually need to provide a threshold γ to choose
m and use the singular values only when µi ≥ γ. Two popular methods are the soft
thresholding [44] and hard thresholding [61].
In recent years, the advance of technology has lead to the observation of massive scale
data, where the dimension of the variable is comparable to the length of the observation.
In this situation, the TSVD will lose its validity. To address this problem, in the present
paper, we consider the matrix denoising problem (2.2.1) by assuming M is comparable
to N and estimate S in the following two regimes:
Chapter 2. Random matrices in high dimensional statistics 115
Regime (1). S is of low rank and we have prior information that its singular vectors
are sparse;
Regime (2). S is of low rank and we have no prior information on the singular vectors.
In regime (1), S is called simultaneously low rank and sparse matrix. This type of
matrix has been heavily used in biology. A typical example is from the study of gene
expression data [90]. Yang, Ma and Buja [112] also consider such problem but from a
quite different perspective. They do not take the local behavior of singular values and
vectors into consideration. Instead, they use an adaptive thresholding method to recover
S in (2.2.1). In regime (2), it is almost hopeless to completely recover S as we have little
information. We are interested in looking at what is the best we can do in this case.
A natural (and probably necessary) assumption is rotation invariance [24], as the only
information we know about the singular vectors is orthonormality. It is notable that,
in this case, our result coincides with the results proposed by Gavish and Donoho [62],
where they consider the estimator from another perspective and restrict the estimator to
be conservative.
Our methodologies rely on investigating the local properties of singular values and
vectors. We study the convergent limits and rates for the singular values and vectors for
high dimensional rectangular matrices assuming M is comparable to N. In this section,
we consider the problem (2.2.1) and assume that X = (xij) is an M×N matrix with i.i.d
centered entries xij = N−1/2qij, where qij is of unit variance and there exists a constant
C, for some p ∈ N large enough, qij satisfies the following condition
E|qij|p ≤ C. (2.2.2)
We denote the SVD of S as S = UDV ∗, whereD = diagd1, · · · , dr, U = (u1, · · · , ur), V =
(v1, · · · , vr), and where ui ∈ RM , vi ∈ RN are orthonormal vectors and r is a fixed con-
Chapter 2. Random matrices in high dimensional statistics 116
stant. We also assume d1 > d2 > · · · > dr > 0. Then (2.2.1) can be written as
S = X + UDV ∗. (2.2.3)
Throughout the paper, we are interested in the following setup, for some large constant
C > 0, we have
c ≡ cN :=N
M, C−1 ≤ c ≤ C. (2.2.4)
It is well-known that for the noise matrix X, the spectrum of XX∗ satisfies the cele-
brated Marchenco-Pastur (MP) law and the largest eigenvalue satisfies the Tracy-Widom
(TW) distribution. Specifically, denote λi := λi(XX∗), i = 1, 2, · · · , K, where K =
minM,N, as the eigenvalues of XX∗ in a decreasing fashion, we have that
λ1 = λ+ +O(N−2/3), λ+ = (1 + c−1/2)2, (2.2.5)
holds with high probability. Furthermore, denote ξi, ζi as the singular vectors of X, for
some large constant C > 0, with high probability, we have [35]
maxk|ξi(k)|2 + |ζi(k)|2 = O(N−1), i ≤ C.
To sketch the behavior of S, we consider the case when r = 1 in (2.2.3). Assuming
that the distribution of the entries of X is bi-unitarily invariant, Benaych-Georges and
Nadakuditi [13] established the convergent limits using free probability theory. Denote
µi := µi(SS∗), i = 1, 2, · · · , K, as the eigenvalues of SS∗, they proved that when d >
c−1/4, µ1 would detach from the spectrum of the MP law and become an outlier. And
when d < c−1/4, µ1 converges to λ+ and sticks to the spectrum of the MP law. For the
singular vectors, denote ui, vi as the left and right singular vectors of S, i = 1, 2, · · · , K.
They proved that when d > c−1/4, u1, v1 would be concentrated on cones with axis parallel
to u1, v1 respectively, and the apertures of the cones converged to some deterministic
Chapter 2. Random matrices in high dimensional statistics 117
limits. And when d < c−1/4, u1, v1 will be asymptotically perpendicular to u1, v1
respectively.
Our computation and proof rely on the isotropic local MP law [16, 72] and the
anisotropic local law [74]. These results say that the eigenvalue distribution of the sample
covariance matrix XX∗ is close to the MP law, down to the spectral scales containing
slightly more than one eigenvalue. These local laws are formulated using the Green
functions,
G1(z) := (XX∗ − z)−1, G2(z) := (X∗X − z)−1, z = E + iη ∈ C+. (2.2.6)
To illustrate our results and ideas, we give an overview of the local behavior of the
singular values and vectors of S and how they can be used to recover the signal matrix S
in (2.2.1). As we have seen from [35, 40, 74, 110], the self-adjoint linearization technique
is quite useful in dealing with rectangular matrices. Hence, in a first step, we denote by
H =
0 z1/2S
z1/2S∗ 0
=
0 z1/2X
z1/2X∗ 0
+
0 z1/2UDV ∗
z1/2V DU∗ 0
= H + UDU∗,
(2.2.7)
where D,U are defined as
D :=
0 z1/2D
z1/2D 0
, U :=
U 0
0 V
. (2.2.8)
(2.2.7) is a very convenient expression. On one hand, the eigenvalues of SS∗ can be
uniquely characterized by the eigenvalues of H. On the other hand, the Green functions
of XX∗ and X∗X are contained in that of H (see (2.2.48)).
Next we will give a heuristic description of our results. We will always use λ1 ≥ · · · ≥
λK , K = minM,N, to represent the non-trivial eigenvalues of XX∗ and denote µ1 ≥
· · · ≥ µK as the eigenvalues of SS∗. We also denote ξi, ζi, i = 1, · · · , K as the singular
Chapter 2. Random matrices in high dimensional statistics 118
vectors ofX and ui, vi as those of S. And we denoteG(z) as the Green function ofH, G(z)
as that of H. Consider r = 1 in (2.2.3) and by a standard perturbation discussion (see
Lemma 2.2.17), we find that µ1 satisfies the equation det(U∗G(µ1)U + D−1) = 0. Using
the anisotropic local law in [74], we find that (see Lemma 2.3.27) G has a deterministic
limit Π when N is large enough. Heuristically, the convergent limit of µ1 is determined
by the equation det(U∗Π(z)U + D−1) = 0. An elementary calculation shows that, when
d > c−1/4, µ1 → p(d), where p(d) is defined as
p(d) =(d2 + 1)(d2 + c−1)
d2. (2.2.9)
When d > c−1/4, the largest eigenvalue µ1 will detach from the bulk and become an outlier
around its classical location p(d). We would expect this happens under a scale of N−1/3.
This can be understood in the following ways: increasing d beyond the critical value
c−1/4, we expect µ1 to become an outlier, where its location p(d) is located at a distance
greater than O(N−2/3) from λ+. By using mean value theorem, the phase transition will
take place on the scale when
|d− c−1/4| ≥ O(N−1/3). (2.2.10)
When (2.2.10) happens, we also prove that
µ1 = p(d) +O(N−1/2(d− c−1/4)1/2
). (2.2.11)
Below this scale, we would expect the spectrum of SS∗ to stick to that ofXX∗. Especially,
the largest eigenvalue µ1 still has the Tracy-Widom distribution with the scale N−2/3,
which reads as
µ1 = λ+ +O(N−2/3). (2.2.12)
For the singular vectors, when d > c−1/4, we have < u1, u1 >2→ a1(d), < v1, v1 >
2→
Chapter 2. Random matrices in high dimensional statistics 119
a2(d), where a1(d), a2(d) are deterministic functions of d and defined in (2.2.27). For the
local behavior, we will use an integral representation of Greens functions (see (2.2.83)).
However, when r > 1, if di ≈ dj, i 6= j, we would expect that ui(vi), uj(vj) lie in the
same eigenspace. And then we cannot distinguish the singular vectors. Therefore, in this
paper, we assume that for i 6= j, there exists some ε0 > 0, such that di, dj satisfy the
following condition
|p(di)− p(dj)| ≥ N−1/2+ε0(di − c−1/4)1/2. (2.2.13)
(2.2.13) is referred as non-overlapping condition [17, 72], it ensures that the eigenspace
corresponding to different di, i = 1, · · · , r can be well separated. This can be under-
stood in the following ways: when di, dj > c−1/4, the corresponding eigenvalues µi, µj
of SS∗ will converge to p(di) and p(dj) respectively. Hence, (2.2.13) ensures that the
singular vectors can be distinguished individually. Under the assumption that di’s are
well-separated and satisfy (2.2.10), we prove that
< u1, u1 >2= a1(d) +O(N−1/2), < v1, v1 >
2= a2(d) +O(N−1/2). (2.2.14)
Below the scale of (2.2.10), we prove that
< u1, u1 >2= O(N−1), < v1, v1 >
2= O(N−1). (2.2.15)
Armed with (2.2.11), (2.2.12), (2.2.14) and (2.2.15), we can go to the matrix denoising
problem (2.2.3) under the two different regimes. In the first regime, we assume there
exists sparse structure of the singular vectors, in the case when d > c−1/4, we would
expect u1, v1 to be sparse as well. Hence, S will be of sparse structure. Therefore, by
suitably choosing a submatrix of S and doing SVD for the submatrix, we can get an
estimator for the singular vectors. Our novelty is to truncate singular values and vectors
simultaneously. For the estimation of singular values, we can reverse (2.2.11) to get the
Chapter 2. Random matrices in high dimensional statistics 120
estimator for d. For the singular vectors, based on (2.2.15), the truncation level should
be much larger than N−1/2 and we will use K-means clustering algorithm to choose such
level. However, when d < c−1/4, we can estimate nothing according to (2.2.12) and
(2.2.15).
In the second regime, as we have no prior information whatsoever on the true eigen-
basis of S, the only possibility is to use the eigenbasis of S. This is equivalent to the
assumption of rotation invariance. We will propose a consistent rotation invariant esti-
mator (RIE) Ξ(S), which satisfies the following condition,
Ω1Ξ(S)Ω2 = Ξ(Ω1SΩ2), (2.2.16)
where Ω1,Ω2 are orthogonal (rotation) matrix in RM ,RN respectively.
Sparse estimation In the present application, we study the denoising problem (2.2.1),
where S is sparse in the sense that the nonzero entries are assumed to be confined on a
block. We assume that ui, vi are sparse and introduce the following definition to precisely
describe the sparsity.
Definition 2.2.1. For any vector ν ∈ RN , ν is a sparse vector if there exists a subset
N∗ ⊂ 1, 2, · · · , N with |N∗| = O(1), such that
|ν(i)| =
O(1), i ∈ N∗;
O(N−1/2), otherwise.
Denote
q = argmini1 ≤ i ≤ K : µi ≤ λ+ +N−2/3+τ, τ > 0 is a small constant, (2.2.17)
where λ+ is defined in (2.2.5). Therefore, q is defined as the index of the first extremal
non-outlier eigenvalue. As p(d) is an increasing function of d and p(c−1/4) = λ+, we
Chapter 2. Random matrices in high dimensional statistics 121
conclude that there exist q − 1 outliers and a phase transition will happen after µq.
With the above notations, we provide the stepwise SVD Algorithm 1 to recover S
in (2.2.1). As ui, vi are sparse, we need to find a submatrix of S by a suitable truncation.
Algorithm 1 Stepwise SVD
1: Do SVD for S =∑K
i=1 µiuiv∗i , and do the initialization S1 = S =
∑t1i u
1i (v
1i )∗.
2: while 1 ≤ j < q do3: dj = p−1((tj1)2), where p−1(x) is the inverse of the function defined in (2.2.9).4: Use two thresholds αuj 1√
M, αvj 1√
N, and denote
Ij := 1 ≤ k ≤M : |uj1(k)| ≥ αuj, Jj := 1 ≤ k ≤ N : |vj1(k)| ≥ αvj. (2.2.18)
5: Do SVD for the block matrix Sb = Sj[Ij, Jj] =∑ρiu
ji (v
ji )∗.
6: Assume Ij = k1, · · · , kj, construct uj by letting
µj(kj) =
µj1(j), kj ∈ Ij,0, otherwise.
Similarly, we can construct vj.
7: Let Sj+1 = Sj − djuj v∗j and do SVD for Sj+1 =∑tj+1i uj+1
i (vj+1i )∗.
8: end while9: Denote S =
∑q−1k=1 dkukv
∗k as our estimator.
Algorithm 1 provides us a way to recover S stepwisely. We first estimate d1, u1, v1
using the estimation d1, u1, v1, then d2, u2, v2 by analyzing S − d1u1v∗1. In each step,
we only need to look at the largest singular value and its associated singular vectors.
It is notable that, we drop all the singular values of S when they are below the level
λ+ +N−2/3+τ , where the shrinkage of the singular values can be denoted as
di = 1(µi > λ+ +N−2/3+τ )p−1(µi). (2.2.19)
Our methodology relies on truncating singular values and vectors simultaneously.
As illustrated in (2.2.18), the thresholds αu and αv play the key roles in recovering
the sparse structure of the singular vectors. It will be proved in Section 2.2 that any
threshold satisfying (2.2.18) should work when N is sufficiently large. In the finite sample
Chapter 2. Random matrices in high dimensional statistics 122
framework (when N is not quite large), we employ the K-means algorithm to stabilize
the recovery of the sparse structure of S. The reason behind is, the entries in the singular
vectors ui, vi can be well classified into two categories (see Lemma 2.2.10). Denote the
index sets Cju, C
jv getting from the K-means algorithm [60], where they satisfy
mink∈Cju|uj1(k)| 1√
M, mink∈Cjv|vj1(k)| 1√
N. (2.2.20)
We now replace (2.2.18) with the following step:
• Do K-means clustering to partition the entries of uj1, vj1 into two classes, where
Ij := 1 ≤ k ≤M : k ∈ Cju, Jj := 1 ≤ k ≤ N : k ∈ Cj
v, (2.2.21)
where Cju, C
jv satisfy (2.2.20).
We use Table 2.1 to compare the results of three algorithms, our stepwise SVD(SWSVD),
the sparse SVD(SSVD) proposed by [112] and the truncated SVD(TSVD). For the im-
plementation of SSVD, we use the ssvd package in R which is contributed by the first
author of [112]. From Table 2.1, we find that our method outperforms both the SSVD
and TSVD in all the cases . Furthermore, the standard deviation is small, which implies
that our estimation is quite stable.
Rotation invariant estimation This subsection is devoted to recovering S in (2.2.1)
assuming that no prior information about S is available. In this regime, we will consider
the rotation invariant estimator (RIE) satisfying (2.2.16). We conclude from that any
RIE shares the same singular vectors as S. To construct the optimal estimator, we use
the Frobenius norm as our loss function. Denote S = Ξ(S), we have
||S − S||22 = Tr(S − S)(S − S)∗. (2.2.22)
Chapter 2. Random matrices in high dimensional statistics 123
M=300 M=500Sparsity L2 error norm Std Sparsity L2 error norm Std
SWSVD 0.05 0.043 0.175 0.05 0.045 0.1890.1 0.614 0.178 0.1 0.6 0.160.2 0.822 0.126 0.2 0.825 0.1370.45 1.1 0.114 0.45 1.09 0.09
SSVD 0.05 4.01 0.002 0.05 4.01 0.0020.1 4.01 0.004 0.1 4.02 0.0020.2 4.04 0.004 0.2 4.03 0.0040.45 4.06 0.005 0.45 4.08 0.004
TSVD 0.05 53.9 6.872 0.05 53.75 6.630.1 53.72 6.63 0.1 53.38 6.710.2 52.33 7.01 0.2 52.2 6.650.45 51.043 2.49 0.45 52.4 4.3
Table 2.1: Comparison of the algorithms. We choose r = 2, c = 2, d1 = 7, d2 = 4 in(2.2.3). The noise matrix X is Gaussian. In the table, sparsity is defined as the ratioof non-zero entries and length of the vector and we assume that ui, vi, i = 1, 2 have thesame sparsity. We highlight the smallest error norm.
Therefore, the form of the RIE can be written in the following way
S = argminH∈M(U ,V )
||H − S||2, (2.2.23)
whereM(U , V ) is the class of M×N matrices whose left singular vectors are U and right
singular vectors are V . Suppose S =∑K
i=1 ηkukv∗k, denote µk1k =< uk1 , uk >, νk1k =<
vk1 , vk >, then by an elementary computation, we find
||S − S||22 =r∑
k=1
(d2k + η2
k)− 2r∑
k=1
dkηkµkkνkk
+K∑
k=r+1
η2k − 2
r∑k1 6=k2
dk1ηk2µk1k2νk1k2 − 2K∑
k1=r+1
r∑k2=1
ηk1dk2µk2k1νk2k1 . (2.2.24)
Chapter 2. Random matrices in high dimensional statistics 124
Therefore, S is optimal if
ηk =< uk, Svk >=r∑
k1=1
dk1µk1kνk1k, k = 1, · · · , K. (2.2.25)
In the present paper, we use the following estimator for ηk and will prove its consistency
in Section 2.2. Recall (2.2.17), the estimator is denoted as
ηk =
dka1(dk)a2(dk), k ≤ q − 1;
0, k ≥ q.
, (2.2.26)
where dk = p−1(µk) and a1(x), a2(x) are defined as
a1(x) =x4 − c−1
x2(x2 + c−1), a2(x) =
x4 − c−1
x2(x2 + 1). (2.2.27)
Figure 3.3 are two examples of the estimations of ηk. From the graph, we find that our
estimator ηk is quite accurate.
1.5 2.0 2.5 3.0 3.5 4.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
c=0.5
Value of d
RIE
Estimation ValueTrue Value
1.5 2.0 2.5 3.0 3.5 4.0
1.0
1.5
2.0
2.5
3.0
3.5
c=2
Value of d
RIE
Estimation ValueTrue Value
Figure 2.1: RIE. We choose r = 1 and M = 300 for (2.2.3). We estimate η1 using theestimator (2.2.26) for c = 0.5, 2 with different values of d. The entries of X are Gaussianrandom variables and the singular vectors satisfy the exponential distribution with rate1.
Figure 2.2 records the relative improvement in average loss (RIAL) compared to
Chapter 2. Random matrices in high dimensional statistics 125
TSVD, where the RIAL is defined as
RIAL(N) = 1− E||S − S||2E||SN − S||2
, (2.2.28)
and where SN is the TSVD estimation and S the RIE. We conclude from the figure that
our method provides better estimation compared to the TSVD. Similar results have been
shown for the estimation of covariance matrices.
50 100 150 200 250 300
0.75
0.80
0.85
0.90
0.95
Dimension
RIAL
Figure 2.2: RIE compared to TSVD. We choose r = 1, d = 4, c = 2 in (2.2.1). X isa random Gaussian matrix and the entries of the singular vectors satisfy the exponen-tial distribution with rate 1. We perform 1000 Monte-Carlo simulations for each M tosimulate the RIAL defined in (2.2.28). The red line indicates the increasing trend as Mincreases.
Remark 2.2.2. In [62], Donoho and Gavish get similar results from the perspective of
optimal shrinkage. However, they need two more assumptions: (1). they drop the last
two error terms in (2.2.24) by assuming they are small enough (see Lemma 4 in their
paper); (2) their estimators are assumed to be conservative, where they assume
ηk = 0, k ≥ q.
However, we find that the estimator defined in (2.2.26) can still be consistent even without
Chapter 2. Random matrices in high dimensional statistics 126
these assumptions.
Main results In this section, we give the main results of this paper. Throughout
the paper, we always use ε1 for a small constant and D1 for a large constant. Denote
R := 1, 2, · · · , r and O as a subset of of R by
O := i : di ≥ c−1/4 +N−1/3+ε0, ε0 > ε1 is a small constant, (2.2.29)
and
k+ = |O|. (2.2.30)
Remark 2.2.3. Our results can be extended to a more general domain by denoting O′ :=
i : di ≥ c−1/4 + N−1/3. The proofs still hold true with some minor changes except we
need to discuss the case when di ∈ (c−1/4 +N−1/3, c−1/4 +N−1/3+ε0). We will not pursue
this generalization.
For any subset A ⊂ O, we define the projections on the left and right singular subspace
of S by
Pl :=∑i∈A
uiu∗i , Pr :=
∑j∈A
vj v∗j . (2.2.31)
As discussed in (2.2.13), we need the following non-overlapping condition, which was
firstly introduced in [17].
Definition 2.2.4. For i = 1, 2, · · · ,M, the non-overlapping condition is written as
νi(A) ≥ (di − c−1/4)−1/2N−1/2+ε0 , (2.2.32)
where ε0 is defined in (2.2.29) and νi(A) is defined by
νi(A) :=
minj /∈A |di − dj|, if i ∈ A,
minj∈A |di − dj|, if i /∈ A.(2.2.33)
Chapter 2. Random matrices in high dimensional statistics 127
With the above preparation, we state our main results of the singular values of S.
Theorem 2.2.5. For i = 1, 2, · · · , k+, where k+ is defined in (2.2.30), there exists some
large constant C > 1 such that Cε1 < ε0, when N is large enough, with 1 − N−D1
probability, we have
|µi − p(di)| ≤ N−1/2+Cε0(di − c−1/4)1/2, (2.2.34)
where p(di) is defined in (2.2.9). Moreover, for j = k+ + 1, · · · , r, we have
|µj − λ+| ≤ N−2/3+Cε0 , (2.2.35)
where λ+ is defined in (2.2.5).
The above theorem gives precise location of the outlier singular values and the ex-
tremal non-outlier singular values. For the outliers, they locate around their classical
locations p(di) and for the non-outliers, they locate around λ+. However, (2.2.35) can
be extended to a more general framework. Instead of considering λ+, we can locate µj
around the eigenvalues of XX∗, which is the phenomenon of eigenvalue sticking. The
results of the singular vectors are given by the following theorem.
Theorem 2.2.6. For all i, j = 1, 2, · · · , r, there exists some constant C > 0, under the
assumption of (2.3.20), with 1−N−D1 probability, when N is large enough, we have
|< ui,Pluj > −δij1(i ∈ A)a1(di)| ≤ N ε1R(i, j, A,N), (2.2.36)
| < vi,Prvj > −δij1(i ∈ A)a2(di)| ≤ N ε1R(i, j, A,N), (2.2.37)
where a1(x), a2(x) are defined in (2.2.27) and R(i, j, A,N) is defined as
R(i, j, A,N) := N−1/2
[1(i ∈ A, j ∈ A)
(di − c−1/4)1/2 + (dj − c−1/4)1/2+ 1(i ∈ A, j /∈ A)
(di − c−1/4)1/2
|di − dj|
+1(i /∈ A, j ∈ A)(dj − c−1/4)1/2
|di − dj|
]+N−1
[(
1
νi+
1(i ∈ A)
|di − c−1/4|)(
1
νj+
1(j ∈ A)
|dj − c−1/4|)
].
Chapter 2. Random matrices in high dimensional statistics 128
Moreover, fix a small constant τ > 0, for k+ + 1 ≤ j ≤ (1− τ)K, denote κdj := N−2/3(j ∧
(K + 1− j))2/3, we have
| < ui, uj >2 | ≤ NCε0
N((di − c−1/4)2 + κdj ), i = 1, 2, · · · , r, (2.2.38)
and
| < vi, vj > |2 ≤NCε0
N((di − c−1/4)2 + κdj ), i = 1, 2, · · · , r. (2.2.39)
Furthermore, if c 6= 1, (2.2.38) and (2.2.39) hold for all j = k+ + 1, · · · ,M.
Remark 2.2.7. The assumption j ≤ (1−τ)K ensures that µj ≥ δ, for some constant δ > 0.
When c 6= 1, it is guaranteed as we will see from Lemma 2.2.22 that µj ≥ (1− c−1/2)2/2.
We need µj ≥ δ for the technical purpose of the application of the local laws.
Next we will give some examples to illustrate our results. We assume that c 6= 1.
Example 2.2.8. (1). Consider the right singular vectors and let A = i, we have
| < vi, vi >2 −a2(di)| ≤ N ε1
[1
N1/2(di − c−1/4)1/2+
1
Nν2i (di − c−1/4)2
].
This implies that, the cone concentration of the singular vector holds if i ∈ O and the
non-overlapping condition (2.3.20) holds. Furthermore, if di is well-separated from both
the critical point c−1/4 and the other outliers, the error bound is of order 1√N.
(2). Let A = i and for 1 ≤ j 6= i ≤ r, we have
| < vj, vi >2 | ≤ N ε1
N(di − dj)2.
Hence, if |di−dj| = O(1), then vi will be completely delocalized in any direction orthogonal
to vi.
Chapter 2. Random matrices in high dimensional statistics 129
(3). If i ∈ O, j /∈ O, then we have
| < vi, uj >2 | ≤ NCε0
N((di − c−1/4)2 + κdj ).
Hence, when |di − c−1/4| = O(1) or κdj = O(1), uj will be completely delocalized in the
direction of vi. The first case reads as µi is an outlier and the second case as that µj is
in the bulk of the spectrum of SS∗.
We now conclude the consistency of our estimators.
Theorem 2.2.9. For the matrix denosing model (2.2.3), we have:
(1). For the Regime (1), with prior information that ui, vi are sparse in the sense of
Definition 2.2.1 and di > c−1/4 + δ, |di − dj| ≥ δ, i 6= j, i, j = 1, 2, · · · , k+, δ > 0
is a small constant, then there exists some C > 0, with 1− o(1) probability, for the
estimator S getting from Algorithm 1, we have
||S − S||2 ≤ N−1/2+Cε0 +
√√√√ r∑i=k++1
d2i .
(2). For the Regime (2), recall the rotation invariant estimator defined in (2.2.25), then
there exists some large constant C > 0 and small constant τ > 0, with 1 − N−D1
probability, we have ηk → ηk, k = 1, 2, · · · , K. Furthermore, for 1 ≤ k ≤ (1− τ)K,
we have
|ηk − ηk| ≤ 1(k ≤ k+)N−1/2+Cε0 + 1(k > k+)N−1+Cε0 . (2.2.40)
Moreover, when c 6= 1, (2.2.40) holds for all k = 1, · · · , K.
Chapter 2. Random matrices in high dimensional statistics 130
Proof. We firstly prove (1). Denote S1 =∑k+
i=1 diuiv∗i , S2 =
∑ri=k++1 diuiv
∗i , we have
||S − S||2 ≤ ||S − S1||2 +
√√√√ r∑i=k++1
d2i .
It is easy to check that
||S − S1||22 ≤ 2k+∑i=1
(di − di)2 + 2 Tr (RR∗) , (2.2.41)
where R is defined as R :=∑k+
i=1 diuiv∗i −
∑k+
i=1 diuiv∗i . The first term on the right-hand
side of (2.2.41) is bounded by N−1+Cε0 using (2.2.34). For the second term, we only need
to control Tr((vi−vi)(vi−vi)∗) and Tr((ui−ui)(ui−ui)∗) by Cauchy-Schwarz inequality.
Due to similarity, we only prove for the right singular vectors.
Under the sparsity assumption, the non-zero entries of S are confined on a block
matrix Sb of some fixed dimension m × n. Denote Sb := Sb + Xb, if our algorithm can
correctly choose the positions of the non-zero entries of ui, vi (i.e. Sb) with 1 − o(1)
probability, we can conclude our proof using the fact (see [99, Lemma 4.3])
Vb = Vb +O(||X∗bXb + S∗bXb +X∗bSb||2),
where Vb, Vb are the right singular vectors of Sb, Sb respectively. Therefore, under the
assumption that xij is of variance 1/N, we have that with 1 − o(1) probability, Vb =
Vb +O(N−1/2+Cε0). This concludes our proof.
The rest of the proof leaves to show that (2.2.18) can correctly find the positions
of the non-zero entries (i.e. Sb) with 1 − o(1) probability, which is summarized as the
following lemma.
Lemma 2.2.10. For i = 1, 2, · · · , k+, denote Ji as the index set of the non-zero entries
of vi, for some constant C > 0, there exists some δ ∈ (Cε0,12), with 1− o(1) probability,
Chapter 2. Random matrices in high dimensional statistics 131
we have
|vi(k)| ≥ N−1/2+δ, k ∈ Ji; |vi(k)| ≤ N−1/2+Cε0 , k ∈ J ci ∩ 1, · · · , N.
By Lemma 2.2.10, we have that with 1−o(1) probability, maxk1 /∈Ji |vi(k1)| mink2∈Ji |vi(k2)|,
which implies that Algorithm 1 can correctly recover the sparse structure of the singular
vectors. Next we prove (2). The consistency of ηk is an immediate result of [13, Theorem
2.9]. For the convergent rate, recall (2.2.25), we have
ηk =k+∑k1=1
dk1µk1kνk1k +r∑
k1=k++1
dk1µk1kνk1k.
Hence, the proof follows from (2.2.34), (2.2.36), (2.2.37), (2.2.38) and (2.2.39).
Notations and basic tools. In this part, we introduce some notations and tools which
will be used in this paper. Recall that the empirical spectral distribution (ESD) of an
n× n symmetric matrix H is defined as
F(n)H (λ) :=
1
n
n∑i=1
1λi(H)≤λ.
We define the typical domain for z = E + iη by
D(τ) ≡ D(τ,N) := z ∈ C+ : τ ≤ E ≤ τ−1, N−1+τ ≤ η ≤ τ−1, (2.2.42)
where τ > 0 is a small constant. Recall (2.2.4), we assume that τ < cN < τ−1.
Definition 2.2.11. The Stieltjes transform of the ESD of X∗X is given by
m2(z) ≡ m(N)2 (z) :=
∫1
x− zdF
(N)X∗X(x) =
1
N
N∑i=1
(G2)ii(z) =1
NTrG2(z),
Chapter 2. Random matrices in high dimensional statistics 132
where G2(z) is defined in (2.2.6). Similarly, we can also define m1(z) := M−1TrG1(z).
Denote m1c(z) := limN→∞m1(z), m2c(z) := limN→∞m2(z) be the Stieltjes trans-
forms of limiting spectral distributions of m1(z),m2(z). Using the identity m1(z) =
−1−cNz
+ cNm2(z), we have
m1c(z) =c− 1
z+ cm2c(z). (2.2.43)
Definition 2.2.12. For X satisfying (2.2.2), under the assumption (2.2.4), the ESD of
XX∗ converges weakly to the Marchenko-Pastur (MP) law as N →∞:
µ(A) =
(1− c)10∈A + ν(A), c < 1
ν(A), c ≥ 1,
where dν(x) = ρ1c(x) and
ρ1c(x)dx =c
2π
√(λ+ − x)(x− λ−)
xdx, λ± = (1± c−
12 )2. (2.2.44)
The Stieltjes transform of the MP law m1c(z) has the closed form expression
m1c(z) =1− c−1 − z + i
√(λ+ − z)(z − λ−)
2zc−1. (2.2.45)
Remark 2.2.13. From (2.2.43), we have that m2(z) converges to m2c(z) as N →∞, where
m2c(z) =c−1 − 1
z+ c−1m1c(z) =
c−1 − 1− z + i√
(λ+ − z)(z − λ−)
2z. (2.2.46)
It is notable that
− z−1(1 +m2c(z))−1 = m1c(z). (2.2.47)
Recall (2.2.7) and G(z) = (H − z)−1, by Schur’s complement [74], it is easy to check
Chapter 2. Random matrices in high dimensional statistics 133
that
G(z) =
G1(z) z−1/2G1(z)X
z−1/2X∗G1(z) G2(z)
, (2.2.48)
for G1,2 defined in (2.2.6). Denote the index sets I1 := 1, ...,M, I2 := M + 1, ...,M +
N, I := I1 ∪ I2. Then we have
m1(z) =1
M
∑i∈I1
Gii, m2(z) =1
N
∑µ∈I2
Gµµ.
Similarly, we denote G(z) = (H − z)−1, where H is defined in (2.2.7). Next we introduce
the spectral decomposition of G(z). By (2.2.48), we have
G(z) =K∑k=1
1
µk − z
uu∗k z−1/2√µkukv
∗k
z−1/2√µkvku
∗k vkv
∗k
. (2.2.49)
As we have seen in (2.2.9), the function p(d) plays a key role in describing the con-
vergent limits of the outlier singular values of S. An elementary computation yields that
p(d) attains its global minimum when d = c−1/4 and p(c−1/4) = λ+, and
p′(x) ∼ (x− c−1/4). (2.2.50)
To precisely locate the outlier singular values of S, we need to analyze
T s(x) :=s∏i=1
(xm1c(x)m2c(x)− d−2i ). (2.2.51)
By (2.2.45) and (2.2.46), when x ≥ λ+, we have
xm1c(x)m2c(x) =x− (1 + c−1)−
√(x+ c−1 − 1)2 − 4c−1x
2c−1. (2.2.52)
Next we collect the preliminary results of the properties of T s(x) as the following lemma.
Chapter 2. Random matrices in high dimensional statistics 134
Lemma 2.2.14. Suppose d1 > d2 > · · · > ds > c−1/4, then we have that there exist s
solutions of T s(x) = 0 and they are pi := p(di), i = 1, 2, · · · , s, write
T s(pi) = 0. (2.2.53)
Furthermore, denote
T (x) := xm1c(x)m2c(x), (2.2.54)
T (x) is a strictly monotone decreasing function when x > λ+.
For z ∈ D(τ) defined in (2.2.42), denote
κ := |E − λ+|. (2.2.55)
By (2.2.52), it is easy to check that
T (z)− c1/2 =z − λ+ − i
√(λ+ − z)(z − λ−)
2c−1. (2.2.56)
The following lemma summarizes the basic properties of m2c(z) and T (z), the estimates
are based on the elementary calculations of (2.2.52) and (2.2.56).
Lemma 2.2.15. For any z ∈ D(τ) defined in (2.2.42), we have
|T (z)| ∼ |m2c(z)| ∼ 1, |c1/2 − T (z)| ∼ |1−m22c(z)| ∼
√κ+ η,
and for some small δ > 0,
Im T (z) ∼ Imm2c(z) ∼
√κ+ η, if E ∈ [λ+ − δ, λ+],
η√κ+η
, if E > λ+.
,
Chapter 2. Random matrices in high dimensional statistics 135
as well as
|Re T (z)− c1/2| ∼
η√κ+η
+ κ, E ∈ [λ+ − δ, λ+],
√κ+ η, E > λ+.
. (2.2.57)
The next lemma provides the local estimate on the derivative of T (x) on the real
axis.
Lemma 2.2.16. For d > c−1/4, denote Id := [x−(d), x+(d)], x±(d) := p(d)±N−1/2+ε0(d−
c−1/4)1/2, where ε0 is defined in (2.2.29). Then ∀ x ∈ Id, we have that
T ′(x) ∼ (d− c−1/4)−1. (2.2.58)
The following perturbation identity plays the key role in our proof, as it naturally
provides us a way to incorporate the Green functions using a deterministic equation.
Lemma 2.2.17. Recall (2.2.7), assume µ ∈ R/σ(H) and det D 6= 0, then µ ∈ σ(H) if
and only if
det(U∗G(µ)U + D−1) = 0. (2.2.59)
The following lemma establishes the connection between the Green functions of H
and H defined in (2.2.7).
Lemma 2.2.18. For z ∈ C+, we have
G(z) = G(z)−G(z)U(D−1 + U∗G(z)U)−1U∗G(z), (2.2.60)
and
U∗G(z)U = D−1 −D−1(D−1 + U∗G(z)U)−1D−1. (2.2.61)
Chapter 2. Random matrices in high dimensional statistics 136
One of the key ingredients of our computation are the local laws. Denote
Ψ(z) :=
√Imm2c(z)
Nη+
1
Nη, Σ :=
z−1/2 0
0 I
, (2.2.62)
and m(z) ≡ mN(z) as the unique solution of the equation
f(m(z)) = z, Imm(z) ≥ 0, f(x) = −1
x+
1
cN
1
x+ 1.
Recall (2.2.48), the following lemma shows that G(z) converges to a deterministic matrix
Π(z) with high probability.
Lemma 2.2.19. Fix τ > ε1, then for all z ∈ D(τ), with 1 − N−D1 probability, for any
unit deterministic vectors u,v ∈ RM+N , we have
| < u,Σ−1(G(z)− Π(z))Σ−1v > | ≤ N ε1Ψ(z), |m2(z)−m(z)| ≤ N ε1
Nη, (2.2.63)
where Π(z) is defined as
Π(z) :=
−z−1(1 +m(z))−1 0
0 m(z)
. (2.2.64)
It is notable that in general, m(z) depends on N and Lemma 2.2.15 also holds for
m(z). However, in our computation, we can replace m(z) with m2c(z) due to the following
local MP law.
Lemma 2.2.20. Fix τ > ε1, then for all z ∈ D(τ), with 1−N−D1 probability, we have
|m2(z)−m2c(z)| ≤ N ε1Ψ(z).
Beyond the support of the limiting spectrum of the MP law, we have stronger results
Chapter 2. Random matrices in high dimensional statistics 137
all the way down to the real axis. More precisely, define the region
D(τ, ε1) := z ∈ C+ : λ+ +N−2/3+ε1 ≤ E ≤ τ−1, 0 < η ≤ τ−1, (2.2.65)
then we have the following stronger control on D(τ, ε1).
Lemma 2.2.21. For z ∈ D(τ, ε1), with 1−N−D1 probability, we have
| < u,G2(z)v > −m2c(z) < u, v > | ≤ N−1/2+ε1(κ+ η)−1/4,
for all unit vectors u, v ∈ RN . Similar result holds for G1(z),m1c(z). Furthermore, for
any deterministic vectors u,v ∈ RM+N , we have
| < u,Σ−1(G(z)− Π(z))Σ−1v > | ≤ N−1/2+ε1(κ+ η)−1/4. (2.2.66)
Denote the non-trivial classical eigenvalue locations γ1 ≥ γ2 ≥ · · · ≥ γK of XX∗ as∫∞γidρ1c = i
N, where ρ1c is defined in (2.2.44). The consequent result of Lemma 2.3.27 is
the rigidity of eigenvalues.
Lemma 2.2.22. Fix any small τ ∈ (0, 1), for 1 ≤ i ≤ (1−τ)K, with 1−N−D1 probability,
we have
|λi − γi| ≤ N−2/3+ε1(i ∧ (K + 1− i))−1/3.
Furthermore, if c 6= 1, the above estimate holds for all i = 1, 2, · · · , K.
Using Lemma 2.2.22, we find that κdj defined in (2.2.38) is a deterministic version of
κµj = |µj − λ+|.
Proofs of Theorem 2.2.5 and 2.2.6 In this part, we focus on the singular values of S
and prove Theorem 2.2.5. A key deviation from their proof is that our matrix D defined
in (2.2.8) is not diagonal, it appears that in order to analyze (2.2.59), they only need to
deal with the diagonal elements but we need to control the whole matrix. We will make
Chapter 2. Random matrices in high dimensional statistics 138
use of the following interlacing theorem for rectangular matrices [104].
Lemma 2.2.23. For any M×N matrices A,B, denote σi(A) as the i-th largest singular
value of A, then we have
σi+j−1(A+B) ≤ σi(A) + σj(B), 1 ≤ i, j, i+ j − 1 ≤ K.
The proof relies on two main steps: (i) fix a configuration independent of N , establish
two permissible regions, Γ(d) of k+ components and I0, where the outliers of SS∗ are
allowed to lie in Γ(d) and each component contains precisely one eigenvalue and the
r − k+ non-outliers lie in I0; (ii) a continuity argument where the result of (i) can be
extended to arbitrary N−dependent D.
The following 2r × 2r matrix plays the key role in our analysis
M r(x) := U∗G(x)U + D−1. (2.2.67)
By Lemma 2.2.17, x ∈ σ(SS∗) if and only if detM r(z) = 0. Using Lemma 2.2.20 and
2.2.21, we find that x−rT r(x) ≈ detM r(x), where T r(x) is defined in (2.2.51). As T r(x)
behaves differently in Γ(d) and I0, we will use different strategies to prove (2.2.34) and
(2.2.35).
Proof of Theorem 2.2.5. Denote k0 := r − k+ and write
d = (d1, · · · , dr) = (d0,d+), dσ = (dσ1 , · · · , dσkσ), σ = 0,+,
where we adapt the convention
d0k0 ≤ · · · ≤ d0
1 ≤ c1/4 < d+k+ ≤ · · · ≤ d+
1 , k0 + k+ = r.
Chapter 2. Random matrices in high dimensional statistics 139
Next we define the sets
D+(ε0) := d+ : c−1/4 +N−1/3+ε0 ≤ d+i ≤ τ−1, i = 1, · · · , k+, (2.2.68)
D0(ε0) := d0 : 0 < d0i < c−1/4 +N−1/3+ε0 , i = 1, · · · , k0, (2.2.69)
and the sets of allowed d′s, which is D(ε0) := (d0,d+) : dσ ∈ Dσ(ε0), σ = +, 0.
Denote the following sequence of intervals
I+i (d) := [p(d+
i )−N−1/2+ε3(d+i − c−1/4)1/2, p(d+
i ) +N−1/2+ε3(d+i − c−1/4)1/2], (2.2.70)
where ε3 satisfies the following condition
Cε1 < ε3 <1
4ε0, C > 2 is some large constant. (2.2.71)
For d ∈ D(ε0), we denote Γ(d) := ∪k+i=1I+i (d) and I0 := [λ+−N−2/3+C′ε0 , λ++N−2/3+C′ε0 ],
where C ′ satisfies 2 < C ′ < 4.
For a first step, we show that Γ(d) is our permissible region which keeps track of the
outlier eigenvalues of SS∗. And the rest of the eigenvalues corresponding to D0(ε0) will
lie in I0. We fix a configuration d(0) ≡ d that is independent of N in this step.
Lemma 2.2.24. For any d ∈ D(ε0), with 1−N−D1 probability, we have
σ+(SS∗) ⊂ Γ(d), (2.2.72)
where σ+(SS∗) is the set of the outlier eigenvalues of SS∗ associated with D+(ε0). More-
over, each interval I+i (d) contains precisely one eigenvalue of SS∗, i = 1, 2, · · · , k+.
Furthermore, we have
σo(SS∗) ⊂ I0, (2.2.73)
Chapter 2. Random matrices in high dimensional statistics 140
where σo(SS∗) is the set of the non-outlier eigenvalues corresponding to D0(ε0).
Proof. First of all, it is easy to check that Γ(d) ∩ I0 = ∅ using (2.2.50) and the fact
C ′ > 2. Denote Sb := p(d+k+) − N−1/2+ε3(d+
k+ − c−1/4)1/2. In order to prove (2.2.72), we
first consider the case when x > Sb. It is notable that x /∈ σ(XX∗) by Lemma 2.2.22,
(2.2.50) and (2.2.71). Recall (2.2.64) and (2.2.67), using the fact r is bounded and Lemma
2.2.21, with 1−N−D1 probability, we have
M r(x) = U∗Π(x)U + D−1 +O(N−1/2+ε1κ−1/4). (2.2.74)
It is well-known that if λ ∈ σ(A+B) then dist(λ,σ(A)) ≤ ||B||; therefore, we have that
µi(SS∗) ≤ τ−1, i = 1, · · · , K for τ > 0 defined in (2.2.42). Recall (2.2.51), by (2.2.50),
(2.2.58) and (2.2.71), with 1−N−D1 probability, we have
|T r(x)| ≥ N−1/2+(C−1)ε1κ−1/4, if x ∈ [Sb, τ−1]/Γ(d). (2.2.75)
Using the formula
det
xIr diag(α1, · · · , αr)
diag(α1, · · · , αr) yIr
=r∏i=1
(xy − α2i ),
Lemma 2.2.20, (2.2.47) and (2.2.74), we conclude that
det(D−1 + U∗Π(x)U) = x−rT r(x) +O(N−1/2+ε1κ−1/4). (2.2.76)
By (2.2.75) and (2.2.76), we conclude that M r(x) is non-singular when x ∈ [Sb, τ−1]/Γ(d).
Next we will use Rouche’s theorem to show that inside the permissible region, each
interval I+i (d) contains precisely one eigenvalue of SS∗. Let i ∈ 1, · · · , k+ and pick a
smallN -independent counterclockwise (positive-oriented) contour C ⊂ C/[(1−c−1/2)2, (1+
c−1/2)2] that encloses p(d+i ) but no other p(d+
j ), j 6= i. For large enough N, define
Chapter 2. Random matrices in high dimensional statistics 141
f(z) := det(M r(z)), g(z) := det(T r(z)). By the definition of determinant, the functions
g, f are holomorphic on and inside C. And g(z) has precisely one zero z = p(d+i ) inside
C. On C, it is easy to check that
minz∈C|g(z)| ≥ c > 0, |g(z)− f(z)| ≤ N−1/2+ε1κ−1/4,
where we use (2.2.74) and Lemma 2.2.20. Hence, f(z) has only one zero in I+i (d) accord-
ing to Rouche’s theorem. This concludes the proof of (2.2.72) using Lemma 2.2.17. In
order to prove (2.2.73), using the following fact: for any two M×N rectangular matrices
A,B, we have σi(A + B) ≥ σi(A) + σK(B), i = 1, · · · , K, and Lemma 2.2.22, we find
that
µi ≥ λ+ −N−2/3+C′ε0 , i = k+ + 1, · · · , r. (2.2.77)
For the non-outliers, we assume that Sb > λ+ +N−2/3+C′ε0 , otherwise the proof is already
done. Now we assume x /∈ I0, by (2.2.72) and (2.2.77), we only need to discuss the case
when x ∈ (λ+ + N−2/3+C′ε0 , Sb). In this case, we will prove that M r(x) is non-singular
by comparing with M r(z), where z = x + iN−2/3−ε4 and ε4 < ε1 is some small positive
constant. Denote the spectral decomposition of G(z) as
G(z) =∑k
1
λk − zgαg
∗α, gα ∈ RM+N .
Denote ui, i = 1, · · · , 2r as the i-th column in U defined in (2.2.8) and abbreviate
u∗iG(z)uj as Guiuj(z), and η := N−2/3−ε4 , using spectral decomposition and the fact
x > λ+ +N−2/3+C′ε0 , we have
|Guiuj(x)−Guiuj(x+ iη)| ≤ ImGuiui(x+ iη) + ImGujuj(x+ iη).
Chapter 2. Random matrices in high dimensional statistics 142
Therefore, by Lemma 2.2.20 and 2.2.21, with 1−N−D1 probability, we have
M r(x) = M r(z) +O(N ε1
(Imm2c(z) +
√Imm2c(z)
Nη
)).
Using Lemma 2.2.15 and a similar discussion to (2.2.75), we have
M r(x) = T r(z) +O(N−1/3(N−C′ε0/4 +N ε1−C′ε0/4)).
By Lemma 2.2.15 and 2.2.20, we find that |T r(z)| ≥ N−1/3+C′ε0
2 , where we use the
assumption that x > λ+ + N−2/3+C′ε0 . Therefore, M r(x) is non-singular as we have
assumed 2 < C ′ < 4. This concludes the proof of (2.2.73).
In the second step, we will extend the proof to any configuration d(1) depending on
N using the continuity argument. This is done by a bootstrap argument by choosing a
continuous path connecting d(0) and d(1). It is recorded as the following lemma.
Lemma 2.2.25. For any N-dependent configuration d(1) ∈ D(ε0), (2.2.34) and (2.2.35)
hold true.
Singular vectors. In this section, we focus on the local behavior of singular vectors. We
first deal with the outlier singular vectors and then the non-outlier ones. Due to similarity,
we only prove (2.2.37) and (2.2.39), (2.2.36) and (2.2.38) can be handled similarly.
Proof of (2.2.37). It is notable that, by Lemma 2.2.21 and Theorem 2.2.5, for i ∈ O,
there exists a constant C > 0, for N large enough, with 1 − N−D1 probability , we can
choose an event Ξ such that for all z ∈ D(τ, ε1) defined in (2.2.65).
1(Ξ)|(V ∗G2(z)V )ij −m2c(z)δij| ≤ (κ+ η)−1/4N−1/2+Cε1 . (2.2.78)
Chapter 2. Random matrices in high dimensional statistics 143
Next we will restrict our discussion on the event Ξ. Recall (2.2.33) and for A ⊂ O, we
define for each i ∈ A the radius
ρi :=νi ∧ (di − c−1/4)
2. (2.2.79)
Under the assumption of (2.3.20), we have
ρi ≥1
2(di − c−1/4)−1/2N−1/2+ε0 . (2.2.80)
We define the contour Γ := ∂Υ as the boundary of the union of discs Υ := ∪i∈ABρi(di),
where Bρ(d) is the open disc of radius ρ around d. We summarize the basic properties of
Υ as the following lemma.
Lemma 2.2.26. Recall (2.2.9) and (2.2.65), we have p(Υ) ⊂ D(τ, ε1). Moreover, each
outlier µii∈A lies in p(Υ), and all the other eigenvalues of SS∗ lie in the complement
of p(Υ).
Armed with the above results, we now start the proof of the outlier singular vectors.
Our starting point is an integral representation of the singular vectors. By (2.2.48), we
have
v∗i G2vj = v∗i Gvj, (2.2.81)
where vi ∈ RM+N is the natural embedding of vi with vi = (0, vi)∗. Recall (2.3.29), using
the spectral decomposition of G2(z), Lemma 2.2.26 and Cauchy’s integral formula, we
have
Pr = − 1
2πi
∫p(Γ)
G2(z)dz = − 1
2πi
∫Γ
G2(p(ζ))p′(ζ)dζ. (2.2.82)
By Lemma 2.2.18, Cauchy’s integral formula, (2.2.81) and (2.2.82), we have
< vi,Prvj >=1
2didjπi
∫p(Γ)
(D−1 + U∗G(z)U)−1ij
dz
z, (2.2.83)
Chapter 2. Random matrices in high dimensional statistics 144
where i, j are defined as i := r + i, j := r + j. Recall (2.2.64), as D−1 + U∗Π(z)U is of
finite dimension, by Lemma 2.2.20, 2.2.21, (2.2.47) and (2.2.78), we can now use Π(z) as
Π(z) :=
m1c(z) 0
0 m2c(z)
.
Next we decompose D−1 + U∗G(z)U by
D−1 + U∗G(z)U = D−1 + U∗Π(z)U−∆(z), ∆(z) = U∗Π(z)U−U∗G(z)U. (2.2.84)
It is notable that ∆(z) can be controlled by Lemma 2.2.20 and 2.2.21. Using the resolvent
expansion to the order of one on (2.2.84), we have
< vi,Prvj >=1
didj(S(0) + S(1) + S(2)), (2.2.85)
where
S(0) :=1
2πi
∫p(Γ)
(1
D−1 + U∗Π(z)U)ijdz
z,
S(1) =1
2πi
∫p(Γ)
[1
D−1 + U∗Π(z)U∆(z)
1
D−1 + U∗Π(z)U]ijdz
z,
S(2) =1
2πi
∫p(Γ)
[1
D−1 + U∗Π(z)U∆(z)
1
D−1 + U∗Π(z)U∆(z)
1
D−1 + U∗G(z)U]ijdz
z.
By an elementary computation, we have
(D−1 + U∗Π(z)U)−1ij =
δijzm2c(z)
zm1c(z)m2c(z)−d−2i
, 1 ≤ i, j ≤ r;
δijzm1c(z)
zm1c(z)m2c(z)−d−2i
, r ≤ i, j ≤ 2r;
δij(−1)i+jz1/2d−1
i
zm1c(z)m2c(z)−d−2i
, 1 ≤ i ≤ r, r ≤ j ≤ 2r;
δij(−1)i+jz1/2d−1
j
zm1c(z)m2c(z)−d−2j
, r ≤ i ≤ 2r, 1 ≤ j ≤ r.
(2.2.86)
Chapter 2. Random matrices in high dimensional statistics 145
Using the fact pim1c(pi)m2c(pi) = 1d2i
and the residual theorem, we have
S(0) = δijm2c(pi)
T ′(pi)= δij
d4i − c−1
d2i + 1
. (2.2.87)
Next we control the term S(1). Applying (2.2.86) on S(1), we have
S(1) =1
2πi
∫p(Γ)
f(z)
(zm1c(z)m2c(z)− d−2i )(zm1c(z)m2c(z)− d−2
j )dz, (2.2.88)
where f(z) = f1(z) + f2(z) and f1,2(z) are defined as
f1(z) := m2c(z)[zm2c(z)∆(z)ij + (−1)i+iz1/2d−1i ∆(z)ij],
f2(z) := d−1j [(−1)j+jz1/2m2c(z)∆(z)ij + (−1)i+j+i+jd−1
i ∆(z)ij].
We now use the change of variable as in (2.2.82) and rewrite S(1) as
S(1) =1
2πi
∫Γ
f(p(ζ))
(ζ−2 − d−2i )(ζ−2 − d−2
j )p′(ζ)dζ = d2
i d2j
1
2πi
∫Γ
f(p(ζ))ζ4
(d2i − ζ2)(d2
j − ζ2)p′(ζ)dζ,
where we use the fact p(ζ)m1c(p(ζ))m2c(p(ζ)) = ζ−2. By (2.2.50), Lemma 2.2.15 and
2.2.21, we conclude that
|f(p(ζ))p′(ζ)ζ4| ≤ (ζ − c−1/4)1/2N−1/2+ε1 . (2.2.89)
Denote
fij(ζ) =f(p(ζ))p′(ζ)ζ4
(di + ζ)(dj + ζ).
As fij is holomorphic inside the contour Γ, by Cauchy’s differentiation formula, we have
f ′ij(ζ) =1
2πi
∫C
fij(ξ)
(ξ − ζ)2dξ, (2.2.90)
Chapter 2. Random matrices in high dimensional statistics 146
where the contour C is the circle of radius |ζ−c−1/4|2
centered at ζ. Hence, by (2.2.50),
(2.2.89), (2.2.90) and the residual theorem, we have
|f ′ij(ζ)| ≤ (ζ − c−1/4)−1/2N−1/2+ε1 . (2.2.91)
In order to estimate S(1), we consider the following three cases (i) i, j ∈ A, (ii) i ∈ A, j /∈
A, (or i /∈ A, j ∈ A), (iii) i, j /∈ A. By the residual theorem, S(1) = 0 when case (iii)
happens. Hence, we only need to consider the cases (i) and (ii). For the case (i), when
i 6= j, by the residual theorem and (2.2.91), we have
|S(1)| = d2i d
2j
∣∣∣∣fij(di)− fij(dj)di − dj
∣∣∣∣ ≤ d2i d
2j
|di − dj|
∣∣∣∣∫ dj
di
|f ′ij(t)|dt∣∣∣∣ ≤ d2
i d2jN−1/2+ε1
(di − c−1/4)1/2 + (dj − c−1/4)1/2.
When i = j, by the residual theorem, we have |S(1)| ≤ d4i (di − c−1/4)−1/2N−1/2+ε1 . For
the case (ii), when i ∈ A, j /∈ A, by the residual theorem and (2.2.78), we have
|S(1)| = |d2i d
2jfij(di)
di − dj| ≤
d2i d
2j(di − c−1/4)1/2
|di − dj|N−1/2+ε1 .
We can get similar results when i /∈ A, j ∈ A. Putting all the cases together, we find that
|S(1)| ≤ N−1/2+ε1
[1(i ∈ A, j ∈ A)d2
i d2j
(di − c−1/4)1/2 + (dj − c−1/4)1/2+ 1(i ∈ A, j /∈ A)
d2i d
2j(di − c−1/4)1/2
|di − dj|
+1(i /∈ A, j ∈ A)d2i d
2j(dj − c−1/4)1/2
|di − dj|
].
(2.2.92)
Finally, we need to estimate S(2). Here the residual calculations can not be applied
directly as U∗G(z)U is not necessary to be diagonal and a relation comparable to
p(ζ)m1c(p(ζ))m2c(p(ζ)) = ζ−2 does not exist. Instead, we need to precisely choose the
contour Γ. We record the result as the following lemma.
Chapter 2. Random matrices in high dimensional statistics 147
Lemma 2.2.27. When N is large enough, with 1−N−D1 probability, for some constant
C > 0, we have
|S(2)| ≤ CN−1+2ε1(1
νi+
1(i ∈ A)
|di − c−1/4|)(
1
νj+
1(j ∈ A)
|dj − c−1/4|). (2.2.93)
Therefore, plugging (2.2.87), (2.2.92) and (2.2.93) into (2.2.85), we conclude the proof
of (2.2.37). Before concluding this subsection, we briefly discuss the proof of (2.2.36).
By Lemma 2.2.18 and Cauchy’s integral formula , we have
< ui,Pluj >=1
2didjπi
∫p(Γ)
(D−1 + U∗G(z)U)−1ij
dz
z.
Then we can use a similar discussion as (2.2.85), computing the convergent limit from
S(0) and controlling the bounds for S(1) and S(2). We remark that the convergent limit
is different because we use (D−1 + U∗Π(z)U)ij, r ≤ i, j ≤ 2r in (2.2.86), which results in
S(0) = δijm1c(pi)
T ′(pi)= δij
d4i − c−1
d2i + c−1
.
This concludes the proof of (2.2.36).
For the non-outliers, the proof strategy for the outlier singular vectors will not work
as we cannot use the residual theorem. We will use a spectral decomposition for our
proof.
Proof of (2.2.39). Denote
z = µj + iη, (2.2.94)
where η is defined as the smallest solution of
Imm2c(z) = N−1+6ε1η−1. (2.2.95)
Chapter 2. Random matrices in high dimensional statistics 148
As we assume j ≤ (1− τ)K or c 6= 1, we conclude that |z| has a constant lower bound.
Therefore, by Lemma 2.3.27, 2.2.20 and 2.2.21, with 1−N−D1 probability, we have
| < u,Σ−1(G(z)− Π(z))Σ−1v > | ≤ N4ε1
Nη. (2.2.96)
Recall (2.2.55), abbreviating κ = |µj − λ+|, by Lemma 2.2.15 and (2.2.35), we find that
η ∼
N6ε1
N√κ+N2/3+2ε1
, if µj ≤ λ+ +N−2/3+4ε1 ,
N−1/2+3ε1κ1/4, if µj ≥ λ+ +N−2/3+4ε1 .
. (2.2.97)
For z defined in (2.2.94), by the spectral decomposition, we have
< vi, vj >2≤ η < vi, Im G2(z)vi >= η < vi, Im G(z)vi >, (2.2.98)
where vi ∈ RM+N is the natural embedding of vi. By Lemma 2.2.18, we have
< vi, G(z)vi >= − 1
zd2i
(D−1 + U∗G(z)U)−1ii .
Similar to (2.2.85), using a simple resolvent expansion and (2.2.86) , we have
< vi,G(z)vi >
= − 1
zd2i
[zm2c(z)
zm1c(z)m2c(z)− d−2i
+zf(z)
(zm1c(z)m2c(z)− d−2i )2
+([(D−1 + U∗Π(z)U)−1∆(z)]2(D−1 + U∗G(z)U)−1
)ii], (2.2.99)
where f(z) is defined in (2.2.88). To estimate the right-hand side of (2.2.99), we use the
following error estimate
minj|d−2j − T (z)| ≥ Im T (z) ∼ Imm2c(z) =
N6ε1
Nη N4ε1
Nη≥ |∆(z)|,
Chapter 2. Random matrices in high dimensional statistics 149
where we use (2.2.96) and Lemma 2.2.20. By a similar resolvent expansion, there exists
some constant C > 0, such that
∣∣∣∣∣∣∣∣ 1
D−1 + U∗G(z)U
∣∣∣∣∣∣∣∣ ≤ C
Imm2c(z)= CN1−6ε1η.
We therefore get from (2.2.99), the definition of f and (2.2.96) that
< vi, G(z)vi >=m2c(z)
1− d2iT (z)
+O(d2i
|1− d2iT (z)|2
N4ε1
Nη). (2.2.100)
By (2.2.98), we have
< vi, vj >2≤ η
|1− d2iT (z)|2
[Imm2c(z)(1− d2
i c1/2 + Re(d2
i c1/2 − d2
iT (z)))
+d2i Rem2c(z) Im T (z) +
Cd2iN
4ε1
Nη
]. (2.2.101)
By (2.2.57), (2.2.95) and (2.2.97), we have
Imm2c(z)[(1− d2i c
1/2)+ Re(d2i c
1/2 − d2iT (z))]
≤ CN6ε1
Nη
(|di − c−1/4|+ max
√κ+ η,
η√κ+ η
+ κ).
For the other item, by Lemma 2.2.15, we have |Rem2c(z) Im T (z)| ∼ Imm2c(z). Putting
all these estimates together, we have
< vi, vj >2≤ CN6ε1
N |1− d2iT (z)|2
.
The rest of the proof leaves to give an estimate of 1 − d2iT (z). We summarize it as the
following lemma.
Lemma 2.2.28. Recall (2.2.44), for all µj ∈ [λ−, λ+ +N−2/3+Cε0 ], there exists a constant
Chapter 2. Random matrices in high dimensional statistics 150
δ > 0, such that
|1− d2iT (z)| ≥ δd2
i (|d−2i − c1/2|+ Im T (z)).
Therefore, we have
< vi, vj >2≤ NCε0
N((di − c−1/4)2 + κdj ), κdj := N−2/3(j ∧ (K + 1− j))2/3,
where we use the fact that Im T (z) ≥ c√κdj . This concludes the proof of (2.2.39). For
the proof of (2.2.38), we will use the spectral decomposition
< ui, uj >2≤ η < ui, Im G1(z)ui >= η < ui, Im G(z)ui >,
and
< ui, G(z)ui >= − 1
zd2i
(D−1 + U∗G(z)U)−1ii.
Then by the resolvent expansion similar to (2.2.99) and control the items using Lemma
2.2.15, 2.3.27, 2.2.20 and 2.2.21, we can conclude the proof.
2.3 Eigen-structure of sample covariance matrix of
general form
Covariance matrices play important roles in high dimensional data analysis, which find
applications in many scientific endeavors, ranging from functional magnetic resonance
imaging and analysis of gene expression arrays to risk management and portfolio alloca-
tion. Furthermore, a large collection of statistical methods, including principal compo-
nent analysis, discriminant analysis, clustering analysis, and regression analysis, require
the knowledge of the covariance structure. Estimating a high dimensional covariance
matrix becomes the fundamental problem in high dimensional statistics.
Chapter 2. Random matrices in high dimensional statistics 151
The starting point of covariance matrix estimation is the sample covariance matrix,
which is a consistent estimator when the dimension of the data is fixed. In the high
dimensional regime, even though the sample covariance matrix itself is a poor estima-
tor, it can still provide lots of information about the eigen-structure of the population
covariance matrix. In many cases, the population covariance matrices can be effectively
estimated using the information of sample covariance matrices. Two main types of co-
variance matrices have been studied in the literature. One is the covariance matrix whose
eigenvalues are all attached in the bulk of its spectrum. The null case is when the en-
tries of the data matrix are i.i.d, where the spectrum of the sample covariance matrices
satisfies the celebrated Marcenco-Pastur (MP) law. For data matrix with correlated en-
tries, the spectrum satisfies the deformed MP law and has been well studied in. In the
deformed MP law, several bulk components are allowed (recall that the spectrum of MP
law has only one bulk component). The other line of the effort is to add a few outliers
(i.e. eigenvalues detach from the bulk) to the spectrum of MP law, which becomes the
spiked covariance matrix.
In the present paper, we study the local asymptotics of the empirical eigen-structure
of sample covariance matrices with general form. We will add a finite number of outliers
to the spectrum of the deformed MP law. Hence, our framework can be viewed as; on
one hand an extension of the spiked model by allowing multiple bulk components, on
the other hand, an extension of covariance matrices with deformed MP law by adding a
finite number of spikes. It is a unified framework for covariance matrices of general form
containing all the models discussed above.
Local deformed Marcenko-Pastur law. It is well-known that the empirical eigen-
value density of sample covariance matrices with independent entries converges to the
celebrated Marcenko-Pastur (MP) law. In the case when the population covariance ma-
trices are of general structure, it has been shown that the empirical eigenvalue density
Chapter 2. Random matrices in high dimensional statistics 152
still converges to a deterministic limit, which is called the deformed MP law. And the
local deformed MP law is an immediate result from the anisotropic local law. Denote
c ≡ cN :=N
M∈ (0,∞), (2.3.1)
and X = (xij) to be an M × N data matrix with centered entries xij = N−1/2qij,
1 ≤ i ≤ M and 1 ≤ j ≤ N , where qij are i.i.d random variables with unit variance and
for all p ∈ N, there exists a constant Cp, such that q11 satisfies that E|q11|p ≤ Cp.
The MP law and its variates are best formulated using the Stieltjes transform (see
Definition 2.3.24). Denote H = XX∗ and its Green function by
GI(z) = (H − z)−1, z = E + iη ∈ C+.
The local MP law can be informally written as
1
MTrGI(z) = mMP (z) +O(
√ImmMP (z)
Nη+
1
Nη),
where mMP is the Stieltjes transform of the MP law. It is notable that mMP (z) is
independent of N. For the general sample covariance matrices without outliers, we adapt
the model in [74] and write
Qb = Σ1/2b XX∗Σ
1/2b , (2.3.2)
where Σb is a positive definite matrix satisfying some regularity conditions. We will call
Qb the bulk model in this paper. Denote the eigenvalues of Σb by σb1 ≥ σb2 ≥ · · · ≥ σbM > 0,
and the empirical spectral distribution (ESD) of Σb by
πb(A) :=1
M
M∑i=1
1σbi∈A. (2.3.3)
Chapter 2. Random matrices in high dimensional statistics 153
We assume that there exists some small positive constant τ such that,
τ < σbM ≤ σb1 ≤ τ−1, τ ≤ c ≤ τ−1, πb([0, τ ]) ≤ 1− τ. (2.3.4)
Next we discuss the asymptotic density of Q1b := X∗ΣbX. Assuming that πb ⇒ πb∞
weakly, it is well-known that if πb∞ is a compactly supported probability measure on R,
and let c > 0, then for each z ∈ C+, there is a unique mD ≡ mΣb(z) ∈ C+ satisfying
1
mD
= −z +1
c
∫x
1 +mDxπb∞(dx). (2.3.5)
We define by ρD the probability measure associated with mD (i.e. mD is the Stieltjes
transform of ρD) and call it the asymptotic density of Q1b . Our assumption (2.3.4) implies
that the spectrum of Σb cannot be concentrated at zero, thus it ensures πb∞ is a compactly
supported probability measure. Therefore, mD and ρD are well-defined. The behaviour
of ρD can be entirely understood by the analysis of the function fD
z = fD(mD), ImmD ≥ 0, where fD(x) := −1
x+
1
c
∫λ
1 + xλπb∞(dλ). (2.3.6)
In practical applications, the limiting form (2.3.5) is usually not available and we are
interested in the large N case. We now define the deterministic function m ≡ mΣb,N(z)
as the unique solution of
z = f(m), Imm ≥ 0, where f(x) := −1
x+
1
cN
M∑i=1
πb(σbi)x+ (σbi )
−1. (2.3.7)
Similarly, we define by ρ the probability measure associated with m(z). The local de-
formed MP law can be informally written as
1
NTrG(z) = m(z) +O(
√Imm(z)
Nη+
1
Nη), (2.3.8)
Chapter 2. Random matrices in high dimensional statistics 154
where G(z) is the Green function of Q1b . It is notable that m(z) depends on N in general.
Remark 2.3.1. In the literature, there are no results on the control of m − mD as it
depends on the convergent rate of πb ⇒ πb∞. We believe that under mild assumptions,
we can replace m(z) with mD(z). We will not pursue this generalization in this paper.
General covariance matrices. This subsection is devoted to defining the general
covariance matrices. We start with the discussion of the spectrum of the bulk model Qb
defined in (2.3.2). We firstly summarize the properties of f defined in (2.3.7), it can be
found in [74, Lemma 2.4, 2.5 and 2.6].
Lemma 2.3.2. Denote R = R∪ ∞, then f defined in (2.3.7) is smooth on the M + 1
open intervals of R defined through
I1 := (−(σb1)−1, 0), Ii := (−(σbi )−1,−(σbi−1)−1), i = 2, · · · ,M, I0 := R/ ∪Mi=1 Ii.
We also introduce a multiset C ⊂ R containing the critical points of f , using the conven-
tions that a nondegenerate critical point is counted once and a degenerate critical point
will be counted twice. In the case cN = 1, ∞ is a nondegenerate critical point. With the
above notations, we have
• (Critical Points) : |C ∩ I0| = |C ∩ I1| = 1 and |C ∩ Ii| ∈ 0, 2 for i = 2, · · · ,M.
Therefore, |C| = 2p, where for convenience, we denote by x1 ≥ x2 ≥ · · · ≥ x2p−1 be
the 2p− 1 critical points in I1 ∪ · · · ∪ IM and x2p be the unique critical point in I0.
• (Ordering) : Denote ak := f(xk), we have a1 ≥ · · · ≥ a2p. Moreover, we have
xk = m(ak) by assuming m(0) := ∞ for cN = 1. Furthermore, for k = 1, · · · , 2p,
there exists a constant C such that 0 ≤ ak ≤ C.
• (Structure of ρ): supp ρ ∩ (0,∞) = (∪pk=1[a2k, a2k−1]) ∩ (0,∞).
Chapter 2. Random matrices in high dimensional statistics 155
We post the following regularity conditions on Σb, which are proposed in [74, Def-
inition 2.7]. Roughly speaking, the regularity condition rules out the ouliers from the
spectrum of Qb.
Assumption 2.3.3. Fix τ > 0, we assume that
(i). The edges ak, k = 1, · · · , 2p are regular in the sense that
ak ≥ τ, minl 6=k|ak − al| ≥ τ, min
i|xk + (σbi )
−1| ≥ τ. (2.3.9)
(ii). The bulk components k = 1, · · · , p are regular in the sense that for any fixed τ ′ > 0
there exists a constant ν ≡ ντ,τ ′ such that the density of ρ in [a2k + τ ′, a2k−1 − τ ′] is
bounded from below by ν.
Remark 2.3.4. The second condition in (2.3.9) states that the gap in the spectrum of ρ
adjacent to ak can be well separated when N is sufficiently large. And the third condition
ensures a square root behaviour of ρ in a small neighborhood of ak. As a consequence,
it will rule out the outliers. The bulk regularity imposes a lower bound on the density of
eigenvalues away from the edges.
To extend the bulk model, we now add r (finite) number of spikes to the spectrum of
Σb. Denote the spectral decomposition of Σb as
Σb =M∑i=1
σbiviv∗i , Db = diagσb1, · · · , σbM.
Denote the index set I ⊂ 1, 2, · · · ,M as the collection of the indices of the r outliers,
where
I := o1, · · · , or ⊂ 1, 2, · · · ,M. (2.3.10)
Chapter 2. Random matrices in high dimensional statistics 156
Now we define
Σg =M∑i=1
σgi viv∗i , where σgi =
σbi (1 + di), i ∈ I;
σbi , otherwise.
, di > 0. (2.3.11)
We also assume that di are in the decreasing fashion. We further define
O := σgi , i ∈ I. (2.3.12)
Therefore, we can write
Σg = Σb(1 + VDV∗) = (1 + VDV∗)Σb, (2.3.13)
where V = (v1, · · · ,vM) and D = (di) is an M×M diagonal matrix where di = di, i ∈ I
and zero otherwise. As D is not invertible, we write
VDV∗ =∑i∈I
diviv∗i = VoDoV
∗o, (2.3.14)
where Vo is a M × r matrix containing vi, i ∈ I and Do is a r× r diagonal matrix with
entries di, i ∈ I. Then our model can be written as
Qg = Σ1/2g XX∗Σ1/2
g . (2.3.15)
We will call it the general model. Denote K = minM,N and use µ1 ≥ µ2 ≥ · · · ≥
µK > 0 to be the nontrivial eigenvalues of Qg and ui, i = 1, 2, · · · ,M as the eigenvectors.
We also use λ1 ≥ λ2 ≥ · · · ≥ λK > 0 to denote the nontrivial eigenvalues of Qb and ubi
as the eigenvectors of Qb. As there exist p bulk components, for convenience, we relabel
the indices of the eigenvalues of Qg using µi,j, which stands for the j-th eigenvalue of the
i-th bulk component. Similarly, we can relabel for di,j, λi,j, σgi,j, σ
bi,j,vi,j, ui,j, and ubi,j.
Chapter 2. Random matrices in high dimensional statistics 157
We further assume that the r outliers are associated with p bulk components and
each with ri, i = 1, 2, · · · , p outliers satisfying∑p
i=1 ri = r. Using the convention that
x0 =∞, we denote the subset O+ ⊂ O by O+ =⋃pi=1O
+i , where O+
i is defined as
O+i = σgi,j : x2i−1 +N−1/3+ε0 ≤ − 1
σgi,j< x2(i−1) − c0, (2.3.16)
where ε0 > 0 is some small constant and 0 < c0 < minix2(i−1)−x2i−1
2. We further denote
r+i := |O+
i | and the index sets associated with O+i ,O+ by I+
i , I+, where
I+i := (i, j) : σgi,j ∈ O+
i , I+ :=
p⋃i=1
I+i . (2.3.17)
We can relabel I in the similar fashion.
Remark 2.3.5. Our results can be extended to a more general domain by denoting
O+i = σgi,j : x2i−1 +N−1/3 ≤ − 1
σgi,j< x2(i−1) − c0.
The proofs still hold true with some minor changes except we need to discuss the case
when x2i−1 +N−1/3 ≤ − 1σgi,j≤ x2i−1 +N−1/3+ε0 . We will not pursue this generalization.
For definiteness, we introduce the following assumption.
Assumption 2.3.6. For all i = 1, 2, · · · , p, j = 1, 2, · · · , ri, we have
f(x2i−1) ≤ f(− 1
σgi,j) ≤ f(x2(i−1)), f(x0) =∞. (2.3.18)
Furthermore, we assume that
∣∣∣∣f(− 1
σgi,j)− f(x2(i−1))
∣∣∣∣− ∣∣∣∣f(− 1
σgi,j)− f(x2i−1)
∣∣∣∣ ≥ τ, i = 2, · · · , p, (2.3.19)
where τ > 0 is some constant.
Chapter 2. Random matrices in high dimensional statistics 158
Roughly speaking, Assumption 2.3.6 ensures that the outliers are always on the right
of each bulk component. When the outliers are on the left (i.e. (2.3.19) reverses), we can
get similar results.
To avoid repetition, we summarize the assumptions for future reference.
Assumption 2.3.7. We assume that (2.3.1), (2.3.4), (2.3.11) and Assumption 2.3.3
and 2.3.6 hold true.
It is notable that in [25], they consider a similar model but with spikes only on the
right of the spectrum.
Main results. We first introduce the following non-overlapping condition. Roughly
speaking, it ensures that the eigenvalues of Qg are well separated so that we can identify
the eigen-structure.
Assumption 2.3.8 (Non-overlapping condition). For A ⊂ O+, i = 1, 2, · · · , p, j =
1, 2, · · · , r+i , we assume that
νi,j(A) ≥ (− 1
σgi,j− x2i−1)−1/2N−1/2+ε0 , (2.3.20)
where ε0 is defined in (2.3.16) and νi,j is defined as
νi,j ≡ νi,j(A) :=
minσgi1,j1 /∈A
| − 1σgi,j
+ 1σgi1,j1|, if σgi,j ∈ A,
minσgi1,j1∈A| − 1
σgi,j+ 1
σgi1,j1|, if σgi,j /∈ A.
. (2.3.21)
Remark 2.3.9. In this paper, we compute the convergent limits of the outlier eigenvectors
under Assumption 2.3.8. However, with extra work, we can show that the results still
hold true by removing this assumption. We will not pursue this generalization.
We now state the main results. To the end of this paper, we always use D1 as a
generic large constant and ε1 < ε0 a small constant.
Chapter 2. Random matrices in high dimensional statistics 159
Theorem 2.3.10 (Outlier eigenvalues). Under Assumption 2.3.7, for i = 1, 2, · · · , p, j =
1, 2, · · · , r+i , there exists some constant C > 1, when N is large enough, with 1 − N−D1
probability, we have
|µi,j − f(− 1
σgi,j)| ≤ N−1/2+Cε0(− 1
σgi,j− x2i−1)1/2. (2.3.22)
Moreover, for i = 1, 2, · · · , p, j = r+i + 1, · · · , ri, we have
|µi,j − f(x2i−1)| ≤ N−2/3+Cε0 . (2.3.23)
The above theorem gives precise location of the outlier and the extremal non-outlier
eigenvalues. For the outliers, they will locate around their classical locations f(− 1σgi,j
)
and for the non-outliers, they will locate around the right edge of the bulk component.
However, (2.3.23) can be easily extended to a more general framework. Instead of con-
sidering the bulk edge, we can locate µi,j around the eigenvalues of Qb, which is the
phenomenon of eigenvalue sticking. We denote the classical eigenvalue locations in the
bulk by γ1 ≥ γ2 ≥ · · · ≥ γK , where N∫∞γidρ = i − 1
2. And we relabel the classical
number of eigenvalues in the i-th bulk component through Ni :=∫ a2i−1
a2idρ. Furthermore,
for i = 1, 2, · · · , p and j = 1, 2, · · · , Ni, we denote
λi,j := λj+∑l<iNl
, γi,j := γj+∑l<iNl
∈ (a2i, a2i−1). (2.3.24)
It is notable that γi,j can also be characterized through N∫ a2i−1
γi,jdρ = j − 1
2.
Theorem 2.3.11 (Eigenvalue sticking). Under Assumption 2.3.7, for i = 1, 2, · · · , p,
denote
αi+ := min1≤j≤Ni
∣∣∣∣− 1
σgi,j− x2i−1
∣∣∣∣ , (2.3.25)
Chapter 2. Random matrices in high dimensional statistics 160
with 1−N−D1 probability, when αi+ ≥ N−1/3+2ε1 ,
|µi,j+r+i − λi,j| ≤N2ε1
Nαi+. (2.3.26)
Remark 2.3.12. We remark that when αi+ < N−1/3+2ε1 , it can be shown that (2.3.26) still
holds true. However, in this case, the eigenvalue rigidity (see Lemma 2.3.34) gives sharp
bound, where
|µi,j+r+i − λi,j| ≤ N−2/3+ε1(j ∧ (Ni + 1− j))−1/3.
Furthermore, for some small constant τ ′ > 0, if γi,j ∈ [a2i + τ ′, a2i−1 − τ ′], we have
|µi,j+r+i − λi,j| ≤ N−1+ε1 .
We will see later from Lemma 2.3.34 that when αi+ = O(1), the sticking bound N−1 is
much smaller than the typical gap N−2/3j−1/3 near the edges.
Theorem 2.3.10 and 2.3.11 can be used to estimate the spectrum of the general model.
For the bulk model, El Karoui [70] consistently estimated the spectrum by solving a linear
programming problem whose objective function containing (2.3.7); later on Kong and
Valiant [75] considered the problem by using the information from samples and provided
sharp convergent rates for the estimation. However, neither of the above two methods
can be applied to estimate the general model as both of them reply on the information
from the deformed MP law, which will ”ignore” the finite number of outliers. For the
general model, the spiked part can be estimated using (2.3.22) while the bulk part can
be estimated using the methods from [70, 75] due to the eigenvalue sticking property.
Next, we introduce the results of the eigenvectors. Denote
ui,j :=1
σgi,j
f ′(−1/σgi,j)
f(−1/σgi,j). (2.3.27)
Theorem 2.3.13 (Outlier eigenvectors). For 1 ≤ i1, i2 ≤ p, 1 ≤ j1 ≤ r+i1, 1 ≤ j2 ≤ r+
i2,
Chapter 2. Random matrices in high dimensional statistics 161
under Assumption 2.3.7 and 2.3.8, with 1−N−D1 probability, we have
∣∣< ui1,j1 ,vi2,j2 >2 −1(i1 = i2, j1 = j2)ui1,j1
∣∣ ≤ N ε1R(i1, j1, i2, j2, N), (2.3.28)
where R(i1, j1, i2, j2, N) is defined as
R(i1, j1, i2, j2, N)
:= 1(i1 = i2, j1 = j2)1√N
(− 1
σgi1,j1− x2i1−1
)−1/2
+N−1
1
ν2i2,j2
+1(i1 = i2, j1 = j2)
(− 1σgi2,j2
− x2i1−1)2
.
More generally, we consider the spectral projections and the generalized components.
Denote
PA :=∑
(i,j)∈A
ui,ju∗i,j, A ⊂ I+. (2.3.29)
For a vector w ∈ RM , we define wi,j :=< vi,j,w > .
Corollary 2.3.14. For A ⊂ I+ and any deterministic vector w ∈ RM , define
< w,ZAw >:=∑
(i,j)∈A
ui,jw2i,j.
Under Assumption 2.3.7 and 2.3.8, with 1−N−D1 probability, when N is large enough,
we have
|< w,PAw > − < w,ZAw >| ≤ N ε1R(w, A), (2.3.30)
Chapter 2. Random matrices in high dimensional statistics 162
where R(w, A) :=∑∑
wi1,j1wi2,j2R(i1, j1, i2, j2, A) and R(i1, j1, i2, j2, A) is defined as
N−1/2
1 ((i1, j1), (i2, j2) ∈ A)
(− 1σgi1,j1
− x2i1−1)1/4(− 1σgi2,j2
− x2i2−1)1/4+ 1((i1, j1) ∈ A, (i2, j2) /∈ A)
(− 1σgi1,j1
− x2i1−1)1/2
| − 1σgi1,j1
+ 1σgi2,j2|
+1((i1, j1) /∈ A, (i2, j2) ∈ A)(− 1
σgi2,j2− x2i2−1)1/2
| − 1σgi1,j1
+ 1σgi2,j2|
+N−1
(1
νi1,j1+
1((i1, j1) ∈ A)
− 1σgi1,j1
− x2i1−1
)(1
νi2,j2+
1((i2, j2) ∈ A)
− 1σgi2,j2
− x2i2−1
)
.Theorem 2.3.15 (Non-outlier eigenvectors). For (k, i) ∈ I+ and (l, j) ∈ I/I+, under
Assumption 2.3.7 and 2.3.8, with 1−N−D1 probability, we have
< vk,i,ul,j >2≤ N6ε1
N(κdl,j + ((σgk,i)−1 + x2k−1)2)
, (2.3.31)
where κdl,j := (j ∧ (Nl + 1− j))2/3N−2/3.
Corollary 2.3.16. For (l, j) ∈ I/I+, under Assumption 2.3.7 and 2.3.8, for w ∈ RM ,
with 1−N−D1 probability, when N is large enough, we have
< w,ul,j >2≤∑ Cw2
k,iN6ε1
N(κdl,j + ((σgk,i)−1 + x2k−1)2)
. (2.3.32)
Before concluding this part, we give a few examples to illustrate our results of the
sample eigenvectors.
Example 2.3.17. (i). Let A = (i, j) ∈ I+, w = vi,j and − 1σgi,j− x2i−1 ≥ τ > 0, then
for some constant C > 0, we have
| < ui,j,vi,j >2 −ui,j| ≤ N−1/2+Cε1 .
If we take A = (i, j) and w = vi1,j1 with (i1, j1) 6= (i, j), if − 1σgi,j
+ 1σgi1,j1
≥ τ , we then
Chapter 2. Random matrices in high dimensional statistics 163
have
| < ui,j,vi1,j1 > |2 ≤ N−1+Cε1 .
In particular, if σbi = 1, i = 1, 2, · · · ,M, our results coincide with [17, Example 2.13 and
2.14].
(ii). Take w = vk,i and ul,j as in Theorem 2.3.15. Assume that | 1σgk,i
+ x2k−1| ≥ τ and
κdl,j = O(1), then we have
< vk,i,ul,j >2≤ N−1+Cε1 .
Examples We consider a few examples to explain our results in detail. We first provide
two types of conditions on Σb verifying Assumption 2.3.3. They can be found in [74,
Example 2.8 and 2.9].
Condition 2.3.18. We suppose that n is fixed and there are only n distinct eigenvalues
of Σb. We further assume that σb1, · · · , σbn and Nπb(σb1), · · · , Nπb(σbn) all converge in
(0,∞) as N →∞. We also assume that the critical points of limN f are non-degenerate,
and limN ai > limN ai+1 for i = 1, 2, · · · , 2p− 1.
Condition 2.3.19. We suppose that c 6= 1 and πb is supported in some interval [a, b] ⊂
(0,∞), and that πb converges weakly to some measure πb∞ that is absolutely continuous
and whose density satisfies that τ ≤ dπb∞(E)/dE ≤ τ−1 for E ∈ [a, b]. In this case, p = 1.
In all the examples, we only derive the results for the eigenvalues and leave the
discussion and interpretation of the eigenvectors to the readers. We firstly provide two
examples satisfying Condition 2.3.18.
Example 2.3.20 (BBP transition). We suppose that r = 1 and let σbi = 1, i =
1, 2, · · · ,M. We assume that c > 1. In this case, we can instead use f ≡ fD defined
in (2.3.6), where we have
f(x) = −1
x+
1
c(x+ 1).
Chapter 2. Random matrices in high dimensional statistics 164
It can be easily checked that the critical points of f(x) are −√c√c−1
, −√c√c+1
, which implies
that p = 1. By (2.3.22), the convergent limit of the largest eigenvalue is
µ = f(− 1
d+ 1) = 1 + d+ c−1(1 + d−1),
and the phase transition happens when
− 1
1 + d> −
√c√
c+ 1⇒ d > c−1/2.
And the local convergence result reads as
∣∣∣∣µ− f(− 1
d+ 1)
∣∣∣∣ ≤ N−1/2+Cε0(d− c−1/2)1/2.
Example 2.3.21 (Spiked model with multiple bulk components). Consider the M ×N
sample covariance matrix with population covariance matrix defined by
Σg = diag35, 18, · · · , 18︸ ︷︷ ︸M/2− 1 times
, 4, 1, · · · , 1︸ ︷︷ ︸M/2− 1 times
. (2.3.33)
We assume c = 2 and then we have
f(x) = −1
x+
1
4
(1
x+ 118
+1
x+ 1
).
Furthermore, there are four critical points of f , approximately −2.3926,−0.62575,−0.11133,−0.037035.
Here, p = 2. Due to the fact
f(− 1
35) = 44.522 > f(−0.037035) = 40.759, f(−1
4) = 3.0476 > f(−0.62575) = 1.827,
we find that there are two outliers f(− 135
), f(−14). Similarly, we can derive the local
convergent results.
Chapter 2. Random matrices in high dimensional statistics 165
Next we provide two examples satisfying Condition 2.3.19, where there exists only
one bulk component.
Example 2.3.22 (Spiked model with uniform distributed eigenvalues). Consider an
M ×N sample covariance matrix with population covariance matrix defined by
Σg = diag8, 2.9975, 1.995, · · · , 1.005, 1.0025.
The limiting distribution of Σb is the uniform distribution on the interval [1, 3]. Let c = 2
and we can use f ≡ fD defined in (2.3.6), where we get
f(x) = − 1
2x− 1
4x2log
3x+ 1
x+ 1,
with critical points approximately −2.0051405, −0.2513025. Therefore, the left and right
edges are
f(−2.0051405) ≈ 0.1494, f(−0.2513025) ≈ 6.3941.
And the outlier is f(−18) ≈ 9.3836 > 6.3941.
Example 2.3.23 (Spiked Toeplitz matrix). Suppose that we can observe an M × N
sample covariance matrix with a Toeplitz population covariance matrix whose (i, j)-th
entry is 0.4|i−j| and spike locates at 10. We choose c = 2 and f can be written as
f(x) = − 1
2x− 1
3.8092x2log(
2.332x+ 1
0.4286x+ 1),
where the interval [0.4286, 2.332] is approximately the support of the population eigenval-
ues. The critical points are approximately −0.333552,−3.61753. Therefore, the left and
right edges are
f(−3.61753) ≈ 0.0859, f(−0.333552) ≈ 4.3852.
And the outlier is f(− 110
) ≈ 10.8221 > 4.3852.
Chapter 2. Random matrices in high dimensional statistics 166
Statistical applications Now we use some concrete examples to explain how our
results can be applied to the study of high dimensional statistics.
Optimal shrinkage of eigenvalues. Donoho, Gavish and Johnstone [45] propose a
framework to compute the optimal shrinkage of eigenvalues for a spiked covariance matrix
model, where the outliers are assumed to be on the right of the unique bulk component
and σbi = 1, i = 1, 2, · · · ,M. They shrink the outlier eigenvalues using some nonlinear
functions and keep the bulk eigenvalues as ones via the rank-aware shrinkrage rule. To
extend their results, suppose that we want to estimate Σg using Σg :=∑M
i=1 σgi uiu
∗i ,
under the rank-aware rule, where
Σg :=∑i∈I+
β(µi)uu∗i +∑i/∈I+
σgi uu∗i , (2.3.34)
where σgi = σbi , i /∈ I+ can be efficiently estimated using the spectrum estimation method
and β is some nonlinearity. Our task is find the optimal β.
The main conclusion is that the optimal β depends strongly on the loss function.
An advantage of the rank-aware rule (2.3.34) is that, a rich class of loss functions can
be well-decomposed. Hence, the problem of finding β can be reduced to optimizing a
low dimensional loss function. Theorem 2.3.10 and 2.3.13 can be used to improve their
results. In detail, we need to modify the computation procedure:
1. We can estimate the bulk spectrum and derive the form of f.
2. Calculate l(µ) = − 1f−1(µ)
.
3. Calculate c(µ) = 1l(µ)
f ′(−1/l(µ))f(−1/l(µ))
.
4. Calculate s(µ) = s(l(µ)) using s(l) =√
1− c2(l).
As the optimal β is only in terms of l(µ), c(µ) and s(µ), we can substitute l(µ), c(µ)
and s(µ) into the formulas of β under the 26 loss functions. For instance, we list 10
Chapter 2. Random matrices in high dimensional statistics 167
different loss functions and their corresponding optimal β in Table 2.2. It is notable
that, we will see from (2.3.41) the derivative of f can be put into a summation of σbi ,
which reduces our computation burden. In Figure 2.5, we show the shrinkers computed
from different loss functions for Example 2.3.21 and 2.3.22.
Oracle estimation under Frobenius norm. If we have no prior information what-
soever on the true eigenbasis of the covariance matrix, the natural choice for us is to
use the sample eigenvectors. Therefore, we need to find a diagonal matrix D, such that
L(Σg,UDU∗) is minimized, where L(·, ·) is some loss function. Then our estimator will
be Σg = UDU∗. Consider the Frobenius norm as our loss function, we have
||Σg − Σg||2F = (||Σg||2F − || diag(U∗ΣgU)||2F ) + ||D− diag(U∗ΣgU)||2F . (2.3.35)
Our estimator is called oracle estimator as the first part of the right-hand side of (2.3.35)
cannot be optimized, hence we should take D = diag(U∗ΣgU). The oracle estimator is a
better choice than simply using the sample covariance matrix Qg. Unlike the rank-aware
approach in the previous example, we consider the shrinkage for all the eigenvalues, where
di ≡ β(µi) := u∗iΣgui, i = 1, 2, · · · ,M. (2.3.36)
We assume that I = I+ and rewrite (2.3.36) as
di =∑j∈I+
σgj < vj,ui >2 +
∑j /∈I+
σgj < vj,ui >2 . (2.3.37)
For i ∈ I+, by (2.3.28), the first part of the right-hand side of (2.3.37) satisfies
∑j∈I+
σgj < vj,ui >2→ f ′(−1/σgi )
f(−1/σgi ). (2.3.38)
Chapter 2. Random matrices in high dimensional statistics 168
For the second part of (2.3.37), we have
∑j /∈I+
σgj < vj,ui >2→ 1
N
(σgi )2
f(−1/σgi )
M∑j=1
(σbj)2
(σgi − σbj)2. (2.3.39)
Meanwhile, inserting f(− 1σgi
) back into (2.3.7), we get
f(− 1
σgi) = σgi +
1
N
M∑j=1
1
−(σgi )−1 + (σbj)
−1. (2.3.40)
Differentiating with respect to σgi on both sides of (2.3.40), we get
f ′(− 1
σgi)(σgi )
−2 = 1− 1
N
M∑j=1
(σgj )2
(σgi − σbj)2. (2.3.41)
Therefore, by (2.3.37), (2.3.38), (2.3.39) and (2.3.41), when i ∈ I+, we have that
di →(σgi )
2
f(−1/σgi ). (2.3.42)
When i /∈ I+, by Theorem 2.3.11, we can use the estimator derived by [77], where di
satisfies
di →1
µi| limη→0mD(µi + iη)|2. (2.3.43)
Under mild assumption on the convergence rate of πb → πb∞, we can replace mD(µi + iη)
with m(µi + iη), then we have that∣∣ 1N
TrG(µi + iη)−m(µi + iη)∣∣ ≤ 1
Nη. Therefore,
the oracle estimator can be written as Σg = UDU∗, where D = diagd1, · · · , dM and
di, i = 1, 2, · · · ,M are defined as
di :=1
λi|m(µi + iN−1/2)|2, m(µi + iN−1/2) =
1
N
N∑k=1
1
µk − µi − iN−1/2. (2.3.44)
Chapter 2. Random matrices in high dimensional statistics 169
Note that the estimator (2.3.44) can be regarded as a nonlinear shrinkage of the sam-
ple eigenvalues, this can be understood as follows: Denote m1(z) = 1M
TrG1(z), where
G1(z) is the Green function of Qb. It is easy to check that m(z) =c−1N −1+c−1
N zm1(z)
z. As a
consequence, we can rewrite
di =µi
|1− c−1N − c
−1N µim1(µi + iN−1/2)|2
.
Factor model based estimation. Factor models have been heavily used in the study
of empirical finance. In these applications, financial stocks share the same market risks
and hence their returns can be highly correlated. The cross sectional units are modeled
using a few common factors [57]
Yit = b∗i ft + uit, (2.3.45)
where Yit is the return of the i-th stock at time t, bi is a vector of factor loadings, ft is a
K × 1 vector of latent common factors and uit is the idiosyncratic component, which is
uncorrelated with ft. The matrix form of (2.3.45) can be written as Yt = Bft + ut. For
the purpose of identifiability, we impose the following constraints: [58] Cov(ft) = IK and
the columns of B are orthogonal. As a consequence, the population covariance matrix
can be written as
Σ = BB∗ + Σu. (2.3.46)
(2.3.46) can be written into our general covariance matrices model (2.3.13) by letting
Σb = Σu, BB∗ = VDV∗Σb. The estimation can be computed using the least square
optimization
arg minB,F||Y −BF∗||2F , (2.3.47)
N−1F∗F = IK , B∗B is diagonal.
Chapter 2. Random matrices in high dimensional statistics 170
The least square estimator for B is Λ = N−1YF, where the columns of F satisfy that
N−1/2FK is the eigenvector corresponds to the K-th largest eigenvalue of Y∗Y. Under
some mild conditions, Fan, Liao and Mincheva [58] showed that BB∗ corresponded to
the spiked parts whereas Σu the bulk part. In most of the applications, Σu is assumed
to have some sparse structure. Hence, the estimator can be written as
Σ = ΛKΛ∗K + Σu,
where ΛK is the collection of the columns correspond to the factors and Σu is the esti-
mated using some thresholding method by analyzing the residual.
0 100 200 300 400 500 600 700 8000
20
40
60
80
100
120
140
160
180
200
Dimension M
Frobeniusloss
Figure 2.3: Estimation loss using factor model. We simulate the estimation error underFrobenius norm for Example 2.3.21, where the blue line stands for the sample covari-ance matrices estimation, red dots for our Multi-POET estimation and magenta dots forthe POET estimation. We find that using information from the population covariancematrices can help us improve the inference results.
Preliminaries. This section is devoted to introducing the tools for our proofs. We
start by giving some notations and definitions. For fixed small constants τ, τ ′ > 0, we
Chapter 2. Random matrices in high dimensional statistics 171
define the domains
S ≡ S(τ,N) := z ∈ C+ : |z| ≥ τ, |E| ≤ τ−1, N−1+τ ≤ η ≤ τ−1, (2.3.48)
Sei ≡ Sei (τ′, τ, N) := z ∈ S : E ∈ [ai − τ ′, ai + τ ′], i = 1, 2, · · · , 2p, (2.3.49)
Sbi ≡ Sbi(τ′, τ, N) := z ∈ S : E ∈ [a2i + τ ′, a2i−1 − τ ′], i = 1, 2, · · · , p, (2.3.50)
So ≡ So(τ, τ ′, N) := z ∈ S : dist(E, Supp(ρ)) ≥ τ ′. (2.3.51)
It is notable that S =⋃2pi=1 Sei∪
⋃pi=1 Sbi∪So. Recall that the empirical spectral distribution
(ESD) of an N ×N symmetric matrix H is defined as
F(N)H (λ) :=
1
N
N∑i=1
1λi(H)≤λ.
Definition 2.3.24 (Stieltjes transform). The Stieltjes transform of the ESD of Q1b is
given by
m(z) ≡ m(N)(z) :=
∫1
x− zdF
(N)
Q1b
(x) =1
NTrG(z), (2.3.52)
where we recall that G(z) is the Green function of Q1b .
We further denote
κi(x) := |x− f(x2i−1)|, κdi,j := N−2/3(j ∧ (Ni + 1− j))2/3. (2.3.53)
We will see from Lemma 2.3.34 that, κdi,j is a deterministic version of κi(µi,j). The fol-
lowing lemma determines the locations of the outlier eigenvalues of Qg. Denote Gb(z) as
the Green function of Qb and σ(Qg) the spectrum of Qg.
Lemma 2.3.25. µ ∈ σ(Qg)/σ(Qb) if and only if
det((1 + µGb(µ))VDV∗ + I) = 0. (2.3.54)
Chapter 2. Random matrices in high dimensional statistics 172
Recall (2.3.14), as D is not invertible, we instead use the following corollary.
Corollary 2.3.26. µ ∈ σ(Qg)/σ(Qb) if and only if
det(D−1o + I + µV∗oGb(µ)Vo) = 0. (2.3.55)
Next we introduce the local deformed MP law for Qb. Denote
Ψ(z) :=
√Imm(z)
Nη+
1
Nη.
Lemma 2.3.27. Fix τ > 0, for the sample covariance matrix Qb defined in (2.3.2)
satisfying (2.3.4), suppose that Assumption 2.3.3 holds. Then for any unit vectors u,v ∈
RM , with 1−N−D1 probability, we have
∣∣∣∣< u,Σ−1/2b
(Gb(z) +
1
z(1 +m(z)Σb)
)Σ−1/2b v >
∣∣∣∣ ≤ N ε1Ψ(z), z ∈ S uniformly .
Furthermore, outside the spectrum when z ∈ Seo, where Seo is defined as
Seo := z ∈ S dist(E, Supp(ρ)) ≥ N−2/3+τ , |z| ≤ τ−1, (2.3.56)
then we have
∣∣∣∣< u,Σ−1/2b
(Gb(z) +
1
z(1 +m(z)Σb)
)Σ−1/2b v >
∣∣∣∣ ≤ N ε1
√Imm(z)
Nη, z ∈ Seo uniformly .
The following lemma summarizes the results when z is restricted to the real axis.
Lemma 2.3.28. For z ∈ Seo ∩ R and any unit vectors u,v ∈ RM , with 1 − N−D1
probability, we have
∣∣∣∣< u,Σ−1/2b
(Gb(z) +
1
z(1 +m(z)Σb)
)Σ−1/2b v >
∣∣∣∣ ≤ N−1/2+ε1κ−1/4,
Chapter 2. Random matrices in high dimensional statistics 173
where κ := min1≤i≤p |E − f(x2i−1)|.
Next we will extend the Weyl’s interlacing theorem to fit our setting. We first discuss
the case when r = 1, the general case is just a corollary. Denote Gg(z) as the Green
function of Qg.
Lemma 2.3.29. Let r = 1 in (2.3.11), we assume that the outlier is associated with
the k-th bulk component, k = 1, 2, · · · , p, recall (2.3.24), define sk =∑k−1
i=1 Ni, under
Assumption 2.3.6, we have
λsk ≥ µsk+1 ≥ λsk+1 ≥ · · · ≥ λsk+Nk . (2.3.57)
It is easy to deduce the following corollary for the rank r case.
Corollary 2.3.30. For the rank r model defined in (2.3.11), we have
µsk+i ∈ [λk,i+rk , λk,i−rk ], 1 ≤ i ≤ Nk,
where we use the convention that λk,i−rk := +∞ when i− rk < 1.
Under Assumption 2.3.3 and 2.3.6, the convention can be made as λk,i−rk = f(x2k−2)
when i − rk < 1. The following lemma establishes the connection between the Green
functions of Qb and Qg. It provides an expression for analyzing the eigenvectors.
Lemma 2.3.31.
V∗oGg(z)Vo =1
z
[D−1o −
(1 + Do)1/2
Do
(D−1o + 1 + zV∗oGb(z)Vo)
−1 (1 + Do)1/2
Do
].
(2.3.58)
Lemma 2.3.32. Denote
B := m ∈ R : m 6= 0, − 1
m/∈ Supp(πb∞),
Chapter 2. Random matrices in high dimensional statistics 174
then we have
x /∈ Supp(ρD)⇐⇒ mD ∈ B and f ′D(mD) > 0, (2.3.59)
fD(mD(z)) = z, mD(fD(z)) = z. (2.3.60)
Similar results hold for f,m and ρ.
Next we collect the properties of m(z) as the following lemma, its proof can be found
in [74, Lemma A.4 and A.5].
Lemma 2.3.33. For z ∈ S, we have
Imm(z) ∼
√κ+ η, E ∈ Supp(ρ) ;
η√κ+η
, E /∈ Supp(ρ).
, (2.3.61)
and
mini|m(z) + (σbi )
−1| ≥ τ, (2.3.62)
where τ > 0 is some constant. Furthermore, if z ∈ Sei , we have
|m(z)− xi| ∼√κ+ η. (2.3.63)
We conclude this section by listing two important consequences of Lemma 2.3.27,
which are the eigenvalue rigidity and edge universality.
Lemma 2.3.34. Recall (2.3.24), fix τ > 0, under Assumption 2.3.3 and (2.3.4), for all
i = 1, · · · , p and j = 1, · · · , Ni satisfying γi,j ≥ τ , with 1−N−D1 probability, we have
|λi,j − γi,j| ≤ N−2/3+ε1(j ∧ (Ni + 1− j))−1/3. (2.3.64)
Furthermore, let i := b(i+1)/2c to be the bulk component to which the edge i belongs. For
0 < τ ′ < τ and j = 1, · · · , Ni satisfying γi,j ∈ [ai− τ ′, ai + τ ′], with 1−N−D1 probability,
Chapter 2. Random matrices in high dimensional statistics 175
we have
|λi,j − γi,j| ≤ N−2/3+ε1(j ∧ (Ni + 1− j))−1/3.
And for all j = 1, 2, · · · , Ni satisfying γi,j ∈ [a2i + τ ′, a2i−1 − τ ′], we have
|λi,j − γi,j| ≤N ε1
N.
For i = 1, · · · , p, define $i := (|f ′′(xi)|/2)1/3 and for any fixed l ∈ N and bulk component
i = 1, 2, · · · , p, we define
q2i−1,l :=N2/3
$2i−1
(λi,1 − a2i−1, · · · , λi,l − a2i−1),
q2i,l := −N2/3
$2i
(λi,Ni − a2i, · · · , λi,Ni−l+1 − a2i),
then for any fixed continuous bounded function h ∈ Cb(Rl), there exists b(h, π) ≡ bN(h, π),
depending only on π, such that limN→∞(Eh(qi,l)− b(h, π)) = 0.
Eigenvalues. In this section, we study the local asymptotics of the eigenvalues of Qg
and prove Theorem 2.3.10 and 2.3.11. We firstly prove the outlier and the extremal
non-outlier eigenvalues, then the bulk eigenvalues. We always use C,C1, C2 to denote
some generic large constants, whose values may change from one line to the next.
Outlier eigenvalues. The outlier eigenvalues of Qg are completely characterized by
(2.3.55). It relies on two main steps: (i) Recall (2.3.13), fix a configuration independent of
N , establishing two permissible regions, Γ(D) of r+ components and I0, where the outliers
of Qg are allowed to lie in Γ(D) and each component contains exactly one eigenvalue and
the rb non-outliers lie in I0; (ii) Using a continuity argument where the result of (i) can
be extended to arbitrary N−dependent D. We will prove the results by contradiction.
Chapter 2. Random matrices in high dimensional statistics 176
We first find that for oi ∈ I (recall (2.3.10))
− 1
1 + σboim(f(−(σgoi)−1))= −d−1
oi− 1. (2.3.65)
We also observe that for some small constant ν > 0, when x ∈ [x2i−1 − ν, x2i−1 + ν], i =
1, 2, · · · , p, we have
f ′(x) = O(|x− x2i−1|), f(x)− f(x2i−1) = O(|x− x2i−1|2). (2.3.66)
Proof of Theorem 2.3.10. We first deal with anyN -independent configuration D ≡ D(0).
For (i, j) ∈ I+, denote Ii,j ≡ Ii,j(D) = [I−i,j, I+i,j], where I±i,j is defined as
I±i,j := f(− 1
σgi,j)± (− 1
σgi,j− x2i−1)1/2N−1/2+C1ε1 , (2.3.67)
and for k = 1, · · · , p, define Ik ≡ Ik(D) by
Ik := [f(x2k)−N−2/3+C2ε1 , f(x2k−1) +N−2/3+C2ε1 ],
where C1, C2 > 0 will be specified later. We will show that with 1 − N−D1 probability,
the complement of
I :=
(⋃i,j
Ii,j
)⋃(⋃k
Ik
)= S1 ∪ S2, (2.3.68)
contains no eigenvalues of Qg. This is summarized as the following lemma.
Lemma 2.3.35. Denote σ+ as the set of the outlier eigenvalues of Qg associated with
I+, then we have
σ+ ⊂ S1. (2.3.69)
Moreover, each interval Ii,j contains precisely one eigenvalue of Qg. Furthermore, we
Chapter 2. Random matrices in high dimensional statistics 177
have
σb ⊂ S2, (2.3.70)
where σb is the set of the extremal non-outliers associated with I/I+.
Proof. We first assume that
2 < C2 < C1, C21ε1 < ε0. (2.3.71)
Therefore, by (2.3.59), (2.3.66) and Lemma 2.3.34, with 1 − N−D1 probability, we have
that S1 ∩ S2 = ∅. For k = 1, 2, · · · , p, denote
L+k := f(− 1
σgk,r+k
)−N−1/2+C1ε1(− 1
σgk,r+k
− x2k−1)1/2.
We only prove for the first bulk component and all the others can be done by induction.
For any x > L+1 , it is easy to check that x /∈ σ(Qb) using (2.3.71) and Lemma 2.3.34.
Under the assumption of (2.3.4), with 1 − N−D1 probability, we have that µ1(Qg) ≤ C
for some large constant C. Recall (2.3.10), by Lemma 2.3.28, with 1−N−D1 probability,
we have
det(D−1o + 1 + xV∗oGb(x)Vo) =
r∏i=1
(d−1oi
+ 1− 1
1 +m(x)σboi) +O(N−1/2+ε1κ−1/4
x ),
where we use the fact that r is finite. First of all, suppose that x ∈ (L+1 , C)/(
⋃j I1,j),
by (2.3.65), (2.3.66), Lemma 2.3.33 and inverse function theorem, it is easy to conclude
thatr∏i=1
(d−1oi
+ 1− 1
1 +m(x)σboi) ≥ N−1/2+(C1−1)ε1κ−1/4
x . (2.3.72)
Hence, D−1o + 1 + xV∗oGb(x)Vo is regular provided C1 > 2. This implies that x is not an
eigenvalue of Qg using Corollary 2.3.26. Secondly, we consider the extremal non-outlier
Chapter 2. Random matrices in high dimensional statistics 178
eigenvalues. Using Corollary 2.3.30, with 1−N−D1 probability, we find that
µ1,j ≥ f(x1)−N−2/3+C3ε1 , j = r+1 , · · · , r1,
where we use Lemma 2.3.34 and C3 > 0 is some constant. We assume that L+1 ≥
f(x1) + N−2/3+C3ε1 , otherwise the proof is already done. When x /∈ I1, i.e x ∈ (f(x1) +
N−2/3+C3ε1 , L+1 ), as a control compared to (2.3.72) does not exist, we need to use different
strategy to finish our proof. Denote
z := x+ iN−2/3−ε2 , 0 < ε2 < ε1.
As r is finite, it can be easily checked by the spectral decomposition (see the proof of [35,
Lemma 5.3]) that, there exists some constant C > 0, such that
||xV∗oGb(x)Vo − zV∗oGb(z)Vo|| ≤ C maxi
Im < vi, Gb(z)vi > . (2.3.73)
By Lemma 2.3.27, with 1−N−D1 probability, we have
xV∗oGb(x)Vo = − 1
1 +m(z)Dob
+O(N−1/2+ε1κ−1/4x ).
As a consequence, we have
det(D−1o + 1 + xV∗oGb(x)Vo) = O(N−1/2+ε1κ−1/4
x + max1≤j≤r+1
|m(z) + (σg1,j)−1|).
The proof follows from the fact
max1≤j≤r+1
|m(z) + (σg1,j)−1| = max
1≤j≤r+1|m(z)− x1 + x1 + (σg1,j)
−1| ≥ N−1/3+C3ε1 ,
where we use Lemma 2.3.33. Hence, D−1o + 1 + xV∗oGb(x)Vo is regular provided C3 > 1.
Chapter 2. Random matrices in high dimensional statistics 179
Similarly, for all k = 1, 2, · · · , p, we can show that when x ∈ (L+k , C)/S1 and x /∈ S2,
D−1o + 1 + xV∗oGb(x)Vo is regular by induction. Lemma 2.3.35 will be proved if we
can show that each interval Ii,j contains precisely one eigenvalue of Qg. Let (i, j) ∈
I+ and pick a small N -independent counterclosewise (positive-oriented) contour C ⊂
C/⋃pk=1[a2k, a2k−1] that encloses f(− 1
σgi,j) but no other point. For large enough N, define
F (z) := det(D−1o + 1 + zV∗oGb(z)Vo), G(z) := det(D−1
o + 1− 1
1 +m(z)Dob
).
F (z), G(z) are holomorphic on and inside C. G(z) has precisely r+ zeros f(− 1σgi,j
) inside
C. And on C, it is easy to check that for some constant δ > 0,
minz∈C|G(z)| ≥ δ > 0, |G(z)− F (z)| ≤ N−1/2+ε1κ−1/4,
where we use Lemma 2.3.27. Hence, it follows from Roche’s theorem.
In a second step, we extend the proofs to any configuration D(1) depending on N
by using a continuity argument. This is done by a bootstrap argument by choosing a
continuous path. We firstly deal with (2.3.22). As r is finite, we can choose a path
(D(t) : 0 ≤ t ≤ 1) connecting D(0) and D(1) having the following properties:
(i). For all t ∈ [0, 1], recall (2.3.14) and (2.3.16), for (i, j) ∈ I+, σi,j(t) ∈ O+i , i =
1, 2, · · · , p, j = 1, 2, · · · , r+i , where σgi,j(t) = (1 + di,j(t))σ
bi,j.
(ii). For the i-th bulk component, i = 1, 2, · · · , p, if Ii,j1(D(1)) ∩ Ii,j2(D(1)) = ∅ for a
pair 1 ≤ j1 < j2 ≤ r+i , then I+
i,j1(D(t)) ∩ I+
i,j2(D(t)) = ∅ for all t ∈ [0, 1].
Recall (2.3.15), denote Qg(t) := Σ1/2g (t)XX∗Σ
1/2g (t), where Σg(t) = (1+VD(t)V∗)Σb.
As the mapping t→ Qg(t) is continuous, we find that µi,j(t) is continuous in t ∈ [0, 1] for
Chapter 2. Random matrices in high dimensional statistics 180
all (i, j) where µi,j(t) are the eigenvalues of Qg(t). Moreover, by Lemma 2.3.35, we have
σ+(Qg(t)) ⊂ S1(t),∀ t ∈ [0, 1]. (2.3.74)
We focus our discussion on the i-th bulk component. In the case when the r+i intervals
are disjoint, we have
µi,j(t) ∈ Ii,j(D(t)), t ∈ [0, 1],
where we use property (ii) of the continuous path, (2.3.74) and the continuity of µi,j(t).
In particular, it holds true for D(1). Now we consider the case when they are not disjoint.
Recall (2.3.17), denote B as a partition of I+i and the equivalent relation as
j1 ≡ j2 if Ii,j1(D(1)) ∩ Ii,j2(D(1)) 6= ∅.
Therefore, we can decompose B = ∪jBj. It is notable that each Bj contains a sequence of
consecutive integers. Choose any s ∈ Bj, without loss of generality, we assume s is not
the smallest element in Bj. Since they are not disjoint, for some constant C > 0, we have
− 1
σgi,s−1
+1
σgi,s≤ 2N−1/2+Cε1(− 1
σgi,s− x2i−1)−1/2,
where we use the fact that f ′′(x) ≥ 0 when x is close to the right edge of each bulk
component, (2.3.66) and (2.3.67). This yields that
(− 1
σgi,s−1
−x2i−1)1/2 ≤ (− 1
σgi,s−x2i−1)1/2
(1 +− 1σgi,s−1
+ 1σgi,s
− 1σgi,s− x2i−1
)≤ (− 1
σgi,s−x2i−1)1/2(1+o(1)).
Therefore, by repeating the process for the remaining s ∈ Bj, we find
diam(∪s∈BjI+i,j(D(1))) ≤ CN−1/2+Cε0 min
s∈Bj(− 1
σgi,j(1)− x1)1/2(1 + o(1)),
Chapter 2. Random matrices in high dimensional statistics 181
where we use the fact that r = O(1). This immediately yields that
|µi,j(1)− f(− 1
σi,j(1))| ≤ N−1/2+Cε0(− 1
σgi,j(1)− x2i−1)1/2,
for some constant C > 0. This completes the proof of (2.3.22). Finally, we deal with the
extremal non-outlier eigenvalues (2.3.23). By the continuity of µi,j(t) and Lemma 2.3.35,
we have
σ0(Qg(t)) ⊂ S2(t), t ∈ [0, 1]. (2.3.75)
In particular it holds true for D(1). The proofs follow from Corollary 2.3.26 and Lemma
2.3.34.
Eigenvalue sticking. In this subsection, we prove the eigenvalue sticking property
of Qg. Similar to the proof of (2.3.23), it contains three main steps, (i) Establishing a
forbidden region which contains with high probability no eigenvalues of Qg; (ii) Using a
counting argument for the eigenvalues where the forbidden region does not depend on
N ; and (iii) Using a continuity argument where the result of (ii) can be extended to
arbitrary N -dependent region.
Proof of Theorem 2.3.11. We start with step (i), for definiteness, we focus our discussion
on the i-th bulk component. Define
η := N−1+3ε1(αi+)−1.
It is notable that η ≤ N−2/3+ε1 . We firstly show that for any x satisfying
x ∈ [f(x2i−1)− C1, f(x2i−1) +N−2/3+2ε1 ], dist(x,σ(Qb)) ≥ η, (2.3.76)
will not be an eigenvalue of Qg, where 0 < C1 < f(x2i−1)− f(x2i) is some constant. The
discussion is similar to that of (2.3.23). We observe that 2|λi − x| ≥√
(λi − x)2 + η2 by
Chapter 2. Random matrices in high dimensional statistics 182
(2.3.76). Denote z := x+ iη, by Lemma 2.3.27, we conclude that
maxi∈I
Im < vi, Gb(z)vi >≤ max√κ+ η,
η√κ+ η
, η,
where we use (2.3.61), (2.3.62) and (2.3.76). Hence, by Lemma 2.3.27 and (2.3.73), with
1−N−D1 probability
D−1o +1+xV∗oGb(x)Vo = D−1
o +1− 1
1 +m(z)Dob
+O
(N ε1Ψ(z) + max
√κ+ η,
η√κ+ η
, η).
Furthermore, by (2.3.62) and (2.3.63), we have
det
(D−1o + 1− 1
1 +m(z)Dob
)≥ C|αi+ −
√κ+ η|.
Therefore, when κ ≤ N−2ε1(αi+)2, with 1−N−D1 probability, we have
det(D−1o + 1 + xV∗oGb(x)Vo) ≥ O(αi+ −N−ε1αi+). (2.3.77)
This implies that x is not an eigenvalue of Qg by Corollary 2.3.26. Similarly, by Theorem
2.3.10 and Lemma 2.3.34, we therefore conclude that for j ≤ N1−3ε1i (αi+)3, the set
x ∈ [λi,j−ri−1, f(x2i−1) +N−2/3+2ε1 ] : dist(x,σ(Qg)) > N−1+3ε1(αi+)−1
, (2.3.78)
contains no eigenvalue of Qg. Next we will use a standard counting argument to locate the
eigenvalues of Qg in terms of Qb for any fixed configuration D ≡ D(0). We summarize
it as the following lemma.
Lemma 2.3.36. For j ≤ N1−3ε1i (αi+)3, with 1−N−D1 probability, we have
|µi,j+r+i − λi,j| ≤N2ε1
Nαi+. (2.3.79)
Chapter 2. Random matrices in high dimensional statistics 183
For the case when j > N1−3ε1i (αi+)3, using Corollary 2.3.30 and (2.3.64), we find that
|µi,j+r+i − λi,j| ≤ N−2/3+ε1j−1/3 ≤ N2ε1
Nαi+. (2.3.80)
In a third step, we extend the proofs to any configuration D(1) depending on N by
using the continuity argument, Corollary 2.3.30, Lemma 2.3.34, (2.3.78) and (2.3.79).
We summarize it as the following lemma, where the proofs are similar to those in the
second step of Theorem 2.3.10.
Lemma 2.3.37. Lemma 2.3.36 and (2.3.80) hold true for any N-dependent configuration
D(1).
Outlier eigenvectors. For (i, j) ∈ I+, we define the contour γi,j := ∂Υi,j as the
boundary of the disk Bρi,j(− 1σgi,j
), where ρoi,j is defined as (recall (2.3.21))
ρi,j :=νi,j ∧ (− 1
σgi,j− x2i−1)
2, i = 1, 2, · · · , p, j = 1, 2, · · · , r+
i .
Under Assumption 2.3.8, it is easy to check that
ρi,j ≥1
2(− 1
σgi,j− x2i−1)1/2N−1/2+ε0 . (2.3.81)
We further define
Γi,j := f(γi,j). (2.3.82)
We summarize the basic properties of the contour as the following lemma.
Lemma 2.3.38. Recall (2.3.56), for (i, j) ∈ I+, we have Γi,j ⊂ Seo. Furthermore, each
outlier µi,j lies in⋃I+ Γi,j and all the other eigenvalues lie in the complement of
⋃I+ Γi,j.
Chapter 2. Random matrices in high dimensional statistics 184
Proof of Theorem 2.3.13. We firstly prove the following proposition, where we assume
that all the outliers are on the right of the first bulk component. The general case is an
easy corollary.
Proposition 2.3.39. Theorem 2.3.13 holds true when all the outliers are on the right of
the first bulk component.
Proof. When all the outliers are on the right of the first bulk component, we have I+ =
1, 2, · · · , r+. By Lemma 2.3.27 and 2.3.33, we can choose an event Ξ, with 1 − N−D1
probability, when N is large enough, such that for all z ∈ Seo, we have
1(Ξ)
∣∣∣∣∣∣∣∣− 1
1 +m(z)Dob
− zV∗oGb(z)Vo
∣∣∣∣∣∣∣∣ ≤ (κ+ η)−1/4N−1/2+ε1 . (2.3.83)
For i, j ∈ I+, by spectral decomposition, Cauchy’s integral formula and Theorem 2.3.10,
we have that
< ui,vj >2= − 1
2πi
∫Γi
< vj, Gg(z)vj > dz = − 1
2πi
∫γi
< vj, Gg(f(ζ))vj > f ′(ζ)dζ.
(2.3.84)
Furthermore, by (2.3.58) and Cauchy’s integration theorem, we can write
< ui,vj >2= − 1
2πi
∫Γi
[V∗oGg(z)Vo]jj dz =1 + djd2j
1
2πi
∫Γi
(D−1o + 1 + zV∗oGb(z)Vo)
−1jj
dz
z.
(2.3.85)
Now we introduce the following decomposition
D−1o +1+zV∗oGb(z)Vo = D−1
o +1− 1
1 +m(z)Dob
−∆(z), ∆(z) := (− 1
1 +m(z)Dob
−zV∗oGb(z)Vo).
(2.3.86)
It is notable that ∆(z) can be well-controlled by (2.3.83). Using the resolvent expansion
Chapter 2. Random matrices in high dimensional statistics 185
to the order of two on (2.3.86) and by (2.3.85), we have the following decomposition
< ui,vj >2=
1 + djd2j
(s1 + s2 + s3),
where si, i = 1, 2, 3 are defined as
s1 :=1
2πi
∫Γi
1
d−1j + 1− 1
(1+m(z)σbj )
dz
z,
s2 :=1
2πi
∫Γi
1
d−1j + 1− 1
(1+m(z)σbj )
2
(∆(z))jjdz
z,
s3 :=1
2πi
∫Γi
(1
D−1o + 1− 1
(1+m(z)Dob )
∆(z)1
D−1o + 1− 1
(1+m(z)Dob )
∆(z)1
D−1o + 1 + zV∗oGb(z)Vo
)jj
dz
z.
First of all, the convergent limit is characterized by s1. By the residual theorem and
(2.3.60), we have
1 + djd2j
s1 =1
σgj
1
2πi
∫γi
f ′(ζ)
f(ζ)
1 + ζσbjζ + (σgj )
−1dζ = δij
1
σgi
f ′(−1/σgi )
f(−1/σgi ).
Next we will control s2 and s3. For s2, we rewrite it as
s2 =d2j
2(σgj )2πi
∫γi
hjj(ζ)
(ζ + 1σgj
)2dζ, hjj(ζ) := (1 + ζσbj)
2(∆(f(ζ)))jjf ′(ζ)
f(ζ).
As hjj(ζ) is holomorphic inside the contour γi, by (2.3.66) and (2.3.83), we conclude that
with 1−N−D1 probability,
|hjj(ζ)| ≤ |ζ − x1|1/2N−1/2+ε1 . (2.3.87)
Chapter 2. Random matrices in high dimensional statistics 186
By Cauchy’s differentiation formula, we have
h′jj(ζ) =1
2πi
∫C
hjj(ξ)
(ξ − ζ)2dξ, (2.3.88)
where C is the circle of radius |ζ−x1|2
centered at ζ. Hence, by (2.3.87), (2.3.88) and the
residual theorem, with 1−N−D1 probability, we have
|h′jj(ζ)| ≤ |ζ − x1|−1/2N−1/2+ε1 . (2.3.89)
When i = j, by the residual theorem and (2.3.89), we have
|s2| =∣∣∣∣ d2
i
(σgi )2h′jj(−
1
σgi)
∣∣∣∣ ≤ d2i
(σgi )2(− 1
σgi− x1)−1/2N−1/2+ε1 .
When i 6= j, by Assumption 2.3.8 and the residual theorem, we have |s2| = 0. Finally,
we estimate s3. Here the residual calculation is not available, we need to choose precise
contour for our discussion. We summarize it as the following lemma, whose proof can be
found in [17, Section 5.1].
Lemma 2.3.40. When N is large enough, there exists some constant C > 0, with 1 −
N−D1 probability, we have
|s3| ≤ Cd2j
(σgj )2N−1+2ε1(
1
ν2j
+1(i = j)
(− 1σgj− x1)2
).
This concludes the proof.
The proof of Theorem 2.3.13 is the same as Proposition 2.3.39 except that we need
to change the indices for any bulk component.
Non-outlier eigenvectors. For the non-outlier eigenvector, the residual calculation
is not available.
Chapter 2. Random matrices in high dimensional statistics 187
Proof of Theorem 2.3.15. We firstly suppose that all the outliers are on the right of the
first bulk component and focus on the l-th bulk component. The proof of the general
case is similar. For simplicity, we use µj for µl,j and uj for ul,j. For this subsection, we
use the spectral parameter z = µj + iη, where η is the unique smallest solution of
Imm(z) = N−1+6ε1η−1. (2.3.90)
As a consequence, with 1−N−D1 probability, Lemma 2.3.27 reads as
∣∣∣∣∣∣∣∣− 1
1 +m(z)Dob
− zV∗oGb(z)Vo
∣∣∣∣∣∣∣∣ ≤ N4ε1
Nη. (2.3.91)
Using the spectral decomposition, we have
< vi,uj >2≤ ηv∗i ImGg(z)vi. (2.3.92)
Recall (2.3.86), by (2.3.58) and a simple resolvent expansion, we get
< vi, Gg(z)vi >
=1
z
1
di− (1 + di)
d2i
1
d−1i + 1− 1
(1+m(z)σbi )
+
(1
d−1i + 1− 1
(1+m2c(z)σbi )
)2
(∆(z))ii
+
(1
D−1o + 1− 1
(1+m(z)Dob )
∆(z)1
D−1o + 1− 1
(1+m(z)Dob )
∆(z)1
D−1o + 1 + zV∗oGb(z)Vo
)ii
)].
(2.3.93)
We first observe that
mini|m(z) + (σgi )
−1| ≥ Imm(z) ∆(z),
Chapter 2. Random matrices in high dimensional statistics 188
where we use (2.3.91). This yields that
∣∣∣∣∣∣∣∣ 1
D−1o + 1 + zV∗oGb(z)Vo
∣∣∣∣∣∣∣∣ ≤ C
Imm(z).
Therefore, we get from (2.3.93) that
z < vi, Gg(z)vi >=1
diσbi
1 +m(z)σbim(z) + (σgi )
−1+O
(1
σgi σbi
|1 +m(z)σbi |2
|m(z) + (σgi )−1|2
N4ε1
Nη
).
Hence, combine with (2.3.92), we have
< vi,uj >2 ≤ η Im
1 +m(z)σbiz(m(z) + (σgi )
−1)+O
(|1 +m(z)σbi |2
|z||m(z) + (σgi )−1|2
N4ε1
N
)≤[η2
|z|2Re
1 +m(z)σbim(z) + (σgi )
−1+µjη
|z|2Im
1 +m(z)σbim(z) + (σgi )
−1
]+O
(|1 +m(z)σbi |2
|z||m(z) + (σgi )−1|2
N4ε1
N
).
(2.3.94)
Under Assumption 2.3.3, by Corollary 2.3.30 and Lemma 2.3.34, we have |µj| ≥ τ, where
τ > 0 is some constant. On one hand, we have
η2
|z|2Re
1 +m(z)σbim(z) + (σgi )
−1≤ CNCε1
N |m(z) + (σgi )−1|2
,
where C > 0 is some large constant. Similarly, we have
µjη
|z|2Im
1 +m(z)σbim(z) + (σgi )
−1≤ Cη
|m(z) + (σgi )−1|2
Imm(z).
Therefore, we conclude from (2.3.62), (2.3.94) and the definition of (2.3.90) that
< vi,uj >2≤ N6ε1
N |m(z) + (σgi )−1|2
.
Chapter 2. Random matrices in high dimensional statistics 189
It is easy to check that
m(z) + (σgi )−1 = m(z)− x1 + x1 + (
1
σgi) ∼√κ+ η + (− 1
σgi)− x1,
where we use (2.3.63). This yields that
∣∣Re(m(z) + (σgi )−1)∣∣ = O
(√κ+ η + (− 1
σgi)− x1
). (2.3.95)
We conclude that
|m(z) + (σgi )−1|2 ≥ C
(κdl,j + ((σgi )
−1 + x1)2).
This concludes our proof.
Proof of Lemma 2.3.25. Using the identity det(1+XY ) = det(1+Y X), µ is an eigenvalue
of Qg if and only if
0 = det(Σ1/2g XX∗Σ1/2
g − µ) = det(X∗ΣgX − µ) = det(XX∗Σb(1 + VDV∗)− µ)
= det(Σ1/2b (1 + VDV∗)XX∗Σ
1/2b − µ) = det(Qb − µ) det(Gb(µ)VDV∗Qb + 1)
= det(Qb − µ) det(QbGb(µ)VDV∗ + 1).
Using the Woodbury matrix identity
(A+ SBT )−1 = A−1 − A−1S(B−1 + TA−1S)−1TA−1, (2.3.96)
and
QbGb(µ) = Qb(Qb − µ)−1 = (1−Q−1b µ)−1,
Chapter 2. Random matrices in high dimensional statistics 190
we find that
0 = det((1−Q−1b µ)−1VDV∗ + 1) = det((1 + µGb(µ))VDV∗ + 1).
Proof of Corollary 2.3.26. By (2.3.14), (2.3.54) and the identity det(1 +XY ) = det(1 +
Y X), we have
det((1 + µGb(µ))VDV∗ + 1) = 0⇔ det(V∗o(1 + µGb(µ))VoDo + 1) = 0,
the proof follows from the fact that Do is invertible.
Proof of Lemma 2.3.29. We first write
(1 + VDV∗)1/2Gg(z)(1 + VDV∗)1/2 =(
Σ1/2b XX∗Σ
1/2b − z(1 + VDV∗)−1
)−1
= Gb(z)−Gb(z)Voz
D−1o + 1 + zV∗oGb(z)Vo
V∗oGb(z),
where in the second equality we use (2.3.96). For simplicity, we now omit the indices for
v, d and in this case Vo = v. Denote Gvvg,b(z) :=< v, Gg,b(z)v >, we have
Gvvg (z) =
1
d+ 1Gvvb (z)− 1
d+ 1(Gvv
b (z))2 z
d−1 + 1 + zGvvb (z)
,
which implies that
1
Gvvb (z)
+z
d−1 + 1=
1
d+ 1
1
Gvvg (z)
.
Writing this in spectral decomposition yields that
(∑ < v,ubi >2
λi − z
)−1
=1
d+ 1
(∑ < v,ui >2
µi − z
)−1
− z
d−1 + 1. (2.3.97)
Chapter 2. Random matrices in high dimensional statistics 191
It is notable that the left-hand side of (2.3.97) defines a function of z ∈ (0,∞) with M−1
singularities and M zeros, which is smooth and decreasing away from the singularities.
Moreover, its zeros are the eigenvalues of Qb. Similar results hold for Qg. Hence, if z is
an eigenvalue of Qg, we should have
(∑ < v,ubi >
λi − z)−1 = − z
d−1 + 1.
We can then conclude our proof using the monotone decreasing property of the left-hand
side of (2.3.97).
Proof of Lemma 2.3.31. We first observe that
Σ−1/2b Σ1/2
g Gg(z)Σ1/2g Σ
−1/2b = Σ
−1/2b (XX∗ − zΣ−1
g )−1Σ−1/2b
= (Qb − z + z − zΣ1/2b Σ−1
g Σ1/2b )−1
= (G−1b (z) + zVoDo(1 + Do)
−1V∗o)−1,
where in the last step we use the fact that
Σ1/2b Σ−1
g Σ1/2b − 1 = −VD(1 + D)−1V∗ = −VoDo(1 + Do)
−1V∗o.
We now again use the Woodbury matrix identity (2.3.96) to get
Σ−1/2b Σ1/2
g Gg(z)Σ1/2g Σ
−1/2b = Gb(z)− zGb(z)Vo(D
−1o + 1 + zV∗oGb(z)Vo)
−1V∗oGb(z).
Multiplying Vo on both sides of the equation and using the identity
A− A(A+B)−1A = B −B(A+B)−1B,
we can conclude our proof.
Chapter 2. Random matrices in high dimensional statistics 192
Proof of Corollary 2.3.14. Without loss of generality, we mainly focus on the case when
all the outliers are on the right of the first bulk component. Recall (2.3.27) and (2.3.29),
similar to the proof of Proposition 2.3.39, with 1−N−D1 probability, for i, j ∈ I, we have
< vi,PAvj >= δij1(i ∈ A)ui +N ε1R(i, j, A,N), (2.3.98)
where R(i, j, A,N) is defined as
R(i, j, A,N) :=N−1/2
[1(i, j ∈ A)(− 1
σgi− x1)−1/4(− 1
σgj− x1)−1/4 + 1(i ∈ A, j /∈ A)
(− 1σgi− x1)1/2
| − 1σgi
+ 1σgj|
+1(i /∈ A, j ∈ A)(− 1
σgj− x1)1/2
| − 1σgi
+ 1σgj|
+N−1
[(
1
νi+
1(i ∈ A)
− 1σgi− x1
)(1
νj+
1(j ∈ A)
− 1σgj− x1
)
].
For the general case when i, j = 1, 2, 3, · · · ,M, we denote I := I ∪ i, j and consider
Σg := (1 + VoDoV∗o)Σb, Vo := [vk]k∈I , D := diag(dk)k∈I , (2.3.99)
where dk := dk for k ∈ I and dk = ε, ε > 0 small enough for k ∈ I/I. Since |I| ≤ r + 2
is finite, (2.3.98) can be applied to (2.3.99). By continuity, taking the limit ε → 0, we
conclude that (2.3.98) holds true for all i, j = 1, 2, 3, · · · ,M. For the proof of (2.3.30), as
w =∑M
j=1wjvj, we have
< w,PAw >=M∑i=1
M∑j=1
wiwj < vi,PAvj > .
Then the proof follows from (2.3.98).
Proof of Corollary 2.3.16. For the proof of (2.3.32), as w =∑M
j=1wjvj, we have
< w,uj >=M∑k=1
wk < vk,uj > .
Chapter 2. Random matrices in high dimensional statistics 193
The rest of the proof is similar to that of Corollary 2.3.14 by using the elementary
inequality.
Simulation studies. We provide the simulation detail and results. We use Figure 2.4
to show the simulation results of the examples. We can see that there exist some outlier
eigenvalues in our general model (in red), and the bulk eigenvalues stick to that of the
underlying bulk model (in blue). For the statistical application of optimal shrinkage
of eigenvalues, we focus our discussion on the sample covariance matrices from Example
2.3.21 and 2.3.22. Consider any two matrices, true matrix A and estimation B, we denote
L(A,B) as the loss between A and B. For completeness, we list 10 loss functions and
their optimal shrinkers in Table 2.2. In each of the examples, we use Figure 2.5 to show
the simulation results of optimal shrinkers under the 10 different loss functions. For the
oracle estimation, we use Figure 2.6 to compare our estimation method and the QuEST
method. And finally, Figure 2.7 is used to show how our results can be employed in the
factor model to improve the estimation.
0 50 100 150 200 250 300 350 4000
1
2
3
4
5
6
7
0 50 100 150 200 250 300 350 4000
5
10
15
20
25
30
35
40
45
0 50 100 150 200 250 300 350 4000
1
2
3
4
5
6
7
8
9
10
0 50 100 150 200 250 300 350 4000
2
4
6
8
10
12
Figure 2.4: Spectrum of the examples. We simulate the spectrum of four 400 × 800
sample covariance matrices with different population covariance matrices.
Chapter 2. Random matrices in high dimensional statistics 194
Frobenius matrix norm Shrinker Statistical measure Shrinker
||A−B||F lc2 + s2 Stein loss lc2+ls2
||A−1 −B−1||F lc2+ls2
Entropy loss lc2 + s2
||A−1B − I||F lc2+l2s2
c2+l2s2Divergence loss
√l2c2+ls2
c2+ls2
||B−1A− I||F l2c2+s2
lc2+s2Matusita Affinity (1+c2)l+s2
1+c2+ls2
||A−1/2BA−1/2 − I||F 1 + (l−1)c2
(c2+ls2)2Frechet discrepancy (
√lc2 + s2)2
Table 2.2: 10 different loss functions and their optimal shrinkers.
40 45 50 55 60 65 70 75 800
10
20
30
40
50
60
70
80
Empirical eigenvalue λ
Shrunk
eneigenvalue
β(λ)
Model two: Frobenius norm discrepancies (c = 2)
F1F2F3F4F5C1C2
40 45 50 55 60 65 70 75 800
10
20
30
40
50
60
70
80
Empirical eigenvalue λ
Shrunk
eneigenvalue
β(λ)
Model two: statistical discrepancies (c = 2)
SteinEntropy
Divergence
FrechetAffineC1C2
8 9 10 11 12 13 14 150
5
10
15
Empirical eigenvalue λ
Shrunk
eneigenvalue
β(λ)
Model three: Frobenius norm discrepancies (c = 2)
F1F2F3F4F5C1C2
8 9 10 11 12 13 14 150
5
10
15
Empirical eigenvalue λ
Shrunk
eneigenvalue
β(λ)
Model three: statistical discrepancies (c = 2)
SteinEntropy
Divergence
FrechetAffineC1C2
Figure 2.5: Optimal shrinkers under different loss functions. We simulate the optimalshrinkers using Example 2.3.21 and 2.3.22. Model two is for Example 2.3.21 and three isfor Example 2.3.22. F1 to F5 correspond to the Frobenius matrix norms in Table 2.2,C1 stands for the empirical eigenvalue and C2 for the true eigenvalue.
Chapter 2. Random matrices in high dimensional statistics 195
0 1 2 3 4 5 6 70.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
Eigenvalue λ
Oracleestimator
0 5 10 15 20 25 30 35 40 450
5
10
15
20
25
30
Eigenvalue λOracleestimator
0 2 4 6 8 101
2
3
4
5
6
7
Eigenvalue λ
Oracleestimator
0 2 4 6 8 10 120
1
2
3
4
5
6
7
8
9
10
Eigenvalue λ
Oracleestimator
Figure 2.6: Estimation of oracle estimator. We simulate the estimation of oracle estima-tors under Frobenius norm, where the blue line stands for the true estimator, red dots forour estimation and magenta dots for the estimation using QuEST [78]. The first panel(top left), the second panel (top right), the third panel (bottom left) and the fourth panel(bottom right) correspond to the Example 2.3.20, 2.3.21, 2.3.22 and 2.3.23 respectively,where the entries of X are standard normal random variables. In Example 2.3.20, thespike locates at 6.
Chapter 2. Random matrices in high dimensional statistics 196
0 100 200 300 400 500 600 700 8000
5
10
15
20
25
Dimension M
Frobeniusloss
0 100 200 300 400 500 600 700 8000
20
40
60
80
100
120
140
160
180
200
Dimension M
Frobeniusloss
0 100 200 300 400 500 600 700 8000
5
10
15
20
25
30
35
40
45
Dimension M
Frobeniusloss
0 100 200 300 400 500 600 700 8004
6
8
10
12
14
16
18
20
22
Dimension M
Frobeniusloss
Figure 2.7: Estimation error using POET with information from random matrices theory.We simulate the estimation error under Frobenius norm, where the blue line stands forthe sample covariance matrices estimation, red dots for our Multi-POET estimation. Wefind that using information from the sample covariance matrices can help us improvethe inference results. The first panel (top left), the second panel (top right), the thirdpanel (bottom left) and the fourth panel (bottom right) correspond to the Example2.3.20, 2.3.21, 2.3.22 and 2.3.23 respectively, where the entries of X are standard normalrandom variables. In Example 2.3.20, the spike locates at 6.
Chapter 3
Random matrices in non-stationary
time series analysis
In this chapter, we provide detailed proof and computation on the study of non-stationary
time series, which is the second part of our contribution in Section 1.4. For a complete
discussion, we refer to our papers [42, 43].
3.1 Locally stationary time series and physical de-
pendence measure
Definition 3.1.1. Let η′i be an i.i.d. copy of ηi. Assuming that for some q >
0, ||xi||q <∞, for j ≥ 0, we define the physical dependence measure by
δ(j, q) := supt∈[0,1]
maxi||G(t,Fi)−G(t,Fi,j)||q , (3.1.1)
where Fi,j := (Fi−j−1, η′i−j, ηi−j+1, · · · , ηi).
The measure δ(j, q) quantifies the changes in the filter’s output when the input of
the system j steps ahead is changed to an i.i.d. copy. If the change is small, then we
197
Chapter 3. Random matrices in non-stationary time series analysis 198
have short-range dependence. It is notable that δ(j, q) is related to the data generating
mechanism and can be easily computed.
In the present paper, we impose the following assumptions and the physical depen-
dence measure to control the temporal dependence of the non-stationary time series.
Assumption 3.1.2. There exists a large constant τ > 0 and q > 4, such that
δ(j, q) ≤ j−τ , j ≥ 1. (3.1.2)
Furthermore, G satisfies the property of stochastic Lipschitz continuity,
||G(t1,Fi)−G(t2,Fi)||q ≤ C|t1 − t2|, (3.1.3)
for any t1, t2 ∈ [0, 1] and we also assume that
supt
maxi||G(t,Fi)||q <∞. (3.1.4)
(3.1.2) indicates that the time series has short-range dependence. (3.1.3) implies that
G(·, ·) changes smoothly over time and ensures local stationarity. Furthermore, for each
fixed t ∈ [0, 1], denote
γ(t, j) = E(G(t,F0), G(t,Fj)), (3.1.5)
(3.1.3) and (3.1.4) imply that γ(t, j) is Lipschiz continuous in t. Furthermore, we need
the following mild assumption on the smoothness of γ(t, j).
Assumption 3.1.3. For any j ≥ 0, we assume that γ(t, j) ∈ Cp([0, 1]), p > 0 is some
integer, where Cp([0, 1]) is the function space on [0, 1] of continuous functions that have
continuous first p derivatives.
Many important consequences can be derived due to Assumption 3.1.2 and 3.1.3. We
list the most useful ones. The first one is the following control on γ(t, j).
Chapter 3. Random matrices in non-stationary time series analysis 199
Lemma 3.1.4. Under Assumption 3.1.2 and 3.1.3, there exists some constant C > 0,
such that
supt|γ(t, j)| < Cj−τ , j ≥ 1.
Another important conclusion is that the coefficients are of polynomial decay. Hence,
when i > b is large, where b = n2/τ , we only need to focus on autoregressive fit of order
b instead of i− 1.
Lemma 3.1.5. Under Assumption 3.1.2 and letting b = n2/τ , there exists some constant
C > 0, such that
supi>b|φij| ≤
maxn−4+5/τ , Cj−τ, i ≥ b2;
maxn−2+3/τ , Cj−τ, b < i < b2.
(3.1.6)
Furthermore, when i > b, denote φbi = (φi1, · · · , φib), and φbi = (Γbi)−1γbi with entries
(φi1, · · · , φib), where Γbi = Cov(xbi−1,xbi−1), γbi = E(xbi−1xi), xbi−1 = (xi−1, · · · , xi−b), we
have
supi
∣∣∣∣∣∣φbi − φbi ∣∣∣∣∣∣ ≤ Cn−2+1/τ .
Finally, denote φb( in) := (φ1( i
n), · · · , φb( in)) by
φb(i
n) = (Γbi)
−1γbi , (3.1.7)
where Γbi and γbi are defined as
Γbi = Cov(xbi−1, xbi−1), γi = Cov(xbi−1, xi),
with xbi−1,k = G( in,Fi−k), k = 1, 2, · · · , b. The following lemma shows that φbi can be
well approximated by φb( in) when i > b.
Chapter 3. Random matrices in non-stationary time series analysis 200
Lemma 3.1.6. Under Assumption 3.1.2, there exists some constant C > 0, such that
supi>b
∣∣∣∣φij − φj( in)
∣∣∣∣ ≤ Cn−1+2/τ , for all j ≤ b.
Till the end of this paper, we will always use b = n2/τ . Before concluding this section,
we summarize the basic properties of εi.
Lemma 3.1.7. First of all, we have supi σ2i < ∞. Furthermore, denote the physical
dependence measure of εi as δε(j, q), then there exists some constant C > 0, such that
δε(j, q) ≤ Cj−τ , j ≥ 1.
3.2 Estimation of covariance and precision matrices
By Lemma 3.1.5 and 3.1.6, it suffices to estimate φij, i ≤ b, φj(in), i > b ≥ j and the
variances of the residuals. When i > b, by (3.1.4) and Lemma 3.1.5, it is easy to check
that with 1− o(1) probability,
supi
∣∣∣∣∣i−1∑
j=b+1
φijxi−j
∣∣∣∣∣ = o(n−1).
Therefore, we now write
xi =b∑
j=1
φijxi−j + εi, i = b+ 1, · · · , n. (3.2.1)
Time-varying coefficients for i > b. We first estimate the time-varying coefficients
φj(in) using the method of Sieves [10, 31, 32] when i > b. We firstly observe the following
fact.
Lemma 3.2.1. Under Assumption 3.1.2 and 3.1.3, for any j, we have that φj(t) ∈
Cp([0, 1]).
Chapter 3. Random matrices in non-stationary time series analysis 201
Based on the above lemma, we use
θj(i
n) :=
c∑k=1
ajkαk(i
n), j ≤ b, (3.2.2)
to approximate φj(in), where αk( in) is a set of pre-chosen orthogonal bases on [0, 1] and
c ≡ c(n) stands for the number of basis which will be specified later. We impose the
following regularity condition on the regressors and the basis functions.
Assumption 3.2.2. For any k = 1, 2, · · · , b, denote Σk(t) ∈ Rk×k via Σkij(t) = γ(t, |i−
j|), we assume that the eigenvalues of
∫ 1
0
Σk(t)⊗ (b(t)b∗(t)) ,
are bounded above and also away from zero by a constant κ > 0, where b(t) = (α1(t), · · · , αc(t))∗ ∈
Rc.
Lemma 3.2.3. Denote the L∞ norm with respect to Lebesgue measure as
ρ := supt∈[0,1]
|φj(t)− θj(t)| .
We then have that ρ = O(c−p) for the Orthogonal polynomials, Trigonometric polynomi-
als, Spline series when r ≥ p+ 1, and Orthogonal wavelets when m > p.
Using the Cholesky decomposition (3.2.1), we can now write
xi =b∑
j=1
c∑k=1
ajkzkj(i
n) + εi, i = b+ 1, · · · , n, (3.2.3)
with zkj(in) defined as
zkj(i
n) := αk(
i
n)xi−j.
We can use the ordinary least square (OLS) method to estimate the coefficients ajk.
Chapter 3. Random matrices in non-stationary time series analysis 202
Denote the vector β ∈ Rbc with βs = ajs,ks , where js = b scc + 1, ks = s − b s
cc × c.
Similarly, we define yi ∈ Rbc by letting yis = zks,js(in). Furthermore, we denote Y ∗ as the
bc × (n − b) rectangular matrix whose columns are yi, i = b + 1, · · · , n. We also define
by x ∈ Rn−b containing xb+1, · · · , xn. Hence, the OLS estimator for β can be written as
β = (Y ∗Y )−1Y ∗x.
Moreover, denote xi = (xi−1, · · · , xi−b)∗ ∈ Rb and X = (xb+1, · · · ,xn) ∈ Rb×(n−b).
Denote the matrices Ei ∈ R(n−b)×(n−b) satisfies that (Ei)st = 1, when s = t = i− b and 0
otherwise. As a consequence, we can write
Y ∗ =n∑
i=b+1
(X ⊗ b(
i
n)
)Ei, (3.2.4)
where ⊗ stands for the Kronecker product. It is well-known that the OLS estimator
satisfies
β = β +
(Y ∗Y
n
)−1Y ∗ε
n, (3.2.5)
where ε ∈ Rn−b contains εb+1, · · · , εn. We decompose β into b blocks by denoting β =
(β∗1, · · · ,β∗b )∗, where each βi ∈ Rc. Similarly, we can decompose β. Therefore, our sieve
estimator can be written as φj(in) = β∗jb( i
n) and it satisfies that
φj(i
n)− φj(
i
n) = (βj − βj)∗b(
i
n). (3.2.6)
By carefully choosing c = nα1 , we show that φj(in) are consistent estimators for φj(
in)
uniformly in i for all j ≤ b. Denote ζc := supi ||b( in)||, we have
Theorem 3.2.4. Under Assumption 3.1.2, for some sufficiently small positive constants
α2, α3 > 0 satisfying
2/τ + α1 − α3 < 0, 4/τ + 2(α1 + α3 − 1) < 0, (3.2.7)
Chapter 3. Random matrices in non-stationary time series analysis 203
− 1/2 + α2 = o(ζc), (3.2.8)
with P1 := 1 −maxO(n1−q(1/2+α2)), O(n4/τ+2α1+2α3−2)
probability, for some constant
C > 0, we have
supi>b,j≤b
∣∣∣∣φj( in)− φj(i
n)
∣∣∣∣ ≤ E1,
where E1 is defined as
E1 := maxCζcn−1/2+α2 , Cn−pα1.
Here we recall that p is the order of smoothness of γ(t, j) defined in Assumption 3.1.3.
Remark 3.2.5. (i). The first constraint of (3.2.7) ensures that 1nY ∗Y is regular such that
the smallest eigenvalue can be bounded away from zero.
(ii). ζc = O(√c) for Trigonometric polynomials, Spline series and Orthogonal wavelets.
And ζc = O(c) for Orthogonal polynomials.
(iii). We can easily find a consistent estimator by suitably choosing the above parameters.
For instance, when τ = 10, we can choose α1 = 1/8, α2 = 14, α3 = 1
2for Fourier basis and
Orthogonal Wavelet basis.
Time-varying coefficients for i < b It is notable that by Lemma 3.1.6, when i, j are
not very large, we cannot use the estimators. In this situation, for each φij, j < i ≤ b,
we estimate them using the coefficients of the best linear predictor after smoothing by
Sieve method. For instance, in order to estimate φ21, we reply on the following regression
equations
xk = φk1xk−1 + ξ2k, k = 2, 3, · · · , n,
where ξ22 = ε2. Due to the local stationarity assumption, informally we can say there
exists a continuous function f21, such that φ21 ≈ f21( 2n), where f21 can be efficiently
estimated using the Sieve method. Now we make rigorous of this idea. For each fixed
Chapter 3. Random matrices in non-stationary time series analysis 204
i ≤ b, to estimate φi, we make use of the following equations,
xk =i−1∑j=1
λkjxk−j + ξik, k = i, i+ 1, · · · , n, (3.2.9)
where λki = (λk1, · · · , λk,i−1) are the coefficients of the best linear prediction using the
i− 1 predecessors and ξii = εi. Note that λii = φi. Using Yule-Walker’s equation, we find
λki = (Γki )−1γki ,
where Γki = Cov(xki ,xki ),γ
ki = Cov(xki , xk) and xki = (xk−1, · · · , xk−i+1). Due to Assump-
tion 3.1.3, we can denote fki = (f i1( kn), · · · , f ii−1( k
n)) by
fki = (Γki )−1γki , (3.2.10)
with Γki , γki defined by
Γki = Cov(xki , xki ), γ
ki = Cov(xki , xk),
where xki,j = G( kn,Fi−j). In order to estimate φij via f ij(
in), we will make use of the
following lemma.
Lemma 3.2.6. Under Assumptions 3.1.2, for each fixed i ≤ b and for any j ≤ i − 1,
f ij(t) are Cp functions on [0, 1]. Furthermore, we have
∣∣∣∣φij − f ij( in)
∣∣∣∣ ≤ maxn−1+2/τ , n−pα1, j < i ≤ b. (3.2.11)
Therefore, by Lemma 3.2.3, the rest of the work leaves to estimate the functions
Chapter 3. Random matrices in non-stationary time series analysis 205
f ij(in), j < i ≤ b using Sieve approximation by denoting
f ij(i
n) =
c∑k=1
djkαk(i
n).
We denote the OLS estimation as f ij(in) =
∑ck=1 djkαk(
in).
Theorem 3.2.7. Under Assumption 3.1.2, for some sufficiently small positive con-
stants α2, α3 > 0 satisfying (3.2.7) and (3.2.8), for some constant C > 0, with 1 −
maxO(n1−q(1/2+α2)), O(n4/τ+2α1+2α3−2)
probability, we have
supi≤b,j<i
∣∣∣∣f ij( in)− f ij(i
n)
∣∣∣∣ ≤ maxCζcn−1/2+α2 , n−1+2τ , Cn−pα1.
Sieve estimation for noise variances. We discuss the case for i > b and i ≤ b
separately. For i > b, denote εbi = xi −∑b
j=1 φijxi−j and (σbi )2 = E(εbi)
2. σi can be well
approximated using σbi by the following lemma.
Lemma 3.2.8. For i > b and some constant C > 0, we have
supi>b
∣∣σ2i − (σbi )
2∣∣ ≤ Cn−2+2/τ .
Furthermore, denote g( in) = E
(xi −
∑bj=1 φijG( i
n,Fi−j)
)2
, we then have
supi>b
∣∣∣∣(σbi )2 − g(i
n)
∣∣∣∣ ≤ Cn−1+4/τ .
Finally, g( in) ∈ Cp([0, 1]).
Denote rbi = (εbi)2, it is notable that rbi can not be observed directly. Denote rbi = εi
2,
where
εi = xi −b∑
j=1
c∑k=1
ajkzkj(i
n), i = b+ 1, · · · , n. (3.2.12)
Chapter 3. Random matrices in non-stationary time series analysis 206
By Theorem 3.2.4, we conclude that with P1 probability, for some constant C > 0, we
have
supi>b|rbi − rbi | ≤ Cn4/τ (n−1+2/τ + E1). (3.2.13)
Denote the centered random variables ωbi = rbi −(σbi )2, by Lemma 3.2.8 and (3.2.13), with
P1 probability, we can write
rbi = g(i
n) + ωbi +O(n−1+6/τ + n4/τE1). (3.2.14)
Invoking Lemma 3.2.3 and for convenience, we can therefore write our regression equation
as
rbi =c∑
k=1
dkαk(i
n) + ωbi , i = b+ 1, · · · , n. (3.2.15)
Similar to Lemma 3.1.7, we can show that the physical dependence measure of ωbi is also
of polynomial decay. Therefore, the OLS estimator for α = (d1, · · · , dc)∗ can be written
as
α = (W ∗W )−1W ∗r,
where W ∗ is an c× (n− b) matrix whose i-th column is (α1(i + b), · · · , αc(i + b))∗, i =
1, 2, · · · , n− b, and r is an Rn−b containing rbb+1, · · · , rbn. Furthermore, by the property of
OLS, we have
α = α+
(W ∗W
n
)−1W ∗ω
n, (3.2.16)
where ω = (ωb+1, · · · , ωn)∗. Correspondingly, we have the following consistency result.
Theorem 3.2.9. Under Assumption 3.1.2 and 3.1.3, for some sufficiently small constant
α2 > 0, with P2 := 1− n1−q(1/2+α2) probability, for some constant C > 0, we have
supi0>b
∣∣∣∣g(i0n
)− g(i0n
)
∣∣∣∣ ≤ maxCζcn−1/2+α2 , n−pα1. (3.2.17)
Finally, we study the estimation of σ2i , i = 1, 2, · · · , b, which enjoys the same dis-
Chapter 3. Random matrices in non-stationary time series analysis 207
cussion as in Section 3.2. Recall ξik defined in (3.2.9), denote (σik(ξ))2 = E(ξik)
2, us-
ing a similar discussion to Lemma 3.2.8, we can find a smooth function gi, such that
supi≤b |(σii(ξ))2 − gi( in)| ≤ O(n−1+4τ ), especially we can use gi( i
n) to estimate σ2
i . When
i = 1, we need to estimate the variance function of xi.
The rest of the work leaves to estimate gi(t) using Sieve method similar to (3.2.15)
for i ≤ b, where we replace the residual with rik, k = i, · · · , n. Here rik is defined as
rik :=
(xi −
i−1∑j=1
f ij(k
n)xi−j
)2
, k = i, i+ 1, · · · , n. (3.2.18)
Then for i0 ≤ b, we can estimate gi0( i0n
) similarly, except that the dimension of W ∗ is
c× (n+ 1− i). The results can be summarized as the following theorem.
Theorem 3.2.10. Under Assumption 3.1.2 and 3.1.3, with P2 probability, for some
constant C > 0, we have
supi0≤b
∣∣∣∣gi0(i0n )− gi0(i0n
)
∣∣∣∣ ≤ maxCζcn−1/2+α2 , n−pα1.
Estimation for covariance and precision matrices. It is natural to choose
Γ := Φ−1D(Φ−1)∗,
as our estimator. Here Φ is a lower triangular matrix, where the diagonal entries are all
ones. We now control the estimation error between Γ and Γ.
We first observe that, as det(ΦΦ∗) = det(ΦΦ∗) = 1, combining with Assumption
3.1.2, there exist some constants C1, C2 > 0, such that
C1 ≤ λmin(ΦΦ∗) ≤ λmax(ΦΦ∗) ≤ C2.
Similar results hold for ΦΦ∗.
Chapter 3. Random matrices in non-stationary time series analysis 208
Proposition 3.2.11. Under Assumptions 3.1.2, 3.1.3 and 3.2.2, for α1, α2 and α3 de-
fined in (3.2.7) and (3.2.8), with 1 − P (τ, α1, α2, α3, n) probability, for some constant
C > 0, we have
∣∣∣∣∣∣Γ− Γ∣∣∣∣∣∣ ≤ Cn4/τ max
n−1+4/τ , n−pα1 , ζ2
cn−1+2α2 , ζcn
−1/2+α2, (3.2.19)
where P (τ, α1, α2, α3, n) is defined as
maxO(n1−q(1/2+α2)), O(n4/τ+2α1+2α3−2)
. (3.2.20)
Proof. Using the fact that any two compatible matrices A,B, AB and BA have the same
non-zero eigenvalues, we conclude that
∣∣∣∣∣∣Γ− Γ∣∣∣∣∣∣ ≤ C
−1/21 ||E||,
where E has the following form of decomposition
E = E1 + E2 + E3
=[D− D
]+[D(
Φ−1 − Φ−1)∗
Φ∗ + Φ(
Φ−1 − Φ−1)
D]
+[Φ(
Φ−1 − Φ−1)
D(
Φ−1 − Φ−1)∗
Φ∗].
Denote B := D(Φ−1)∗(Φ − Φ)∗(Φ−1)∗Φ∗, we therefore have ||E2|| ≤ 2||B||. We fur-
ther denote RΦ := Φ − Φ, we first observe that RΦ = 0, i ≤ j. Then by Lemma
3.1.5, Lemma 3.1.6, Theorem 3.2.4 and 3.2.7, for i ≤ b or j ≤ b ≤ i, with 1 −
maxO(n1−q(1/2+α2)), O(n4/τ+2α1+2α3−2)
probability, we have (RΦ)ij = O(maxζcn−1/2+α2 , n−pα1).
And for i > b, j > b, |(RΦ)ij| ≤ j−τ . This implies that with the above probability, we
have
λmax
((Φ− Φ)(Φ− Φ)∗
)≤ maxCζ2
cn−1+2α2+4/τ , Cn−2pα1+ 4
τ ,
Chapter 3. Random matrices in non-stationary time series analysis 209
where we use the Gershgorin’s circle theorem. As a consequence, by submultiplicaticity,
for some constant C > 0, we have that
||E2|| ≤ maxCζcn−1/2+α2 , Cn−pα1.
Similarly, we can show that
||E3|| ≤ maxCζ2cn−1+2α2+4/τ , Cn−2pα1+ 4
τ .
By (3.2.14), Theorem 3.2.9 and 3.2.10, with P1 probability,
||E1|| ≤ C max(ζcn−1/2+α2+4/τ , n−pα1+4/τ , n−1+8/τ
).
Hence, we have finished our proof.
An advantage of the Cholesky decomposition is that we can easily estimate the pre-
cision matrix using the following estimator
Γ−1 := Φ∗D−1Φ.
Similar to Proposition 3.2.11, we have the following result for precision matrix.
Corollary 3.2.12. Under Assumptions 3.1.2, 3.1.3 and 3.2.2, for α1, α2 and α3 defined
in (3.2.7) and (3.2.8), with 1 − P (τ, α1, α2, α3, n) probability, for some constant C > 0,
we have
∣∣∣∣∣∣Γ−1 − Γ−1∣∣∣∣∣∣ ≤ Cn4/τ max
n−1+4/τ , n−pα1 , ζ2
cn−1+2α2 , ζcn
−1/2+α2,
with P (τ, α1, α2, α3, n) defined in (3.2.20).
In this subsection, we show by simulations the finite sample performance of our es-
Chapter 3. Random matrices in non-stationary time series analysis 210
timation for covariance and precision matrices. We investigate the following four non-
stationary models:
• Non-stationary MA(1) process
xi = 0.6 cos(2πi
n)εi−1 + εi,
where εi are i.i.d N (0, 1) random variables.
• Non-stationary MA(2) process
xi = 0.6 cos(2πi
n)εi−1 + 0.3 sin(
2πi
n)εi−2 + εi.
• Time-varying AR(1) process
xi = 0.6 cos(2πi
n)xi−1 + εi.
• Time-varying AR(2) process
xi = 0.6 cos(2πi
n)xi−1 + 0.3 sin(
2iπ
n)xi−2 + εi.
It is easy to compute the true covariance matrices of the above models. In the
following simulations, we report the average estimation errors in term of operator norm
and their standard deviations based on 1000 repetitions. We compare the results for two
different types of bases, the Fourier basis and Daubechies Orthogonal Wavelet basis [33].
We also record the estimation errors by using the simple covariance matrices.
We observe from Table 3.1 that our covariance matrix estimators are better than
simply using the sample covariance matrices in general. Due to the consistency of our
estimators, they are more accurate when n becomes large. Furthermore, as we can see
Chapter 3. Random matrices in non-stationary time series analysis 211
from the estimation of AR(1) and AR(2) processes, our estimators can still be quite
accurate even when the underlying covariance matrix is not very sparse. But the simple
covariance matrix will become worse due to the curse of dimensionality. Similarly, as
we can see from Table 3.2, our estimators for the precision covariance matrices are also
reasonably accurate. We remark that most of the simple precision matrices are singular
due to correlation. In this sense, our methodology provides a natural way to estimate
the precision matrix of non-stationary time series.
n=200 n=500 n=800
MA(1)Fourier Basis 1.71 (0.77) 1.52 (0.6) 1.58 (0.56)Wavelet Basis 1.91(0.87) 1.85 (0.86) 1.65 (0.68)Sample estimation 2.53 (0.005) 2.55 (0.002) 2.55 (0.002)
MA(2)Fourier Basis 2.12 (0.88) 2.04 (0.72) 1.79 (0.69)Wavelet Basis 2.21 (1) 2.12(0.98) 1.73(0.83)Sample estimation 2.78(0.005) 2.78 (0.002) 2.8 (0.002)
AR(1)Fourier Basis 1.78 (0.98) 1.67 (0.77) 1.53 (0.76)Wavelet Basis 1.89 (1.1) 1.78 (0.99) 1.65 (0.92)Sample estimation 5.79 (0.03) 6.06 (0.01) 6.81 (0.01)
AR(2)Fourier Basis 3.1 (1.2) 2.95 (1.03) 2.67 (0.87)Wavelet Basis 3.5 (1.05) 3.3 (1) 3.02 (0.95)Sample estimation 8.3 (0.04) 8.8 (0.018) 8.97 (0.01)
Table 3.1: Operator norm error for estimation of covariance matrices.
Our simulation are very limited ones, because we need the true covariance and pre-
cision matrices to be known for comparison. In these cases, we hence choose either the
Fourier basis or Orthogonal Wavelet basis for our Sieve estimation. However, as we can
see from Table 3.1 and 3.2, when n is quite large, the differences from using different
bases can be ignored. Therefore, we suggest the employment of Orthogonal Wavelet basis
in practice. As a final remark, in the finite sample case, we shall choose b and c using
data-driven model selection methods, for instance, the AIC and BIC.
Chapter 3. Random matrices in non-stationary time series analysis 212
n=200 n=500 n=800
MA(1)Fourier Basis 3.3 (0.26) 2.98 (0.27) 2.57 (0.28)Wavelet Basis 3.35 (0.7) 3.16 (0.53) 3.08 (0.44)
MA(2)Fourier Basis 3.8 (0.22) 3.66 (0.22) 3.43 (0.21)Wavelet Basis 3.85 (0.45) 3.72 (0.45) 3.52 (0.43)
AR(1)Fourier Basis 0.85 (0.39) 0.6 (0.22) 0.37 (0.19)Wavelet Basis 1.1 (0.68) 0.97 (0.48) 0.77 (0.46)
AR(2)Fourier Basis 1.32 (0.55) 1.03 (0.32) 0.79 (0.29)Wavelet Basis 1.57 (0.78) 1.38 (0.67) 1 (0.52)
Table 3.2: Operator norm error for estimation of precision matrices.
3.3 Inference of covariance and precision matrices
Another advantage of our methodology is that we can test the structure of the covariance
and precision matrices by using some simple statistics in terms of the entries of Φ. On
one hand, when i > b, denote the vector Bj( in) ∈ Rbc with b blocks, where the j-th block
is the basis b( in) and zero otherwise. Therefore, for any fixed i > b, j ≤ b, we have
(φj(
i
n)− φj(
i
n)
)2
= B∗j(i
n)(β − β)(β − β)∗Bj(
i
n) +O(n−2pα1).
It can be seen from the the above equation, the order of smoothness and number of
base functions are important to our asymptotics. In this section, we assume that p can
be quite large such that pα1 > 1.
As a consequence, recall (3.2.5), it is easy to see that for Σ, we have
(φj(
i
n)− φj(
i
n)
)2
→ B∗j(i
n)Σ−1Y
∗ε
n
ε∗Y
nΣ−1Bj(
i
n) in probability. (3.3.1)
Similarly, for any fixed i ≤ b, j ≤ i − 1, denote the vector Bij( in) ∈ R(i−1)c with i − 1
Chapter 3. Random matrices in non-stationary time series analysis 213
blocks, where the j-th block is b( in) and zero otherwise, we then have
(f ij(
i
n)− f ij(
i
n)
)2
= B∗ij(i
n)(di − di)(di − di)
∗Bij(i
n), (3.3.2)
where di and di. Note that the above equation is different from (3.3.1) in the sense that
β is of the same dimension bc for all i > b but di is of dimension (i− 1)c for i ≤ b.
Hypothsis testing and test statistics In this subsection, we focus on discussing two
fundamental tests in time series analysis. One of the targets in time series analysis is
to test whether the observed samples are from a White noise process, where the null
hypothesis is
H10 : xi is a White noise process.
Under H10, recall (3.1.7) and (3.2.10), we shall have that φj(
in), f ij(
in) are all zeros.
Therefore, our estimation φj(in), f ij(
in) should be small enough for all i, j. We therefore
use the following statistic to test H10
T1 =b∑
j=1
∫ 1
0
φ2j(t)dt+
b∑i=2
i−1∑j=1
∫ 1
0
(f ij(t)
)2
dt.
In the analysis of time series, it is also important to test the bandedness of the
precision matrices. In our setup, the Cholesky decomposition provides a convenient way
to test the bandedness. For any given k0 ≡ k0(n) b, we are interested in testing the
following hypothesis
H20 : The precision matrix of xi is k0-banded.
Denote Γ−1 = Φ∗D−1Φ. As Γ−1 is strictly positive, the Cholesky decomposition is unique.
Therefore, we conclude that Φ is also k0-banded. Furthermore, under H20, recall (3.1.7)
and (3.2.10), we have that φj(in) = 0, for j > k0. Therefore, it is natural for us to use
Chapter 3. Random matrices in non-stationary time series analysis 214
the following statistics
T2 =b∑
j=k0+1
∫ 1
0
φ2j(t)dt+
b∑i=k0+2
i−1∑j=k0+1
∫ 1
0
(f ij(t)
)2
dt.
It is notable that both of the test statistics T1, T2 can be written into summations of
quadratic forms under the null hypothesis. For T1 under H10 and T2 under H2
0, we have
that
φ2j(t) =
(φj(t)− φj(t)
)2
, (f ij(t))2 =
(f ij(t)− f ij(t)
)2
.
Therefore, all the above quantities can be computed using (3.3.1) and (3.3.2), which are
quadratic forms of a high dimensional locally stationary time series.
High dimensional Gaussian approximation. As we have seen from the previous
subsection, both of the test statistics are involved with high dimensional quadratic forms.
The distribution of the quadratic form from Gaussian vectors can be easily computed
using Lindeberg’s central limit theorem. We need to discuss the Gaussian approximation
of the quadratic form from general distributions. We mainly focus on the discussion when
i > b and point out the differences for i ≤ b in the end of this subsection.
Using (3.2.4) and the basic properties of the Kronecker product, we find that
Y ∗ε =n∑
i=b+1
(Xεi)⊗ b(i
n),
where εi ∈ Rn−b satisfies that εis = εi when s = i − b and zero otherwise. Denote
q∗ij = B∗j( in)Σ−1 ∈ Rbc. We now rewrite q∗ij as a summation of Kronecker product by
q∗ij =b∑
k=1
e∗k ⊗ q∗ijk,
where ek is the standard basis in Rb and qijk is the k-th block of qij of size c. As a
Chapter 3. Random matrices in non-stationary time series analysis 215
consequence, we can write
q∗ijY ∗ε
n=
1
n
n∑k=b+1
h∗kqkij, (3.3.3)
where we denote hk = εkxk, qkij ∈ Rb is denoted by (qkij)s = q∗ijsb( kn). Hence, by (3.3.1)
and Slutsky’s theorem, it suffices to find the distribution of
1
n2
n∑k1=b+1
n∑k2=b+1
h∗k1qk1ij (qk2ij )∗hk2 . (3.3.4)
In order to derive the central limit theorems, we now write (3.3.4) into a quadratic form.
To do this, we define H ∈ Rb by letting
Hs =1√n
n∑k=b+1
hk(s)qkij(s), (3.3.5)
where hk(s), qkij(s) stand for the s-th entry of the vector respectively. Hence, we can
rewrite (3.3.4) as 1nH∗EH, where E is a b × b matrix with all the entries being ones.
Therefore, it suffices to derive the distribution of the above quadratic form.
Under the assumption that hk, k = b + 1, · · · , n are i.i.d, Xu, Zhang and Wu [111]
derived the L2 asymptotics of H∗H, where they showed that it was normally distributed
after properly scaling. In our setting, hk’s are correlated, so we need to extend their
results to allow dependence structure. For the later case, Zhang and Cheng [117] derived
the asymptotics for the maximal entry of H for locally stationary time series. For the
rest of this subsection, we employ the above ideas to derive the L2 asymptotics under
physical dependence measure for high dimensional locally stationary time series.
To derive the distribution of the quadratic form, we firstly look at the Gaussian
vectors. We now write 1nH∗EH as
1
n2h∗Qijh, (3.3.6)
Chapter 3. Random matrices in non-stationary time series analysis 216
where h ∈ R(n−b)b is a vector of n − b blocks, where the k-th block is hk and Qij is
an (n − b)b × (n − b)b block matrix with block size b × b, where the (k1, k2)−th block
is qk1ij (qk2ij )∗. It is notable that Qij is a rank-one symmetric matrix and E(hk) = 0. We
denote g = (gb+1, · · · ,gn)∗ ∼ N (0,Ω) as a Gaussian vector, which is dependent of h and
preserves its covariance structure. Due to non-stationarity, it is reasonable to assume that
Ω is a full-rank matrix. This implies that QijΩ is also of rank one. As a consequence,
we conclude that
g∗Qijg ≡ λijwij, (3.3.7)
where ≡ means that they have the same distribution, λij is the eigenvalue of QijΩ and
wij is a standard Chi-squared random variable with one degree of freedom.
Next, we will show that the above conclusion holds for general locally stationary time
series, with which we have two main issues to deal. The first issue is the dependence
structure. To handle this, by choosing a smooth function, equation (7.1), we use the
technique of M -dependent sequence and the leave one-block-out argument. The second
issue is to prove the universality for distributions beyond Gaussian. We employ Stein’s
method to continuously compare h and g. Similar ideas have been used in proving the
universality of random matrices theory. It is notable that we use the continuous version
of Stein’s method because of dependence. For the independent case, the authors used
the discrete version of Stein’s method.
For i > b, we have the following result on the high dimensional Gaussian approxima-
tion. We first recall (3.3.5), it is notable that
sups,k||qkij(s)|| ≤ c1/2ζc,
where we use Cauchy-Schwarz inequality. Denote
Z :=1
c1/2ζcH =
1√n
n∑k=b+1
zk, (3.3.8)
Chapter 3. Random matrices in non-stationary time series analysis 217
where zk := 1c1/2ζc
wk ∈ Rb with wk(s) = hk(s)qkij(s). We also denote U = 1√
n
∑nk=b+1 uk,
where (ub+1, · · · ,un) as a centered Gaussian random vector preserving the covariance
structure of (zb+1, · · · , zn). Our task is to control the following Kolmogorov distance
ρij := supx∈R
∣∣P (Rzij ≤ x)− P (Ru
ij ≤ x)∣∣ , (3.3.9)
with the definitions
Rzij = Z∗EZ, Ru
ij = U∗EU.
Theorem 3.3.1. Under Assumption 3.1.2, 3.1.3 and 3.2.2, we have
limn→∞
supi>b,j≤b
ρij = O(l(n)),
where l(n) is defined as
maxM−1
x , n−ε, ψ−1/2, ψ1/2b5/4n−1/4M−τ/2+1/2,M2ψ2b3Mxn−1, ψb2M−1
x ,
M5b4M3xψ
3n−2, ψ2b4M3Mxn−1, ψn−1/2+ε
(M−5/6
x +M1/2M−3x
)√Mx log b
,
where Mx, ψ,M →∞ and ε ∈ (0, 1) is some constant.
It can be easily checked that l(n) → 0 by suitably choosing some parameters, for
instance
ψ = M = Mx = b1/8.
Similar discussion holds for i ≤ b except the dimension of Y∗i is (i− 1)c varying with
i. Denote Σi ∈ R(i−1)c×(i−1)c as the convergent limit of ofY∗iYi
n. We further denote
Zi :=1
c1/2ζcHi =
1√n
n∑k=i
zik,
where zik := 1c1/2ζc
wik ∈ Ri−1 with wk(s) = hik(s)p
kij(s). Here hik = ξikx
ik ∈ Ri−1 and
Chapter 3. Random matrices in non-stationary time series analysis 218
pkij(s) = p∗ijsb( kn) with p∗ij = B∗ij( in)Σ−1
i . Similarly, we can prove the Gaussian approxi-
mation results and we omit the detail here.
Theorem 3.3.2. Under Assumption 3.1.2, 3.1.3 and 3.2.2, we have
limn→∞
supj<i≤b
ρij = O(l(n)),
where l(n) is defined as
maxM−1
x , n−ε, ψ−1/2, ψ1/2b5/4n−1/4M−τ/2+1/2,M2ψ2b3Mxn−1, ψb2M−1
x ,
M5b4M3xψ
3n−2, ψ2b4M3Mxn−1, ψn−1/2+ε
(M−5/6
x +M1/2M−3x
)√Mx log b
,
where Mx, ψ,M →∞ and ε ∈ (0, 1) is some constant.
Asymptotics of test statistics With the above preparation, we now derive the dis-
tributions for the test statistics T1 and T2 defined in Section 3.3. For T1, under H10, we
have
T1 =1
n2
∫ 1
0
[h∗
b∑j=1
Qj(t)h +b∑i=2
((hi)∗
i−1∑j=1
Qij(t)hi
)]dt, (3.3.10)
where Qj(t) ∈ R(n−b)b×(n−b)b is an extension of Qij by letting Qj(in) = Qij, hi ∈
R(n−i+1)(i−1) and Qij(t) ∈ R(n−i+1)(i−1)×(n−i+1)(i−1) is defined similarly to Qj(t) by us-
ing the vector Bij( in).
Recall that each Qj(t) is a rank-one matrix, where we can write as Qj(t) = tj(t)t∗j(t).
Here tj ∈ R(n−b)b is a vector of (n − b) blocks with the k-th block being qkj (t). As Σ
is positive definite, for each fixed t, tj, j = 1, · · · , b are linearly independent. Hence,∑bj=1 Qj(t) is a rank b symmetric matrix. For the Gaussian case, using Lindeberg’s
central limit theorem, h∗∑b
j=1 Qj(t)h is normally distributed.
Chapter 3. Random matrices in non-stationary time series analysis 219
Remark 3.3.3. For each fixed t, denote
Qb(t) =b∑
j=1
Qj(t),
and it spectral decomposition as Qb(t) =∑b
k=1 λbk(t)u
bk(t)(u
bk(t))
∗. Recall equation (3.3.7),
Theorem 3.3.1 has established the Gaussian approximation for quadratic form with rank-
one matrix. Qb(t) is a rank b matrix with b n, we can therefore extend our results to
the form h∗Qb(t)h by modifying equation (3.3.4). In detail, we can write
b∑j=1
λbj(t)h∗ubk(t)(u
bk(t))
∗h =b∑
j=1
λbj(t)n∑
k1=b+1
n∑k2=b+1
h∗k1uk1j (t)uk2j (t)∗hk2 . (3.3.11)
As we have seen from previous discussion, the key part is to construct a similar form of
equation (3.3.5). Next we can rewrite (3.3.11) as
1
nH∗bE
bHb,
where Eb ∈ Rb2×b2 is a b × b diagonal block matrix with block size b × b, with the j-th
block being λbj(t)E. Here E ∈ Rb×b is a matrix with all entries being ones. Hb is a block
vector with the j-th block of the item∑n
k1=b+1
∑nk2=b+1 h∗k1u
k1j (t)uk2j (t)∗hk2 in the form
of (3.3.5). As b2 n, it is easy to check that the proof of Theorem 3.3.1 still holds true
with some minor changes.
For the second item on the right-hand side of (3.3.10), it can be written into the
following form
h∗Q(t)h,
where h is a vector of length∑b
i=2(n− i+ 1)(i−1) contains all the vectors hi and Q(t) is
a (∑b
i=2(n− i+ 1)(i− 1))× (∑b
i=2(n− i+ 1)(i− 1)) block diagonal matrix with the i-th
diagonal block to be Qij(t). As the covariance matrix of h is regular, we conclude that it
Chapter 3. Random matrices in non-stationary time series analysis 220
is still normally distributed. We can see from the above discussion that T1 is normally
distributed with some complicated covariane structure. However, due to Assumption
3.1.2, it is easy to check that when the first part of (3.3.10) is small, the second item is
also small. Therefore, we instead use the following statistic
T ∗1 =1
n2
∫ 1
0
(Σhzh)∗
(b∑
j=1
Qj(t)
)(Σhzh) dt, (3.3.12)
where Σh is the covariance matrix of h.
We denote the rank-one matrix Ωbk(t) = ubk(t)(u
bk(t))
∗Ω with its unique non-trivial
eigenvalue being µbk(t). For any 1 ≤ k1, k2 ≤ b, we also denote Ωbk1,k2
(t) =ubk1
(ubk2)∗+ubk2
(ubk1)∗
2Ω
with its unique non-trivial eigenvalue being µbk1,k2(t). By Assumption 3.1.3 and the
smoothness of the base functions, we conclude that λbk(t), µbk(t) are smooth functions
on [0, 1]. This is because that the characteristic polynomials are smooth functions of the
coefficients. For any t ∈ [0, 1], we define
f1(t) := limb→∞
1
b
b∑k=1
λbk(t)µbk(t),
i.e the pointwise convergent limit. As λbk, µbk are smooth functions, f1(t) is also a smooth
function on [0, 1]. We can therefore conclude the following result.
Lemma 3.3.4. There exist continuous functions f1, f2, such that for t ∈ [0, 1] uniformly,
we have
1
b
b∑k=1
λbk(t)µbk(t)→ f1(t), (3.3.13)
1
b2
[b∑
k1=1
b∑k2=1
λbk1(t)λbk2
(t)(2(µk1,k2(t))
2 + µbk1.k2(t))
(3.3.14)
−
(b∑
k=1
λbk(t)µbk(t)
)2→ f2(t).
Chapter 3. Random matrices in non-stationary time series analysis 221
We can analyze T2 in the same way and use the following statistic to replace it
T ∗2 =1
n2
∫ 1
0
(Σhzh)∗
(b∑
j=k0+1
Qj(t)
)(Σhzh) dt, (3.3.15)
The asymptotics of T ∗1 and T ∗2 can be summarized as the following proposition.
Proposition 3.3.5. Under Assumption 3.1.2, 3.1.3 and 3.2.2, we have
(1). Under H10, we have
T ∗1 ⇒ N (µ1, σ21),
where µ1, σ1 are defined as
µ1 =b
n2
∫ 1
0
f1(t)dt,
σ21 =
b2
n4
∫ 1
0
f2(s)(1− s)ds,
where f1(t) and f2(t) are defined in (3.3.13) and (3.3.14).
(2). Under H20, we have
T2 ⇒ N (µ2, σ22),
where µ2, σ2 are defined as
µ2 =b
n2
∫ 1
0
f3(t)dt,
σ22 =
b2
n4
∫ 1
0
f4(s)(1− s)ds,
where f3(t) and f4(t) are defined similarly as (3.3.13) and (3.3.14) by replacing b
with b− k0.
Chapter 3. Random matrices in non-stationary time series analysis 222
Estimation of long-run covariance matrices. From (3.3.12) and (3.3.15), we find
that the key part for the accurate testing is to estimate the long-run covariance matrix
Σh for h. We follow the construction of the Nadaraya-Watson (NW) type estimator [120],
where h is assumed to be of fixed dimension. Under Assumption 3.1.2, we show that the
above estimator is still consistent in our setup for h ∈ R(n−b)b with worse convergent rate.
Note that the covariance matrix of h is a (n− b)× (n− b) block matrix. We first consider
the diagonal part, where each block Λk is the covariance matrix of hk, k = b + 1, · · · , n.
Denote
Λ(t) = Cov(hb(t),hb(t)),
where hb(t) = (G1(t,Fb), · · · , G1(t,F1)). Here G1 is defined such that εixi−1 = G1( in,Fi)
for i > b. The following lemma shows that Λk can be well estimated by Λ( kn) for k > b.
Lemma 3.3.6. Under Assumption 3.1.2 and 3.1.3, we have
supk>b
∣∣∣∣∣∣∣∣Λ(k
n)− Λk
∣∣∣∣∣∣∣∣ = O(n−1+4/τ ).
Next we consider the upper-off-diagonal blocks. For any b < k ≤ n − b + 1, we find
that for j > b+ k, for some constant C > 0, we have
||Λkj|| ≤ C(j − b)−τ+1, (3.3.16)
where we use a similar discussion to Lemma 3.1.4 and Gershgorin’s circle theorem. As a
consequence, we only need to estimate the blocks Λkj for k < j ≤ k + b. For each fixed
j, we denote by
Λj(t) = Cov (hb(t),hb+j(t)) ,
where hb(t) = (G1(t,Fb+j), · · · , G1(t,Fj+1)). Similar to Lemma 3.3.6, we have
∣∣∣∣∣∣∣∣Λj(k
n)− Λkj
∣∣∣∣∣∣∣∣ = O(n−1+4/τ ).
Chapter 3. Random matrices in non-stationary time series analysis 223
Next, we will estimate Λ(t), Λj(t), 1 ≤ j ≤ b using the Nadaraya-Watson type estimators.
Denote
Ψk =m∑j=0
hk+j, ∆k = ΨkΨ∗k/(m+ 1), b+ 1 ≤ k ≤ n−m,
where m → ∞ and mn→ 0. Let hn be the bandwidth and γn = hn + (m + 1)/n. For
t ∈ I = [γn, 1− γn] ⊂ (0, 1), we define
Λ(t) =n−m∑k=b+1
W (t, k)∆k, where W (t, k) =Khn( k
n− t)∑n−m
j=b+1Khn( jn− t)
,
where Khn(·, ·) is a smooth symmetric density function defined on R supported on [−1, 1].
Similarly, for each fixed j ≤ b, we now define
∆kj = ΨkΨ∗k+j/(m+ 1), b+ 1 ≤ k ≤ n− j −m,
and denote
Λj(t) =
n−j−m∑k=b+1
Wj(t, k)∆kj, where Wj(t, k) =Khn( k
n− t)∑n−m−j
j=b+1 Khn( jn− t)
.
Finally we define Σh as the long-run covariance matrix estimator by setting its blocks by
(Σh)kk = Λ(b+ k
n), (Σh)kj = Λj(
k + b
n), (3.3.17)
and zero otherwise, where k = 1, 2, · · · , n − b, k < j ≤ k + b. To avoid abusing the
notations, we denote Λ0(t) ≡ Λ(t).
Theorem 3.3.7. Under Assumption 3.1.2 and 3.1.3, let m→∞, mn→ 0, hn → 0 and
nhn →∞, for j = 0, 1, 2, · · · , b, we have
supt∈I
∣∣∣∣∣∣Λj(t)− Λj(t)∣∣∣∣∣∣ = O
(b
(1
m+ h2
n +b
n+
1√nhn
)). (3.3.18)
Chapter 3. Random matrices in non-stationary time series analysis 224
As a consequence, we have
∣∣∣∣∣∣Σh − Σh
∣∣∣∣∣∣ = O
(b2
(1
m+ h2
n +b
n+
1√nhn
)). (3.3.19)
In practice, the true εi is unknown and we have to use εi defined in (3.2.12). We then
define
Λ(t) =n−m∑k=b+1
W (t, k)∆k, Λj(t) =
n−j−m∑k=b+1
Wj(t, k)∆kj,
where ∆k(∆kj) are defined as ∆k(∆kj) with hk therein replaced by hk := xbk εk. Similarly,
we can define the estimation Σh. The analog of Theorem 3.3.7 is the following result.
Theorem 3.3.8. Under the assumptions of Theorem 3.3.8, we have
supt∈I
∣∣∣∣∣∣Λj(t)− Λj(t)∣∣∣∣∣∣ = O
(b
(1
m+ h2
n +b
n+
1√nhn
+ θn
)),
where θn is defined as
θn = b(n−1 + bc−p + bζc(bc/n)1/2)m−1.
As a consequence, we have
∣∣∣∣∣∣Σh − Σh
∣∣∣∣∣∣ = O
(b2
(1
m+ h2
n +b
n+
1√nhn
+ θn
)).
By Proposition 3.3.5, Theorem 3.3.7 and 3.3.8, we now propose the following practical
procedure to test H10 ( the test of H2
0 is similar):
1. For j = 1, 2, · · · , b, i = b+ 1, 2, · · · , n, estimate Σ−1 using n(Y ∗Y )−1 and calculate
Qij by the definition of (3.3.6).
2. Estimate the long-run covariance using (3.3.17) from the samples hknk=b+1.
Chapter 3. Random matrices in non-stationary time series analysis 225
3. Generate B (say 2000) i.i.d copies of Gaussian random vectors zi, i = 1, 2, · · · , B.
Here zi ∼ N (0,1), where 1 is the identity matrix of dimension (n − b)b. For each k =
1, 2, · · · , B, calculate the following Riemann summation
T 1k =
1
n2
b∑j=1
n∑i=b+1
(Σhzk)∗Qij(Σhzk).
4. Let T 1(1) ≤ T 1
(2) ≤ · · · ≤ T 1(B) be the order statistics of T 1
k , k = 1, 2, · · · , B. Reject
H10 at the level α if T ∗1 > M(bB(1−α)c), where bxc stands for the largest integer smaller or
equal to x. Let B∗ = maxk : T 1k ≤ T ∗1 , the p-value can be denoted as 1−B∗/B.
By Proposition 3.3.5, we find that T1(T2) converges at the rate n1−b under H10(H2
0).
Therefore, using Theorem 3.3.8, we find that both of our test statistics T1 and T2 have
asymptotic power 1 under the following two alternatives respectively
H1a : inf
i|Cov(xi, xi+1)| ≥ b2
(1
m+ h2
n +b
n+
1√nhn
+ θn
),
H2a : inf
i|Cov(xi, xi+k0+1)| ≥ b2
(1
m+ h2
n +b
n+
1√nhn
+ θn
).
In this subsection, we design simulations to study the finite sample performance for
the testings of White noise and bandedness of precision matrices using the procedure
described before. At the nominal levels 0.01, 0.05 and 0.1, the simulated Type I error
rates are listed below for the null hypothesis of H10 and H2
0 based on 1000 simulations,
where for H20 we use the AR(2) model (i.e k0 = 2). From Table 3.3 and 3.4, we see that
the performance of our proposed tests are reasonably accurate for both the Fourier basis
and Daubechies Wavelet basis.
Next we consider the statistical power of our tests under some given alternatives. For
the testing of White noise, we choose the four examples considered above as our alter-
Chapter 3. Random matrices in non-stationary time series analysis 226
n=200 n=500 n=800
α = 0.01Fourier Basis 0.01 0.01 0.009Wavelet Basis 0.01 0.01 0.01
α = 0.05Fourier Basis 0.055 0.049 0.048Wavelet Basis 0.051 0.048 0.046
α = 0.1Fourier Basis 0.106 0.096 0.091Wavelet Basis 0.091 0.089 0.087
Table 3.3: Simulated type I error rates under H10.
n=200 n=500 n=800
α = 0.01Fourier Basis 0.009 0.011 0.009Wavelet Basis 0.011 0.01 0.01
α = 0.05Fourier Basis 0.05 0.05 0.048Wavelet Basis 0.051 0.05 0.05
α = 0.1Fourier Basis 0.098 0.096 0.1Wavelet Basis 0.091 0.102 0.092
Table 3.4: Simulated type I error rates under H20 for k0 = 2.
natives. For the testing of bandedness of the precision matrices, for the null hypothesis,
we choose k0 = 2 and consider the follow two types of alternatives
• Time-varying AR(3) process
xi = 0.6 cos(2πi
n)xi−1 + 0.3 sin(
2iπ
n)xi−2 + δ sin(
2iπ
n)xi−3 + εi,
where εi are i.i.d standard normal random variables and δ ∈ (0, 0.3). It can be
easily checked that this is a locally stationary process.
• Non-stationary MA(3) process
xi = 0.6 cos(2πi
n)εi−1 + 0.3 sin(
2iπ
n)εi−2 +
i
nεi−3 + εi.
Chapter 3. Random matrices in non-stationary time series analysis 227
In all of our simulations, we choose the Orthogonal Wavelet basis as our Sieve base
functions. Figure 3.1 and 3.2 show that our testing procedure is quite robust and has
strong statistical power for both tests.
50 100 150 200 250 300
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Sample size
Stat
istica
Pow
er
AR(1)AR(2)MA(1)MA(2)
Figure 3.1: Statistical power of White noise testing under nominal level 0.05.
50 100 150 200 250 300
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Sample size
Stat
istica
Pow
er
AR(3)MA(3)
Figure 3.2: Statistical power of bandedness testing under nominal level 0.05. For theAR(3) process we choose δ = 0.2.
Finally, we simulate the statistical power for various choices of δ in the AR(3) process
for the sample size n = 200, 300 respectively in Figure 3.3, we find that our method is
quite robust.
Chapter 3. Random matrices in non-stationary time series analysis 228
0 0.05 0.1 0.15 0.2 0.25 0.30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sta
tistica
l p
ow
er
Values of thresholding
n=200
n=300
Figure 3.3: Statistical power of bandedness testing under nominal level 0.05 for differentchoices of δ.
Bibliography
[1] Arka Adhikari and Ziliang Che. The edge universality of correlated matrices. arXiv:
1712.04889, 2018.
[2] Oskari Ajanki, Laszlo Erdos, and Torben Kruger. Local spectral statistics of Gaus-
sian matrices with correlated entries. Journal of Statistical Physics, 163:280–302,
2016.
[3] Oskari Ajanki, Laszlo Erdos, and Torben Kruger. Stability of the matrix dyson
equation and random matrices with correlations. Probability Theory and Related
Fields (to appear), 2016.
[4] Zhidong Bai and Jack Silverstein. Spectral analysis of large dimensional random
matrices. Springer Series in Statistics, Springer, 2rd edition, 2010.
[5] Zhidong Bai and Jianfeng Yao. Central limit theorems for eigenvalues in a spiked
population model. Annales de l’Institut Henri Poincare - Probabilites et Statis-
tiques, 44:447–474, 2008.
[6] Jinho Baik, Gerard Ben Arous, and Sandrine Peche. Phase transition of the largest
eigenvalue for nonnull complex sample covariance matrices. The Annals of Proba-
bility, 33:1643–1697, 2005.
[7] Zhigang Bao and Xiucai Ding. Tracy-Widom limits for sample covariance matrices
with spikes of moderately large rank. In Progress, 2018.
229
Bibliography 230
[8] Zhigang Bao, Xiucai Ding, and Ke Wang. Singular subspace inference. In progress,
2018.
[9] Zhigang Bao, Guangming Pan, and Wang Zhou. Universality for the largest eigen-
value of sample covariance matrices with general population. The Annals of Statis-
tics, 43:382–421, 2015.
[10] Alexandre Belloni, Victor Chernozhukov, Denis Chetverikov, and Kengo Kato.
Some new asymptotic theory for least squares series: Pointwise and uniform re-
sults. Journal of Econometrics, 186:345–366, 2015.
[11] Florent Benaych-Georges and Antti Knowles. Lectures on the local semicircle law
for Wigner matrices. arXiv: 1601.04055, 2016.
[12] Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigen-
vectors of finite, low rank perturbations of large random matrices. Advances in
Mathematics, 227:494–521, 2011.
[13] Florent Benaych-Georges and Raj Rao Nadakuditi. The singular values and vec-
tors of low rank perturbations of large rectangular random matrices. Journal of
Multivariate Analysis, 111:120–135, 2012.
[14] Pavel Bleher and Arno Kuijlaars. Random matrices with external source and mul-
tiple orthogonal polynomials. International Mathematics Research Notices, 3:109–
129, 2004.
[15] Pavel Bleher and Arno Kuijlaars. Integral representations for multiple Hermite and
multiple Laguerre polynomials. Annales de l’Institut Fourier, 55:2001–2014, 2005.
[16] Alex Bloemendal, Laszlo Erdos, Antti Knowles, Horng-Tzer Yau, and Jun Yin.
Isotropic local laws for sample covariance and generalized Wigner matrices. Elec-
tronic Journal of Probability, 19:1–53, 2014.
Bibliography 231
[17] Alex Bloemendal, Antti Knowles, Horng-Tzer Yau, and Jun Yin. On the principal
components of sample covariance matrices. Probability Theory and Related Fields,
164:459–552, 2016.
[18] Alex Bloemendal and Balint Virag. Limits of spiked random matrices i. Probability
Theory and Related Fields, 156:795–825, 2013.
[19] Alex Bloemendal and Balint Virag. Limits of spiked random matrices ii. The
Annals of Probability, 44:2726–2769, 2016.
[20] Alexei Borodin. Biorthogonal ensembles. Nuclear Physics B, 3:704–732, 1998.
[21] Alexei Borodin. Determinantal point processes. arXiv:0911.1153, 2009.
[22] Alexei Borodin, Patrik Ferrari, Michael Prahofer, and Tomohiro Sasamoto. Fluc-
tuation properties of the TASEP with periodic initial configuration. Journal of
Physics A: Mathematical and General, 129:1055–1080, 2007.
[23] Gaetan Borot. An introduction to random matrix theory. arXiv:1710.10792, 2017.
[24] Joel Bun, Romain Allez, Jean-Philippe Bouchaud, and Marc Potters. Rotational
invariant estimator for general noisy matrices. IEEE Transactions on Information
Theory, 62:7475–7490, 2016.
[25] Joel Bun, Jean-Philippe Bouchaud, and Marc Potters. Cleaning large correlation
matrices: Tools from random matrix theory. Physics Reports, 666:7475–7490, 2017.
[26] Mireille Capitaine and Catherine Donati-Martin. Spectrum of deformed random
matrices and free probability. arXiv: 1607.05560, 2016.
[27] Mireille Capitaine, Catherine Donati-Martin, and Delphine Feral. The largest
eigenvalues of finite rank deformation of large Wigner matrices: Convergence and
nonuniversality of the fluctuations. The Annals of Probability, 37:1–47, 2009.
Bibliography 232
[28] Sourav Chatterjee. A generalization of the lindeberg principle. The Annals of
Probability, 34:2061–2076, 2006.
[29] Sourav Chatterjee. Superconcentration and Related Topics. Springer, 2014.
[30] Ziliang Che. Universality of random matrices with correlated entries. Electronic
Journal of Probability, 22:1–38, 2017.
[31] Xiaohong Chen. Large Sample Sieve Estimation of Semi-nonparametric Models.
Chapter 76 in Handbook of Econometrics, Vol. 6B, James J. Heckman and Edward
E. Leamer, 2007.
[32] Xiaohong Chen and Timothy Christensen. Optimal uniform convergence rates
and asymptotic normality for series estimators under weak dependence and weak
conditions. Journal of Econometrics, 188:447–465, 2015.
[33] Ingrid Daubechies. Ten Lectures on Wavelets. Society for Industrial and Applied
Mathematics, 1992.
[34] Percy Deift. Orthogonal Polynomials and Random Matrices: A Riemann-Hilbert
Approach. Courant Lecture Notes, American Mathematical Society and the Coun-
rant Institute of Mathematical Sciences at the New York University, 1999.
[35] Xiucai Ding. Singular vector distribution of sample covariance matrices. arXiv:
1611.01837, 2016.
[36] Xiucai Ding. Asymptotics of empirical eigen-structure for high dimensional sample
covariance matrices of general form. arXiv: 1708.06296, 2017.
[37] Xiucai Ding. High dimensional deformed rectangular matrices with applications in
matrix denoising. arXiv: 1702.06975, 2017.
[38] Xiucai Ding, Weihao Kong, and Gregory Valiant. Norm consistent oracle estimators
for high dimensional covariance matrices of general form. Preprint, 2017.
Bibliography 233
[39] Xiucai Ding and Jeremy Quastel. Multi-matrix model and generalization of Airy
process. In progress, 2018.
[40] Xiucai Ding and Fan Yang. A necessary and sufficient condition for edge univer-
sality at the largest singular values of covariance matrices. The Annals of Applied
Probability (in press), 2016.
[41] Xiucai Ding and Fan Yang. Necessary and sufficient condition for edge universality
for a general class of sample covariance matrices. In progress, 2018.
[42] Xiucai Ding and Zhou Zhou. Estimation and inference for covariance and precision
matrices of non-stationary time series. Preprint, 2018.
[43] Xiucai Ding and Zhou Zhou. On the stationarity testing for the correlation of time
series. Preprint, 2018.
[44] David Donoho. De-noising by soft-thresholding. IEEE Transactions on Information
Theory, 41:613–627, 1995.
[45] David Donoho, Matan Gavish, and Iain Johnstone. Optimal shrinkage of eigenval-
ues in the spiked covariance model. The Annals of Statistics (to appear), 2013.
[46] R. Brent Dozier and Jack W. Silverstein. Analysis of the limiting spectral dis-
tribution of large dimensional information-plus-noise type matrices. Journal of
Multivariate Analysis, 98:1099–1122, 2007.
[47] Laszlo Erdos. Universality of wigner random matrices: a survey of recent results.
Russian Mathematical Surveys, 66:507, 2011.
[48] Laszlo Erdos. Lecture Notes on the Matrix Dyson Equation and its Applications
for Random Matrices. IAS/Park City Mathematics Program, 2017.
Bibliography 234
[49] Laszlo Erdos, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Spectral statistics of
Erdos-Renyi graphs II: Eigenvalue spacing and the extreme eigenvalues. Commu-
nications in Mathematical Physics, 314:587–640, 2012.
[50] Laszlo Erdos, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Spectral statistics of
Erdos-Renyi graphs I: Local semicircle law. The Annals of Probability, 41:2279–
2375, 2013.
[51] Laszlo Erdos and Horng-Tzer Yau. A Dynamical Approach to Random Matrix
Theory. Courant Lecture Notes, American Mathematical Society and the Counrant
Institute of Mathematical Sciences at the New York University, 2017.
[52] Laszlo Erdos, Horng-Tzer Yau, and Jun Yin. Rigidity of eigenvalues of generalized
Wigner matrices. Advances in Mathematics, 229:1435–1515, 2012.
[53] Laszlo Erdos, Horng-Tzer Yau, and Jun Yin. Rigidity of eigenvalues of generalized
Wigner matrices. Advances in mathematics, 229:1435–1515, 2012.
[54] Bertrand Eynard. Eigenvalue distribution of large random matrices, from one
matrix to several coupled matrices. Nuclear Physics B, 506:633–664, 1997.
[55] Bertrand Eynard, Taro Kimura, and Sylvain Ribault. Random matrices. arXiv:
1510.04430, 2015.
[56] Bertrand Eynard and Madan Mehta. Matrices coupled in a chain: I. eigenvalue
correlations. Journal of Physics A: Mathematical and General, 31:44–49, 1998.
[57] Jianqing Fan, Yuan Liao, and Martina Mincheva. High-dimensional covariance
matrix estimation in approximate factor models. The Annals of Statistics, 39:3320–
3356, 2011.
Bibliography 235
[58] Jianqing Fan, Yuan Liao, and Martina Mincheva. Large covariance estimation by
thresholding principal orthogonal complements. Journal of the Royal Statistical
Society: Series B, 75:603–680, 2013.
[59] Patrik Ferrari, Michael Praehofer, and Herbert Spohn. Stochastic growth in one
dimension and gaussian multi-matrix models. arXiv:03010053, 2003.
[60] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An
Introduction to Statistical Learning. Sringer Texts in Statistics, 2013.
[61] Matan Gavish and David Donoho. The optimal hard threshold for singular values
is 4√3. IEEE Transactions on Information Theory, 60:5040–5053, 2014.
[62] Matan Gavish and David Donoho. Optimal shrinkage of singular values. IEEE
Transactions on Information Theory, 63:2137–2152, 2017.
[63] G.H Golub and C. Van Loan. Matrix Computation. Johns Hopkins University
Press, 4th edition, 2013.
[64] Kurt Johansson. Shape flucations and random matrices. Communications in Math-
ematical Physics, 209:437–476, 2000.
[65] Kurt Johansson. Discrete polynuclear growth and determinantal processes. Com-
munications in Mathematical Physics, 242:277–329, 2003.
[66] Kurt Johansson. Random matrices and determinantal processes. arXiv:0510038,
2005.
[67] Iain Johnstone. On the distribution of the largest eigenvalue in principal compo-
nents analysis. The Annals of Statistics, 29:295–327, 2001.
[68] Iain Johnstone. Multivariate analysis and Jacobi ensembles: Largest eigenvalue,
TracyWidom limits and rates of convergence. The Annals of Statistics, 36:2638–
2716, 2008.
Bibliography 236
[69] Noureddine El Karoui. Tracy-Widom limit for the largest eigenvalue of a large
class of complex sample covariance matrices. The Annals of Probability, 35:663–
714, 2007.
[70] Noureddine El Karoui. Spectrum estimation for large dimensional covariance ma-
trices using random matrix theory. The Annals of Statistics, 36:2757–2790, 2008.
[71] Antti Knowles and Jun Yin. Eigenvector distribution of Wigner matrices. Proba-
bility Theory and Related Fields, 155:543–582, 2013.
[72] Antti Knowles and Jun Yin. The isotropic semicircle law and deformation of Wigner
matrices. Communications on Pure and Applied Mathematics, 66:1663–1749, 2013.
[73] Antti Knowles and Jun Yin. The outliers of a deformed Wigner matrix. The Annals
of Probability, 42:1980–2031, 2014.
[74] Antti Knowles and Jun Yin. Anisotropic local laws for random matrices. Probability
Theory and Related Fields, 169:257–352, 2017.
[75] Weihao Kong and Gregory Valiant. Spectrum estimation from samples. The Annals
of Statistics, 45:2352–2367, 2017.
[76] Arno Kuijlaars. Random matrices with external source and multiple orthogonal
polynomials. Proceedings of the International Congress of Mathematicians, pages
1417–1432, 2010.
[77] Olivier Ledoit and Sandrine Peche. Eigenvectors of some large sample covariance
matrix ensembles. Probability Theory and Related Fields, 151:233–264, 2011.
[78] Olivier Ledoit and Michael Wolf. Numerical implementation of the QuEST func-
tion. Computational Statistics & Data Analysis, 115:199–223, 2017.
[79] Ji Oon Lee and Kevin Schnelli. Edge universality for deformed wigner matrices.
Reviews in Mathematical Physics, 27:1550018, 2015.
Bibliography 237
[80] Ji Oon Lee and Kevin Schnelli. Tracy-Widom distribution for the largest eigenvalue
of real sample covariance matrices with general population. The Annals of Applied
Probability, 26:3786–3839, 2016.
[81] Ji Oon Lee and Jun Yin. A necessary and sufficient condition for edge universality
of wigner matrices. Duke Mathematical Journal, 163:117–173, 2014.
[82] Haoyang Liu, Alexander Aue, and Debashis Paul. On the marcenko-pastur law for
linear time series. The Annals of Statistics, 43:675–712, 2015.
[83] V.A. Marcenko and L.A. Pastur. Distribution for some sets of random matrices.
Sbornik: Mathematics, 1:457–483, 1967.
[84] Konstantin Matetski, Jeremy Quastel, and Daniel Remenik. The KPZ fixed point.
arXiv:1701.00018, 2017.
[85] Madan Mehta. Random matrices. Elsevier Academic Press, 3rd edition, 2004.
[86] Alexandru Nica and Roland Speicher. Lectures on the Combinatorics of Free Prob-
ability. London Mathematical Society Lecture Note Series, Cambridge University
Press, 2006.
[87] Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked
covariance model. Statistica Sinica, 17:1617–1642, 2007.
[88] Debashis Paul and Jack Silverstein. No eigenvalues outside the support of the
limiting empirical spectral distribution of a separable covariance matrix. Journal
of Multivariate Analysis, 100:37–57, 2009.
[89] Natesh Pillai and Jun Yin. Universality of covariance matrices. The Annals of
Applied Probability, 24:935–1001, 2014.
[90] Beatriz Pontes, Raul Giraldez, and Jesus S.Aguilar-Ruiz. Biclustering on expression
data: A review. Journal of Biomedical Informatics, 57:163–180, 2015.
Bibliography 238
[91] Mohsen Pourahmadi. Joint mean-covariance models with applications to longitu-
dinal data: unconstrained parameterisation. Biometrika, 3:677–690, 1999.
[92] Michael Prahofer and Herbert Spohn. Scale invariance of the PNG droplet and the
Airy process. Journal of Physics A: Mathematical and General, 108:1071–1106,
2002.
[93] Jeremy Quastel and Daniel Remenik. Airy processes and variational problems.
Topics in Percolative and Disordered Systems, pages 121–171, 2014.
[94] Jeremy Quastel and Herbert Spohn. The one-dimensional KPZ equation and its
universality class. Journal of Physics A: Mathematical and General, 160:965–984,
2015.
[95] Adrian Rollin. Stein’s method in high dimensionas with applications. Annales de
l’Institut Henri Poincare, Probabilites et Statistiques, 49:529–549, 2011.
[96] Nathan Ross. Fundamentals of steins method. Probability Surveys, 8:210–293,
2011.
[97] Tomohiro Sasamoto. Spatial correlations of the 1D KPZ surface on a flat substrate.
Journal of Physics A: Mathematical and General, 38:549–556, 2005.
[98] Jack W. Silverstein and Sang-Il Choi. Analysis of the limiting spectral distribution
of large dimensional random matrices. Journal of Multivariate Analysis, 54:295–
309, 1995.
[99] Defeng Sun and Jie Sun. Strong semismoothness of eigenvalues of symmetric matri-
ces and its application to inverse eigenvalue problems. SIAM Journal on Numerical
Analysis, 40:2352–2367, 2003.
[100] Gabor Szego. Orthogonal Polynomials. Colloquium Publications. XXIII. American
Mathematical Society, 1939.
Bibliography 239
[101] Kazumasa Takeuchi and Masaki Sano. Universal fluctuations of growing interfaces:
Evidence in turbulent liquid crystals. Physical review letters, 104:230601, 2010.
[102] Terence Tao and Van Vu. Random matrices: Universality of local eigenvalue statis-
tics. Acta Mathematica, 206:127, 2011.
[103] Terence Tao, Van Vu, and Manjunath Krishnapur. Random matrices: Universality
of esds and the circular law. The Annals of Probability, 38:2023–2065, 2010.
[104] Terrence Tao. Topics in Random matrix theory. American Mathematical Society,
2012.
[105] Craig Tracy and Harold Widom. Differential equations for Dyson processes. Com-
munications in Mathematical Physics, 252:7–41, 2004.
[106] Eugene Wigner. Characteristic vectors bordered matrices with infinite dimensions.
Annals of Mathematics, 62:548–564, 1955.
[107] Eugene Wigner. On the distributions of the roots of certain symmetric matrices.
Annals of Mathematics, 67:325–327, 1958.
[108] John Wishart. The generalised product moment distribution in samples from a
normal multivariate population. Biometria, 20:32–52, 1928.
[109] Weibiao Wu. Nonlinear system theory: Another look at dependence. Proceedings of
the National Academy of Sciences of the United States of America, 40:14150–14151,
2005.
[110] Haokai Xi, Fan Yang, and Jun Yin. Local circular law for the product of a deter-
ministic matrix with a random matrix. Electronic Journal of Probability, 22:1–77,
2017.
[111] Mengyu Xu, Danna Zhang, and Wei Biao Wu. l2 asymptotics for high-dimensional
data. arXiv: 1405.7244, 2015.
Bibliography 240
[112] Dan Yang, Zongming Ma, and Andreas Buja. Rate optimal denoising of simulta-
neously sparse and low rank matrices. The Journal of Machine Learning Research,
17:1–27, 2016.
[113] Jianfeng Yao, Shurong Zheng, and Zhidong Bai. Large Sample Covariance Matrices
and High-Dimensional Data. Cambridge University Press, 2015.
[114] Horng-Tzer Yau and Paul Bourgade. The eigenvector moment flow and local quan-
tum unique ergodicity. Communications in Mathematical Physics, 350:231–278,
2017.
[115] Bo Zhang, Guangming Pan, and Jiti Gao. CLT for largest eigenvalues and unit
root tests for high-dimensional nonstationary time series. The Annals of Statistics
(to appear), 2016.
[116] Lixin Zhang. Spectral analysis of large dimensional random matrices. Ph. D. Thesis.
National University of Singapore, 2006.
[117] Xianyang Zhang and Gugang Cheng. Guassian approximation for high dimensional
vector under physical dependence. Bernoulli(to appear), 2017.
[118] Zhou Zhou. Heteroscedasticity and autocorrelation robust structural change detec-
tion. Journal of the American Statistical Association, 108:726–740, 2013.
[119] Zhou Zhou and Weibiao Wu. Local linear quantile estimation for non-stationary
time series. The Annals of Statistics, 37:2696–2729, 2009.
[120] Zhou Zhou and Weibiao Wu. Simultaneous inference of linear models with time
varying coefficents. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 72:513–531, 2010.