Tracy-Widom limit for Spearman’s rhomazgbao.people.ust.hk/TW law for Spearman28.pdf · ijy ik= 1 n 1 (1.2) ; 8j6= k: We then do the scaling X= 1 p n (1.3) Y; and denote by x iand

1

Tracy-Widom limit for Spearman’s rho

Zhigang Bao∗

Hong Kong University of Science and Technology

[email protected]

In this paper, we study the Spearman rank correlation matrix, which is arandom matrix model from the non-parametric statistics. We focus on the highdimensional scenario when n is proportional to p. In the null case, we show thatthe Tracy-Widom law holds for the largest eigenvalues of the Spearman rankcorrelation matrix. The proof is based on a general strategy for the universalityof the covariance type matrices from Pillai and Yin [17].

Date: December 18, 2017Keywords: largest eigenvalue, non-parametric statistics, Spearman’s ρ, Tracy-Widom law, randommatrices

1. Introduction.

1.1. Matrix model and main results. Let w = (w1, . . . , wp) be a p-dimensionalrandom vector with independent but may not be identically distributed com-ponents. We further assume that wi’s are all continuous random variables. Letwj = (w1j , . . . , wpj)

′, j ∈ J1, nK be n i.i.d. samples of w. Hereafter we use the nota-tion Ja, bK := [a, b] ∩ Z. We then call W = (wij)p,n the data matrix. In this paper,we focus on the setting when n are p are comparably large, i.e.,

p = p(n), cn :=p

n→ c ∈ (0,∞), if n→∞,(1.1)

for some positive constant c.We then construct the corresponding Spearman rank correlation matrix from

the data matrix W as follows. For each fixed i ∈ J1, pK, we can rank n sampleswi1, . . . , win according to their size. Let qij be the rank of wij among wi1, . . . , win.Observe that for each i ∈ J1, pK, the random vector (qi1, . . . , qin) is a random

∗The author is partially supported by Hong Kong RGC grant ECS 26301517

2

permutation uniformly distributed on Sn. Here Sn is the symmetric group of theset {1, 2, . . . , n}. Next, we normalize qij ’s as

yij :=

√12

n2 − 1

(qij −

n+ 1

2

), (i, j) ∈ J1, pK× J1, nK.

We further set Y = (yij)p,n as the matrix of rank. Observe that by the assumptionon the independence of the components of w, we see that the p rows of Y are i.i.drandom vectors. It is also easy to check for any i ∈ J1, pK, there are

Eyij = 0, Ey2ij = 1, Eyijyik = − 1

n− 1, ∀j 6= k.(1.2)

We then do the scaling

X =1√nY,(1.3)

and denote by xi and yi the i-th rows of X and Y , respectively. The Spearmanrank correlation matrix is defined as

S ≡ Sn := XX ′ =1

nY Y ′.(1.4)

Observe that the matrix entry Sab is the Spearman rank correlation coefficient ofthe ranks of the samples of wa and those of wb. Hence, the matrix S is a naturalmultivariate extension of the Spearman rank correlation coefficient.

Since Marchenko and Pastur discovered the global spectral distribution (MP law)in their seminal work[15], there has been a vast of literature devoted to the spec-tral property of the large dimensional sample covariance matrix and its varieties.Especially, on the largest eigenvalue, Johnstone [12] proved the Tracy-Widom law(TW law) in the null case, i.e., the population covariance matrix is Ip. The TWlaw was then shown to be universal for sample covariance matrices in the nullcase, even under more general distribution assumptions, see [18, 17]. Later on, theTracy-Widom law was further extended to more general population assumption,see [5, 14, 13]. In [4, 16], it was also shown that the TW law holds for the samplecorrelation matrix under in the null case.

Although many spectral statistics of the sample covariance matrices and cor-relation matrices turn out to be extremely useful for various statistical inferenceproblems, these two matrix models are both parametric. Consequently, certainparametric assumption such as the moment condition is needed for the limitingtheorem on the spectral statistics. For instance, on the TW law for the covariancematrices, we refer to [8] for a necessary moment assumption. Moreover, limitingresults such as TW law are very often used to test the independence of the com-ponents of the population random vector w. Mathematically, the idea is valid onlyfor the Gaussian vectors. For general distribution, covariance matrix only contains

3

the information on correlation rather than dependence. Due to the above reasons,it is very natural to consider the limiting spectral properties of the nonparamet-ric random matrix models. Among the others, the Spearman rank correlation andKendall rank correlation matrices are probably the most important and naturalones. However, the study of these multivariate non-parametric models under thehigh-dimension setting is much less, in contrast to the parametric ones. So far, thereare a only a handful of results on this direction. The global spectral distributionsfor the Spearman rank correlation matrix and Kendall rank correlation matrix havebeen derived in [1] and [2], respectively. A CLT for the linear eigenvalue statisticsof the Spearman rank correlation matrix has been established in [6]. On the localscale, we proved the TW law for the Kendall rank correlation matrices in [3] re-cently. It is the first TW law for a non-parametric matrix model, and is also thefirst TW law for a high-dimensional U-statistics. In this paper, our aim is to derivethe companion result (TW law) for the Spearman rank correlation matrix.

Before stating our main results, we first recall the global spectral property of Sfrom [1]. Let λ1(S) ≥ . . . ≥ λp(S) be p ordered eigenvalue of S. We denote theempirical spectral distributions (ESD) of S by

Fn :=1

p

p∑i=1

δλi(S).

In [1], it is proved that Fn is asymptotically given by the standard MP law. Morespecifically, we have the following theorem.

Theorem 1.1 (Theorem 2.2 of [1]). Under the assumption (1.1), we have thatalmost surely Fn converges weakly to Fc whose density is given by

ρc(x) =1

2πc

√(d+,c − x)(x− d−,c)

x1(d−,c ≤ x ≤ d+,c)

where

d±,c = (1±√c)2.

In case c > 1, in addition, Fc has a singular part: a point mass (1− c−1)δ0.

Further, replacing c be cn, we denote by ρcn , Fcn , d±,cn the analogues of ρc, Fc,d±,c, respectively.

To state our main results, we further denote by Q := 1nXX

′ a Wishart matrix,where X is p × n data matrix with i.i.d. N(0, 1) variables. Let λi(Q) be the i-thlargest eigenvalue of Q.

Our main result is the following theorem.

4

Theorem 1.2 (Edge universality of Spearman rank correlation matrix). Sup-pose that the assumption (1.1) holds. There exist positive constants ε and δ suchthat for any s ∈ R,

P(n

23 (λ1(Q)− d+,cn) ≤ s− n−ε

)− n−δ ≤ P

(n

23 (λ1(S)− d+,cn) ≤ s

)≤ P

(n

23 (λ1(Q)− d+,cn) ≤ s+ n−ε

)+ n−δ

holds when n is sufficiently large.

Remark 1.3. The above result can be generalized to the joint distribution ofthe first few eigenvalues. More specifically, there exist positive constants ε and δsuch that for any fixed positive integer k and any s1, . . . , sk ∈ R,

P(n

23 (λ1(Q)− d+,cn) ≤ s1 − n−ε, . . . , n

23 (λk(Q)− d+,cn) ≤ sk − n−ε

)− n−δ

≤ P(n

23 (λ1(S)− d+,cn) ≤ s1, . . . , n

23 (λk(S)− d+,cn) ≤ sk

)≤ P

(n

23 (λ1(Q)− d+,cn) ≤ s1 + n−ε, . . . , n

23 (λk(Q)− d+,cn) ≤ sk + n−ε

)+ n−δ

holds when n is sufficiently large. We refer to Remark 1.4 of [17] for a similarextension for the sample covariance matrix. The extension here can be proved inthe same way.

From Theorem 1.2, we have the following corrolary on the largest eigenvalue.

Corollary 1.4 (Tracy-Widom law for λ1(S)). Under the assumption of The-orem 1.2, we have

n23 c

16nd− 2

3+,cn

(λ1(S)− d+,cn

)=⇒ TW1

1.2. Proof strategy. The proof of Theorem 1.2 will be done with the aid of ageneral strategy in Pillai and Yin [17] for the covariance type matrices, which itselfis an adaptation of the method originally raised in [11] by Erdos, Yau and Yin forWigner matrices. Roughly speaking, to prove the TW law for the largest eigenvalue,first one needs to prove a local law for the spectral distribution, which controls thelocation of the eigenvalues on an optimal local scale. Second, with the aid of thelocal law, one needs to perform a Green function comparison between the matrixof interest and certain reference matrix ensemble, whose edge spectral behavioris already known. In [17], an extended criterion of the local law for covariancetype of matrices with independent columns (or rows) was given, see Theorem 3.6of [17]. It allows one to relax the independence assumption on the entries withinone columns (or rows) to certain extent, as long as some large deviation estimateshold for certain linear and quadratic forms of each column (or row) of the data

5

matrix, see Lemma 3.4 of [17]. This general criterion was then used in [16] and [4]to establish the edge universality of the sample correlation matrices.

For the Spearman rank correlation matrix S = XX ′ defined in (1.4), our maintask is to show a large deviation estimate (c.f. Proposition 2.1) for each row of thematrix X. Once Proposition 2.1 is established, we can use the criterion in Theorem3.6 of [17] to conclude the local law of S. It turns out that the Green functioncomparison part can be done similarly to that in [16] for the Pearson’s correlationmatrix, by choosing an appropriate reference matrix ensemble. The reference matrixto be chosen is this work turns out to be the traditional sample covariance matrix,which is subtracted by the sample mean and divided by n− 1.

1.3. Notation and organization.

1.3.1. Notation. We need the following definition on high-probability estimatesfrom [9].

Definition 1.5. Let X ≡ X (N) and Y ≡ Y(N) be two sequences of nonnegativerandom variables. We say that Y stochastically dominates X if, for all (small) ε > 0and (large) D > 0,

P(X (N) > N εY(N)

)≤ N−D,(1.5)

for sufficiently large N ≥ N0(ε,D), and we write X ≺ Y or X = O≺(Y). WhenX (N) and Y(N) depend on a parameter v ∈ V (typically an index label or a spectralparameter), then X (v) ≺ Y(v), uniformly in v ∈ V, means that the thresholdN0(ε,D) can be chosen independently of v.

We use the symbols O( · ) and o( · ) for the standard big-O and little-o notation.We use c and C to denote strictly positive constants that do not depend on N .Their values may change from line to line. For any matrix A, we denote by ‖A‖ itsoperator norm, while for any vector a, we use ‖a‖ to denote its 2-norm. The matrixentries of A are denoted by Aij . In addition, we use double brackets to denote indexsets, i.e., for n1, n2 ∈ R, Jn1, n2K := [n1, n2] ∩ Z. We also use 1 to represent theall-one vector, whose dimension may be changed from one to another.

1.3.2. Organization. The paper is organized as follows: In Section 2, we willprove some large deviation estimates for certain linear and quadratic forms of xi’s,and then briefly state the proof of the local law of S based on Theorem 3.6 of [17].In Section 3, we perform the Green function comparison and then prove our mainresults.

2. Local law of S. In this section, our final goal is to prove a strong local lawfor the matrix S: Proposition 2.3. To this end, we shall first establish some largedeviation estimates for certain linear and quadratic forms of xi’s, which are the

6

rows of the matrix X defined in (1.3). Then Proposition 2.3 will follows from theselarge deviation estimates and Theorem 3.6 of [17].

2.1. Large deviation estimates for xi. Let

si =

√12

n(n2 − 1)

(i− n+ 1

2

), i ∈ J1, nK.

We set the vector

s := (s1, . . . , sn).(2.1)

Let x be a random permutation which is uniformly distributed on the symmetricgroup of s. We can then regard x1, . . . ,xp as i.i.d. copies of x. Further, we letξ = (ξ1, . . . , ξn) be a random vector with i.i.d. components which are uniformlydistributed on the set s, i.e., P(ξj = si) = 1

n , i ∈ J1, nK, j ∈ J1, nK. Let ξi =(ξij)

nj=1, i ∈ J1, pK be i.i.d. copies of ξ. Note that xij ’s are also identically distributed

as ξj , but xij ’s are correlated. We further set

x := ξΣ12 , xi := ξiΣ

12 ,(2.2)

where

Σ =n

n− 1In −

1

n− 111′.(2.3)

Here 1 represents the n-dimensional all-one vector. Let X and Ξ be the p×n matrixwith xi and ξi as the i-th row, respectively. We further denote by

S := XX ′ = ΞΣΞ′.(2.4)

Observe that S is the classical sample covariance matrix in statistics theory, whichis subtracted by sample mean and normalized by n−1, although in random matrixtheory the simplified model ΞΞ′ is considered more often. Since ΞΣΞ′ is just a rankone perturbation of n

n−1ΞΞ′, it is known that almost surely the empirical spectral

distribution of S also converges weakly to Fc.Below is a collection of large deviation estimates on xi’s, and also xi’s.

Proposition 2.1. Let xi’s be defined as (1.3). For any deterministic vectora = (aj) ∈ Cn and matrix B = (bij) ∈ Cn×n, we have

∣∣xia′∣∣ ≺√‖a‖2n

,(2.5) ∣∣xiBx′i − 1

nTrBΣ

∣∣ ≺ 1

n

√Tr|B|2.(2.6)

The same inequalities hold if we replace xi by xi (c.f. (2.2)).

7

Proof of Proposition 2.1. The proof relies on a martingale concentrationargument. We start with (2.5). Recall the random vector x = (x1, . . . , xn) which is arandom permutation uniformly distributed on the symmetric group of s (c.f. (2.1)).Since xi’s are i.i.d. copies of x, it suffices to prove all the results in Proposition 2.1with xi replaced by x for brevity.

We first construct a martingale difference sequence. Define the filtration

F0 = ∅, F` := σ{x1, . . . , x`}, ` ∈ J1, nK,(2.7)

and set

M` =

n∑j=1

aj

(E(xj |F`

)− E

(xj |F`−1

))(2.8)

It is clear that, conditioning on F`, xj is uniformly distributed on the set s \{x1, . . . , x`} for all j ∈ J`+ 1, nK. Hence, we have

E(xj |F`) = − 1

n− `∑k=1

xk, j ∈ J`+ 1, nK(2.9)

where we used the fact that∑n

i=1 xi =∑n

i=1 si = 0. From the definition in (2.8),it is easy to check that Mn = 0 since Fn = Fn−1. In addition, we have

M` = xà` +

n∑j=`+1

ajE(xj |F`

)−

n∑j=`

ajE(xj |F`−1

).

=(a` −

1

n− `

n∑j=`+1

aj

)(x` +

1

n− `+ 1

`−1∑k=1

xk

), ` ∈ J1, n− 1K.(2.10)

Using the bound |xj | = O( 1√n

) and the fact

∣∣∣ `−1∑k=1

xk

∣∣∣ =∣∣∣ n∑k=`

xk

∣∣∣ ≤ C√n

min{`− 1, n− `+ 1},(2.11)

we can simply get from (2.10) that

|M`| ≤C√n

( 1

n− `

n∑j=`+1

|aj |+ |a`|).(2.12)

Applying Burkholder inequality, we have for any fixed positive integer q ≥ 2

E∣∣∣ n∑`=1

M`

∣∣∣q ≤ (Cq)3q2 E( n∑`=1

M2`

) q2 .(2.13)

8

From (2.12), we have

n∑`=1

M2` =

n−1∑`=1

M2` ≤

C

n

n−1∑`=1

( 1

(n− `)2( n∑j=`+1

|aj |)2

+ a2`

)

≤ C

n

n−1∑`=1

( 1

n− `

n∑j=`+1

a2j + a2`

)≤ C log n

n

n∑`=1

a2j .(2.14)

Plugging (2.14) into (2.13) and using Markov’s inequality, we can conclude (2.5).Next, we prove (2.6). It suffices to show the following two∣∣∣ n∑

j=1

bjj(x2ij −

1

n)∣∣∣ ≺ 1

n

√∑j

(bjj)2,(2.15)

∣∣∣∑j 6=k

bjkxijxik +1

n(n− 1)

∑j 6=k

bjk

∣∣∣ ≺ 1

n

√∑j 6=k

(bjk)2.(2.16)

For (2.15), again, we construct a sequence of martingale differences as

N` =

n∑j=1

bjj

(E(x2j |F`

)− E

(x2j |F`−1

)).(2.17)

We have Nn = 0 since Fn = Fn−1. Again, given {x1, . . . , x`−1}, we recall the factthat xj is uniformly distributed on s \ {x1, . . . , x`−1} for all j ≥ `. Moreover, since∑n

j=1 s2j = 1, we have

E(x2j |F`) =1

n− `

(1−

∑k=1

x2k

), ∀j ≥ `+ 1.(2.18)

Applying (2.18), we obtain

N` = b``x2` +

n∑j=`+1

bjjE(x2j |F`

)−

n∑j=`

bjjE(x2j |F`−1

).

=(b`` −

1

n− `

n∑j=`+1

bjj

)(x2` −

1

n− `+ 1

(1−

`−1∑k=1

x2k

))Using the fact xk = O( 1√

n), we have

|N`| ≤C

n

( 1

n− `

n∑j=`+1

|bjj |+ |b``|)

9

The remaining proof of (2.15) is nearly the same as that for (2.5). We thus omitthe details.

Next, we prove (2.16). It suffices to estimate a half of the quadratic form. Recallthe filtration defined in (2.7). We further set

L` :=∑i<j

bij

(E(xixj |F`

)− E

(xixj |F`−1

))= L`1 + L`2 + L`3 + L`4,(2.19)

where

L`1 :=

`−1∑i=1

bi`xi

(x` − E(x`|F`−1)

)L`2 :=

n∑j=`+1

b`j

(E(xj |F`)x` − E(xjx`|F`−1)

)

L`3 :=

`−1∑i=1

n∑j=`+1

bijxi

(E(xj |F`)− E(xj |F`−1)

)L`4 :=

n∑i=`+1

n∑j=i+1

bij

(E(xixj |F`)− E(xixj |F`−1)

).(2.20)

First, using (2.5) we can improve (2.11) to

∣∣∣ `−1∑i=1

xi

∣∣∣ =∣∣∣ n∑i=`

xi

∣∣∣ ≺ min{√`− 1

n,

√n− `+ 1

n

}.(2.21)

Hence, in light of (2.9), we have

|E(xj |F`−1)| ≺1√

n(n− `+ 1).(2.22)

Moreover, we also have for i, j ≥ ` and i 6= j,

E(xixj |F`−1

)=− 1

(n− `+ 1)(n− `)

(( `−1∑i=1

xi)2

+(1−

`−1∑i=1

x2i)).(2.23)

Observe that

|1−`−1∑i=1

x2i | ≤ Cn− `+ 1

n.(2.24)

Combining (2.21), (2.23) with (2.24), we obtain for ` ∈ J1, n− 1K∣∣∣E(xixj |F`−1)∣∣∣ ≤ C 1

n(n− `),(2.25)

10

and for ` = n we simply use the bound |xixj | = O( 1n).

Let q ≥ 2 be any given integer. Using Burkholder inequality again, we have

E∣∣∣∑

`

L`

∣∣∣q ≤ (Cq)3q2 E(∑

`

L2`

) q2.(2.26)

Then, applying generalized Minkowski inequality, we obtain(E(∑

`

L2`

) q2) 2

q ≤∑`

(E|L`|q

) 2q.(2.27)

Hence, it suffices to estimate E|Là|q for a = 1, 2, 3, 4. For E|L`1|q, using the bound|xi| = O( 1√

n), we have

E|L`1|q = E(∣∣x` − E(x`|F`−1)

∣∣q∣∣∣ `−1∑i=1

bi`xi

∣∣∣q) ≤ C

nq2

E∣∣∣ `−1∑i=1

bi`xi

∣∣∣q ≺ (∑`−1

i=1 b2i`)

q2

nq,

(2.28)

where the last step follows from (2.5).Next, we estimate E|L`2|q. Plugging the bounds (2.22), (2.25) and |xi| = O( 1√

n)

into the definition in (2.20), we have

|L`2| ≺∑n

j=`+1 |b`j |n√n− `+ 1

≤

√∑nj=`+1 b

2`j

n.

Consequently, we have

E|L`2|q ≺(∑n

j=`+1 b2`j)

q2

nq.(2.29)

Next, we estimate E|L`3|q. From (2.9) and (2.11), we have

∣∣E(xj |F`)− E(xj |F`−1)∣∣ =

1

n− `

∣∣∣ ∑`−1i=1 xi

n− `+ 1+ x`

∣∣∣ ≺ 1

(n− `)√n, ` ∈ J1, n− 1K,

and we also have the trivial fact E(xj |Fn) − E(xj |Fn−1) = 0. Therefore, by (2.5),we have

|L`3| ≺1

(n− `)√n

n∑j=`+1

∣∣∣ `−1∑i=1

bijxi

∣∣∣≺ 1

(n− `)n

n∑j=`+1

√√√√`−1∑i=1

b2ij ≺1

n√n− `

√√√√`−1∑i=1

n∑j=`+1

b2ij .

11

Hence, we have

E|L`3|q ≺(∑`−1

i=1

∑nj=`+1 b

2ij

) q2

nq(n− `)q2

.(2.30)

Next, we estimate E|L`4|q. From (2.23), we have for i, j > `∣∣E(xixj |F`)− E(xixj |F`−1

)∣∣=∣∣∣ 1

(n− `− 1)(n− `)(n− `+ 1)

(( `−1∑i=1

xi

)2+(

1−`−1∑i=1

x2i

))+

2

(n− `)(n− `− 1)x`

`−1∑i=1

xi

∣∣∣ ≺ 1

n(n− `)32

.

Therefore, we have

L`4 ≺1

n(n− `)32

n∑i=`+1

n∑j=i+1

|bij | ≤ C1

n√n− `

√√√√ n∑i=`+1

n∑j=i+1

b2ij .

Hence, we also have

E|L`4|q ≺(∑n

i=`+1

∑nj=i+1 b

2ij

) q2

nq(n− `)q2

.(2.31)

Using generalized Minkowski inequality again, we can conclude from (2.28), (2.29),(2.30) and (2.31) that

(E|L`|q

) 2q =

(E|

4∑a=1

Là|q) 2

q ≤ 4(E(

4∑a=1

L2à)

q2

) 2q ≤ 4

4∑a=1

(E|Là|q

) 2q

≺ 1

n2

( `−1∑i=1

b2i` +n∑

j=`+1

b2`j +1

n− `

n∑j=`+1

j−1∑i=1

b2ij

).(2.32)

Recall (2.27). We see from (2.32) that(E(∑

`

L2`

) q2) 2

q ≤∑`

(E|L`|q

) 2q ≺ 1

n2

∑i<j

b2ij ,(2.33)

where we used the fact∑`

n∑j=`+1

j−1∑i=1

1

n− `b2ij =

∑j

j−1∑i=1

( j−1∑`=1

1

n− `

)b2ij

≤ C log n∑j

j−1∑i=1

b2ij ≺∑i<j

b2ij .

12

Plugging (2.33) into (2.26), we have

E∣∣∣∑

`

L`

∣∣∣q ≤ (Cq)3q2

(∑i<j b

2ij

) q2

nq.

Hence, we have ∣∣∣∑i 6=j

bijxixj − E(∑i 6=j

bijxixj

)∣∣∣ ≺ 1

n

√∑i 6=j

b2ij(2.34)

This together with the fact Exixj = − 1n(n−1) (c.f. (1.2)) concludes the proof of

(2.16).Recall the definition of x from (2.2) and also observe that the entries of ξ’s are

i.i.d.. Then using the large deviation estimates for linear and quadratic forms ofthe i.i.d. random variables (c.f. Corollary B.3 of [10] for instance), we see that both(2.5) and (2.6) still hold if we replace x by x.

Hence, we completed the proof of Proposition 2.1.

2.2. Strong local law for S. Recall the notation Fcn as the distribution definedin Theorem 1.1 with c replaced by cn. In addition, we dente by γ1 ≥ γ2 ≥ . . . ≥ γp∧nthe ordered p-quantiles of Fcn , i.e., γj is the smallest real number such that∫ γj

−∞dFcn(x) =

p− j + 1

p, j ∈ J1, n ∧ pK,(2.35)

We denote by m the Stieltjes transform of Fcn in the sequel. It is known thatm : C+ → C+ satisfies the following equation

m(z) =1

1− cn − z − cnzm(z).(2.36)

The following lemma on m(z) is elementary.

Lemma 2.2. For any z ∈ E + iη ∈ D(ε), we have

|m(z)| ∼ 1,(2.37)

Imm(z) ∼

√κ+ η, if E ≤ d+,cn

η√κ+η

, if E ≥ d+,cn(2.38)

where κ ≡ κ(E) := |E − d+,cn |.

13

Recall the matrix S defined in (2.4). We introduce some intermediate matricesbetween S and S. Starting from X, we replace xi’s by xi’s one by one and get thesequence of intermediate matrices

X = X0, X1, . . . , X`, X`+1, . . . , Xp−1, Xp = X.(2.39)

Correspondingly, we set

S` = X`X′`, G` ≡ G`(z) := (S` − z)−1, m`(z) :=

1

pTrG`(z)(2.40)

For ` = 0, we simply write S0, G0,m0 as S,G,m. We further introduce the notations

Λd := maxk|Gkk −m|, Λo := max

k 6=`|Gk`|, Λ := |m−m|.(2.41)

We then set the domain

D(ε) :={z = E + iη :

1

2d+,c ≤ E ≤ 2d+,c, n

−1+ε ≤ η ≤ 1}

(2.42)

We remind here that the following proof works well for a larger domain with E ∈[12d−,c, 2d+,c] (say), which covers the whole spectrum, in case c 6= 1. Here in D(ε)we focus on a neighborhood of the right edge d+,c only to avoid the discussion onthe regime of c. The following discussion restricted to the domain D(ε) is sufficientfor the universality of the largest eigenvalues. We then further define the controlparameter

Ψ ≡ Ψ(z) :=

√Imm

Nη+

1

Nη.

We claim that the following local law holds.

Proposition 2.3. Under the assumption (1.1), the following bounds hold.(i): (Entrywise local law)

Λd(z) ≺ Ψ(z), Λo(z) ≺ Ψ(z)

holds uniformly on D(ε).(ii): (Strong local law)

Λ(z) ≺ 1

Nη

holds uniformly on D(ε).(iii): (Rigidity on the right edge). For i ∈ [1, δp] with any sufficiently small

constant δ ∈ (0, 1), we have

|λi(S)− γi| ≺ n−23 i−

13 .

All the above hold if we replace S by S` for all ` ∈ J1, pK.

14

In the sequel, we will prove Proposition 2.3, based on the large deviation estimatein Proposition 2.1 and the general framework developed in [17].

Proof of Proposition 2.3. With the large deviation estimate in Proposition2.1, the proof of Proposition 2.3 is nearly the same as that for Theorem 3.1 in [17].The main difference is that here we state all the estimates with the notation ≺ (c.f.Definition 1.5) instead of the more quantitive statements in [17]. More directly, wecan regard Proposition 2.3 as a consequence of Proposition 2.1 and Theorem 3.6 in[17]. Nevertheless, here in (2.6), we do have 1

nTrBΣ which is not exactly the sameas 1

nTrB in Lemma 3.4 of [17] (set σ2 = 1n therein). In the sequel, we justify this

minor issue.We first define a random control parameter

Π ≡ Π(z) :=

√Imm(z) + |Λ(z)|

nη+

1

nη.

Observe that since Imm(z) & η, we always have Ψ(z),Π(z) & n−12 . We denote by

X(i) the submatrix of X with xi deleted. Further, we denote by S(i) = X(i)(X(i))′

and G(i) := (S(i) − z)−1. Denote by

S := X ′X, G(z) := (S − z)−1, S(i) := (X(i))′X(i), G(i)(z) := (S(i) − z)−1.(2.43)

Observe that

TrG(z) = TrG(z)− n− pz

, TrG(i)(z) = TrG(i)(z)− n− p+ 1

z.(2.44)

Let Gij be the (i, j)th entry of G. Using Schur complement, we see that

Gii =1

xix′i − z − xi(X(i))′G(i)X(i)x′i.

The place we need to use (2.6) is the following

xi(X(i))′G(i)X(i)x′i −

1

nTr(X(i))′G(i)X(i)Σ = O≺

( 1

n

√Tr((X(i))′G(i)X(i))2

).

(2.45)

The key observation is that

Tr(X(i))′G(i)X(i)1′1 = 1(X(i))′G(i)X(i)1′ = 0,(2.46)

since xk1′ =

∑j xkj = 0. Moreover, we can write

(X(i))′G(i)X(i) = S(i)G(i) = (In + zG(i)),(2.47)

15

where S(i) and G(i) are defined in (2.43). Hence, from (2.3), (2.46), (2.47), we seethat

1

nTr(X(i))′G(i)X(i)Σ =

1

n− 1Tr(X(i))′G(i)X(i)

=1

n− 1Tr(In + zG(i)) =

n

n− 1+ z

1

n− 1TrG(i).

Hence, we can write (2.45) as

xi(X(i))′G(i)X(i)x′i =

n

n− 1+ z

1

n− 1TrG(i) +O≺

( 1

n

√n+ zTrG(i) + z2Tr(G(i))2

)=

n

n− 1+ z

1

n− 1TrG(i) +O≺(Π) = 1 + zTrG(i) +O≺(Π),(2.48)

where the second step follows from (2.44), the fact |TrG − TrG(i)| ≺ 1η and the

fact |m| ≤ |m| + |Λ| ≤ C + |Λ|, and the fact Π(z) & n−12 . Moreover, (2.46) also

holds even we replace X by any intermediate matrix X` defined in (2.39), since

xi1′ = ξiΣ

121′ = 0. Hence, the estimate (2.48) also holds if we replace X by any

X`. All the remaining proof is the same as the counterpart in [17]. We thus omitthe details.

Hence, we conclude the proof of Proposition 2.3.

3. Edge universality for S. In this section, we prove the edge universalityof S by Green function comparison.

3.1. Green function comparison. Recall the intermediate matrices defined in(2.39) and the notations introduced in (2.40). Our aim is to show the followinglemma.

Lemma 3.1. Fix any γ ∈ J0, pK. Let ε > 0 be any sufficiently small constant.Let E,E1, E2 ∈ R satisfy E1 < E2 and

|E|, |E1|, |E2| ≤ n−23+ε,(3.1)

and set η0 = n−23−ε. Let F : R→ R be a smooth function satisfying

maxx∈R|F (`)(x)|(|x|+ 1)−C ≤ C, ` = 1, 2, 3, 4

for some positive constant C. Then, there exists a constant δ > 0 such that, forsufficiently large n we have∣∣∣EF (nη0Immγ(d+,cn + E + iη0))

− EF (nη0Immγ+1(d+,cn + E + iη0))∣∣∣ ≺ n−1−δ,(3.2)

16

and also ∣∣∣EF(n ∫ E2

E1

Immγ(d+,cn + x+ iη0)dx)

− EF(n

∫ E2

E1

Immγ+1(d+,cn + x+ iη0)dx)∣∣∣ ≺ n−1−δ.(3.3)

Proof of Lemma 3.1. In the sequel, we only show the proof for (3.2). Theproof of (3.3) can be done similarly. For brevity, throughout the proof, we willsimply write Cε with any positive constant C (independent of ε) by ε in the sequel.In other words, we allow ε to vary from line to line, up to C. Suppose that Xγ andXγ+1 differ by the i-th row for some i ∈ J1, pK.

We first define the matrix X(i)` to be the submatrix of X` with i-th row removed.

Hence, X(i)γ = X

(i)γ+1. Further, we set

S(i)γ := X(i)

γ (X(i)γ )′, G(i)

γ := (S(i)γ − z)−1, m(i)

γ :=1

pTrG(i)

γ ,

S(i)γ := (X(i)γ )′X(i)

γ , G(i)γ := (S(i)γ − z)−1.

We now expand mγ(z) around m(i)γ (z) as follows

mγ =1

pTrGγ(z) =

1

pTrGγ(z) +

n− ppz

=1

pTr(G(i)γ −

G(i)γ x′ixiG(i)γ

1 + xiG(i)γ x′i

)+n− ppz

= m(i)γ −

1

pz− 1

p

xi(G(i)γ )2xi

1 + xiG(i)γ xi:= µ(i)γ −

1

p

xi(G(i)γ )2xi

1 + xiG(i)γ xi.

We further denote

Ei := xiG(i)γ x′i −p

nm(z).

We then do the following expansion

nη0p

xi(G(i)γ )2x′i

1 + xiG(i)γ x′i= δi1 + δi2 + δi3 +O≺(n−

43 ),(3.4)

where

δik :=nη0p

1

(1 + pnm(z))k

(−Ei)k−1xi(G(i)γ )2x′i, k = 1, 2, 3.

In (3.4), we also used the estimates

|Ei| ≺ n−13+ε, |xi(G(i)γ )2x′i| ≺ n

13+ε(3.5)

17

which follow from Proposition 2.1, the fact G(i)1′ = 0, Lemma 2.2 and Proposition2.3. From (3.5), we also have

|δik| ≺ n−k3+ε, k = 1, 2, 3.(3.6)

Consequently, we have the expansion

F (nη0Immγ(z))− F (nη0Imµ(i)γ (z))

= −F (1)(nη0Imµ(i)γ (z))(Im δi1 + Im δi2 + Im δi3

)+ F (2)(nη0Imµ(i)γ (z))

(1

2(Im δi1)

2 + Im δi1Im δi2

)− F (3)(nη0Imµ(i)γ (z))

(1

6(Im δi1)

3)

+O≺(n−43+ε),

where we used (3.6). Hence, to prove (3.2), it suffices to show that for all nonneg-ative integers a, b satisfying a ≥ 1 and a+ b ≤ 3, the following holds

ηa0

∣∣∣E(xi(G(i)γ )2x′i)a(xiG(i)γ x′i)b − E(xi(G(i)γ )2x′i)

a(xiG(i)γ x′i)b∣∣∣ ≺ n−1−δ.(3.7)

Observe that the LHS is 0 if a + b = 1 (i.e., a = 1, b = 0), due to the factthat the covariance structure of xi is the same as that of xi. For the case ofa + b = 2, 3, we need the following technical lemma which can be obtained viaelementary calculation.

Lemma 3.2. Let x = (x1, . . . , xn) and x = (x1, . . . , xn) be defined in Section2.1. Then for any vector of index k = (k1, k2, . . . , k2d) for d = 2, 3, we have∣∣∣E( 2d∏

i=1

xki

)− E

( 2d∏i=1

xki

)∣∣∣ ≺ n−d−d d1(k)2e−1,(3.8)

where d1(k) represents the number of the lone index in k and dd1(k)2 e represents

the smallest integer greater than or equal to d1(k)2 .

With the aid of Lemma 3.2, we proceed to the proof of (3.7). In case of thePearson’s sample correlation matrix in [16], the counterpart of (3.8) has a sharperbound n−d−max{d1(k),1}, see Lemma 5.5 therein. Nevertheless, the bound in (3.8) isas good as that in [16] when d1(k) ≤ 3. Consequently, in the sequel, we only needto check those terms with d1(k) ≥ 4. The case of d1(k) ≤ 3 can be done in thesame way as [16].

We start with the case a = 1, b = 1. In this case, we have

η0Ei(xi(G(i)γ )2x′i)(xiG(i)γ x′i) = η0∑

k:d1(k)≤3

((G(i)γ )2)k1k2(G(i)γ )k3k4Exk1xk2xk3xk4

+η0∑

k:d1(k)>3

((G(i)γ )2)k1k2(G(i)γ )k3k4Exk1xk2xk3xk4 .(3.9)

18

Apparently, (3.9) and (3.10) still hold if we replace x by x. As mentioned above,we can use the argument in [16] to conclude∣∣∣η0 ∑

k:d1(k)≤3

((G(i)γ )2)k1k2(G(i)γ )k3k4(Exk1xk2xk3xk4 − Exk1 xk2 xk3 xk4

)∣∣∣ ≺ n−1−δ.For the second part in (3.9), we observe that

1(d1(k) > 3)Exk1xk2xk3xk4 = Ex1x2x3xk4 .(3.10)

Using (3.10) and (3.8), we get∣∣∣η0 ∑k:d1(k)>3

((G(i)γ )2)k1k2(G(i)γ )k3k4(Exk1xk2xk3xk4 − Exk1 xk2 xk3 xk4

)∣∣∣≺ n−5η0

∑k

|((G(i)γ )2)k1k2 ||(G(i)γ )k3k4 | ≤ n−3η0√

Tr|G(i)γ |4√

Tr|G(i)γ |2 ≺ n−53+ε,

(3.11)

where in the second step we used Cauchy-Schwarz inequality. The case of a = 2, b =0 can be proved similarly. More specifically, we can again decompose the sum intotwo parts according to d1(k) ≤ 3 or d1(k) > 3. In the first case, we can simply usethe argument in [16] to conclude the estimate. For the part with d1(k) > 3, instead

of the bound n−53+ε in (3.11), we will have n−

43+ε for the case of a = 2, b = 0.

For the case of a+ b = 3, we have

ηa0Ei(xi(G(i)γ )2x′i)a(xiG(i)γ x′i)b = ηa0

∑k

a∏j=1

((G(i)γ )2)k2j−1k2j

3∏j=a+1

(G(i)γ )k2j−1k2jE( 6∏j=1

xkj

)The above also holds if we replace x by x. Again, we can decompose the sum overk into two parts according to whether d1(k) ≤ 3 or d1(k) > 3. The estimate of thefirst part follows from the discussion in [16] again. Hence, it suffices to consider thecases d1(k) = 6 or d1(k) = 4. For the first case, by (3.8) we have

∣∣∣ηa0 ∑k:d1(k)=6

a∏j=1

((G(i)γ )2)k2j−1k2j

3∏j=a+1

(G(i)γ )k2j−1k2j

(E( 6∏j=1

xkj)− E

( 6∏j=1

xkj))∣∣∣

≺ n−7ηa0∣∣∣ ∑k:d1(k)=6

a∏j=1

((G(i)γ )2)k2j−1k2j

3∏j=a+1

(G(i)γ )k2j−1k2j

∣∣∣≺ n−4ηa0

(Tr|G(i)γ |4

)a2(Tr|G(i)γ |2

) b2 ≺ n−2+ε.

When a + b = 3 and d1(k) = 4. We shall further decompose the discussion intothree cases (a, b) = (1, 2), (2, 1) or (3, 0). The discussion for all three cases are

19

similar, we thus only present the details for the first case in the sequel. In this case,using (3.8) and the fact d1(k) = 4, we have

∣∣∣η0 ∑k:d1(k)=4

((G(i)γ )2)k1k2(G(i)γ )k3k4(G(i)γ )k5k6

(E( 6∏j=1

xkj)− E

( 6∏j=1

xkj))∣∣∣

≺ n−6η0∣∣∣ ∑k:d1(k)=4


∣∣∣ =: n−6η0|I1 + I2 + I3 + I4|,

where I1 represents the sum of the terms with k1 = k2; I2 represents the sumof the terms with k3 = k4 or k5 = k6; I3 represents the sum of the terms with]({k1, k2} ∩ {k3, k4, k5, k6}) = 1; and I4 represents the sum of the terms with]({k3, k4} ∩ {k5, k6}) = 1. Observe that

|I1| ≤ Tr|G(i)γ |2(∑k,`

|(G(i)γ )k`|)2 ≤ n2(Tr|G(i)γ |2)2 ≺ n4η−10 ,

|I2| ≤ Tr|G(i)γ |(∑k,`

|(G(i)γ )k`|)(∑

k,`

|((G(i)γ )2)k`|)≤ n2Tr|G(i)γ |

√Tr|G(i)γ |2

√Tr|G(i)γ |4 ≺ n4η

− 32

0 ,

|I3| ≤(∑k,`

|(G(i)γ )k`|)(∑

k,`

|((G(i)γ )3)k`|)≤ n2

√Tr|G(i)γ |2

√Tr|G(i)γ |6 ≺ n3η

− 52

0 ,

|I4| ≤(∑k,`

|((G(i)γ )2)k`|)2 ≤ n2Tr|G(i)γ |4 ≺ n2η

− 52

0 .

Then it is easy to check that

∣∣∣η0 ∑k:d1(k)=4


(E( 6∏j=1

xkj)− E

( 6∏j=1

xkj))∣∣∣ ≺ n−1−δ.

Similarly, one can check that the above estimate holds for the cases (a, b) = (2, 1)or (3, 0). This concludes the proof of (3.7). Further, we completed the proof ofLemma 3.1.

Using Lemma 3.1, we can now prove Theorem 1.2 and Corollary 1.4.

Proof of Theorem 1.2. Similarly to the proof of Theorem 1.1 in [17], fromLemma 3.1, one can show that

P(n

23 (λ1(S)− d+,cn) ≤ s− n−ε

)− n−δ

≤ P(n

23 (λ1(S)− d+,cn) ≤ s

)≤ P

(n

23 (λ1(S)− d+,cn) ≤ s+ n−ε

)+ n−δ,(3.12)

20

where S is defined in (2.4). It is known from Theorem 2.7 of [7] that the largesteigenvalues of S differ from the corresponding ones of ΞΞ′ only by O≺( 1

n). Thistogether with the edge universality of the sample covariance matrix in [17] furtherimplies

P(n

23 (λ1(Q)− d+,cn) ≤ s− n−ε

)− n−δ

≤ P(n

23 (λ1(S)− d+,cn) ≤ s

)≤ P

(n

23 (λ1(Q)− d+,cn) ≤ s+ n−ε

)+ n−δ,(3.13)

where Q is the Wishart matrix in Theorem 1.2. Combining (3.12) with (3.13) wecan conclude the proof of Theorem 1.2.

Proof of Corollary 1.4. The conclusion follows directly from Theorem 1.2and the Tracy Widom limit for λ1(Q). This completes the proof.

References.

[1] Z. D. Bai, W. Zhou: Large sample covariance matrices without independence structures incolumns. Statistica Sinica, 425-442. (2008)

[2] A. S. Bandeira, A. Lodhia, P. Rigollet. Marcenko-Pastur law for Kendall’s tau. ElectronicCommunications in Probability, 22. (2017)

[3] Z. G. Bao: Tracy-Widom limit for Kendall tau. arXiv:1712.00892[4] Z. G. Bao, G. M. Pan, W. Zhou: Tracy-Widom law for the extreme eigenvalues of sample

correlation matrices, Electron. J. Probab. 17, No. 88, 1-32, (2012).[5] Z. G. Bao, G. M. Pan, W. Zhou: Universality for the largest eigenvalue of sample covariance

matrices with general population, Ann. Stat. 43(1), 382-421 (2015).[6] Z. G. Bao, L.-C. Lin, G. M. Pan, W. Zhou. Spectral statistics of large dimensional Spearman’s

rank correlation matrix and its application. The Annals of Statistics, 43(6), 2588-2623. (2015)[7] A. Bloemendal, A. Knowles, H.-T. Yau, J. Yin: On the principal components of sample

covariance matrices. Probab. Theory and Related Fields, 164(1-2): 459-552 (2016).[8] X. Ding, F. Yang: A necessary and sufficient condition for edge universality at the largest

singular values of covariance matrices. arXiv:1607.06873.[9] Erdos, L., Knowles, A., Yau, H.-T.: Averaging fluctuations in resolvents of random band

matrices, Ann. Henri Poincare 14, 1837-1926 (2013).[10] Erdos, L., Yau, H.-T., Yin, J.: Bulk universality for generalized Wigner matrices, Probability

Theory and Related Fields, 1-67 (2012).[11] Erdos, L., Yau, H.-T., Yin, J.: Rigidity of Eigenvalues of Generalized Wigner Matrices, Adv.

Math. 229 (3), 1435-1515 (2012).[12] I. M. Johnstone: On the distribution of the largest eigenvalue in principal components anal-

ysis. Ann. Stat., 295-327 (2001).[13] A. Knowles, J. Yin. Anisotropic local laws for random matrices. arXiv:1410.3516.[14] J. O. Lee, K. Schnelli. Tracy-Widom distribution for the largest eigenvalue of real sample

covariance matrices with general population. Ann. Appl. Probab., 26(6), 3786-3839 (2016).[15] V. A. Marchenko, L. A. Pastur. Distribution of eigenvalues for some sets of random matrices.

Mathematics of the USSR-Sbornik, 1(4): 457 (1967).[16] N. S. Pillai, J. Yin: Edge universality of correlation matrices. The Annals of Statistics, 40(3):

1737-1763 (2012).

21

[17] N. S. Pillai, J. Yin: Universality of covariance matrices. Ann. Appl. Probab., 24(3), 935-1001(2014).

[18] K. Wang: Random covariance matrices: Universality of local statistics of eigenvalues up tothe edge. Random Matrices: Theory and Applications, 1(01): 1150005 (2012).

Documents

Tracy-Widom limit for Spearman’s rhomazgbao.people.ust.hk/TW law for Spearman28.pdf · ijy ik= 1 n 1 (1.2) ; 8j6= k: We then do the scaling X= 1 p n (1.3) Y; and denote by x iand