43
Submitted to Bernoulli arXiv: arXiv:0000.0000 Interpoint Distance Based Two Sample Tests in High Dimension CHANGBO ZHU 1,* and XIAOFENG SHAO 1,** 1 Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA. E-mail: * [email protected]; ** [email protected] In this paper, we study a class of two sample test statistics based on inter-point distances in the high dimensional and low/medium sample size setting. Our test statistics include the well-known energy distance and maximum mean discrepancy with Gaussian and Laplacian kernels, and the critical values are obtained via permutations. We show that all these tests are inconsistent when the two high dimensional distributions correspond to the same marginal distributions but differ in other aspects of the distributions. The tests based on energy distance and maximum mean discrepancy mainly target the differences between marginal means and variances, whereas the test based on L 1 -distance can capture the difference in marginal distributions. Our theory sheds new light on the limitation of inter-point distance based tests, the impact of different distance metrics, and the behavior of permutation tests in high dimension. Some simulation results and a real data illustration are also presented to corroborate our theoretical findings. Keywords: Two Sample Test, High Dimensionality, Permutation Test, Power Analysis. 1. Introduction In many statistical and machine learning applications, we need inference about the two populations or distributions based on the data samples collected. For example, we need to compare the effectiveness of two newly developed drugs in clinical research, the higher educational level between two countries in a social study and the global warming effects on two regions in environmental science. Two sample hypothesis testing is a statistical procedure to deal with such problems. Formally speaking, having i.i.d. p-dimensional samples X 1 , ··· ,X n = d X F and Y 1 , ··· ,Y m = d Y G, we are interested in knowing whether the underlining distributions F and G which generate the two samples are the same, i.e. to test the following hypothesis, H 0 : F = G versus H A : F 6= G. The study of two-sample testing has a long history and dates back to Kolmogorov- Smirnov’s test [19, 27], where the empirical CDFs are compared using the sup-norm. Related work for univariate two-sample tests includes Cramer von-Mises criterion [9, 29] and Anderson-Darling test [3]. Extensions to comparison of multivariate distributions and also the k-sample problem can be found in [5, 6, 12, 17, 26] among others. Some 1 imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Interpoint Distance Based Two Sample Tests in High Dimension

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Submitted to BernoulliarXiv: arXiv:0000.0000

Interpoint Distance Based Two Sample Tests

in High Dimension

CHANGBO ZHU1,* and XIAOFENG SHAO1,**

1Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, 61820,USA. E-mail: *[email protected]; **[email protected]

In this paper, we study a class of two sample test statistics based on inter-point distances in thehigh dimensional and low/medium sample size setting. Our test statistics include the well-knownenergy distance and maximum mean discrepancy with Gaussian and Laplacian kernels, and thecritical values are obtained via permutations. We show that all these tests are inconsistent whenthe two high dimensional distributions correspond to the same marginal distributions but differin other aspects of the distributions. The tests based on energy distance and maximum meandiscrepancy mainly target the differences between marginal means and variances, whereas thetest based on L1-distance can capture the difference in marginal distributions. Our theory shedsnew light on the limitation of inter-point distance based tests, the impact of different distancemetrics, and the behavior of permutation tests in high dimension. Some simulation results anda real data illustration are also presented to corroborate our theoretical findings.

Keywords: Two Sample Test, High Dimensionality, Permutation Test, Power Analysis.

1. Introduction

In many statistical and machine learning applications, we need inference about the twopopulations or distributions based on the data samples collected. For example, we needto compare the effectiveness of two newly developed drugs in clinical research, the highereducational level between two countries in a social study and the global warming effectson two regions in environmental science. Two sample hypothesis testing is a statisticalprocedure to deal with such problems. Formally speaking, having i.i.d. p-dimensionalsamples X1, · · · , Xn =d X ∼ F and Y1, · · · , Ym =d Y ∼ G, we are interested in knowingwhether the underlining distributions F and G which generate the two samples are thesame, i.e. to test the following hypothesis,

H0 : F = G versus HA : F 6= G.

The study of two-sample testing has a long history and dates back to Kolmogorov-Smirnov’s test [19, 27], where the empirical CDFs are compared using the sup-norm.Related work for univariate two-sample tests includes Cramer von-Mises criterion [9, 29]and Anderson-Darling test [3]. Extensions to comparison of multivariate distributionsand also the k-sample problem can be found in [5, 6, 12, 17, 26] among others. Some

1imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

2 C. Zhu and X. Shao

other interesting work focusing on the “trimmed” comparison of distributions can befound in [1, 2, 11, 23].

However, all the afore-mentioned work focuses on the fixed dimensional case. If thedimension exceeds the sample size or is allowed to grow, some of the above methods areexpected to fail. For example, the density-based methods suffer from the curse of highdimensionality in particular. In this paper, we study the two sample tests based on certaindissimilarity metrics that can be expressed as functions of the interpoint distances. Twoof the most popular high dimensional two-sample tests that fall into this category arebased on the Energy Distance (ED) [28] and the Maximum Mean Discrepancy (MMD)[13]. The former is based on the Euclidean distance between sample elements; while thelatter is a kernel based method and is basically a variant of ED with a user-specifiedkernel as distance metric. To be more specific, both ED and MMD take the followingform

EDk(F,G) = 2E[k(X,Y )]− E[k(X,X ′)]− E[k(Y, Y ′)], (1)

where k is a user-specified kernel, X ′, Y ′ are i.i.d copies of X,Y respectively. For instance,k can be chosen as

L2-norm (Euclidean distance) : k(X,Y ) = ‖X − Y ‖2 =√∑p

u=1(xu − yu)2,

Gaussian kernel : k(X,Y ) = exp(−‖X−Y ‖

22

2γ2p

),

Laplacian kernel : k(X,Y ) = exp(−‖X−Y ‖2γp

),

L1-norm : k(X,Y ) = ‖X − Y ‖1 =∑pu=1 |xu − yu|,

where X = (x1, · · · , xp)T , Y = (y1, · · · , yp)T and γp is a user-specified bandwidth pa-rameter. Then, the population version of ED is given by Equation (1) with k beingthe L2-norm and the population version of MMD multiplied by -1 is given by Equa-tion (1) with k being Gaussian or Laplacian kernel. When k is L2-norm, Gaussian orLaplacian kernel, EDk(F,G) enjoys the property that EDk(F,G) = 0⇔ F = G. In fact,EDk(F,G) = 0⇔ F = G holds as long as k is a strongly negative definite kernel [18]. EDand MMD based tests are both nonparametric without any assumption on the underly-ing distributions and can be implemented conveniently in practice using permutations.In this work, we aim to address the following questions:

1, Can EDk based permutation test maintain its power against all kinds of alternativesin the high dimensional setting?

2, What are the impact of different distance metrics?

To answer the above questions, we conduct rigorous theoretical analysis on the power ofEDk(F,G) based permutation test in the high dimensional low sample setting (HDLSS)[16] as well as high dimensional medium sample size setting (HDMSS) [4]. Naturally, wesay a test is consistent if its power goes to 1 under either HDLSS or HDMSS regime. Here,we study the power property of the permutation based tests because they are frequentlyimplemented for Energy Distance and its variants in real life applications.

Let X = (X1, X2, · · · , Xn)T , Y = (Y1, Y2, · · · , Ym)T , Z = (XT ,YT )T denote thesample matrices and EDk

n(Z) be a U-statistic based unbiased estimator of EDk(F,G).

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 3

Our main results include: (i) Derivation of the limiting distribution of EDkn(ΓZ) under

both low and medium sample size setting, where Γ ∼ Uniform(Pn+m) and Pn+m is theset of permutation matrices of dimension (n+m)×(n+m). (ii) Based on the asymptoticresults, we formulate different local alternatives, under which the power behavior of EDk

n

based permutation tests are discussed in detail. (iii) Our theories are applied to existingkernels and statistics, for example

1, Under both HDLSS and HDMSS, EDk based permutation test w.r.t. L2-norm,Gaussian and Laplacian kernel are consistent if the sum of component-wise meanor variance differences are not so small, i.e, limp→∞

∑pu=1(E(xu)−E(yu))2/p 6= 0

or limp→∞ |∑pu=1(var(xu)−var(yu))/p| 6= 0. In addition, if the sum of component-

wise mean and variance differences are both of order o(√p/√nm), i.e.,

p∑u=1

(E(xu)− E(yu))2 = o

( √p

√nm

)and

∣∣∣∣∣p∑

u=1

(var(xu)− var(yu))

∣∣∣∣∣ = o

( √p

√nm

),

these tests suffer substantial power loss (the limits of their power are derived) underHDLSS and have trivial power (power no larger than the significance level) underHDMSS. Furthermore, under HDLSS, the afore-mentioned tests have trivial powerif additionally we have

∑pu,v=1(cov(xu, xv)− cov(yu, yv))

2 = o(p).

2, When k is chosen as L1-norm, EDk based permutation test experiences a powerdrop under HDLSS and trivial power under HDMSS if X,Y have the same uni-variate marginal distribution, i.e. xu =d yu for u = 1, 2, · · · , p. This phenomenonis consistent with the fact that EDk with L1-norm can characterise the discrepan-cies between the marginal univariate distributions. In addition, Under HDLSS, weshow that the L1-norm based test has trivial power when X and Y have the samebivariate marginal distribution, i.e., (xu, xv) =d (yu, yv), u, v = 1, · · · , p.

These findings are further corroborated in our simulation study. It is worth mentioningthat Chakraborty and Zhang [8] investigate the energy distance, maximum mean dis-crepancy, distance covariance and Hilbert-Schmidt Independence Criterion in the highdimensional setting. They propose a new class of metrics which can detect/measure theequality of low-dimensional marginal distributions and a computational efficient t-testis further proposed based on the new metric. By contrast, our focus is on kernel-basedpermutation test and their asymptotic power properties in the high dimensional setting.In the following we introduce some notation and define some frequently used operatorsfor later convenience.

1.1. Notation

Here, random data samples are denoted as, for each i = 1, 2, · · · , n, Xi =d X =(x1, · · · , xp)T ∈ Rp and for each j = 1, 2, · · · ,m, Yj =d Y = (y1, · · · , yp)T ∈ Rp.Next, let X = (X1, X2, · · · , Xn)T , Y = (Y1, Y2, · · · , Ym)T and Z = (XT ,YT )T =(Z1, Z2, · · · , Zn+m)T denote the random sample matrices. Furthermore, let Pn+m =

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

4 C. Zhu and X. Shao{Γ1,Γ2, · · · ,Γ(n+m)!

}be the group containing all permutation matrices of dimension

(n+m)× (n+m) and for each i, let πi be the permutation that corresponds to Γi via

ΓiZ =(Zπi(1), Zπi(2), · · · , Zπi(n+m)

)T.

For a random permutation matrix Γ ∼ uniform(Pn+m), we use π to represent its corre-sponding permutation. Next, given any function ϕ, ϕ(i) is used to denote its i-th orderderivative. Calligraphic letters (K,L,R,W,G) are used to denote self-defined operatorsthat act on random variables to produce random variables. For two random variablesW1,W2, the notation W1 =d W2 means that they have the same distribution. Under

HDLSS or HDMSS setting, we usep→,

d→ to denote convergence in probability, in dis-tribution respectively and utilize the order in probability notations such as stochasticboundedness Op (big O in probability) and convergence in probability op (small o inprobability).

2. Interpoint Distance Based Two Sample Tests

In this paper, we limit our attention to EDk(F,G), where k is a user specified dissimilaritymetric [24] of the following form

k(X,Y ) = ϕ

{1

p

p∑u=1

ψ(xu, yu)

}, (2)

where ψ ≥ 0 and ϕ has continuous second order derivative on (0,+∞). The reasonwe focus on EDk(F,G) of the above form is that the metric k encompasses many well-known distance metrics such as L2-norm, L1-norm, Gaussian and Laplacian kernel. Con-sequently, Energy Distance (ED) and Maximum Mean Discrepancy (MMD) are justspecial cases of EDk(F,G). We summarize the commonly used distance metrics in Table1. Following the literatures [13, 14], we consider the bandwidth parameter γ in Gaussianand Laplacian kernel as a fixed constant and note that its relationship with γp (intro-duced in Gaussian or Laplacian kernel on page 2) is given by γp =

√pγ. Notice that if

k is some well-known distance metrics such as L2-norm, Gaussian kernel (multiplied by-1) and Laplacian kernel (multiplied by -1), a nice property for EDk is that

EDk(F,G) ≥ 0 and EDk(F,G) = 0⇔ F = G. (3)

Here, it is just for the ease of presentation and notational simplicity that k is set tobe Gaussian or Laplacian kernel multiplied by -1. In fact, if k is a universal kernel(see Theorem 5 and Lemma 1 of [13]) or k is a strongly negative definite kernel (seeTheorem 1.9 [18]), Property (3) still holds. On the other hand, using ED1(F,G) to denoteEDk(F,G) when k is the L1-distance, we observe that ED1(F,G) =

∑pu=1 ED(Fu, Gu),

from which it easily follows that

ED1(F,G) ≥ 0 and ED1(F,G) = 0⇔ Fu = Gu for all u = 1, 2, · · · , p.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 5

ψ(x, y) ϕ(x) k EDk(F,G)

(x− y)2

√x L2-norm

Energy distance (ED)

Szekely and Rizzo [28]

−e− x

2γ2Gaussian kernel

(multiplied by -1)Maximum Mean Discrepancy (MMD)

Gretton et al. [13]−e−

√xγ

Laplacian kernel

(multiplied by -1)

|x− y| x L1-normUsed for some graph-based tests

Sarkar et al. [24]

Table 1. Correspondence between different choices of ψ,ϕ and existing distance metrics as well as twosample test statistics in the literature.

Notice that it is possible to have Fu = Gu for all u = 1, 2, · · · , p but F 6= G, under whichwe have ED1(F,G) = 0 while EDk > 0 if k is L2-norm, Gaussian kernel (multiplied by-1) or Laplacian kernel (multiplied by -1). Thus, L2-norm, Gaussian kernel or Laplaciankernel based test statistics have advantage over L1-norm based test statistic in the lowdimensional setting, but we will see later that the story is in a sense reversed under thehigh dimensional setting. Next, an unbiased estimator of EDk is given as

EDkn(Z) =

2

mn

n∑i=1

m∑j=1

k(Xi, Yj)

− 2

n(n− 1)

∑1≤i<j≤n

k(Xi, Xj)−2

m(m− 1)

∑1≤i<j≤m

k(Yi, Yj).

3. Power Analysis for Permutation Test

As permutation tests are commonly used for Energy Distance and kernel variants inpractice due to their implementational convenience and accurate size, we study theirasymptotic behavior under the high dimensional setting in this subsection. Since wehave i.i.d samples, after we permute the data, i.e., shuffle the rows of Z as ΓiZ by somepermutation matrix Γi, what really matters to the distribution of EDk

n(ΓiZ) is how manyX samples stay in the first n rows. Formally, let |A| be the cardinality of the set A andgiven a permutation matrix Γi with the corresponding permutation πi, set

N(Γi) = |{j ∈ {1, 2, · · · , n} : 1 ≤ j ≤ n, n+ 1 ≤ πi(j) ≤ n+m}| .

The integer n − N(Γi) actually counts the number of samples which belong to thefirst n rows of Z both before and after the permutation Γi. Notice that it is possiblethat N(Γi) = N(Γj) for different permutations Γi and Γj . The set Sw collects all the

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

6 C. Zhu and X. Shao

permutations Γi such that N(Γi) = w. Mathematically, fix 0 ≤ w ≤ min{n,m}, setSw = {Γi : N(Γi) = w, i = 1, 2, · · · , (n+m)!} , then

|Sw| =(m

w

)(n

n− w

)n!m!.

To differentiate from Γi, we use italic symbol Γw to represent an element in Sw. Intuitively,|Sw| is the number of permutations that would have n − w samples stay in the first nrows of Z after we apply the corresponding permutation. The above process is furtherillustrated in the following diagram.

Xi : n samples

Yj : m samples

Xi : n− w samplesYj : w samples

Xi : w samplesYj : m− w samples

⇑Z

⇑ΓZ

Γ ∈ Sw

For the inter-point distance based two sample tests, we can equivalently permute theweights on the pair-wise distances instead of permuting data points, i.e., for a fixedpermutation matrix Γs ∈ Pn+m that corresponds to πs, we can write EDk

n(ΓsZ) as

EDkn(ΓsZ) =

n+m∑i=2

i−1∑j=1

Πs,ijk(Zi, Zj), (4)

where Πs,ij is defined as

Πs,ij =

− 2n(n−1) , 1 ≤ πs(i), πs(j) ≤ n,− 2m(m−1) , n+ 1 ≤ πs(i), πs(j) ≤ n+m,

2mn , 1 ≤ πs(i) ≤ n, n+ 1 ≤ πs(j) ≤ n+m,2mn , n+ 1 ≤ πs(i) ≤ n+m, 1 ≤ πs(j) ≤ m.

To formally define the permutation test for EDkn(Z), let R denote the randomization

distribution of EDkn(Z), which is defined by

R(t) =1

(n+m)!

(n+m)!∑i=1

I{EDkn(ΓiZ)≤t} =1

(m+ n)!

min{n,m}∑w=0

∑Γ∈Sw

I{EDkn(ΓZ)≤t}.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 7

For any distribution F , let the (1 − α)-th quantile of F be denoted by QF,1−α. In par-

ticular, the (1− α)th quantile of R is QR,1−α, i.e.

QR,1−α = R−1(1− α) = inf{t : R(t) ≥ 1− α

}. (5)

Then, the level-α permutation test w.r.t. EDkn(Z) is defined as

Reject H0 if EDkn(Z) > QR,1−α.

In real life applications, (n+m)! might be large, we thus resort to an approximation ofQR,1−α. Let Γ1, · · · ,ΓS−1 be i.i.d and uniformly sampled from Pn+m and we approximatethe critical value by QR,1−α, where

R(t) :=1

S

(I{EDkn(Z)≤t} +

S−1∑i=1

I{EDkn(ΓiZ)≤t}

).

Under the null hypothesis, let the critical value c be chosen as c = QR,1−α or c = QR,1−α,

then we have P (EDkn(Z) > c) ≤ α, where the equality may not hold (i.e., the size may

not be equal to α) since our test is non-randomized. A randomization based test can beformulated but should make little difference in practice; see [21].

3.1. Local Alternatives

In this subsection, we define different local alternatives, under which the EDkn(Z) based

permutation test will be consistent, have a nontrivial power limit and exhibit trivialpower (power no larger than the significance level α) in the limit. To formally define thelocal alternative hypothesis, let the operator K be defined as

K(Zi, Zj) =1√p

p∑u=1

{ψ(ziu, zju)− E [ψ(ziu, zju)|ziu]

− E [ψ(ziu, zju)|zju] + E [ψ(ziu, zju)]}, (6)

It follows from Proposition 2.2.1 of [30] that E[K(Zi, Zj)K(Zi′ , Zj′)] = 0 if {i, j} 6={i′, j′}. Next, denote the average distance over components as

ψ(Zi, Zj) =1

p

p∑u=1

ψ(ziu, zju).

In addition, we need to assume the existence of some constants to properly define thelocal alternatives. These constants will also appear in the limiting distribution of our teststatistics.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

8 C. Zhu and X. Shao

Assumption 1. As p→∞, assume the existence of the limiting mean

ex = limp→∞

E[ψ(X,X ′)

], ey = lim

p→∞E[ψ(Y, Y ′)

]and exy = lim

p→∞E[ψ(X,Y )

]and also the limiting variances

vx = limp→∞

var [K(X,X ′)] , vy = limp→∞

var [K(Y, Y ′)] and vxy = limp→∞

var [K(X,Y )] .

Then, we are ready to define the consistency space HAc , under which the EDkn(Z)

implemented as permutation test can be shown to be consistent under both HDLSS andHDMSS settings.

HAc := {(F,G) | 2ϕ(exy) 6= ϕ(ex) + ϕ(ey)} .

We use Ac to denote the complement of any given set A and denote F = (F1, F2, · · · , Fp)and G = (G1, G2, · · · , Gp), where Fu, Gu, u = 1, 2, · · · , p are the marginal univariatedistributions. For commonly used kernels, we have the following table characterizingHAc and the proof is postponed to subsection A.1.

k HAc Characterization

L2-norm

HAc =

(F,G)

∣∣∣∣∣∣∑pu=1(E(xu)− E(yu))2 = o(p) and

|∑pu=1(var(xu)− var(yu))| = o(p)

c

Gaussian kernel

Laplacian kernel

L1-norm HAc = {(F,G) |∑pu=1 ED(Fu, Gu) = o(p)}c

Then, we present the space HAl , under which the normal limit of EDkn(Z) can be derived

under both HDLSS and HDMSS.

HAl :=

(F,G)

∣∣∣∣∣∣∣∣∣∣∣

exy = ex = ey,∣∣2E [ψ(X,Y )]− E

[ψ(X,X ′)

]− E

[ψ(Y, Y ′)

]∣∣ = o(√

1nmp ),

E[∣∣E [ψ(X,Y )|X

]− E

[ψ(X,X ′)|X

]∣∣] = o(√

1nmp ) and

E[∣∣E [ψ(X,Y )|Y

]− E

[ψ(Y, Y ′)|Y

]∣∣] = o(√

1nmp ).

.

Under HAl , a limit for the power of EDkn(Z) (implemented as permutation test) is derived

under HDLSS. On the other hand, its power is shown to be trivial (no larger than thesignificance level α) under HDMSS and HAl . Next, we provide sufficient conditions for

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 9

(F,G) ∈ HAl with respect to the following well known kernels.

k Sufficient conditions for HAl

L2-norm (F,G)

∣∣∣∣∣∣∑pu=1(E(xu)− E(yu))2 = o(

√pnm ) and

|∑pu=1(var(xu)− var(yu))| = o(

√pnm )

⊆ HAlGaussian kernel

Laplacian kernel

L1-norm {(F,G) |Fu = Gu, u = 1, 2, · · · , p} ⊆ HAl

Then, the set of distributions HAt is defined as

HAt := {(F,G) |(F,G) ∈ HAl , vxy = vx = vy } .

It can be shown that under HAt , the EDkn(Z) based permutation test has power no larger

than the significance level α for both HDLSS and HDMSS settings. Sufficient conditionsof being in HAt are provided in the following table.

k Sufficient conditions for HAt

L2-norm

(F,G)

∣∣∣∣∣∣∣∣∣∑pu=1(E(xu)− E(yu))2 = o(

√pnm ),

|∑pu=1(var(xu)− var(yu))| = o(

√pnm ) and∑p

u,v=1(cov(xu, xv)− cov(yu, yv))2 = o(p)

⊆ HAtGaussian kernel

Laplacian kernel

L1-norm{

(F,G)∣∣(xu, xv) =d (yu, yv), u, v = 1, · · · , p

}⊆ HAt

Comparing the three local alternatives, it follows from the definition of HAc , HAl , HAt

that HcAc⊇ HAl ⊇ HAt . We also want to remark that it holds for arbitrary function ϕ

and ψ that

{(F,G) |Fu = Gu, u = 1, 2, · · · , p} ⊆ HAl ,{(F,G)

∣∣(xu, xv) =d (yu, yv), u, v = 1, · · · , p}⊆ HAt .

3.2. High Dimensional Low Sample Size (HDLSS)

The analysis in this subsection is conducted under the high dimensional low sample sizesetting (HDLSS), i.e, n,m are fixed constants and we let p → ∞. Our final goal is tostudy the power of EDk

n(Z) based permutation test under various local alternatives. Tothis end, we need the following assumption. Recall the operator K is defined in (6).

Assumption 2. For fixed n,m, as p→∞, K(Xi, Yj)K(Xi1 , Xi2)K(Yj1 , Yj2)

i,j,i1<i2,j1<j2

d→

bijci1i2dj1j2

i,j,i1<i2,j1<j2

,

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

10 C. Zhu and X. Shao

where {bij , ci1i2 , dj1j2}i,j,i1<i2,j1<j2 are uncorrelated and jointly Gaussian with mean 0and variances var(bij) = vxy, var(ci1i2) = vx, var(dj1j2) = vy.

Remark 3.1. The above multi-dimensional CLT result is classical and can be derivedunder suitable moment and weak dependence assumptions on the components of X andY .

In the above assumption, it is due to the use of double centered distance K(Zi, Zj)that the asymptotic covariance matrix is diagonal. Then, to provide some insights, thefirst step of our power analysis is the Taylor expansion w.r.t ϕ up to the second order,i.e., for i 6= j

k(Zi, Zj) = ϕ (eij) + ϕ(1)(eij)L(Zi, Zj) +R2(Zi, Zj),

where L(Zi, Zj) := ψ(Zi, Zj)− eij is an operator that acts on random variables,

eij =

ex, if 1 ≤ i, j ≤ n,ey, if n+ 1 ≤ i, j ≤ n+m,exy, otherwise.

and R2(Zi, Zj) is the remainder given by

R2(Zi, Zj) = L2(Zi, Zj)

∫ 1

0

∫ 1

0

uϕ(2) (eij + uvL(Zi, Zj)) dvdu.

In order to control the remainder term, we need assumptions about the decay rate ofE[L2(Zi, Zj)]. Thus, we set

α2x = E[L2(X,X ′)], α2

y = E[L2(Y, Y ′)] and α2xy = E[L2(X,Y )].

It then follows from Markov’s inequality that L(X,Y ) = Op(αxy), L(X,X ′) = Op(αx)and L(Y, Y ′) = Op(αy). Then, our next two assumptions are used to control the remain-der terms induced by taking the Taylor expansion.

Assumption 3. α2xy = o(1), α2

x = o(1) and α2y = o(1).

Assumption 4.√pα2

xy = o(1),√pα2

x = o(1) and√pα2

y = o(1).

Remark 3.2. To gain some insights into the above assumptions, a straightforwardcalculation yields

α2xy =

1

p2

p∑u,v=1

cov(ψ(xu, yu), ψ(xv, yv)) +

(∑pu=1E[ψ(xu, yu)]

p− exy

)2

.

Therefore, we have√pα2

xy = o(1) if the component-wise dependencies of both X and Yare not so strong. For illustration purpose, suppose X and Y are κ-dependent weak sta-tionary time series, i.e., xu ⊥ xv and yu ⊥ yv if |u−v| > κ. Then, if maxuE[ψ2(xu, yu)] <

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 11

∞, it is easy to see that α2xy = O (κ/p) and as a consequence, Assumption 4 is satis-

fied as long as κ/√p = o(1). In addition, it is indeed fairly straightforward to verify

the above result when the sequence {(xu, yu)}pu=1 is α-mixing with geometrically decayingcoefficients.

Remark 3.3. When ψ(x, y) = (x− y)2, some algebra shows that

p∑u,v=1

cov(ψ(xu, yu), ψ(xv, yv)) = var(XTX) + var(Y TY )

+ 4var(XTY )− 4cov(XTX,XTE[Y ])− 4cov(Y TY, Y TE[X]).

Thus, suppose∑pu=1E[ψ(xu, yu)]/p − exy = o(p−1/4) and if var(XTX), var(Y TY ),

var(XTY ), var(XTE[Y ]), var(E[X]TY ) all have order o(p1.5), we have√pα2

xy = o(1).

In the next theorem, we state the asymptotic behavior of EDkn(ΓwZ) for each fixed

permutation matrix Γw ∈ Sw. Here, we use the italic gamma Γw ∈ Sw to differentiatefrom Γi ∈ Pn+m.

Theorem 3.1. For fixed Γw ∈ Sw,

(i) Under Assumptions 1 and 3,

EDkn(ΓwZ)

p→ µn,w,

where µn,w is defined as

µn,w := µn(ΓwZ) = (2ϕ(exy)− ϕ(ex)− ϕ(ey))×{1−

(2m− 1

m(m− 1)+

2n− 1

n(n− 1)

)w +

(2

mn+

1

n(n− 1)+

1

m(m− 1)

)w2

}.

(ii) Under Assumptions 1, 2, 4 and local alternative HAl ,

√p(

EDkn(ΓwZ)− µn(ΓwZ)

)d→ N

(0, σ2

n,w

).

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

12 C. Zhu and X. Shao

where σ2n,w is given as

σ2n,w :=σn(ΓwZ)

=

{4

nm− 4

(n+m

n2m2− n

n2(n− 1)2− m

m2(m− 1)2

)w

+ 4

(2

n2m2− 1

n2(n− 1)2− 1

m2(m− 1)2

)w2

}vxy[ϕ(1)(exy)]2

+

{2

n(n− 1)+ 2

(2n

n2m2− 2n− 1

n2(n− 1)2− 1

m2(m− 1)2

)w

− 2

(2

m2n2− 1

n2(n− 1)2− 1

m2(m− 1)2

)w2

}vx[ϕ(1)(ex)]2

+

{2

m(m− 1)+ 2

(2m

n2m2− 1

n2(n− 1)2− 2m− 1

m2(m− 1)2

)w

− 2

(2

n2m2− 1

m2(m− 1)2− 1

n2(n− 1)2

)w2

}vy[ϕ(1)(ey)]2.

We use W ∼ Hypergeometric(n+m,m, n) to denote that W follows the hypergeomet-ric distribution, which describes the probability of n draws from a union of two groups(one group has m elements, the other has n elements) such that w of them are chosenfrom the group of size m. To be precise, W has probability mass function

P (W = w) =

(mw

)(n

m−w)(

n+mn

) for w ∈ {0, 1, · · · ,min{n,m}} .

Then, the limiting distribution of EDkn(ΓZ) is derived in the following proposition.

Proposition 3.1. For Γ ∼ Uniform(Pn+m), which is independent of the data, let

W := N(Γ) ∼ Hypergeometric(n+m,m, n).

(i) Under Assumptions 1 and 3,

EDkn(ΓZ)

p→ µn,W .

(ii) Under Assumptions 1, 2, 4 and local alternative HAl ,

√p(

EDkn(ΓZ)− µn,W

)d→ N

(0, σ2

n,W

).

In the above proposition, N(0, σ2

n,W

)should be understood as a mixture of Gaussian

with probability distribution

P(N(0, σ2

n,W

)≤ a

)=

min{n,m}∑w=1

P (W = w)P (N(0, σ2

n,w

)≤ a).

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 13

Next, let Γ0 corresponds to the identity permutation map, we present the power behaviorof EDk(Z) when the critical values are obtained via permutations.

Theorem 3.2. Under Assumption 1 and assume that 2ϕ(exy) ≥ ϕ(ex) + ϕ(ey).

1, [Consistency] Suppose Assumption 3 hold.

(i) If the critical value is chosen as QR,1−α. Let n,m be large enough such that

n!m!/(n + m)! < 1 − α if m 6= n and 2(n!)2/(2n)! < 1 − α if m = n. Then,we have

limp→∞

PHAc

(EDk

n(Z) > QR,1−α

)= 1,

which means that the asymptotic power of EDk based permutation test is 1 asp goes to infinity.

(ii) If the critical value is chosen as QR,1−α.Then, we have

limp→∞

PHAc

(EDk

n(Z) > QR,1−α

)≥

{1− S−1

bαScn!m!

(n+m)! , if n 6= m,

1− S−1bαSc

2(n!)2

(n+m)! , if n = m.

2, [Power Limit] Suppose Assumptions 2 and 4 hold.

(i) If the critical value is chosen as QR,1−α. Then, we have

limp→∞

PHAl

(EDk

n(Z) > QR,1−α

)= P (V (Γ0) > QT ,1−α),

where

T (t) :=1

(n+m)!

(n+m)!∑i=1

I{V (Γi)≤t}

and

V (Γs) =n∑i=1

m∑j=1

Πs,ijbij −∑

1≤i<j≤n

Πs,ijcij −∑

1≤i<j≤m

Πs,ijdij .

(ii) If the critical value is chosen as QR,1−α.

limp→∞

PHAl

(EDk

n(Z) > QR,1−α

)= P

(V (Γ0) > QT ,1−α

),

where

T :=1

S

(I{V (Γ0)≤t} +

S−1∑i=1

I{V (Γi)≤t}

).

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

14 C. Zhu and X. Shao

3, [Trivial Power] Suppose Assumptions 2 and 4 hold. Then, we have

limp→∞

PHAt

(EDk

n(Z) > c)≤ α where c = QR,1−α or c = QR,1−α,

which means that the asymptotic power of EDk based permutation test is no morethan the level α when p goes to infinity.

Remark 3.4. The above theorem and discussions in subsection 3.1 indicate that

1, L1-norm can be more advantageous than L2-norm, Gaussian kernel and Laplaciankernel when the dimension is high, since L1-distance leads to high power providedthat the summation of discrepancies between marginal univariate distributions isnot so small, while L2-norm, Gaussian kernel and Laplacian kernel would resultin power loss when the total of marginal univariate mean and variance differencesbetween X and Y is of order o(

√p). Notice that the distributions of X and Y

can differ in other aspects of the marginal distribution even if they have the samemarginal univariate mean and variance.

2, All the tests under examination are only capable of detecting the discrepancies ofmarginal distributions. If the two high dimensional distributions F 6= G, but Fu =Gu for u = 1, 2, · · · , p, then none of them have consistent power.

3.3. High Dimensional Medium Sample Size (HDMSS)

In this subsection, the theories are developed under the high dimensional medium samplesize setting (HDMSS), i.e., as p → ∞, n := n(p) → ∞ at a slower rate compared to pand n/m = ρ, where ρ ∈ (0,∞) is a fixed constant. Though the proofs are quite different,most results and phenomena under the HDLSS setting have their similar counterpartsunder the HDMSS setting. Now that we have n,m growing to infinity, we need a strongerversion of Assumptions 3 and 4.

Assumption 5. nmα2xy = o(1), n2α2

x = o(1) and m2α2y = o(1).

Assumption 6.√nmpα2

xy = o(1), n√pα2

x = o(1) and m√pα2

y = o(1).

Remark 3.5. Following Remark 3.2, for κ-dependent stationary time series, α2xy =

O(κ/p). Thus, Assumptions 5 and 6 both require that nmκ = o(p).

To derive the asymptotic distribution under the HDMSS, we note that the leadingterm of EDk

n(Z) is a martingale, the following assumption is used to ensure the condi-tional Lindeberg condition and the requirements on the conditional variance in classicmartingale central limit theorem.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 15

Assumption 7. For any Λ1,Λ2,Λ3,Λ4 ∈ {X,Y }, suppose

E[K4(Λ1,Λ

′2)]

= o(n2),

E[K2(Λ1,Λ

′′3)K2(Λ′2,Λ

′′3)]

= o(n),

E [K(Λ1,Λ′′3)K(Λ1,Λ

′′′4 )K(Λ′2,Λ

′′′4 )K(Λ′2,Λ

′′3)] = o(1),

where (Λ′1,Λ′2,Λ′3,Λ′4), (Λ′′1 ,Λ

′′2 ,Λ

′′3 ,Λ

′′4) and (Λ′′′1 ,Λ

′′′2 ,Λ

′′′3 ,Λ

′′′4 ) are independent copies of

(Λ1,Λ2,Λ3,Λ4).

Remark 3.6. For any function ϕ and ψ, suppose X and Y are κ dependent sequences,i.e., xu ⊥ xv and yu ⊥ yv if |u− v| > κ. If there exists some constant C > 0 such that

max

{supuE[ψ4(xu, yu)

], supuE[ψ4(xu, x

′u)], supuE[ψ4(yu, y

′u)]}≤ C.

Then, for notational convenience, let

φxy,u = ψ(xu, yu)− E [ψ(xu, yju)|xu]− E [ψ(xu, yu)|yu] + E [ψ(xu, yu)] ,

we see that supuE[φ4xy,u] ≤ 44C and thus E[K4(X,Y )] can be bounded as following

E[K4(X,Y )

]=

1

p2

p∑s=1

s+3κ∑t,u,v=s−3κ

E[φxy,sφxy,tφxy,uφxy,v] = O

(κ3

p

).

and similar results can be shown for E[K4(X,X ′)] and E[K4(Y, Y ′)]. Thus, Assumption7 is satisfied if κ3/p = o(1).

Let Φ denote the cdf of N(0, 1). We shall show that EDk(ΓwZ) converges uniformlywith respect to w under the HDMSS setting.

Theorem 3.3. For w = 0, 1, 2, · · · ,min{n,m}, fix Γw ∈ Sw,

(i) Under Assumptions 1 and 5,

supw

∣∣∣EDkn(ΓwZ)− µn,w

∣∣∣ = op(1).

where µn,w is the same as that in Theorem 3.1.(ii) Under Assumptions 1, 5, 6, 7 and local alternative HAl ,

supw

∣∣∣∣∣∣P(√

nmp(

EDkn(ΓwZ)− µn,w

)≤ a

)− Φ

a√nmσ2

n,w

∣∣∣∣∣∣ = o(1)

where a is a fixed constant and σ2n,w is the same as in Theorem 3.1.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

16 C. Zhu and X. Shao

Then, the following theorem states the asymptotic distribution of EDkn(ΓZ).

Theorem 3.4. For Γ ∼ Uniform(Pn+m), which is independent of the data,

(i) Under Assumptions 1 and 5,

EDkn(ΓZ)

p→ 0.

(ii) Under Assumptions 1, 5, 6, 7 and local alternative HAl ,

√nmp

(EDk

n(ΓZ)

EDkn(Γ′Z)

)d→ N

(0,

(σ2 00 σ2

))where Γ′ is an independent copy of Γ and σ2 is the asymptotic variance defined as

σ2 := 4vxy[ϕ(1)(exy)]2 + 2ρvx[ϕ(1)(ex)]2 +2

ρvy[ϕ(1)(ey)]2.

We need the limiting distribution of (EDkn(ΓZ),EDk

n(Γ′Z)) to show that the varianceof randomization distribution go to 0, from which it follows that the randomizationdistribution converges in probability to the limit of its mean. Furthermore, we can showthat the critical values are concentrating on some constants.

Corollary 3.1. Let Γ1, · · · ,ΓS be i.i.d and uniformly sampled from Pn+m.

(i) Under Assumptions 1 and 5, as n ∧m ∧ p ∧ S →∞,

R(t)p→ I{t≥0} and R(t)

p→ I{t≥0}.

Consequently, we have QR,1−αp→ 0 and QR,1−α

p→ 0.

(ii) Under Assumptions 1, 5, 6, 7 and local alternative HAl , as n ∧m ∧ p ∧ S →∞,

1

(n+m)!

(n+m)!∑i=1

I{√nmpEDkn(ΓiZ)≤t}p→ Φ(t/σ),

1

S

S∑i=1

I{√nmpEDkn(ΓiZ)≤t}p→ Φ(t/σ)

Consequently, we have√nmpQR,1−α

p→ σQΦ,1−α and√nmpQR,1−α

p→ σQΦ,1−α,

where σ2 is defined in Theorem 3.4.

The power behavior of EDkn(Z) w.r.t permutation test under the HDMSS is stated in

the following theorem.

Theorem 3.5. Suppose Assumption 1 is true and assume that 2ϕ(exy) ≥ ϕ(ex)+ϕ(ey).For any c ∈ {QR,1−α, QR,1−α}, the following holds.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 17

1, [Consistency] Under Assumption 5, we have

limp→∞

PHAc

(EDk

n(Z) > c)

= 1,

which means that the asymptotic power of EDk based permutation test is 1 asp ∧ n ∧m→∞.

2, [Trivial Power] Under Assumptions 5, 6 and 7, we have

limp→∞

PHAl

(EDk

n(Z) > c)≤ α,

Thus, we have the asymptotic power of EDk based permutation test is no more thanthe level α when p ∧ n ∧m→∞.

Comparing with Theorem 3.2, the EDkn(Z) based permutation test have trivial power

under HAl and the HDMSS setting. This is due to the interesting facts that nmσ2n,W

converges in probability to σ2, which is also the limit of nmσ2n,0 as n→∞ and

cov(

EDkn(ΓZ),EDk

n(Γ′Z))→ 0 as n→∞,

which ensures that the randomization distribution converges in probability to its meanlimit.

4. Numerical Studies

In this section, we consider several examples to demonstrate the finite sample perfor-mance of EDk based permutation test for different distance metrics. In our numericalcomparison, we include the tests of Li [22] (denoted as JL) and Biswas and Ghosh [7](denoted as BG) as these two were shown to have higher power over others in Li [22].The critical values of JL test are determined by its asymptotic distribution, whereas BGtest is also implemented as a permutation test.

4.1. Performance on simulated data

In all our simulations, we set α = 0.05 and perform 1000 Monte Carlo replications with300 permutations for each test. The first example is adopted from the simulation settingof [22] to study the size accuracy.

Example 4.1. Generate samples as

X = (V 1/2RV 1/2)1/2Z1,

Y = (V 1/2RV 1/2)1/2Z2,

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

18 C. Zhu and X. Shao

where R = (rij)pi,j=1, rij = ρ|i−j| and ρ = 0.5 or 0.8; V is a diagonal matrix with

V1/2ii = 1 or uniformly drawn from (1,5). Z1, Z2 are i.i.d copies of Z with

Z = (z1, z2, · · · , zp︸ ︷︷ ︸iid∼N(0,1)

) or Z = ( z1, z2, · · · , zp︸ ︷︷ ︸iid∼Exponential(1)

)− 1p.

In Example 4.1, X and Y follow the same distribution and we consider cases thatn = m = 50 or n = 70,m = 30. From Table 2, we can see that all the tests have quiteaccurate size. To compare the power, we first use an example from [22], which include

Table 2. Size comparison from Example 4.1 for p = 500

ρ V1/2ii n m EDL2-norm EDGaussian EDLaplacian EDL1-norm BG JL

Norm

al

0.5 1 50 50 0.06 0.06 0.058 0.059 0.053 0.0530.5 1 70 30 0.07 0.07 0.068 0.073 0.047 0.0570.5 Un(1,5) 50 50 0.052 0.052 0.05 0.051 0.056 0.0570.5 Un(1,5) 70 30 0.059 0.059 0.061 0.05 0.049 0.0450.8 1 50 50 0.053 0.053 0.052 0.059 0.054 0.0550.8 1 70 30 0.045 0.046 0.046 0.05 0.052 0.0550.8 Un(1,5) 50 50 0.045 0.045 0.049 0.048 0.054 0.0540.8 Un(1,5) 70 30 0.05 0.05 0.049 0.046 0.051 0.051

Exp

on

enti

al

0.5 1 50 50 0.06 0.06 0.058 0.059 0.053 0.0530.5 1 70 30 0.063 0.063 0.063 0.058 0.048 0.0530.5 Un(1,5) 50 50 0.057 0.057 0.058 0.055 0.049 0.060.5 Un(1,5) 70 30 0.056 0.056 0.06 0.058 0.059 0.0580.8 1 50 50 0.054 0.054 0.051 0.047 0.065 0.0620.8 1 70 30 0.061 0.061 0.062 0.065 0.057 0.060.8 Un(1,5) 50 50 0.051 0.05 0.052 0.046 0.045 0.0570.8 Un(1,5) 70 30 0.062 0.062 0.062 0.062 0.06 0.064

the situation when X and Y only differ in their means or only differ in their covariancematrices or differ in both, where β ∈ [0, 1] is the percentage of the p components thatdiffer in their distributions.

Example 4.2. Let R, V, Z1, Z2 be defined the same as in Example 4.1 and we chooseρ = 0.5 here. Generate samples as

(i)

X = (V 1/2RV 1/2)1/2Z1,

Y = (0.125× 1βp,0(1−β)p) + (V 1/2RV 1/2)1/2Z2.

(ii) Let V ∗ be a diagonal matrix with V∗1/2ii = 1.05 for i = 1, 2, · · · , βp and V

∗1/2ii = 1

for i = βp+ 1, · · · , βp.

X = (V 1/2RV 1/2)1/2Z1,

Y = (V ∗1/2RV ∗1/2)1/2Z2.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 19

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

β

Pow

er

Example 4.2 (i)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

β

Example 4.2 (ii)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

β

Example 4.2 (iii)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

β

Pow

er

Example 4.2 (i)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

β

Example 4.2 (ii)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

β

Example 4.2 (iii)

EDL2-norm EDGaussian EDLaplacian EDL1-norm BG JL

Figure 1: Power comparison for example 4.2 and n = 70, m = 30, p = 500, where in thetop 3 figures Z1, Z2 are generated from normal distribution and in the bottom 3 figures,Z1, Z2 are generated from exponential distribution.

(iii) Let V∗1/2ii = 1.04 for i = 1, 2, · · · , βp and V

∗1/2ii = 1 for i = βp+ 1, · · · , βp.

X = (V 1/2RV 1/2)1/2Z1,

Y = (0.1× 1βp,0(1−β)p) + (V ∗1/2RV ∗1/2)1/2Z2.

From Figure 1, we can see that (1) when there is a small difference in the means,EDk-based tests and JL perform similarly, while BG barely show any power. (2) whenthere is a small difference in the scales, JL and BG are consistent and EDk-based testshave very little power. Similar phenomenon by Li [22] were also observed, i.e., EDk basedpermutation test is not sensitive to small scale differences and the method proposed byLi [22] and Biswas and Ghosh [7] have dominant power in this case. Note that there isa tuning parameter involved in JL test and its choice could have a big impact on thesize and power; results not shown. (3) when there are differences for both the means andscales, all the tests performs comparably.

Next, Example 4.3 examines the situation when X and Y have the same marginalunivariate mean and variance, but different marginal univariate distributions.

Example 4.3. Generate samples as

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

20 C. Zhu and X. Shao

(i) Let Rademacher(0.5) be the Rademacher distribution with success probability 0.5,e.g. P (yiu = −1) = P (yiu = 1) = 0.5.

X = (x1, ..., xp)iid∼ N(0, 1),

Y = ( y1, y2, · · · , yβp︸ ︷︷ ︸iid∼Rademacher(0.5)

, yβp+1, yβp+2 · · · , yp︸ ︷︷ ︸iid∼N(0,1)

).

(ii)

X = (x1, ..., xp)iid∼ N(0, 1),

Y = ( y1, y2, · · · , yβp︸ ︷︷ ︸iid∼Uniform(−

√3,√

3)

, yβp+1, yβp+2 · · · , yp︸ ︷︷ ︸iid∼N(0,1)

).

From Figure 2, we see that only EDL1-norm based permutation test has power growingas β elevates (p fixed) or p increases (β fixed). This phenomenon matches with ourtheories, which indicate that L2-norm, Gaussian and Laplacian kernel can detect only

marginal mean and variance differences. For EDL1-norm based permutation test, the poweris growing more rapidly for Example 4.3 (i) than Example 4.3 (ii), which might suggestthat L1-distance is more sensitive for the difference between continuous and discretedistributions. It is also apparent that the JL and BG tests show little power in thisexample. The next example examines the case where X and Y have the same marginalunivariate distributions.

Example 4.4. Generate samples as

(i) Let (y′1, y′2, · · · , y′βp/2)

iid∼ Bernoulli(0.5)

X = (x1, ..., xp)iid∼ Bernoulli(0.5),

Y =(y′1, I{y′1=1}, y

′2, I{y′2=1} · · · , y′βp/2, I{y′βp/2=1}, y1, y2, · · · , y(1−β)p︸ ︷︷ ︸

iid∼Bernoulli(0.5)

).

(ii) Let (y′1, y′2, · · · , y′βp/3)

iid∼ Bernoulli(0.5) and (y′′1 , y′′2 , · · · , y′′βp/3)

iid∼ Bernoulli(0.5)

X = (x1, ..., xp)iid∼ Bernoulli(0.5),

Y =(y′1, y

′′1 , I{y′1=y′′1 }, · · · , y

′βp/3, y

′′βp/3, I{y′βp/3=y′′

βp/3}, y1, y2, · · · , y(1−β)p︸ ︷︷ ︸

iid∼Bernoulli(0.5)

).

Notice that in Example 4.4 (i) X, Y have the same marginal univariate distribution,but different marginal bivariate distributions and in Example 4.4 (ii) X, Y have the same

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 21

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

β

Pow

er

Example 4.3 (i)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

β

Example 4.3 (ii)

EDL2-norm

EDGaussian

EDLaplacian

EDL1-norm

BG

JL

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

p

Pow

er

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

p

Figure 2: Power comparison for Example 4.3 and n = 70, m = 30. For the top two figures,the dimension p is equal to 500 and we plot the power as β ranges from 0 to 1. For thebottom two figures, β is fixed to be 0.5 and the power is plotted with respect to p.

marginal bivariate distribution, but different joint distribution. Theorem 3.2 (ii) andTheorem 3.5 (ii) both provide insights that L2-norm, L1-norm, Gaussian or Laplaciankernel based tests all suffer substantial power loss under Example 4.4 (i). On the otherhand, Theorem 3.2 (iii) suggests us that since Example 4.4 (ii) belong to class HAt , allthese tests have trivial power. The simulation results of Example 4.4 are in Figure 3 andthey again corroborate our theoretical findings.

4.2. Performance on real data

We also compare the power of the above tests on the following real data sets.

• Strawberry data: this data set contains the spectrographs of fruit purees. Thereare totally two classes: one is strawberry purees (authentic samples) and the otherone is non-strawberry purees (adulterated strawberries and other fruits). Each datapoint is of length 235.

• SmallKitchenAppliances data: this data sets contains records of the electricity usageof some kitchen appliances. We only use classes Kettle and Microwave. Each data

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

22 C. Zhu and X. Shao

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

β

Pow

er

Example 4.4 (i)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

β

Example 4.4 (ii)

EDL2-norm

EDGaussian

EDLaplacian

EDL1-norm

BG

JL

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

p

Pow

er

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

p

Figure 3: Power comparison for Example 4.4 and n = 70, m = 30. For the top two figures,the dimension p is equal to 500 and we plot the power as β ranges from 0 to 1. For thebottom two figures, β is fixed to be 1 and the power is plotted with respect to p.

point has readings taken every 2 minutes over 24 hours.• Earthquakes data: this data set is from Northern California Earthquake Data Cen-

ter and has classes of positive and negative major earthquake events. There are 368negative and 93 positive cases and each data point is of length 512.

All the above data sets are downloaded from UCR Time Series Classification Archive[10] (https://www.cs.ucr.edu/~eamonn/time_series_data_2018/) and a glance ofthese data sets is provided in Figure 4. For each of the three data sets, the data pointshave two classes and we want to compare the underlining distributions of the two classes.Following the procedures of [7] and [24], for each m = n ∈ {10, 20, 30, 40, 50, 60}, we ran-domly sample n points from each class and test whether the two distributions are thesame using the afore-mentioned tests. The same procedure is repeated 1000 times tocalculate the power.

The experimental results for these data sets are shown in Figure 5, from which wesee that all the tests have very high power for the Strawberry data with relatively lowsample size. As for the SmallKitchenAppliances and Earthquakes data sets, the L1-normbased test demonstrates superior power compared to other tests. It is also worth notingthat BG and JL barely exhibit any power for the Earthquakes data.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 23

Str

aw

ber

ryClass 1

Class 2

Kit

chen

Ap

pli

an

ces

Eart

hqu

akes

Figure 4: A glance of the data in Section 4.2, where we plot one point from each of thetwo classes for each data set.

5. Discussions&Conclusion

In this paper, we study the two-sample hypothesis testing problem in a high dimen-sion and low/medium sample size setting. Our focus is on the interpoint distance basedpermutation tests, such as those based on Energy Distance (ED) and Maximum MeanDiscrepancy (MMD). Our theory demonstrates that all these tests under examinationare unable to detect the difference between two high dimensional distributions beyondunivariate marginal distributions. In particular, the ED test with L2-norm and MMDwith Gaussian or Laplacian kernels suffer substantial power loss under the HDLSS andhave trivial power under the HDMSS when the average of component-wise mean and vari-ance discrepancies between two distributions are both asymptotically zero at the rate ofo(1/√nmp). Thus these tests mainly target mean and variance differences in marginal

distributions. By contrast, if we use L1-norm in ED test, then the non-negligible differ-ence in marginal univariate distributions, as quantified by cumulative energy distance ofmarginal distributions, can be detected with high power. Thus the theory suggests that

1), The ED with L2-norm, and MMD with Gaussian and Laplacian kernels are of thesame category, as they all depend on the interpoint distance as measured by Euclideandistance, which leads to undesirable power limitation.

2), Although in a low dimensional setting the use of L1-norm in ED is not preferreddue to the fact that it does not completely characterize the difference between two dis-tributions since ED1(F,G) = 0 does not necessarily imply F = G, it seems to have some

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

24 C. Zhu and X. Shao

10 20 30 40 50 600

0.2

0.4

0.6

0.8

1

n = m

Pow

er

Strawberry

10 20 30 40 50 600

0.2

0.4

0.6

0.8

1

n = m

SmallKitchenAppliances

10 20 30 40 50 600

0.2

0.4

0.6

0.8

1

n = m

Earthquakes

EDL2-norm EDGaussian EDLaplacian EDL1 BG JL

Figure 5: Power comparison for real data examples in Section 4.2.

advantage over the ED with L2-norm and MMD with Gaussian and Laplacian kernelsin the high dimensional setting, as shown in both theory and numerical studies. Inter-estingly, for each fixed p, it is shown in [25] that a class of methods based on kernelembedding have better power if using L1-norm to differentiate the expectations of an-alytic kernels evaluated at some well-chosen distribution locations instead of L2-norm,since these features are dense due to the use of analytic functions.

3), As shown in our simulations and data illustration, the existing interpoint distancetest by [22] and [7] also suffer from low power when the two distributions have the samemarginal mean and variances but different marginal distributions. So in this sense, theyare also inferior to the ED test with L1-norm.

4), The difference in marginal distributions of two high dimensional distributions canbe interpreted as the main effect of the distribution differences. It is a standard statisticalpractice to test for the nullity of main effects first, before proceeding to the higher-orderinteractions. Thus we advocate the use of L1-norm based test to test for the presence ofmain differences in two high dimensional distributions.

To conclude the paper, we shall mention a few future directions. First, we are hold-ing the bandwidth parameter in Gaussian and Laplacian kernels fixed for theoreticalconvenience, and it would be interesting to relax this restriction by allowing it to bedata-dependent. Second, there might be some intrinsic difficulty of capturing all kinds ofdifferences in two high dimensional distributions with limited sample sizes, so it seemsnatural to ask whether it is possible to detect any difference beyond marginal univariatedistributions. If possible, what would be the form of the new tests? We leave these topicsfor future investigation.

Appendix A: Technical Details

A.1. Proof of Sufficient Conditions for Local Alternatives

When ψ(x, y) = (x − y)2, ϕ is strictly concave, strictly increasing on (0,+∞) (e.g. L2-norm, Gaussian kernel myltiplied by -1 and Laplacian kernel multiplied by -1), we first

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 25

note that 2exy − ex − ey = 2 limp→∞∑pu=1(E(xu)− E(yu))2/p ≥ 0 and

ϕ(exy)− ϕ(ex) + ϕ(ey)

2≥ ϕ(exy)− ϕ

(ex + ey

2

)≥ 0,

where the equality holds iff exy = ex = ey. Also, some algebra shows that

exy = ex + limp→∞

1

p

p∑u=1

(E(xu)− E(yu))2 + limp→∞

1

p

p∑u=1

(var(yu)− var(xu))

= ey + limp→∞

1

p

p∑u=1

(E(xu)− E(yu))2 + limp→∞

1

p

p∑u=1

(var(xu)− var(yu)).

Thus, in summary we have

2ϕ(exy) = ϕ(ex) + ϕ(ey)⇔ exy = ex = ey

⇔p∑

u=1

(E(xu)− E(yu))2 = o(p) and

∣∣∣∣∣p∑

u=1

(var(xu)− var(yu))

∣∣∣∣∣ = o(p).

This proves the result for HAc characterization. Next, for sufficient conditions of HAl , ifwe have

p∑u=1

(E(xu)− E(yu))2 = o( √

p√nm

)and

∣∣∣∣ p∑u=1

(var(xu)− var(yu))

∣∣∣∣ = o( √

p√nm

),

then it holds that exy = ex = ey and

E[∣∣E [ψ(X,Y )|X

]− E

[ψ(X,X ′)|X

]∣∣]≤ 2

p

√√√√ p∑u=1

E(x2u)

p∑u=1

(E(xu)− E(yu))2 +1

p

∣∣∣∣∣p∑

u=1

(var(yu)− var(xu))

∣∣∣∣∣+

1

p

√√√√ p∑u=1

(E(xu) + E(yu))2

p∑u=1

(E(xu)− E(yu))2 = o

(1

√nmp

).

For HAt , a straight forward calculation shows that

ψ(x, y)− E [ψ(x, y)|x]− E [ψ(x, y)|y] + E [ψ(x, y)] = −2(x− E(x))(y − E(y))

and vxy =∑pu,v=1 4cov(xu, xv)cov(yu, yv)/p. Thus, from Cauchy–Schwarz inequality, we

havep∑

u,v=1

(cov(xu, xv)− cov(yu, yv))2 = o(p)⇒ vxy = vx = vy.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

26 C. Zhu and X. Shao

When ψ(x, y) = |x− y|, ϕ(x) = x, the results follow from the following equality.

2ϕ(exy)− ϕ(ex)− ϕ(ey)

=2 limp→∞

1

p

p∑u=1

E [|x1u − y1u|]− limp→∞

1

p

p∑u=1

E [|x1u − x2u|]− limp→∞

1

p

p∑u=1

E [|y1u − y2u|]

= limp→∞

1

p

p∑u=1

(2E [|x1u − y1u|]− E [|x1u − x2u|]− E [|y1u − y2u|])

= limp→∞

1

p

p∑u=1

ED(Fu, Gu).

A.2. Proof of Theorem 3.1

Proof. (i) Taking a first order Taylor expansion w.r.t ϕ gives

k(Zi, Zj) = ϕ (eij) +R1(Zi, Zj),

where R1(Zi, Zj) is an operator that acts on random variables

R1(Zi, Zj) = L(Zi, Zj)

∫ 1

0

ϕ(1) (eij + vL(Zi, Zj)) dv

For each fixed permutation matrix Γw ∈ Sw

EDkn(ΓwZ) =

n+m∑i=2

i−1∑j=1

Πw,ijϕ (eij)︸ ︷︷ ︸:=µn(ΓwZ)

+

n+m∑i=2

i−1∑j=1

Πw,ijR1(Zi, Zj)︸ ︷︷ ︸:=R1(ΓwZ)

, (7)

where µn(ΓwZ) is the asymptotic mean for the permuted data and equals

2

mn

((w2 + (n− w)(m− w)

)ϕ(exy) + (n− w)wϕ(ex) + (m− w)wϕ(ey)

)− 1

n(n− 1)(2w(n− w)ϕ(exy) + (n− w)(n− w − 1)ϕ(ex) + w(w − 1)ϕ(ey))

− 1

m(m− 1)(2w(m− w)ϕ(exy) + w(w − 1)ϕ(ex) + (m− w)(m− w − 1)ϕ(ey)) .

Then, after re-arranging the terms according to the powers of w, we have

µn(ΓwZ) = (2ϕ(exy)− ϕ(ex)− ϕ(ey)) f(w),

where f(w) is a second order polynomial with respect to w

f(w) = 1−(

2m− 1

m(m− 1)+

2n− 1

n(n− 1)

)w +

(2

mn+

1

n(n− 1)+

1

m(m− 1)

)w2.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 27

For the remainder term R1(ΓwZ), notice that L(Zi, Zj)p→ 0 for any 1 ≤ i < j < n+m.

By the continuous mapping theorem, we know∫ 1

0

ϕ(1)(eij + vL(Zi, Zj))dvp→∫ 1

0

ϕ(1)(eij)dv.

Thus, it holds that R1(Zi, Zj) �p L(Zi, Zj) and R1(ΓwZ) = Op(αxy +αx +αy) = op(1).(ii) Taking a second order Taylor expansion w.r.t ϕ gives

k(Zi, Zj) = ϕ (eij) + ϕ(1) (eij)K(Zi, Zj) +W(Zi, Zj)√

p+R2(Zi, Zj),

where R2 and W are defined as

R2(Zi, Zj) = L2(Zi, Zj)

∫ 1

0

∫ 1

0

uϕ(2) (eij + uvL(Zi, Zj)) dvdu,

W(Zi, Zj) =1√p

p∑u=1

(E[ψ(ziu, zju)|ziu] + E[ψ(ziu, zju)|zju]− E[ψ(ziu, zju)]− eij) .

Accordingly, we can decompose the sample energy distance as

√p(

EDkn(ΓwZ)− µn(ΓwZ)

)=

n+m∑i=2

i−1∑j=1

Πw,ijϕ(1) (eij)K(Zi, Zj)︸ ︷︷ ︸

:=L(ΓwZ)

+

n+m∑i=2

i−1∑j=1

Πw,ijϕ(1)(eij)W(Zi, Zj)︸ ︷︷ ︸

:=Rl(ΓwZ)

+√p

n+m∑i=2

i−1∑j=1

Πw,ijR2(Zi, Zj)︸ ︷︷ ︸:=R2(ΓwZ)

. (8)

For the leading term L(ΓwZ), notice that under Assumption 2, (K(Zi, Zj))i<j convergesjointly to a multivariate normal with mean 0 and a diagonal covariance matrix. Thus,

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

28 C. Zhu and X. Shao

given a permutation matrix Γw, we are able to obtain L(ΓwZ)d→ N(0, σ2

n(ΓwZ)), where

σ2n(ΓwZ) =

4

n2(n− 1)2

{(n− w)(n− w − 1)

2vx[ϕ(1)(ex)]2

+w(w − 1)

2vy[ϕ(1)(ey)]2 + (n− w)wvxy[ϕ(1)(exy)]2

}+

4

m2(m− 1)2

{w(w − 1)

2vx[ϕ(1)(ex)]2

+(m− w)(m− w − 1)

2vy[ϕ(1)(ey)]2 + w(m− w)vxy[ϕ(1)(exy)]2

}+

4

n2m2

{(n− w)wvx[ϕ(1)(ex)]2

+ w(m− w)vy[ϕ(1)(ey)]2 + ((n− w)(m− w) + w2)vxy[ϕ(1)(exy)]2}.

By collecting terms with respect to vxy, vx, vy, we obtain

σ2n,w =

{4

nm− 4

(n+m

n2m2− n

n2(n− 1)2− m

m2(m− 1)2

)w

+ 4

(2

n2m2− 1

n2(n− 1)2− 1

m2(m− 1)2

)w2

}vxy[ϕ(1)(exy)]2

+

{2

n(n− 1)+ 2

(2n

n2m2− 2n− 1

n2(n− 1)2− 1

m2(m− 1)2

)w

− 2

(2

m2n2− 1

n2(n− 1)2− 1

m2(m− 1)2

)w2

}vx[ϕ(1)(ex)]2

+

{2

m(m− 1)+ 2

(2m

n2m2− 1

n2(n− 1)2− 2m− 1

m2(m− 1)2

)w

− 2

(2

n2m2− 1

m2(m− 1)2− 1

n2(n− 1)2

)w2

}vy[ϕ(1)(ey)]2.

We then conclude the result by showing that the remainder terms are negligible.Rl(ΓwZ) =op(1) is proved in lemma A.3. For the R2(ΓwZ) term, it can be shown similarly thatR2(Zi, Zj) �p L2(Zi, Zj) = Op(α

2xy+α2

x+α2y), which implies thatR2(ΓwZ) = Op(

√p(α2

xy+α2x + α2

y)) = op(1) under Assumption 4.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 29

A.3. Proof of Proposition 3.1

Proof. (i) From Equation 7, we obtain

EDkn(ΓZ) =

n+m∑i=2

i−1∑j=1

Πijϕ (eij)︸ ︷︷ ︸:=µn,W

+op(1),

where Πij corresponds to Γ and W = N(Γ) ∼ Hypergeometric(m+ n,m, n).(ii) It follows from Equation 8, Lemma A.3 and the proof of Theorem 3.1 that

√p(

EDkn(ΓZ)− µn(ΓZ)

)=n+m∑i=2

i−1∑j=1

Πijϕ(1) (eij)K(Zi, Zj)︸ ︷︷ ︸

:=L(ΓZ)

+op(1).

Then, under Assumption 2, it is not hard to see that L(ΓZ)d→ N(0, σ2

n,W ), whereW = N(Γ) ∼ Hypergeometric(m+ n,m, n). This concludes the proposition.

A.4. Proof of Theorem 3.2

1, For any a ∈ R(n+m)!, we define the α-th quantile of the set {a1, · · · , a(n+m)!} as

Q1−α{a1, · · · , a(n+m)!

}= min

ai :1

(n+m)!

(n+m)!∑i=1

I{ai≤t} ≥ 1− α

.

Then, we can view Q1−α as a continuous function on R(n+m)!.

(i) By Theorem 3.1, for any fixed Γi ∈ Pn+m, we have EDkn(ΓiZ)

p→ µn(ΓiZ). Thecontinuous mapping theorem implies

Q1−α

{EDk

n(Γ1Z), · · · ,EDkn(Γ(n+m)!Z)

}p→ Q1−α

{µn(Γ1Z), · · · , µn(Γ(n+m)!Z)

}.

Then, it follows from the definition of µn(·) in Theorem 3.1 that

µn(Z) = µn(Γ0Z) = max{µn(Γ1Z), · · · , µn(Γ(n+m)!Z)

}.

Notice that{n!m!

(n+m)! < 1− α if m 6= n,2(n!)2

(2n)! < 1− α if m = n,implies

{ |S0|(n+m)! < 1− α if m 6= n,|S0|+|Smin{n,m}|

(n+m)! < 1− α if m = n,

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

30 C. Zhu and X. Shao

and so µn(Γ0Z) > Q1−α{µn(Γ1Z), · · · , µn(Γ(n+m)!Z)

}. Thus, as p→∞, we conclude

P(

EDkn(Z) > Q1−α

{EDk

n(Γ1Z), · · · ,EDkn(Γ(n+m)!Z)

})→ P

(µn(Γ0Z) > Q1−α

{µn(Γ1Z), · · · , µn(Γ(n+m)!Z)

})= 1.

(ii) For any random permutation matrix Γs, by Proposition 3.1, we have EDkn(ΓsZ)

p→µn,Ws

, where Ws = N(Γs) ∼ Hypergeometric(n + m,m, n). Then, the continuous map-ping theorem implies that

P(

EDkn(Z) > Q1−α

{EDk

n(Z),EDkn(Γ1Z), · · · ,EDk

n(ΓS−1Z)})

→ P(µn,0 > Q1−α

{µn,0, µn,W1

, · · · , µn,WS−1

}).

Since µn,0 = maxw µn,w, in order to have µn,0 = Q1−α{µn,0, µn,W1

, · · · , µn,WS−1

}, at

least bαSc+ 1 elements of{µn,0, µn,W1 , · · · , µn,WS−1

}should be equal to µn,0. Thus, we

get

P(µn,0 > Q1−α

{µn,0, µn,W1

, · · · , µn,WS−1

})= 1− P

(µn,0 = Q1−α

{µn,0, µn,W1

, · · · , µn,WS−1

})≥

{1− S−1

bαScn!m!

(n+m)! , if n 6= m,

1− S−1bαSc

2(n!)2

(n+m)! , if n = m.

2, (i) Since µn(ΓuZ) = 0 for all u = 1, 2, · · · , (n + m)! under HAl , Assumption 2implies that

√pEDk

n(ΓuZ)d→

n∑i=1

m∑j=1

Πu,ijbij −∑

1≤i<j≤n

Πu,ijcij −∑

1≤i<j≤m

Πu,ijdij ,

where Πu,ij corresponds to Γu. Then, the continuous mapping theorem entails

P(

EDkn(Z) > Q1−α

{EDk

n(Γ1Z), · · · ,EDkn(Γ(n+m)!Z)

})= P

(√pEDk

n(Z) > Q1−α

{√pEDk

n(Γ1Z), · · · ,√pEDkn(Γ(n+m)!Z)

})→ P (V (Γ0) > QT ,1−α).

(ii) Conditioned on Γ1,Γ2, · · · ,ΓS−1, the result can be shown similarly with part(i).Then, since the number of permutations is fixed and finite, the unconditioned versionfollows straightforwardly.

3, (i) By construction, we have

1

(n+m)!

(n+m)!∑u=1

I{V (Γu)>Q1−α{V (Γ1),··· ,V (Γ(n+m)!)}} ≤ α.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 31

If ϕ′(exy) = ϕ′(ex) = ϕ′(ey) and vxy = vx = vy, then V (Γu) =d V (Γ0) for any u =1, 2, · · · , (n+m)! and so(V (Γu), Q1−α

{V (Γ1), · · · , V (Γ(n+m)!)

})=d(V (Γ0), Q1−α

{V (Γ1), · · · , V (Γ(n+m)!)

}).

Thus, we have

P(V (Γ0) > Q1−α

{V (Γ1), · · · , V (Γ(n+m)!)

})=

E

1

(n+m)!

(n+m)!∑u=1

I{V (Γu)>Q1−α{V (Γ1),··· ,V (Γ(n+m)!)}}

≤ α.(ii) The proof follows similarly from part (i) by observing that for any s = 1, 2, · · · , S

(V (Γs), Q1−α {V (Γ0), V (Γ1), · · · , V (ΓS−1)})=d (V (Γ0), Q1−α {V (Γ0), V (Γ1), · · · , V (ΓS−1)}) .

A.5. Proof of Theorem 3.3

(i) Recall that for a fixed permutation matrix Γw ∈ Sw that corresponds to Πw,ij ,

EDkn(ΓwZ) =

n+m∑i=2

i−1∑j=1

Πw,ijϕ (eij)︸ ︷︷ ︸:=µn(ΓwZ)

+

n+m∑i=2

i−1∑j=1

Πw,ijR1(Zi, Zj)︸ ︷︷ ︸:=R1(ΓwZ)

.

Under the HDMSS setting, part (i) follows from Lemma A.1.

Lemma A.1. Under Assumption 5, supΓ |R1(ΓZ)| = op(1).

Proof. Consider the events BXY, BX, BY and their complements BcXY, BcX, B

cY, where

BXY =

{min

1≤s≤n,1≤t≤mL(Xs, Yt) ≤ −

1

2exy or max

1≤s≤n,1≤t≤mL(Xs, Yt) ≥

1

2exy

},

BX =

{min

1≤s6=t≤nL(Xs, Xt) ≤ −

1

2ex or max

1≤s6=t≤nL(Xs, Xt) ≥

1

2ex

},

BY =

{min

1≤s 6=t≤mL(Ys, Yt) ≤ −

1

2ey or max

1≤s6=t≤mL(Ys, Yt) ≥

1

2ey

}.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

32 C. Zhu and X. Shao

Then, under assumption 5, as n ∧m ∧ p→∞

P (BXY) = P

⋃1≤s≤n,1≤t≤m

{L(Xs, Yt) ≤ −

1

2exy or L(Xs, Yt) ≥

1

2exy

}≤

∑1≤s≤n,1≤t≤m

P

(|L(Xs, Yt)| ≥

1

2exy

)

≤ nmP(|L(X,Y )| ≥ 1

2exy

)≤

4nmE[L(X,Y )2

]e2xy

= o(1).

Similarly, we can show that P (BX) = o(1) and P (BY) = o(1). Conditioned on eventBcXYB

cXB

cY, we have eij ≤ eij + vL(Zi, Zj) ≤ 3eij/2 for any 0 ≤ v ≤ 1. Suppose

ϕ(1)(·) is a continuous function on (0,+∞), we know there exist a constant C such that|ϕ(1)(eij + vL(Zi, Zj))| ≤ C and consequently, we have

|R1(Zi, Zj)| ≤ C ′|L(Zi, Zj)|,

where C ′ is a constant depends only on ϕ, exy, ex and ey. Let Πij corresponds to Γ,

supΓ

∣∣∣∣∣∣n+m∑i=2

i−1∑j=1

ΠijR1(Zi, Zj)

∣∣∣∣∣∣≤ sup

Γ

∣∣∣∣∣∣n+m∑i=2

i−1∑j=1

ΠijR1(Zi, Zj){IBcXY

IBcXIBcY + IBXY+ IBX

+ IBY

}∣∣∣∣∣∣≤ sup

Γ

n+m∑i=2

i−1∑j=1

|ΠijR1(Zi, Zj)|IBcXYIBcXIBcY + op(1)

≤n+m∑i=2

i−1∑j=1

C ′′

nm|R1(Zi, Zj)|IBcXY

IBcXIBcY + op(1),

where C ′′ is a constant depends only on ρ. Then, for any ε > 0, by Markov’s inequality

P

1

mn

n+m∑i=2

i−1∑j=1

|R1(Zi, Zj)|IBcXYIBcXIBcY > ε

≤ 1

εmn

n+m∑i=2

i−1∑j=1

E[|R1(Zi, Zj)|IBcXYIBcXIBcY ] ≤ C ′′′ (αxy + αx + αy)

ε,

where C ′′′ is a constant depends only on ρ, ϕ, exy, ex and ey.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 33

(ii) Similar to the HDLSS setting, we consider the following decomposition

√nmpEDk

n(ΓwZ) =√nmpµn(ΓwZ) +

√nm

n+m∑i=2

i−1∑j=1

Πw,ijϕ(1) (eij)K(Zi, Zj)︸ ︷︷ ︸

:=√nmL(ΓwZ)

+√nm

n+m∑i=2

i−1∑j=1

Πw,ijϕ(1)(eij)W(Zi, Zj)︸ ︷︷ ︸

:=√nmRl(ΓwZ)

+√nmp

n+m∑i=2

i−1∑j=1

Πw,ijR2(Zi, Zj)︸ ︷︷ ︸:=√nmR2(ΓwZ)

.

Next, for any −∞ < a < ∞ and ε > 0, using the inequality P (X ≤ a) ≤ P (Y ≤a+ ε) + P (|X − Y | > ε), we can show that

P (√nmL(ΓwZ) ≤ a− ε)− P (|

√nmRl(ΓwZ) +

√nmR2(ΓwZ)| > ε)

≤ P (√nmp[EDk

n(ΓwZ)− µn(ΓwZ)] ≤ a)

≤ P (√nmL(ΓwZ) ≤ a+ ε) + P (|

√nmRl(ΓwZ) +

√nmR2(ΓwZ)| > ε).

Then, some algebra shows that

supw

∣∣∣P (√nmp[EDk

n(ΓwZ)− µn(ΓwZ)] ≤ a)− Φ(a/√nmσ2

n(ΓwZ))∣∣∣

≤ supw

∣∣∣P (√nmL(ΓwZ) ≤ a− ε)− Φ

((a− ε)/

√nmσ2

n(ΓwZ))∣∣∣

+ supw

∣∣∣Φ(a/√nmσ2n(ΓwZ)

)− Φ

((a− ε)/

√nmσ2

n(ΓwZ))∣∣∣

+ supw

∣∣∣P (√nmL(ΓwZ) ≤ a+ ε)− Φ

((a+ ε)/

√nmσ2

n(ΓwZ))∣∣∣

+ supw

∣∣∣Φ(a/√nmσ2n(ΓwZ)

)− Φ

((a+ ε)/

√nmσ2

n(ΓwZ))∣∣∣

+2 supwP (|√nmRl(ΓwZ) +

√nmR2(ΓwZ)| > ε).

Next, by Lemma A.2, A.3 and A.4. the right hand side can be made arbitrarily small byfirst choose ε small enough, then n,m, p large enough.

Lemma A.2. Under Assumption 5 and 6, supΓ |√nmR2(ΓZ)| = op(1).

Proof. The proof is similar with Lemma A.1 by observing that conditioned on eventBcXYB

cXB

cY, it holds for some constant C that |R2(Zi, Zj)| ≤ C

∣∣L2(Zi, Zj)∣∣ .

Lemma A.3. Under HAl , supΓ∈Pn+m|√nmRl(ΓZ)| = op(1).

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

34 C. Zhu and X. Shao

Proof. For any fixed permutaiton marix Γ ∈ Pn+m, we have

√nmRl(ΓZ) =

√nmp

n+m∑i=2

i−1∑j=1

Πijϕ(1) (eij)

(E[ψ(Zi, Zj)|Zi

]+ E

[ψ(Zi, Zj)|Zj

])−√nmp

n+m∑i=2

i−1∑j=1

Πijϕ(1) (eij)

(E[ψ(Zi, Zj)

]+ eij

).

Let w = N(Γ), similar to the computation of µn,w, we obtain

√nmp

n+m∑i=2

i−1∑j=1

Πijϕ(1) (eij)E

[ψ(Zi, Zj)

]=√nmp

{2ϕ(1) (exy)E

[ψ(X,Y )

]− ϕ(1) (ex)E

[ψ(X,X ′)

]− ϕ(1) (ey)E

[ψ(Y, Y ′)

]}f(w),

where the right hand side is of order op(1) under HAl . Let π corresponds to Γ, then foreach 1 ≤ i ≤ n such that 1 ≤ π(i) ≤ n, it follows from the definition of Πij that

1

4

j 6=i∑1≤j≤n+m

Πijϕ(1) (eij)E

[ψ(Xi, Zj)|Xi

]=− (n− w − 1)

n(n− 1)ϕ(1) (ex)E

[ψ(Xi, X)|Xi

]− w

n(n− 1)ϕ(1) (exy)E

[ψ(Xi, Y )|Xi

]+

w

nmϕ(1) (ex)E

[ψ(Xi, X)|Xi

]+m− wnm

ϕ(1) (exy)E[ψ(Xi, Y )|Xi

]=

(1

n− w

nm− w

n(n− 1)

){ϕ(1) (exy)E

[ψ(Xi, Y )|Xi

]− ϕ(1)(ex)E

[ψ(Xi, X)|Xi

]},

which entails

supΓ

∣∣∣∣∣∣14j 6=i∑

1≤j≤n+m

Πijϕ(1) (eij)E

[ψ(Xi, Zj)|Xi

]∣∣∣∣∣∣ ≤C√nm

∣∣∣ϕ(1) (exy)E[ψ(Xi, Y )|Xi

]− ϕ(1)(ex)E

[ψ(Xi, X)|Xi

]∣∣∣ ,where C is a constant that only depends on ρ. Using the same approach, the above boundcan be shown to hold for each 1 ≤ i ≤ n such that n + 1 ≤ π(i) ≤ n + m. Similarly, wecan show that for each n+ 1 ≤ i ≤ n+m,

supΓ

∣∣∣∣∣∣14j 6=i∑

1≤j≤n+m

Πijϕ(1)(eij)E

[ψ(Yi, Zj)|Yi

]∣∣∣∣∣∣≤ C√

nm

∣∣∣ϕ(1)(exy)E[ψ(Yi, X)|Yi

]− ϕ(1)(ey)E

[ψ(Yi, Y )|Yi

]∣∣∣ .imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 35

Consequently, the following bound holds

supΓ

∣∣∣∣∣∣√nmpn+m∑i=2

i−1∑j=1

Πijϕ(1)(eij)

(E[ψ(Zi, Zj)|Zi

]+ E

[ψ(Zi, Zj)|Zj

])∣∣∣∣∣∣≤ C ′√p

n∑i=1

∣∣∣ϕ(1)(exy)E[ψ(Xi, Y )|Xi

]− ϕ(1)(ex)E

[ψ(Xi, X)|Xi

]∣∣∣+ C ′

√p

n+m∑i=n+1

∣∣∣ϕ(1)(exy)E[ψ(Yi, X)|Yi

]− ϕ(1)(ey)E

[ψ(Yi, Y )|Yi

]∣∣∣ ,where C ′ is a constant. Finally, an application of Markov’s inequality shows that theright hand side is of order op(1) under HAl .

Lemma A.4. Under Assumptions 1. Let Γ1,Γ2 ∈ Pn+m. Then, for any constantsa1, a2, b, we have

supΓ1,Γ2

∣∣∣∣P (a1

√nmL(Γ1Z) + a2

√nmL(Γ2Z) ≤ b)− Φ

(b

ηa1,a2(Γ1,Γ2)

)∣∣∣∣≤ C

{max

Λ1,Λ2∈{X,Y }E[K4(Λ1,Λ

′2)]/n2

+ maxΛ1,Λ2,Λ3∈{X,Y }

E[K2(Λ1,Λ

′′3)K2(Λ′2,Λ

′′3)]/n

+ maxΛ1,Λ2,Λ3,Λ4∈{X,Y }

E [K(Λ1,Λ′′3)K(Λ1,Λ

′′′4 )K(Λ′2,Λ

′′′4 )K(Λ′2,Λ

′′3)]

+ (vx − var[K(X,X ′)])2 + (vy − var[K(Y, Y ′)])2 + (vxy − var[K(X,Y )])2

}1/5

,

where C is a constant depend on ϕ, ρ, ex, ey and exy only; ηa1,a2(Γ1,Γ2) is defined as

[ηa1,a2(Γ1,Γ2)]2 = nm

n+m∑i=2

i−1∑j=1

(a1Π1,ij + a2Π2,ij)2[ϕ(1)(eij)]

2vij ,

where Π1,ij ,Π2,ij correspond to Γ1,Γ2 respectively and

vij =

vx, if 1 ≤ i, j ≤ n,vy, if n+ 1 ≤ i, j ≤ n+m,vxy, otherwise.

Proof. Notice that

√nm (a1L(Γ1Z) + a2L(Γ2Z)) =

√nm

n+m∑i=2

i−1∑j=1

(a1Π1,ij + a2Π2,ij)ϕ(1)(eij)K(Zi, Zj).

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

36 C. Zhu and X. Shao

Then, for notational convenience, set

H(Zi, Zj) =√nm(a1Π1,ij + a2Π2,ij)ϕ

(1)(eij)K(Zi, Zj),

and Sn+m,l =∑li=2 ξn+m,i, where ξn+m,i =

∑i−1j=1H(Zi, Zj). Next, let

Fn+m,l = σ(Z1, Z2 · · · , Zl)

be the σ-algebra generated by Z1, · · · , Zl, we have {Sn+m,l,Fn+m,l, 1 ≤ l ≤ n + m} isa martingale array and thus we can apply the Berry-Esseen type bound for martingalesequences [Theorem 1 of [15]]. By setting m = 0 and δ = 1 in Theorem 1 of [15], wecompute

n+m∑i=2

E[ξ2n+m,i

], var

[n+m∑i=2

E[ξ2n+m,i

∣∣Fn+m,i−1

]]and

n+m∑i=2

E[ξ4n+m,i

]Firstly, due to the property of double centering, E[H(Zi, Zj)H(Zi′ , Zj′)] 6= 0 only when{i, j} = {i′, j′}. Then, let ηa1,a2(Γ1,Γ2) be in Theorem 1 of [15]

η2a1,a2(Γ1,Γ2)

nm:= lim

p→∞

∑n+mi=2 E

[ξ2n+m,i

]nm

=

n+m∑i=2

i−1∑j=1

(a1Π1,ij + a2Π2,ij)2[ϕ(1)(eij)]

2vij .

Then, to calculate the variance, notice that

var

[n+m∑i=2

E[ξ2n+m,i

∣∣Fn+m,i−1

]]=

n+m∑i1,i2=2

i1−1∑j1,j2=1

i2−1∑j3,j4=1

Θ(i1, i2; j1, j2, j3, j4),

where Θ(i1, i2; j1, j2, j3, j4) is defined as

Θ(i1, i2; j1, j2, j3, j4) =

cov [E [H(Zi1 , Zj1)H(Zi1 , Zj2)|Zj1 , Zj2 ] , E [H(Zi2 , Zj3)H(Zi2 , Zj4)|Zj3 , Zj4 ]] .

Next, for any 1 ≤ j1, j2 ≤ n+m, Λ ∈ {X,Y }, denote

GΛ(Zj1 , Zj2) = E[K(Λ, Zj1)K(Λ, Zj2)|Zj1 , Zj2 ].

To bound each Θ(i1, i2; j1, j2, j3, j4), we need to study the covariance between GΛ1(Zj1 , Zj2)

and GΛ2(Zj′1 , Zj′2).

Lemma A.5. Then, for any 1 ≤ j1, j2, j′1, j′2 ≤ n+m, Λ1,Λ2 ∈ {X,Y }, we have

cov[GΛ1

(Zj1 , Zj2),GΛ2(Zj′1 , Zj′2)

]

=

E[K2(Λ1, Zj1)K2(Λ′2, Zj1)

]− E

[K2(Λ1, Zj1)

]E[K2(Λ2, Zj1)

], j1 = j2 = j′1 = j′2;

E [K(Λ1, Zj1)K(Λ1, Zj2)K(Λ′2, Zj2)K(Λ′2, Zj1)] , j1 = j′1 6= j2 = j′2;

E [K(Λ1, Zj1)K(Λ1, Zj2)K(Λ′2, Zj2)K(Λ′2, Zj1)] , j1 = j′2 6= j2 = j′1;

0, otherwise.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 37

Proof. If j1 = j′2 6= j2 = j′1,

E[GΛ1(Zj1 , Zj2)GΛ2(Zj′1 , Zj′2)

]=E

[E [K(Λ1, Zj1)K(Λ1, Zj2)|Zj1 , Zj2 ]E

[K(Λ′2, Zj′1)K(Λ′2, Zj′2)

∣∣Zj1 , Zj2]]=E [E [K(Λ1, Zj1)K(Λ1, Zj2)K(Λ′2, Zj2)K(Λ′2, Zj1)|Zj1 , Zj2 ]]

=E [K(Λ1, Zj1)K(Λ1, Zj2)K(Λ′2, Zj2)K(Λ′2, Zj1)]

It can be shown similarly for cases j1 = j2 = j3 = j4 and j1 = j′1 6= j2 = j′2. Next, weshow that for other cases, the covariance is 0. We take j1 = j′1, j1 6= j2, j1 6= j′2, j2 6= j′2as an example

E[GΛ1(Zj1 , Zj2)GΛ2(Zj′1 , Zj′2)

]=E

[E [K(Λ1, Zj1)K(Λ1, Zj2)|Zj1 , Zj2 ]E

[K(Λ′2, Zj1)K(Λ′2, Zj′2)

∣∣Zj1 , Zj′2]]=E

[E[K(Λ1, Zj1)K(Λ1, Zj2)K(Λ′2, Zj1)K(Λ′2, Zj′2)

∣∣Zj1 , Zj2 , Zj′2]]=E

[K(Λ1, Zj1)K(Λ1, Zj2)K(Λ′2, Zj1)K(Λ′2, Zj′2)

]=E

[K(Λ1, Zj1)K(Λ′2, Zj1)E [K(Λ1, Zj2)|Λ1,Λ

′2, Zj1 ]E

[K(Λ′2, Zj′2)

∣∣Λ1,Λ′2, Zj1

]]=0.

Next, we can bound var[∑n+m

i=2 E[ξ2n+m,i

∣∣Fn+m,i−1

] ]as

n+m∑i1,i2=1

i1−1∑j1,j2=1

i2−1∑j3,j4=1

Θ(i1, i2; j1, j2, j3, j4)

=

n+m∑i=1

{i−1∑j=1

Θ(i, i; j, j, j, j) + 2∑

1≤j1 6=j2≤i−1

Θ(i, i; j1, j2, j1, j2)

}

+ 2∑

1≤i1<i2≤n+m

{i1−1∑j=1

Θ(i1, i2; j, j, j, j) + 2∑

1≤j1 6=j2≤i1−1

Θ(i1, i2; j1, j2, j1, j2)

}

=O(

maxΛ1,Λ2,Λ3∈{X,Y }

E[K2(Λ1,Λ

′′3)K2(Λ′2,Λ

′′3)]/n

+ maxΛ1,Λ2,Λ3,Λ4∈{X,Y }

E [K(Λ1,Λ′′3)K(Λ1,Λ

′′′4 )K(Λ′2,Λ

′′′4 )K(Λ′2,Λ

′′3)])

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

38 C. Zhu and X. Shao

Finally, to find the upper bound of∑n+mi=2 E

(ξ4n+m,i

),

n+m∑i=2

E(ξ4n+m,i

)=

n+m∑i=2

i−1∑j1,j2,j3,j4=1

E [H(Zi, Zj1)H(Zi, Zj2)H(Zi, Zj3)H(Zi, Zj4)]

=

n+m∑i=2

i−1∑j=1

E[H4(Zi, Zj)

]+ 6

n+m∑i=2

∑1≤j1<j2≤i−1

E[H2(Zi, Zj1)H2(Zi, Zj2)

]=O

(max

Λ1,Λ2∈{X,Y }E[K4(Λ1,Λ

′2)]/n2

)+O

(max

Λ1,Λ2,Λ3∈{X,Y }E[K2(Λ1,Λ

′′3)K2(Λ′2,Λ

′′3)]/n

).

Combining the above bounds, the lemma is a consequence of Theorem 1 in [15].

A.6. Proof of Theorem 3.4

(i) For a random permutation matrix Γ ∼ Uniform(Pn+m), it follows from Lemma A.1that

EDkn(ΓZ) = µn(ΓZ) +R1(ΓZ) = (2ϕ (exy)− ϕ (ex)− ϕ (ey)) f(W ) + op(1)

where W = N(Γ) ∼ Hypergeometric(m + n,m, n). From the normal limit of hypergeo-metric distribution [20], we know that

W√nm

p→√ρ

1 + ρ.

Next, some algebra shows that f(W )p→ 0 and so the result is proved.

(ii) Recall that we can decompose the sample energy distance as

√nmpEDk

n(ΓZ) =√nmpµn(ΓZ) +

√nm

n+m∑i=2

i−1∑j=1

Πijϕ(1) (eij)K(Zi, Zj)︸ ︷︷ ︸

:=√nmL(ΓZ)

+√nm

n+m∑i=2

i−1∑j=1

Πijϕ(1)(eij)W(Zi, Zj)︸ ︷︷ ︸

:=√nmRl(ΓZ)

+√nmp

n+m∑i=2

i−1∑j=1

ΠijR2(Zi, Zj)︸ ︷︷ ︸:=√nmR2(ΓZ)

.

The result is a consequence of Lemma A.3, A.2 and A.6.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 39

Lemma A.6. Under Assumptions 1 and 7,

√nm

(L(ΓZ)L(Γ′Z)

)d→ N

(0,

(σ2 00 σ2

))where Γ′ is independent copy of Γ and σ2 is the asymptotic variance defined as

σ2 := 4vxy[ϕ(1)(exy)]2 + 2ρvx[ϕ(1)(ex)]2 +2

ρvy[ϕ(1)(ey)]2.

Proof. We apply the Cramer-Wold device. For any constants a1, a2, we have

η2a1,a2(Γ,Γ′) = nm

n+m∑i=2

i−1∑j=1

(a1Πij + a2Π′ij)

2[ϕ(1)(eij)]2vij

= nm

n+m∑i=2

i−1∑j=1

(a2

1Π2ij + a2

2(Π′ij)2 + 2a1a2ΠijΠ

′ij

)[ϕ(1)(eij)]

2vij .

Notice that for {i1, j1} ∩ {i2, j2} = ∅, it can be shown that E[Πi1j1Πi2j2 ] = O(1/n5).Then, denote cij = 2a1a2[ϕ(1)(eij)]

2vij , we have

E

nm n+m∑

i=2

i−1∑j=1

cijΠijΠ′ij

2 =n2m2

n+m∑i=2

i−1∑j1=1

i−1∑j2=1

cij1cij2E2[Πij1Πij2 ]

+ 2n2m2∑

2≤i1<i2≤n+m

i1−1∑j=1

ci1jci2jE2[Πi1jΠi2j ]

+ n2m2∑

2≤i1 6=i2≤n+m

∑j1 6=j2

ci1j1ci2j2E2[Πi1j1Πi2j2 ]

=O(1/n).

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

40 C. Zhu and X. Shao

In addition, let W = N(Γ), we obtain

σ2n(ΓZ) :=

n+m∑i=2

i−1∑j=1

Π2ij [ϕ

(1)(eij)]2vij

=

{4

nm− 4

(n+m

n2m2− n

n2(n− 1)2− m

m2(m− 1)2

)W

+ 4

(2

n2m2− 1

n2(n− 1)2− 1

m2(m− 1)2

)W 2

}vxy[ϕ(1)(exy)]2

+

{2

n(n− 1)+ 2

(2n

n2m2− 2n− 1

n2(n− 1)2− 1

m2(m− 1)2

)W

− 2

(2

m2n2− 1

n2(n− 1)2− 1

m2(m− 1)2

)W 2

}vx[ϕ(1)(ex)]2

+

{2

m(m− 1)+ 2

(2m

n2m2− 1

n2(n− 1)2− 2m− 1

m2(m− 1)2

)W

− 2

(2

n2m2− 1

m2(m− 1)2− 1

n2(n− 1)2

)W 2

}vy[ϕ(1)(ey)]2.

Since W/√nm

p→ √ρ/(1 + ρ), some algebra shows that

σ2n(ΓZ) :=

n+m∑i=2

i−1∑j=1

Π2ij [ϕ

(1)(eij)]2vij

p→ σ2,

which entails that η2a1,a2(Γ,Γ′)

p→ a21σ

2 + a22σ

2. Since |Φ(·)| ≤ 1, we have

E

[∣∣∣∣∣Φ(

b

ηa1,a2(Γ,Γ′)

)− Φ

(b√

a21σ

2 + a22σ

2

)∣∣∣∣∣]→ 0.

Next, by a simple triangle inequality∣∣∣∣∣P (a1

√nmL(ΓZ) + a2

√nmL(Γ′Z) ≤ b

)− Φ

(b√

a21σ

2 + a22σ

2

)∣∣∣∣∣ ≤∣∣∣∣∣∣P (a1

√nmL(ΓZ) + a2

√nmL(Γ′Z) ≤ b

)− Φ

b√η2a1,a2(Γ,Γ′)

∣∣∣∣∣∣+

∣∣∣∣∣∣Φ b√

η2a1,a2(Γ,Γ′)

− Φ

(b√

a21σ

2 + a22σ

2

)∣∣∣∣∣∣ .Taking expectation with respect to Γ,Γ′ on both sides, then it follows from Lemma A.4

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 41

and Assumption 7 that∣∣∣∣∣P (a1

√nmL(ΓZ) + a2

√nmL(Γ′Z) ≤ b

)− Φ

(b√

a21σ

2 + a22σ

2

)∣∣∣∣∣ ≤E

∣∣∣∣∣∣P (a1

√nmL(ΓZ) + a2

√nmL(Γ′Z) ≤ b

)− Φ

b√η2a1,a2(Γ,Γ′)

∣∣∣∣∣∣

+ E

∣∣∣∣∣∣Φ b√

η2a1,a2(Γ,Γ′)

− Φ

(b√

a21σ

2 + a22σ

2

)∣∣∣∣∣∣ = o(1).

A.7. Proof of Corollary 3.1

By using Theorem 15.2.3 of [21], the result is a consequence of Theorem 3.4.

A.8. Proof of Theorem 3.5

(i) By Corollary 3.1 and Theorem 3.3

Power = PHAc

(EDk

n(Z) > c)→ P (2ϕ(exy)− ϕ(ex)− ϕ(ey) > 0) = 1.

(ii) By Corollary 3.1 and Theorem 3.3

Power = PHAl

(EDk

n(Z) > c)

= PHAl

(√nmpEDk

n(Z) >√nmpc

)→ P

(N(0, σ2) > σQΦ,1−α

)= α.

Acknowledgements

We would like to thank Dr. Jun Li for providing the code used in [22]. We are also gratefulto the three reviewers for their very helpful comments. The partial support from a USNSF grant is gratefully acknowledged.

References

[1] Alvarez-Esteban, P. C., Del Barrio, E., Cuesta-Albertos, J. A. and Matran, C. [2008],‘Trimmed comparison of distributions’, Journal of the American Statistical Associ-ation 103(482), 697–704.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

42 C. Zhu and X. Shao

[2] Alvarez-Esteban, P. C., Del Barrio, E., Cuesta-Albertos, J. A., Matran, C. et al.[2012], ‘Similarity of samples and trimming’, Bernoulli 18(2), 606–634.

[3] Anderson, T. W. and Darling, D. A. [1952], ‘Asymptotic theory of certain ”goodnessof fit” criteria based on stochastic processes’, The Annals of Mathematical Statistics23(2), 193–212.

[4] Aoshima, M., Shen, D., Shen, H., Yata, K., Zhou, Y.-H. and Marron, J. [2018], ‘Asurvey of high dimension low sample size asymptotics’, Australian & New ZealandJournal of Statistics 60(1), 4–19.

[5] Bickel, P. J. [1969], ‘A distribution free version of the Smirnov two sample test inthe p-variate case’, The Annals of Mathematical Statistics 40(1), 1–23.

[6] Bickel, P. J. and Breiman, L. [1983], ‘Sums of functions of nearest neighbor distances,moment bounds, limit theorems and a goodness of fit test’, The Annals of Probability11(1), 185–214.

[7] Biswas, M. and Ghosh, A. K. [2014], ‘A nonparametric two-sample test applicableto high dimensional data’, Journal of Multivariate Analysis 123, 160–171.

[8] Chakraborty, S. and Zhang, X. [2019], ‘A new framework for distance and kernel-based metrics in high dimensions’, arXiv preprint arXiv:1909.13469 .

[9] Cramer, H. [1928], ‘On the composition of elementary errors: First paper: Mathe-matical deductions’, Scandinavian Actuarial Journal 1928(1), 13–74.

[10] Dau, H. A., Keogh, E., Kamgar, K., Yeh, C.-C. M., Zhu, Y., Gharghabi, S.,Ratanamahatana, C. A., Yanping, Hu, B., Begum, N., Bagnall, A., Mueen, A. andBatista, G. [2018], ‘The ucr time series classification archive’. https://www.cs.

ucr.edu/~eamonn/time_series_data_2018/.[11] Freitag, G., Czado, C. and Munk, A. [2007], ‘A nonparametric test for similarity of

marginals with applications to the assessment of population bioequivalence’, Journalof Statistical Planning and Inference 137(3), 697–711.

[12] Friedman, J. H. and Rafsky, L. C. [1979], ‘Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests’, The Annals of Statistics 7(4), 697–717.

[13] Gretton, A., Borgwardt, K. M., Rasch, M. J., Scholkopf, B. and Smola, A. [2012], ‘Akernel two-sample test’, Journal of Machine Learning Research 13(Mar), 723–773.

[14] Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Scholkopf, B. and Smola, A. J.[2008], A kernel statistical test of independence, in ‘Advances in Neural InformationProcessing Systems’, pp. 585–592.

[15] Hall, P. and Heyde, C. C. [1981], ‘Rates of convergence in the martingale centrallimit theorem’, The Annals of Probability 9(3), 395–404.

[16] Hall, P., Marron, J. S. and Neeman, A. [2005], ‘Geometric representation of highdimension, low sample size data’, Journal of the Royal Statistical Society: Series B(Statistical Methodology) 67(3), 427–444.

[17] Henze, N. [1988], ‘A multivariate two-sample test based on the number of nearestneighbor type coincidences’, The Annals of Statistics 16(2), 772–783.

[18] Klebanov, L. B., Benes, V. and Saxl, I. [2005], N-distances and Their Applications,Charles University in Prague, the Karolinum Press.

[19] Kolmogorov, A. N. [1933], Sulla determinazione empirica di una legge di dis-tribuzione, NA.

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020

Two Sample Tests in High Dimension 43

[20] Lahiri, S. N., Chatterjee, A. and Maiti, T. [2006], ‘A sub-gaussian berry-esseentheorem for the hypergeometric distribution’, arXiv preprint math/0602276 .

[21] Lehmann, E. L. and Romano, J. P. [2006], Testing Statistical Hypotheses, SpringerScience & Business Media.

[22] Li, J. [2018], ‘Asymptotic normality of interpoint distances for high-dimensionaldata with applications to the two-sample problem’, Biometrika 105, 529–546.

[23] Munk, A. and Czado, C. [1998], ‘Nonparametric validation of similar distributionsand assessment of goodness of fit’, Journal of the Royal Statistical Society: Series B(Statistical Methodology) 60(1), 223–241.

[24] Sarkar, S., Biswas, R. and Ghosh, A. K. [2018], ‘On high-dimensional modificationsof some graph-based two-sample tests’, arXiv preprint arXiv:1806.02138 .

[25] Scetbon, M. and Varoquaux, G. [2019], Comparing distributions: l1 geometry im-proves kernel two-sample testing, in ‘Advances in Neural Information ProcessingSystems 32’, Curran Associates, Inc., pp. 12327–12337.

[26] Schilling, M. F. [1986], ‘Multivariate two-sample tests based on nearest neighbors’,Journal of the American Statistical Association 81(395), 799–806.

[27] Smirnov, N. [1948], ‘Table for estimating the goodness of fit of empirical distribu-tions’, The Annals of Mathematical Statistics 19(2), 279–281.

[28] Szekely, G. J. and Rizzo, M. L. [2004], ‘Testing for equal distributions in high di-mension’, InterStat 5, 1–6.

[29] Von Mises, R. [1928], ‘Statistik und wahrheit’, Julius Springer .[30] Zhu, C., Yao, S., Zhang, X. and Shao, X. [2019], ‘Distance-based and rkhs-based

dependence metrics in high dimension’, arXiv preprint arXiv:1902.03291 .

imsart-bj ver. 2014/10/16 file: TwoSampleBernoulli_Revise4.tex date: June 10, 2020