Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
TWO GRAPH-BASED TESTS
FOR HIGH-DIMENSIONAL INFERENCE
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF STATISTICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Hao Chen
June 2013
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/vm961zz5360
© 2013 by Hao Chen. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
David Siegmund, Co-Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Nancy Zhang, Co-Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Jerry Friedman
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
With modern science there is a growing emphasis on multivariate, complex data
types. Some of these data are high dimensional. Others, such as survey preference,
network, and tree data, cannot be characterized easily with standard models on Eu-
clidean spaces. This dissertation details the investigation in this new setting of two
classic statistical problems: change-point detection and two-sample comparison of
categorical data.
Change-point models are widely used in various fields for detecting lack of homo-
geneity in a sequence of observations. In many applications, the dimension of the
observations in the sequence can be very high, even much larger than the length of
the sequence. Testing the homogeneity of such sequences is a challenging but impor-
tant problem. Existing approaches are limited in many ways. We proposed a new
non-parametric approach that can be applied to data in high dimension, and even to
non-Euclidean object data, as long as an informative similarity measure on the sample
space can be defined. The approach is graph-based two-sample tests adapted to the
scan-statistic setting. Graph-based two-sample tests are tests base on graphs connect-
ing observations by similarity [Friedman and Rafsky, 1979, Rosenbaum, 2005]. We
show that this new approach is powerful in high dimensions compared to parametric
approaches. We also derive accurate analytic p-value approximations for very general
situations, which lead to easy off-the-shelf homogeneity testing for large multivariate
data sets. This approach has been applied on two data sets: The determination of
authorship of a classic novel, and the detection of change in a social network over
time.
Two-sample comparison of categorical data is a classic problem in statistics. In
iv
many modern applications, the number of categories can be quite large, even com-
parable to the sample size, causing existing methods to have low power. When the
number of categories is large, there is often underlying structure on the sample space
that can be exploited. We propose a general non-parametric approach that makes
use of similarity information on the space of categories in two-sample tests. Our ap-
proach addresses a shortcoming of existing graph-based two-sample tests by no longer
requiring uniqueness of the underlying graph, thus allowing ties in the distance ma-
trix defining the graph. We found two types of statistics that are both powerful and
fast to compute. We show that their permutation null distributions are asymptoti-
cally normal and that their p-value approximations under typical settings are quite
accurate, facilitating the application of this approach.
v
Acknowledgements
I would like to thank my advisor, Professor Nancy Zhang, for her guidance and friend-
ship throughout my Ph.D. studies, and my co-advisor, Professor David Siegmund, for
his guidance and encouragement. Both of them broadened my perspectives and in-
spired me in many ways. From critical thinking to effective communication, there
are so many things I learned, which are impossible to summarize. Besides academic
assistance, they care about me when I encounter difficulties in life. I thank them for
being such great advisors and friends.
I would like to thank Professors Jerome Friedman, Wing Wong, and Hua Tang
for serving on my committees and for providing valuable feedback. I am especially
grateful to Professor Jerome Friedman for reading this thesis, discussing with me
on related problems, and providing helpful suggestions and insights. In addition, I
would like to thank Professors Susan Holmes, Emmanuel Candes, Art Owen, Richard
Olshen, Bradley Efron, Persi Diaconis, and Tze Lai for their support in my doctoral
years. Moreover, I appreciate the encouragement and help from Professor Mark Low
at University of Pennsylvania.
My Ph.D. life would not be so colorful and memorable without the companionship
of friends from the Sequoia Hall. Especially, I thank Zhen Zhu, Pei He, Luo Lu,
Gourab Mukherjee, Noah Simon, Zongming Ma, Yi Liu, Yao Xie, Li Ma, Linxi Liu,
and Jian Li for all of the wonderful times we’ve had together.
Last but not least, I want to thank my parents for their everlasting love and care.
I dedicate this thesis to them as a simple expression of gratitude for their love and
support.
vi
Contents
Abstract iv
Acknowledgements vi
1 Introduction 1
1.1 A Review of Graph-Based Two-Sample Tests . . . . . . . . . . . . . . 1
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Change-Point Detection . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Two-Sample Comparison of Categorical Data . . . . . . . . . 4
1.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
I Graph-Based Change-Point Detection 6
2 Change-Point Problems 7
2.1 Background and Challenges . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Review of Change-Point Problems with Multivariate Observations . . 9
3 A Graph-Based Framework 10
3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Single Change-Point Alternative . . . . . . . . . . . . . . . . . 11
3.2.2 Changed Interval Alternative . . . . . . . . . . . . . . . . . . 17
4 Analytic Appximations to Significance Levels 19
4.1 Quantity of the Interest . . . . . . . . . . . . . . . . . . . . . . . . . 19
vii
4.2 Properties of the Processes . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Limiting Distributions . . . . . . . . . . . . . . . . . . . . . . 21
4.2.2 Covariance Function . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Asymptotic Approximations . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Skewness Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.1 Derivation of (4.18) and (4.20) . . . . . . . . . . . . . . . . . 29
4.4.2 Explicit Expressions for Skewness . . . . . . . . . . . . . . . . 31
4.5 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.1 Single Change-Point Alternative . . . . . . . . . . . . . . . . . 35
4.5.2 Changed Interval Alternative . . . . . . . . . . . . . . . . . . 37
5 Assessment of the Method 47
5.1 Numeric Power Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Results on Real Data Examples . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Friendship Network . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Authorship Debate . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
II Graph-Based Tests for Two-Sample Comparisons of Cat-egorical Data 59
6 Introduction 60
6.1 Background and Challenges . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Implicit Information and Their Role in Improving the Tests . . . . . 62
6.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7 Graph-Based Test Statistics 65
7.1 The Test Statistics Based on MST . . . . . . . . . . . . . . . . . . . . 66
7.1.1 RaMST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.1.2 RuMST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2 The Test Statistic Based on MDP . . . . . . . . . . . . . . . . . . . . 71
7.2.1 RaMDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 A Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
viii
7.4 Computational Issues of RaMST and RuMST . . . . . . . . . . . . . . . . 76
7.5 A Fast Method Generalized from RaMST . . . . . . . . . . . . . . . . . 78
8 Examples 80
8.1 Preference Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.2 Haplotype Association . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.3 Binary Clinical Features . . . . . . . . . . . . . . . . . . . . . . . . . 84
9 Permutation Distributions 87
9.1 RC0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.2 TC0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.3 Checking the p-Values Under Normal Approximations . . . . . . . . . 91
10 Discussion 94
A Existing Theorems Used in Proofs 96
A.1 Stein’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
B Supporting Materials for Part I 97
B.1 Proofs for the Limiting Distributions . . . . . . . . . . . . . . . . . . 97
B.2 Effect of Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
C Supporting Materials for Part II 106
C.1 Computation Issues for RaMST and RuMST . . . . . . . . . . . . . . . . . 106
C.1.1 Theoretical Justifications . . . . . . . . . . . . . . . . . . . . . 107
C.2 Proofs for Permutation Distributions . . . . . . . . . . . . . . . . . . 114
C.2.1 Proof of Lemma 9.1.1 . . . . . . . . . . . . . . . . . . . . . . . 114
C.2.2 Proof of Theorem 9.1.3 . . . . . . . . . . . . . . . . . . . . . . 117
C.2.3 Proof of Lemma 9.2.1 . . . . . . . . . . . . . . . . . . . . . . . 127
ix
List of Tables
4.1 Critical values for the single change-point scan statistic based on MST
at 0.05 significance level. n = 1000. . . . . . . . . . . . . . . . . . . . 37
4.2 Critical values for the single change-point scan statistic based on MST
at 0.01 significance level. n = 1000. . . . . . . . . . . . . . . . . . . . 38
4.3 Critical values for the single change-point scan statistic based on MDP.
n = 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Critical values for the single change-point scan statistic based on NNG
at 0.05 significance level. n = 1000. . . . . . . . . . . . . . . . . . . . 40
4.5 Critical values for the single change-point scan statistic based on NNG
at 0.01 significance level. n = 1000. . . . . . . . . . . . . . . . . . . . 41
4.6 Critical values for the changed interval scan statistic based on MST at
0.05 significance level. n = 1000. . . . . . . . . . . . . . . . . . . . . . 42
4.7 Critical values for the changed interval scan statistic based on MST at
0.01 significance level. n = 1000. . . . . . . . . . . . . . . . . . . . . . 43
4.8 Critical values for the changed interval scan statistic based on MDP.
n = 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.9 Critical values for the changed interval scan statistic based on NNG at
0.05 significance level. n = 1000. . . . . . . . . . . . . . . . . . . . . . 45
4.10 Critical values for the changed interval scan statistic based on NNG at
0.01 significance level. n = 1000. . . . . . . . . . . . . . . . . . . . . . 46
5.1 Number of simulated sequences (out of 100) with significance less than
5%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
x
5.2 p-values for the tests. In each cell, the first value is calculated from
10,000 permutations and the second value is calculated from the ana-
lytic approximation after skewness correction. . . . . . . . . . . . . . 54
5.3 p-values for the tests only using data from the first 350 chapters.Numbers
in each cell have the same meaning as in Table 5.2. . . . . . . . . . . 54
6.1 Basic Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.1 The power of six tests – four graph-based tests based on RaMST, RuMST,
RaMDP, RuNNG, the deviance test (LR) and Pearson’s Chi-square test –
under two significance levels (α = 0.01, 0.05) and different simulation
settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2 The number of categories, K, and the MSTs on categories, M , as
haplotype length increases for the haplotype association problem in
Section 8.2. All categories are assumed to be non-empty. . . . . . . . 77
7.3 Computational time for RaMST and RuMST. M is the number of MSTs
on categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.1 The power of five tests – three graph-based tests based on RuMST,
RC−uMST, RC−uNNG and two Chi-square tests – under two significance lev-
els (α = 0.01, 0.05) and different simulation settings. . . . . . . . . . 82
8.2 p-values for the KCS data set. . . . . . . . . . . . . . . . . . . . . . . 86
xi
List of Figures
3.1 The MST, MDP and NNG graphs on an example two-dimensional data
set. 20 points were drawn from N (0, I2) (shown in triangles) and 20
points were drawn from N ((2, 2)′, I2) (shown in circles). . . . . . . . . 12
3.2 The computation of RG(t) for nine different values of t. The data
is a sequence of length n = 40, with the first 20 points drawn from
N (0, I2) and the second 20 points drawn from N ((2, 2)′, I2). The sim-
ilarity graph G shown in the plots is the MST on Euclidean distance.
Each t divides the observations into two groups, one group for obser-
vations before and at t (shown as triangles) and the other group for
observations after t (shown as circles). Edges that connect observa-
tions from the two different groups (i.e. edges connecting a triangle
and a circle) are bold in the graph. Notice that G does not change as t
changes, but the group identities of some observations change, causing
RG(t) to change. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 The profile of RG(t) and ZG(t) against t for the same data set as in
Figure 3.2 (a change-point at 20). . . . . . . . . . . . . . . . . . . . . 16
3.4 The profile of RG(t) and ZG(t) against t on a sequence of points all
randomly drawn from N (0, I2). . . . . . . . . . . . . . . . . . . . . . 16
xii
5.1 Results of graph-based scans of the MIT phone call network. Top row
shows results from using number of different edges as the dissimilarity
measure and bottom row shows results from using the normalized num-
ber of different edges. The three columns show three different ways of
constructing the graph: MST, MDP, and NNG from left to right. In
each plot, Z(t) is plotted along t. The estimated change-point is shown
in the caption above the plot. The two vertical lines show n0 and n1;
we basically excluded the first 5% and the last 5% of the points. The
horizontal lines show critical values at 0.05 and 0.01 significance lev-
els, with the solid lines showing critical values computed from 10,000
permutations and the dashed lines showing those computed from the
analytic approximation after skewness correction. . . . . . . . . . . . 52
5.2 Results of graph-based scans of chapter-wise word usage frequencies of
Tirant lo Blanch. The first row shows results from the word length data
and the second row shows results from the context-free word frequency
data. The three columns show scans based on three different graphs:
MST, MDP, and NNG from left to right. The content in each plot is
the same as in Figure 5.1. In the caption for each plot, the estimated
change-point is shown in the form A/B, where A is the index of the
change-point within the 425 chapters used for analysis, and B is the
chapter number in the novel. . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Results from the first 350 chapters. The setting of the figure is the
same as in Figure 5.2. . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.1 Illustration of MST, MDP, and NNG on six points. Notice that only
one of the two possible MSTs on the six points and one of the two
possible NNGs on the six points are shown. . . . . . . . . . . . . . . 66
7.2 Embedding the MST on categories on the subjects. This figure only
shows 3 out of 15552 possible embeddings. . . . . . . . . . . . . . . . 67
7.3 Power versus type I error for tests based on RaMST, RuMST, RaMDP, the
likelihood ratio (deviance), and RuNNG under different simulation settings. 75
xiii
8.1 C-uMST and C-uNNG constructed on a typical data set generated
under parameters ζ0 = 1234 and θ = 5 with na = nb = 20. The
Spearman’s distance is used in both the generating model and for con-
structing the graph. Each node is labeled by the ranking it represents,
followed by the number of subjects from groups a and b with that
ranking in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.2 Power versus type I error for five different tests in the preference rank-
ing example with θ = 5 and na = nb = 20. One of two distance
measures (Kendall’s or Spearman’s distance) was used for the gener-
ating model and for constructing the graph. The title of each plot
denotes the choice of distance: The first letter denotes the distance
used in the generating model (“K” is Kendall’s and “S” is Spearman’s
distance); and the second letter denotes the distance used in construct-
ing the graph. For instance, “KS” in the bottom left panel means that
Kendall’s distance is used in the generating model, but Spearman’s
distance is used in constructing the graph. . . . . . . . . . . . . . . . 83
8.3 The power versus type I error plots for the five tests for the haplo-
type example. The length of the haplotype is 11, but only 4 positions
informative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.1 Boxplots for the differences between p-values calculated from normal
approximation and 10,000 permutations. . . . . . . . . . . . . . . . . 93
B.1 The three quantities, γG(t), θb,G(t) and SG(t) from left to right, for a
MDP graph. n = 1000, b = 3. . . . . . . . . . . . . . . . . . . . . . . 104
B.2 The three quantities, γG(t), θb,G(t) and SG(t) from left to right, for
a MST graph constructed using Euclidean distance on a sequence of
n = 1000 observations iid drawn from N(0, I100). b = 3. . . . . . . . 104
B.3 The integrand before (left) and after (right) extrapolation. The inte-
grand can only be directly calculated in the middle part (t ∈ [248, 752]),
and the outer part is obtained by extending using the boundary tangent.105
xiv
Chapter 1
Introduction
Statistics is a rapidly growing field where new challenges arise before old ones are fully
understood. As advanced technologies in various fields produce data with increasing
dimensionality and volume, many fundamental problems have to be re-addressed to
meet the demands from modern data. This dissertation focuses on two of these
problems: Change-point detection and two-sample comparison of categorical data.
This chapter gives a brief review of graph-based two-sample tests, which is a building
block of our approach in tackling these two problems in high dimension, and a brief
overview of the two problems.
1.1 A Review of Graph-Based Two-Sample Tests
Let y1, . . . ,yn and yn+1, . . . ,yN be two samples from distributions F0 and F1,
respectively. We consider testing the null hypothesis that the two distributions are
the same, F0 = F1, against an omnibus alternative, F0 6= F1.
By graph-based two-sample test, we refer to tests that are based on graphs with the
observations y1, . . . ,yN as nodes. Friedman and Rafsky [1979] proposed the first
graph-based two-sample test as a generalization of the Wald-Wolfowitz runs test to
multivariate settings. Their test is based on a minimum spanning tree (MST) on the
observations yi, i = 1, . . . , N, which is a spanning tree connecting all observations
that minimizes the sum of distances across edges. The Friedman-Rafsky test is based
1
CHAPTER 1. INTRODUCTION 2
on the number of edges connecting observations across different groups:
RG =∑
(i,j)∈G
Igi 6=gj , (1.1)
where G is the MST, and gi is an indicator function for yi belonging to the first
sample. The sum RG is compared to its null distribution obtained by permuting the
sample labels – randomly picking n observations out of the N observations to be the
first sample – and the null hypothesis is rejected if RG is relatively low. The rationale
is that, if the two samples come from different distributions, observations from the
same sample should be closer to each other, and thus edges in the MST should be
more likely to connect observations within a group. Friedman and Rafsky showed
that, while this test has low power in low dimensions, it has comparable power to
likelihood ratio tests in a numerical study of normal data in > 20 dimensions, and
higher power when the normality assumption is violated.
Another graph-based two-sample method, the cross-match test, was proposed
by Rosenbaum [2005]. This test is based on minimum distance non-bipartite pair-
ing (MDP), which divides the N observations into N/2 (assuming N is even) non-
overlapping pairs in such a way as to minimize the total of N/2 distances between
pairs. For odd N , Rosenbaum suggested creating a pseudo data point that has dis-
tance 0 with all other observations, and later discarding the pair containing this
pseudo point. The sum (1.1) is computed with G set to the MDP, and the same ra-
tional is adopted that the null hypothesis is rejected if RG is relatively low compared
to that calculated under permutations. Note that the topology of the MDP is the
same for any graph with N nodes. This fact makes the test based on MDP truly
distribution-free under the null hypothesis.
1.2 Overview
This section gives a brief overview of the challenges of the two problems in high
dimension and how we approached them. The details will be presented in Part I and
Part II of the dissertation, respectively.
CHAPTER 1. INTRODUCTION 3
1.2.1 Change-Point Detection
Change-point models are widely used in various fields for detecting lack of homogene-
ity in an ordered sequence of observations. There is a rich literature on theory and
application of change-point models when the observations are real or integer valued
scalars. However, in many applications, the observations are multivariate. Examples
can be close to our everyday lives. Many classic works in literature, such as Tirant lo
Blanch, a Catalan romance, and Dream of the Red Chamber, a Chinese masterpiece,
have debates on whether there is a change of author mid-way through the novel.
In the digital era, an objective approach to these debates is to statistically test for
abrupt changes in writing style, which can be reflected by word usage. Similar prob-
lems arise in genomic sequence analysis in biology, where it is often of interest to find
regions of the genome with different DNA-word compositions. In both settings, each
observation in the sequence is a vector of word counts over a dictionary. Multivariate
change-point models are also useful for detecting abrupt shifts in network connectiv-
ity for either social networks or gene-/protein- interaction networks, and for detecting
abrupt events in image data from climatology, neuroscience, and surveillance tapes.
In these applications, the dimension of the observations in the sequence can be
very high, even larger than the number of observations. Testing the homogeneity
of such sequences is a challenging but important problem. Existing approaches for
change-point detection in a sequence of multivariate observations are limited in many
ways. Most methods are based on parametric models that are highly context specific.
These parametric approaches for multivariate data cannot be applied in very high
dimensions, unless strong assumptions are made to avoid the estimation of the large
number of nuisance parameters that are a by-product of increasing dimension. Non-
parametric methods have also been proposed, but they do not generalize well to high
dimension.
We propose a new non-parametric approach that can be applied to data in high
dimension, and even to non-Euclidean object data, as long as an informative similarity
measure on the sample space can be defined. Briefly, the approach is graph-based two-
sample tests adapted to the scan-statistic setting. We showed that this new approach
is powerful in high dimensions compared to parametric approaches. We also derived
accurate analytic p-value approximations for very general situations, which leads to
CHAPTER 1. INTRODUCTION 4
easy off-the-shelf homogeneity testing for large multivariate data sets.
1.2.2 Two-Sample Comparison of Categorical Data
Two-sample comparison of categorical data is a classic problem in Statistics. The
standard procedure is to assume that each sample is drawn from a multinomial dis-
tribution, and the comparison becomes a test of whether the two samples come from
the same multinomial distribution. Classical methods, such as the Pearson’s Chi-
square test and the deviance test, work well when we observe each category a large
number of times. At least, the region in the contingency table where the two groups
truly differ needs to be adequately sampled for existing tests to achieve good power.
However, in many applications, the number of categories is comparable to the sample
size, causing existing methods to have low power.
When the number of categories is large, there is often underlying structure on
the sample space that can be exploited. For example, in genetics, a haplotype is a
combination of alleles at adjacent loci on a chromosome that is transmitted together.
A common problem of genetic association studies is to compare haplotype counts
between treatment and control groups. Each haplotype can be represented as a fixed-
length binary vector. Haplotypes that are longer than 10 are often of interest in
genetics, leading to > 1, 000 possible combinations. However, the number of subjects
in association studies is often only in the thousands or even hundreds. In this example,
haplotypes can be related through hamming distance or other more sophisticated
measures. Another example is ranking data from surveys and psychometric research,
where two group comparisons are common. The number of possible rankings is often
large compared to the number of subjects. In this example, rankings can be related
through Kendall’s or Spearman’s distance.
We propose general non-parametric approaches that make use of similarity infor-
mation on the space of categories in two-sample tests. As we see in Section 1.1, both
existing graph-based two-sample tests require a unique underlying graph G on the
observations. When the similarity matrix on observations is filled with ties, which is
a major characteristic for categorical data, neither MST nor MDP can be constructed
uniquely. We explored different ways to construct the graph and the statistic under
the categorical setting and found two types of statistics that are both powerful and
CHAPTER 1. INTRODUCTION 5
fast to compute. We showed that their permutation null distributions are asymptot-
ically normal under mild conditions and that their p-value approximations are quite
accurate under typical settings, facilitating the application of the new approaches to
real problems.
1.3 Remarks
The two problems are separate problems except that we approach both through graph-
based tests. Given the review of graph-based two-sample tests in Section 1.1, Part I
and Part II are “independent” and can be read in either order.
The notations are consistent within each part. I try to make them as consistent
as possible across the two parts. However, some notations are double defined. Their
meanings are clear within the context, but generalizing one from one part to the other
needs caution.
Nevertheless, I always use G to denote the similarity graph in a generic sense, as
well as the set of edges in the graph when the vertex set is implicitly obvious. | · | is
used to denote the size of a set, so |G| is the number of edges in G. If not otherwise
specified, the probabilities are all under permutation.
Some commonly used notations are also defined here. φ(·) and Φ(·) are the density
and cumulative distribution of the standard normal distribution. For any event x, Ix
is the indicator function that takes value 1 if x is true, and 0 otherwise. For any real
value x, [x] denotes the largest integer ≤ x.
Part I
Graph-Based Change-Point
Detection
6
Chapter 2
Change-Point Problems
2.1 Background and Challenges
Change-point models are widely used in various fields for detecting lack of homogene-
ity in a sequence of observations ordered based on an index, such as time. In the
typical formulation, the observations yi, i = 1, 2, . . . , n are assumed to have distri-
bution F0 for i ≤ τ and possibly a different distribution F1 for i > τ . The parameter
τ is refereed to as the change-point. We consider the case where the total length of
the sequence n is fixed. There is rich literature on theory and applications of this
model when yi are real or integer valued scalars. For example, in a well known study
of the annual flow volume of the Nile River at the city of Aswan, Egypt, from 1871
to 1970, each yi is a continuous measurement of the annual discharge from the river
[Cobb, 1978], and the goal is to detect shifts in flow volume. If the distribution of yi
were assumed to be normal, score- or likelihood- based tests can be applied [James
et al., 1987]. Bayesian and non-parametric approaches have also been developed (see
Carlstein et al. [1994] for a survey).
Modern statistical applications are faced with data of increasing richness and
dimension. High throughput measurement schemes and digitization in many scientific
fields have produced data sequences yi : i = 1, 2, . . . , n, where each yi is a high
dimensional vector or even a non-Euclidean data object. The dimension of each
observation can be larger than the length of the sequence. Testing the homogeneity
of such sequences and estimating the locations of change-points if the sequence is not
7
CHAPTER 2. CHANGE-POINT PROBLEMS 8
homogeneous are challenging but important problems. Following are some motivating
examples:
Network evolution: Data on networks have become increasingly common. For ex-
ample, e-mail, phone, and on-line chat records can be used to construct a net-
work of social interactions among individuals [Kossinets and Watts, 2006, Eagle
et al., 2009]. High throughput biological experiments have led to the ubiquitous
study of protein- or gene- interaction networks. A large part of these studies is
characterizing how the network evolves through time. Here, the observation at
each time point is a graphical encoding of the network. In a longitudinal study,
one might ask whether there is an abrupt shift in network connectivity at any
point in time.
Image analysis: Image data collected through time appears in diverse applications,
from video surveillance to climatology to neuroscience. The detection of abrupt
events, such as security breaches, storms, or brain activity, can be formulated as
a change-point problem. Here, the observation at each time point is the digital
encoding of an image.
Text or sequence analysis: Many classic works in both western and eastern liter-
ature have ongoing authorship debates. For example, the debate surrounding
both Tirant lo Blanch, a Catalan romance, and Dream of the Red Chamber,
a Chinese masterpiece, is whether there is a change of authorship mid-way
through the novel. In the digital era, an objective approach to these debates is
to statistically test for abrupt changes in writing style, which can be reflected
by word usage. Similar problems arise in genomic sequence analysis in biology,
where it is often of interest to find regions of the genome with different DNA-
word compositions (see, for example, Tsirigos and Rigoutsos [2005]). In both
settings, each observation in the sequence is a vector of word counts over a large
dictionary of words.
CHAPTER 2. CHANGE-POINT PROBLEMS 9
2.2 Review of Change-Point Problems with Mul-
tivariate Observations
There are several methods that can be used to detect change-point(s) in a sequence
of multivariate observations. Most methods are based on parametric models that are
highly context specific. For example, Zhang et al. [2010] and Siegmund et al. [2011]
studied the problem of detecting common shifts in mean in multivariate Gaussian
sequences with identity covariance. Srivastava and Worsley [1986] and James et al.
[1992] discussed general likelihood ratio tests for a change in multivariate normal
mean. Both tests require that the dimension of the observations be smaller than
the number of observations. Giron et al. [2005] assumed the observations follow
multinomial distribution and they developed a Bayesian approach. This survey is not
exhaustive, while in general, parametric approaches for multivariate data cannot be
applied in very high dimensions, unless strong assumptions are made to avoid the
estimation of the large number of nuisance parameters that are a by-product of the
increasing dimension.
In the nonparametric domain, Desobry et al. [2005] and Harchaoui et al. [2009]
used kernel-based methods. A common drawback for kernel-based methods is that
they rely heavily on the choice of kernel functions and parameters, and the problem
becomes more severe when applying to high-dimensional data. Lung-Yut-Fong et al.
[2011] used a non-parametric approach based on marginal rank statistics, which also
requires the restriction that the number of observations be larger than the dimension
of the data. Although these tests can be quite useful in low dimensions, they are
impractical when data resides in high dimensional sample space.
Chapter 3
A Graph-Based Framework for
Change-Point Detection
3.1 Problem Formulation
We start with a formal formulation of the problem. Consider a sequence of indepen-
dent observations yi, i = 1, . . . , n. Let F0 and F1 be two probability measures on
the sample space, e.g., Rd. We do not assume that F0 and F1 are known, that is,
F0 and F1 can be arbitrary, but need to differ on a set of non-zero measure. We are
concerned with testing the null hypothesis,
H0 : yi ∼ F0, i = 1, . . . , n, (3.1)
against the alternative,
Ha : ∃ 1 ≤ τ < n, yi ∼
F1, i > τ
F0, otherwise.(3.2)
Under the alternative hypothesis, there exists a time point τ where the distribution
of the observations changes abruptly from F0 to F1. (The index can be some other
meaningful orderings. For simplicity, we refer the order to be time if not otherwise
10
CHAPTER 3. A GRAPH-BASED FRAMEWORK 11
specified.) A related alternative, which we will also study, is that of a changed interval:
Ha : ∃ 1 ≤ τ1 < τ2 ≤ n, yi ∼
F1, i = τ1 + 1, . . . , τ2
F0, otherwise.(3.3)
Under the second alternative, yi changes from F0 to F1 and then back to F0 at some
later time.
In both the single change-point and the changed interval alternatives, the obser-
vations are divided into two distinct groups. Usually, we are interested in the case
that both groups have a minimum number of observations: 1 < n0 ≤ τ ≤ n1 < n
for the single change-point scenario and 1 < l0 ≤ τ2 − τ1 ≤ l1 < n for the changed
interval scenario, where n0, n1, l0, l1 are some prespecified values. Sometimes, the
domain knowledge gives us good choices for these values. We may also have some
constrains on the locations of τ1 and τ2.
3.2 Test Statistics
We do not impose any restrictions on the sample space or distributions of yi. Our
approach requires that the similarity between yi can be represented by a graph, with
edges in the graph connecting observations that are “close” in some sense. Recall
Section 1.1 that the rationale of graph-based two-sample test is observations from
the same distribution being more likely to be connected in the graph G if the two
distributions are different. This is also the rationale for our proposed method for
the change-point setting that if there is a change-point, observations from the same
distribution are more likely to be connected. To give a sense of how the similarity
graph usually looks like, Figure 3.1 presents minimum spanning tree (MST), minimum
distance pairing (MDP), and nearest neighbor graph (NNG) on 40 points in R2, where
20 points are randomly drawn from N (0, I2) and 20 points from N ((2, 2)′, I2).
3.2.1 Single Change-Point Alternative
Here, we consider testing the null (3.1) versus the single change-point alternative
(3.2). Let G be the similarity graph on yi. Any time t divides the observations into
CHAPTER 3. A GRAPH-BASED FRAMEWORK 12
Figure 3.1: The MST, MDP and NNG graphs on an example two-dimensional dataset. 20 points were drawn from N (0, I2) (shown in triangles) and 20 points weredrawn from N ((2, 2)′, I2) (shown in circles).
two groups: Those that come before and after t, so the number of edges connecting
points from different groups for time t is defined as:
RG(t) =∑
(i,j)∈G
Igi(t)6=gj(t), gi(t) = Ii>t.
Here, gi(t) is an indicator function for the event of whether yi is observed after t. As
in the Friedman-Rafsky and Rosenbaum’s tests, small values of RG(t) are evidence
against the null. To standardized RG(t) so that it is comparable across t, let
ZG(t) = −RG(t)− E[RG(t)]√Var[RG(t)]
. (3.4)
In the standardization, we also invert the sign, so that large values of ZG(t) are
evidence against the null. The expectation and variance above are defined under
the permutation null, which places 1/n! probability on each of the n! permutations
of yi : i = 1, . . . , n. That is, the time of observing yi is permuted, so i is the
permutation random variable, and gi(t) is the indicator function of observing yi after
t under permutation. Here, For graph G, since it is determined by the values of yi’s,
its structure is not changed under permutation.
Remark 3.2.1. It would be clearer to use π(i) to denote the observed time for yi after
permutation as we do in Appendix B.1. However, when there is no much ambiguity,
CHAPTER 3. A GRAPH-BASED FRAMEWORK 13
i is used to avoid notation cumbersome.
Lemma 3.2.2 below gives analytic formulas for E[RG(t)] and Var[RG(t)]. Before
we state the lemma, we introduce a new notation Gi, which is a subgraph of G
including all edges that connect to node i. Since the vertex set of Gi is obvious (node
i and all nodes that connect to node i in G), Gi is also used to denote the set of edges
in Gi. |Gi| is then the number of edges in Gi. Apparently, |Gi| is also the degree of
node i.
Lemma 3.2.2. Under the permutation null, the expectation and variance of RG(t)
are
E(RG(t)) = p1(t)|G|,
Var(RG(t)) = p2(t)|G|+(
1
2p1(t)− p2(t)
)∑i
|Gi|2 +(p2(t)− p2
1(t))|G|2,
where
p1(t) =2t(n− t)n(n− 1)
,
p2(t) =4t(t− 1)(n− t)(n− t− 1)
n(n− 1)(n− 2)(n− 3).
Proof. Notice that the indices i, j, are the random variables under permutation. The
formula for the expectation is immediate,
E(RG(t)) =∑
(i,j)∈G
P(gi(t) 6= gj(t)) = p1(t)|G|,
because there are 2t(n − t) ways to place i and j on the two sides of t among all
n(n− 1) ways.
For the second moment,
E(R2G(t)) =
∑(i,j),(k,l)∈G
P(gi(t) 6= gj(t), gk(t) 6= gl(t)).
CHAPTER 3. A GRAPH-BASED FRAMEWORK 14
By examining different ways of placing i, j, k, l, we have
P(gi(t) 6= gj(t), gk(t) 6= gl(t))
=
2t(n−t)n(n−1)
= p1(t) if
i = k, j = l
i = l, j = k
t(n−t)n(n−1)
= 12p1(t) if
i = k, j 6= l
i = l, j 6= k
j = k, i 6= l
j = l, i 6= k4t(t−1)(n−t)(n−t−1)n(n−1)(n−2)(n−3)
= p2(t) if i, j, k, l all different.
So
E(R2G(t)) =
∑(i,j)∈G
p1(t) +∑
(i, j), (i, k) ∈ G
j 6= k
1
2p1(t) +
∑(i, j), (k, l) ∈ G
i, j, k, l all different
p2(t)
= p2(t)|G|+(
1
2p1(t)− p2(t)
)∑i
|Gi|2 + p2(t)|G|2.
Var(RG(t)) follows from E(R2G(t))− E2(RG(t)).
Remark 3.2.3. The expectation and variance of RG(t) under the permutation null
depend only on t, n, and two characteristics of the graph – the number of edges (|G|)and the sum of squares of node degrees (
∑ni=1 |Gi|2).
Figure 3.2 illustrates the computation of RG(t) on a small artificial data set of
length n = 40 with the first 20 points drawn from N (0, I2) and the second 20 points
drawn from N ((2, 2)′, I2), so there is a true change-point at τ = 20. The similarity
graph is the MST constructed using Euclidean distance. Figure 3.3 shows plots of
the RG(t) and ZG(t) processes from the same data set. We see that ZG(t) peaked at
the true change-point 20. For contrast, we randomly generated another sequence of
40 with all points drawn from N (0, I2). RG(t) and ZG(t) calculated from this data
set is shown in Figure 3.4. We can see clearly that the data set with no change-point
has ZG(t) almost random, whose maximum is much smaller (around 1 compared to
around 4 in Figure 3.3).
CHAPTER 3. A GRAPH-BASED FRAMEWORK 15
Figure 3.2: The computation of RG(t) for nine different values of t. The data isa sequence of length n = 40, with the first 20 points drawn from N (0, I2) and thesecond 20 points drawn from N ((2, 2)′, I2). The similarity graph G shown in the plotsis the MST on Euclidean distance. Each t divides the observations into two groups,one group for observations before and at t (shown as triangles) and the other groupfor observations after t (shown as circles). Edges that connect observations from thetwo different groups (i.e. edges connecting a triangle and a circle) are bold in thegraph. Notice that G does not change as t changes, but the group identities of someobservations change, causing RG(t) to change.
CHAPTER 3. A GRAPH-BASED FRAMEWORK 16
Figure 3.3: The profile of RG(t) and ZG(t) against t for the same data set as in Figure3.2 (a change-point at 20).
Figure 3.4: The profile of RG(t) and ZG(t) against t on a sequence of points allrandomly drawn from N (0, I2).
CHAPTER 3. A GRAPH-BASED FRAMEWORK 17
We use the scan statistic to test H0 versus Ha:
maxn0≤t≤n1
ZG(t), (3.5)
where n0 and n1 are pre-specified constraints for the range of τ as we described
in Section 3.1. The null hypothesis is rejected if the maxima is greater than some
threshold. How to determine this threshold will be discussed in Chapter 4.
3.2.2 Changed Interval Alternative
In this section, we consider testing H0 versus the changed interval alternative (3.3).
Similar to the single change-point case, any specific alternative (t1, t2) divides the
data into two groups, one group containing all points observed during (t1, t2), and the
other group containing points observed outside of this interval. Then, the number of
edges in G connecting data points from different groups is
RG(t1, t2) =∑
(i,j)∈G
Igi(t1,t2) 6=gj(t1,t2), gi(t1, t2) = It1<i≤t2 .
We standardize RG(t1, t2) as before,
ZG(t1, t2) = −RG(t1, t2)− E(RG(t1, t2))√Var(RG(t1, t2))
.
Lemma 3.2.4 below gives explicit expressions for E(RG(t1, t2)) and Var(RG(t1, t2))
under the permutation null. The scan statistic involves a maximization over t1 and
t2,
max1 ≤ t1 < t2 ≤ n
l0 ≤ t2 − t1 ≤ l1
ZG(t1, t2) (3.6)
where l0 and l1 are constraints on the window size. For example, we can set l1 = n−l0so that only alternatives where the number of observations in either group is larger
than l0 are considered.
Lemma 3.2.4. Under the permutation null, the expectation and variance of RG(t1, t2)
CHAPTER 3. A GRAPH-BASED FRAMEWORK 18
are
E(RG(t1, t2)) = p1(t2 − t1)|G|,
Var(RG(t1, t2)) = p2(t2 − t1)|G|+(
1
2p1(t2 − t1)− p2(t2 − t1)
)∑i
|Gi|2
+(p2(t2 − t1)− p2
1(t2 − t1))|G|2,
where p1(·) and p2(·) are defined in Lemma (3.2.2).
The proof for this lemma is very similar to the proof of Lemma 3.2.2 and so
omitted here.
Remark 3.2.5. We can also constrain t1 and t2 to a prefixed region using domain
knowledge. The procedure will be similar and the p-value approximations in Chapter
4 are up to some minor modifications.
Chapter 4
Analytic Appximations to
Significance Levels
4.1 Quantity of the Interest
How large do the values of the scan statistics (3.5) and (4.2) need to be to constitute
sufficient evidence against the null hypothesis of homogeneity? We are concerned
about the tail distribution of the scan statistics under H0, that is,
P
(max
n0≤t≤n1
ZG(t) > b
)(4.1)
for the single change-point alternative, and
P
max1 ≤ t1 < t2 ≤ nl0 ≤ t2 − t1 ≤ l1
ZG(t1, t2) > b
(4.2)
for the changed interval alternative. (Since 1 ≤ t1 < t2 ≤ n is implicitly obvious,
we omit this in the rest of this chapter for simplicity.) The probabilities in (4.1)
and (4.2) are defined under the permutation distribution, where the order of yi is
permuted. Under the null hypothesis, the observations are iid distributed, the scan
statistic calculated under any permutation of time would follow the same distribution.
19
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 20
Therefore, we can directly sample from the permutation distribution to approximate
the two probabilities (4.1) and (4.2). This is affordable if the number of observations
n is not large. When n is large, doing permutation is computationally prohibitive,
especially for (4.2) where each scan is of order O(n2) if l1 − l0 ∼ O(n). Even though
parallel computing makes doing permutation more achievable, it needs many com-
puting units. In addition, we would have no idea of how different factors come to
play roles in the probabilities. Having analytic expressions for the probabilities, by
all means, make the approach much easier to carry out and give us a better idea of
the approach.
For (4.1) and (4.2), If we treat ZG(t) and ZG(t1, t2) as families of tests, the
two probabilities are the family-wise error rates. However, all the tests are dependent
since they all base on the same sequence and each test, ZG(t) or ZG(t1, t2), has a
complicated distribution because it is calculated under permutation. Therefore, it is
impossible to obtain exact analytic expressions for the two probabilities for finite n.
In the rest of this chapter, we give analytic approximations to the two probabil-
ities. We first show that, under some mild conditions of G, ZG([nu]) : 0 < u < 1converges to a Gaussian process and ZG([nu], [nv]) : 0 < u < v < 1 converges
to a Gaussian random field as n → ∞ (Section 4.2). We then derive analytic ap-
proximations to the two probabilities under these conditions of G (Section 4.3). To
give better approximations for n small and also for more general graphs, we make
some refinements by correcting skewness for the marginal distributions (Section 4.4).
All these approximations are checked by numerical studies under different scenarios
(Section 4.5).
4.2 Properties of the Processes
In this section, we study the random process ZG([nu]) : 0 < u < 1 and the random
field ZG([nu], [nv]) : 0 < u < v < 1 as n → ∞. Their limiting distributions are
stated in Section 4.2.1 and the covariance function of the random process ZG([nu]) :
0 < u < 1 is shown in Section 4.2.2.
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 21
4.2.1 Limiting Distributions
This section shows the limiting distributions of ZG([nu]) : 0 < u < 1 and
ZG([nu], [nv]) : 0 < u < v < 1 using Stein’s method. We first introduce some
notations. For edge e = (e−, e+), where e− < e+ are the indices of the nodes connected
by the edge e, let
Ae = Ge− ∪Ge+ , (4.3)
be the set of edges connecting to either node e− or node e+, and
Be = ∪Ae′ : e′ ∈ Ae, (4.4)
be the set of edges connecting to nodes in Ge− and Ge+ .
Theorem 4.2.1. If∑
e∈G |Ae||Be| ∼ o(n3/2), |G| ∼ O(n), n → ∞, then under the
permutation null,
1. ZG([nu]) : 0 < u < 1 converges to a Gaussian process, which we denote as
Z?G(u) : 0 < u < 1,
2. ZG([nu], [nv]) : 0 < u < v < 1 converges to a two-dimensional Gaussian
random field, which we denote as Z?G(u, v) : 0 < u < v < 1.
Remark 4.2.2. The condition for the graph restrict the “hub” of the graph, that is
a node with a large degree. The condition requires that the maximum degree of the
graph cannot be of order |G|3/4 or higher. To help understand the condition, it can
be simplified to a stronger version which requires that maximum degree of the graph
is o(|G|1/6).
The proof for Theorem 4.2.1 is in Appendix B.1.
4.2.2 Covariance Function
The covariance function of the Gaussian process Z?G(u), 0 < u < 1 is stated in the
next lemma.
ρ?G(u, v)∆= cov(Z?
G(u), Z?G(v)). (4.5)
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 22
Lemma 4.2.3.
ρ?G(u, v) =2(u ∧ v)2(1− (u ∨ v))2|G|+ (u ∧ v)(1− (u ∨ v))(1− 2u)(1− 2v)
∑i |Gi|2
σG(u)σG(v),
(4.6)
where
σ?G(u) =
√2u2(1− u)2|G|+ u(1− u)(1− 2u)2
∑i
|Gi|2.
Proof. First observe that ρ?G(u, u) = 1, which holds for (4.6). Because of the inter-
changeability of u and v in the definition of ρG(u, v), it is enough to show that when
u < v,
ρ?G(u, v) =2u2(1− v)2|G|+ u(1− v))(1− 2u)(1− 2v)
∑i |Gi|2
σG(u)σG(v). (4.7)
Let ρG,n(u, v)∆= cov(ZG([nu]), ZG([nv])), then ρG(u, v) = limn→∞ ρG,n(u, v). Let
s = [nu], t = [nv], then s < t, and limn→∞ s/n = u, limn→∞ t/n = v. Since
cov(ZG(s), ZG(t)) =E(RG(s)RG(t))− E(RG(s))E(RG(t))√
Var(RG(s))Var(RG(t)),
and the expressions for E(RG(s)), E(RG(t)), Var(RG(s)), Var(RG(t)) can be found
in Lemma 3.2.2, we only need to figure out
E(RG(s)RG(t)) =∑
(i,j),(k,l)∈G
P(gi(s) 6= gj(s), gk(t) 6= gl(t)).
By examining different ways of placing i, j, k, l, we have
P[gi(s) 6= gj(s), gk(t) 6= gl(t)]
=
2s(n−t)n(n−1)
:= q1(s, t) if
i = k, j = l
i = l, j = k
s(n−t)(n+2t−2s−2)n(n−1)(n−2)
:= q2(s, t) if
i = k, j 6= l
i = l, j 6= k
j = k, i 6= l
j = l, i 6= k4s(n−t)[(s−1)(n−s−1)+(t−s)(n−s−2)]
n(n−1)(n−2)(n−3):= q3(s, t) if i, j, k, l all different.
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 23
Then
E(RG(s)RG(t)) =∑
(i,j),(k,l)∈G
P(gi(s) 6= gj(s), gk(t) 6= gl(t))
=∑
(i,j)∈G
q1(s, t) +∑
(i, j), (i, k) ∈ G
j 6= k
q2(s, t) +∑
(i, j), (k, l) ∈ G
i, j, k, l all different
q3(s, t)
= (q1(s, t)− 2q2(s, t) + q3(s, t))|G|+ (q2(s, t)− q3(s, t))n∑i=1
|Gi|2 + q3(s, t)|G|2.
So
limn→∞
E(RG(s)RG(t)) = 4u2(1− v)2|G|+ u(1− v)(1− 2u)(1− 2v)n∑i=1
|Gi|2
+ 4uv(1− u)(1− v)|G|2.
Together with
limn→∞
E(RG(s)) = 2u(1− u)|G|,
limn→∞
Var(RG(s)) = 4u2(1− u)2|G|+ u(1− u)(1− 2u)2
n∑i=1
|Gi|2,
and similar for RG(t), we have (4.7).
Remark 4.2.4. ρ?G(u, v), u ≤ v is partially differentiable for u whenever u 6= v. View
v as fixed, when u = v, its left- and right- derivatives are well defined for any order.
We denote the k-th left- and right- derivative by f(k)v,−(0) and f
(k)v,+(0), respectively. It
is not hard to check that f ′v,−(0) = −f ′v,+(0).
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 24
4.3 Asymptotic Approximations
This section studies the asymptotic behavior of the two probabilities (4.1) and (4.2).
We need the function ν(x) defined by
ν(x) = 2x−2 exp
−2
∞∑m=1
m−1Φ
(−1
2xm1/2
), x > 0. (4.8)
This function is closely related to the Laplace transform of the overshoot over the
boundary of a random walk. A simple approximation given in Siegmund and Yakir
[2007] is sufficient for numerical purpose:
ν(x) ≈ (2/x)(Φ(x/2)− 0.5)
(x/2)Φ(x/2) + φ(x/2). (4.9)
The following proposition is the foundation for obtaining analytic approximations to
the probabilities.
Proposition 4.3.1. Assume that n0 →∞, n1 →∞, b→∞, and n→∞ in a way
such that for some 0 < x0 < x1 < 1 and b0 > 0
ni/n→ xi (i = 0, 1) and b/√n→ b0.
Then as n→∞,
P
(max
n0≤t≤n1
Z?G(t/n) > b
)∼ bφ(b)
∫ x1
x0
h∗r0,r1(x)ν(b0
√2h∗r0,r1(x)
)dx, (4.10)
P
(max
n0≤t2−t1≤n1
Z?G(t1/n, t2/n) > b
)(4.11)
∼ b3φ(b)
∫ x1
x0
(h∗r0,r1(x)ν(b0
√2h∗r0,r1(x))
)2
(1− x)dx
where
h∗r0,r1(x) =1
2x(1− x)+
2
4x(1− x) + (1− 2x)2(r1/r0 − 4r0),
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 25
with r0∆= limn→∞ |G|/n, and r1
∆= limn→∞
∑i |Gi|2/n.
Proof. We first show the single change-point case. We adopt Woodroofe’s method
[Woodroofe, 1976, 1978] by condition on the first cross-over.
P( maxn0≤t≤n1
Z?G(t/n) > b)
=∑
n0≤t≤n1
∫ ∞0
P(Z?G(t/n) = b+ dx)P( max
n0≤s<tZ?G(s/n) < b|Z?
G(t/n) = b+ dx) (4.12)
By change of measure and rearranging the terms, we have
P( maxn0≤t≤n1
Z?G(t/n) > b)
=φ(b)
b
∑n0≤t≤n1
∫ ∞0
e−x−x2
2b2 P( maxn0≤s<t
b(Z?G(s/n)− Z?
G(t/n)) < −x|Z?G(t/n) = b+
x
b)dx.
Since b→∞, if x ∼ o(b2), then x2
2b2is negligible to x and x
bis negligible to b; while if
x ∼ O(b), then x+ x2
2b2→∞, and the integrand becomes 0, so
P( maxn0≤t≤n1
Z?G(t/n) > b)
≈ φ(b)
b
∑n0≤t≤n1
∫ ∞0
e−xP( maxn0≤s<t
b(Z?G(s/n)− Z?
G(t/n)) < −x|Z?G(t/n) = b)dx.
Notice that for u < v,
b(Z?G(u)− Z?
G(v))|(Z?G(v) = b) ∼ N ((ρG(u, v)− 1)b2, (1− ρ2
G(u, v))b2).
Let δ = v − u, by Taylor expansion, we have
ρG(u, v) = 1 + f ′v,−(0)δ + f ′′v,−(0)δ2/2 +O(δ3),
ρ2G(u, v) = 1 + 2f ′v,−(0)δ + ((f ′v,−)2 + f ′′v,−(0))δ2 +O(δ3).
So for δ ∼ O(n−1),
b(Z?G(u)− Z?
G(v))|(Z?G(v) = b) ∼ N (−f ′v,−(0)|δ|b2, 2f ′v,−(0)|δ|b2).
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 26
One can show that, for b = b0
√n, and n→∞,
limk→∞
lim supn→∞
∑|i−t|>k
P(Z?G(i/n) > b|Z?
G(t/n) = b+ dx) = 0.
LetW(t)m be a random walk withW
(t)1 ∼ N (µ(t), (σ(t))2), where µ(t) = 1
nf ′v,−(0)b2, (σ(t))2 =
2µ(t). Then
P( maxn0≤s<t
b(Z?G(s/n)− Z?
G(t/n)) < −x|Z?G(t/n) = b) ∼ P( max
n0≤s≤t−W (t)
t−s < −x)
∼ P(minm≥1
W (t)m > x).
Together with the fact∫ ∞0
exp−2µx/σP(minm≥1
Wm > x)dx = µν(2µ/σ),
for a random walk W1 ∼ N (µ, σ) (see Siegmund [1992]), we have
limn→∞
P( maxn0≤t≤n1
Z?G(t/n) > b) ≈ lim
n→∞
φ(b)
b
∑n0≤t≤n1
b20f′t/n,−(0)ν(b0
√2f ′t/n,−(0))
For f ′t/n,−(0), we take the derivative of ρG(u, v), and after some tedious calculation,
we have
f ′v,−(0) =1
2v(1− v)+
2
4v(1− v) + (1− 2v)2(∑
i |Gi|2/|G| − 4|G|). (4.13)
Putting everything together, we have
limn→∞
P( maxn0≤t≤n1
Z?G(t/n) > b) ≈ lim
n→∞
φ(b)
b
∑n0≤t≤n1
b20h∗r0,r1
(t/n)ν(b0
√2h∗r0,r1(t/n))
=φ(b)
b
∫ x1
x0
b20h∗r0,r1
(x)ν(b0
√2h∗r0,r1(x))ndx
= bφ(b)
∫ x1
x0
h∗r0,r1(x)ν(b0
√2h∗r0,r1(x)
)dx.
Now, we show the changed interval case following the method of Siegmund [1988,
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 27
1992]. We omit most of the technical details, which follow these two papers given
that
ρG,(u1,u2)(δ1, δ2)∆= cov(Z?
G(u1 − δ1, u2 − δ2), Z?G(u1, u2).
is differentiable with the derivative being continuous except at δ1 = 0 and at δ2 = 0.
A key intermediate form is
P
(max
n0≤t2−t1≤n1
Z?G(t1/n, t2/n) > b
)≈ φ(b)
b
∑n0≤t2−t1≤n1
C1(t1, t2)b2C2(t1, t2)b2 × ν(√
2C1(t1, t2)b2)ν(√
2C2(t1, t2)b2),
(4.14)
where C1, C2 are the partial derivatives
C1(nu1, nu2) ≡ 1
n
∂−ρG,(u1,u2)(δ1, 0)
∂δ1
∣∣∣∣δ1=0
= − 1
n
∂+ρG,(u1,u2)(δ1, 0)
∂δ1
∣∣∣∣δ1=0
,
C2(nu1, nu2) ≡ − 1
n
∂+ρG,(u1,u2)(0, δ2)
∂δ2
∣∣∣∣δ2=0
.
Under the permutation null, the processes derived from perturbation of the left and
right end points,
Z?G((t1 + k)/n, t2/n), k = . . . ,−2,−1, 0, 1, 2, . . .
and
Z?G(t1/n, (t2 − k)/n), k = . . . ,−2,−1, 0, 1, 2, . . . ,
are identical in distribution to the process
Z?G((t2 − t1 − k)/n), k = . . . ,−2,−1, 0, 1, 2, . . . ,
Thus, the partial derivatives are equal to the derivative in the one change-point
scenario,
C1(t1, t2) = C2(t1, t2) =1
nf ′u2−u1,−(0).
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 28
Substituting 1nf ′u2−u1,−(0) for C1(t1, t2) and C2(t1, t2) in (4.14) and the double sum-
mation goes to an integral as n→∞ yields (4.11).
Therefore, when∑
e∈G |Ae||Be| ∼ o(n3/2), |G| ∼ O(n), we approximate (4.1) and
(4.2) by
P
(max
n0≤t≤n1
ZG(t) > b
)∼ bφ(b)
∫ x1
x0
h∗r0,r1(x)ν(b0
√2h∗r0,r1(x)
)dx, (4.15)
P
(max
n0≤t2−t1≤n1
ZG(t1, t2) > b
)(4.16)
∼ b3φ(b)
∫ x1
x0
(h∗r0,r1(x)ν(b0
√2h∗r0,r1(x))
)2
(1− x)dx.
Remark 4.3.2. In practice, when we use (4.15) and (4.16) to approximate the prob-
abilities, we use the form of h∗r0,r1(x) before taking the limit:
hG(n, x) =(n− 1)[h1(n, x)|G|+ h2(n, x)
∑ni=1 |Gi|2 − h3(n, x)|G|2]
2u(1− u)[h4(n, x)|G|+ h5(n, x)∑n
i=1 |Gi|2 − h6(n, x)|G|2], (4.17)
where
h1(n, x) = 4n(n− 1)(−2nx2 + 2nx− 1
)h2(n, x) = n
[n(n+ 1)(1− 2x)2 − 2(n− 1)
]h3(n, x) = 4n
[n(1− 2x)2 − 1
]h4(n, x) = 4n(n− 1)(nx− 1)(n− nx− 1)
h5(n, x) = n (n− 1)[n2(1− 2x)2 − n+ 2
]h6(n, x) = 4n
[n2(1− 2x)2 − 2n(1− 3x+ 3x2) + 1
].
4.4 Skewness Correction
Convergence of ZG(t) to normality is slow if t/n is close to 0 or 1. Also, MST and
NNG constructed on high dimensional data can be dominated by hubs under standard
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 29
distance measures, such as L2 or L1. We see this to be true, for example, under L2 dis-
tance when the dimension is 100 in the simulations of Section 4.5. In our simulations
we have noticed that if the underlying graph is void of hubs, then the statistic ZG(t)
is right-skewed, and the approximations (4.15) and (4.18) underestimate the true tail
probabilities. If a graph is dominated by hubs, then the statistic is left-skewed, and
the approximations overestimate the tail probabilities. The effects of skewness are
explored in more detail in Appendix B.2.
By incorporating the skewness of the marginal distribution of the two processes,
the two tail probabilities can be approximated by:
P
(max
n0≤t≤n1
ZG(t) > b
)≈ bφ(b)
∫ x1
x0
SG(nx)hG(n, x)ν(√
2b20hG(n, x))dx, (4.18)
where
SG(t) =exp
(12(b− θb,G(t))2 + 1
6γG(t)θb,G(t)3
)√1 + γG(t)θb,G(t)
, (4.19)
with γG(t) = E[Z3G(t)] and θb,G(t) = (−1 +
√1 + 2γG(t)b)/γG(t).
P
(max
n0≤t2−t1≤n1
ZG(t1, t2) > b
)(4.20)
≈ φ(b)
b
∑n0≤t2−t1≤n1
SG(t1, t2)(b2
0hG(n, (t2 − t1)/n)ν(b0
√2hG(n, (t2 − t1)/n))
)2
,
where
SG(t1, t2) =exp
(12(b− θb,G(t1, t2))2 + 1
6γG(t1, t2)θb,G(t1, t2)3
)√1 + γG(t1, t2)θb,G(t1, t2)
, (4.21)
with γG(t1, t2) = E[Z3G(t1, t2)] and θb,G(t1, t2) = (−1 +
√1 + 2γG(t1, t2)b)/γG(t1, t2).
We next show how (4.18) and (4.20) are derived and give explicit expressions for
γG(t) and γG(t1, t2).
4.4.1 Derivation of (4.18) and (4.20)
Correcting skewness under the change-point setting was first carried out in Tu et al.
[1999] and later modified in Tang and Siegmund [2001]. Both of them take a universal
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 30
third moment correction. In our problem, the extent of the skewness of ZG(t) depends
on how closeness t is close to the ends, so the correction needs to be different for
different t. Take the single change-point case for example, we consider the third
moment for calculating the marginal probability P(ZG(t) ∈ b+ dx/b). We show how
to incorporate the third moment through cumulant generating functions and the same
treatment is done for the changed interval case.
In the derivation below we suppress the dependence on the graph G and the time
parameter t. Consider the probability measure dQθ = eθZ−ψ(θ)dP, where ψ(θ) =
log EP(eθZ). Choose θb such that ψ(θb) = EQθb(Z) = b. Then,
P(Z ∈ b+ dx/b) = EP(1Z∈b+dx/b) ≈ e−θb(b+x/b)+ψ(θb)Qθb(Z ∈ b+ dx/b). (4.22)
Since under Qθb , Z is centered at b with variance..
ψ(θb), Qθb(Z ∈ b + dx/b) can be
approximated by the normal density,
Qθb(Z ∈ b+ dx/b) ≈ 1√2π
..
ψ(θb)
exp
(− x2
2b2..
ψ(θb)
)≈ 1√
2π..
ψ(θb)
. (4.23)
The second approximation above is accurate for x/b→ 0.
To obtain ψ(θb) and..
ψ(θb), we use Taylor expansions, noting that ψ(0) = ψ(0) =
0,..
ψ(0) = 1,...
ψ(0) = EP(Z3)∆= γ:
ψ(θ) ≈ ψ(0) + ψ(0)θ +..
ψ(0)θ2
2+
...
ψ(0)θ3
6=θ2
2
(1 +
γθ
3
), (4.24)
..
ψ(θ) ≈..
ψ(0) +...
ψ(0)θ = 1 + γθ. (4.25)
Combining (4.22), (4.23),(4.24) and (4.25) gives
P(Z ∈ b+ dx/b) ≈ 1√2π(1 + γθb)
exp(−θbb− xθb/b+ θ2b (1 + γθb/3)/2). (4.26)
For an approximation of θb, we solve ψ(θb) by approximating ψ up to the third order,
b = ψ(θb) ≈ ψ(0) +..
ψ(0)θb +...
ψ(0)θ2b
2= θb +
1
2γθ2
b , (4.27)
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 31
yielding
θb ≈ (−1 +√
1 + 2γb)/γ. (4.28)
Note that when γ = 0, θb = b. (4.18) follows by using (4.26) in (4.12) in the proof of
Theorem 4.3.1 and approximating the θbx/b term in the exponent by x.
Remark 4.4.1. The term θb(t) is an approximation to the solution of ψt(θ) = b,
where ψt(θ) is the cumulant generating function of Z(t). By a third order Taylor ap-
proximation to ψt(θ), we have ψ−1t (b) ≈ (−1+
√1 + 2γ(t)b)/γ(t). When the marginal
distribution is left-skewed, it is possible that γ(t) can be too small for 1 + 2γ(t)b to be
positive. This does not mean that the solution to ψt(θ) = b does not exist, but that
higher moments are needed to get a good approximation. In this paper, we apply an
easy heuristic fix to this problem: Since 1 + 2γ(t)b < 0 usually happens when t/n is
close to 0 or 1, within this problematic region θb(t) can be extrapolated using its val-
ues outside the region. The details of the extrapolation method are shown in Appendix
B.2.
4.4.2 Explicit Expressions for Skewness
We have
E(Z3G(t)) =
E3(RG(t)) + 3E(RG(t))Var(RG(t))− E(R3(t))
(Var(RG(t)))3/2,
E(Z3G(t1, t2)) =
E3(RG(t1, t2)) + 3E(RG(t1, t2))Var(RG(t1, t2))− E(R3(t1, t2))
(Var(RG(t1, t2)))3/2.
The explicit expressions of E(RG(t)), Var(RG(t)), E(RG(t1, t2)), and Var(RG(t1, t2)
are given in Lemma 3.2.2 and Lemma 3.2.4. The explicit expressions of E3(RG(t))
and E3(RG(t1, t2)) are given in the following lemma.
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 32
Lemma 4.4.2.
E(R3G(t)) = p1(t)|G|+ 3
2p1(t)
∑i
|Gi|(|Gi| − 1)
+ 3p2(t)
(|G|(|G| − 1) +
1
2
∑i
|Gi|(|Gi| − 1)(|G| − |Gi|)
)
− 3p2(t)
∑i
|Gi|(|Gi| − 1) +∑
(i,j)∈G
(|Gi| − 1)(|Gj| − 1)
+ p3(t)
∑i
|Gi|(|Gi| − 1)(|Gi| − 2)
+ p4(t)
|G|(|G| − 1)(|G| − 2) + 6∑
(i,j)∈G
(|Gi| − 1)(|Gj| − 1)
− 2p4(t)
∑(i,j)∈G
|k : (i, k), (j, k) ∈ G|
− p4(t)
(∑i
|Gi|(|Gi| − 1)(3|G| − 2|Gi| − 2)
).
The functions p1(t) and p2(t) are given in Lemma 3.2.2, and
p3(t) :=t(n− t)((n− t− 1)(n− t− 2) + (t− 1)(t− 2))
n(n− 1)(n− 2)(n− 3),
p4(t) :=8t(t− 1)(t− 2)(n− t)(n− t− 1)(n− t− 2)
n(n− 1)(n− 2)(n− 3)(n− 4)(n− 5).
Also
E3(RG(t1, t2)) = E3(RG(t2 − t1)). (4.29)
The terms in E(R3G(t)) can be rearranged and written in other forms. The expan-
sion shown here makes it easier to understand the origin of each term in the context
of the proof.
Proof. For the un-centered process RG(t),
E(R3G(t)) =
∑(i,j),(k,l),(u,v)∈G
P(gi(t) 6= gj(t), gk(t) 6= gl(t), gu(t) 6= gv(t)).
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 33
There are in total eight different configurations for three edges randomly chosen from
the graph, and we derive P(gi(t) 6= gj(t), gk(t) 6= gl(t), gu(t) 6= gv(t))∆= P3 separately
for each of them.
1) The three edges are actually the same edge.
P3 = P(gi(t) 6= gj(t)) =2t(n− t)n(n− 1)
.
2) Two edges are the same and share one node with the third edge.
P3 = P(gi(t) 6= gj(t), gi(t) 6= gk(t)) =t(n− t)n(n− 1)
.
3) Two edges are the same and do not share any node with the third edge.
P3 = P(gi(t) 6= gj(t), gk(t) 6= gl(t)) =4t(t− 1)(n− t)(n− t− 1)
n(n− 1)(n− 2)(n− 3).
4) The three edges share one node, and neither of them share the other node (star-
shaped).
P3 = P(gi(t) 6= gj(t), gi(t) 6= gk(t), gi(t) 6= gl(t))
=t(n− t)((n− t− 1)(n− t− 2) + (t− 1)(t− 2))
n(n− 1)(n− 2)(n− 3).
5) One edge share one node with another edge and share the other node with the
third edge. No node sharing between the second and the third edge (linear chain).
P3 = P(gi(t) 6= gj(t), gi(t) 6= gk(t), gj(t) 6= gl(t))
=2t(t− 1)(n− t)(n− t− 1)
n(n− 1)(n− 2)(n− 3).
6) The three edges form a triangle.
P3 = P(gi(t) 6= gj(t), gj(t) 6= gk(t), gk(t) 6= gi(t)) = 0.
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 34
7) Two edges share one node, and share no node with the third edge.
P3 = P(gi(t) 6= gj(t), gi(t) 6= gk(t), gu(t) 6= gv(t))
=2t(t− 1)(n− t)(n− t− 1)
n(n− 1)(n− 2)(n− 3).
8) No pair of the three edges share any node.
P3 = P(gi(t) 6= gj(t), gk(t) 6= gl(t), gu(t) 6= gv(t))
=8t(t− 1)(t− 2)(n− t)(n− t− 1)(n− t− 2)
n(n− 1)(n− 2)(n− 3)(n− 4)(n− 5).
The following lists the number of occurrences of each of the above cases:
1) |G|
2) 3∑
i |Gi|(|Gi| − 1)
3) 3|G|(|G| − 1)− 3∑
i |Gi|(|Gi| − 1)
4)∑
i |Gi|(|Gi| − 1)(|Gi| − 2)
5) 6∑
(i,j)∈G(|Gi| − 1)(|Gj| − 1)− 6∑
(i,j)∈G |k : (i, k), (j, k) ∈ G|
6) 2∑
(i,j)∈G |k : (i, k), (j, k) ∈ G|
7) 3∑
i |Gi|(|Gi|−1)(|G|−|Gi|)+6∑
(i,j)∈G |k : (i, k), (j, k) ∈ G|−12∑
(i,j)∈G(|Gi|−1)(|Gj| − 1)
8) |G|(|G|−1)(|G|−2)+6∑
(i,j)∈G(|Gi|−1)(|Gj|−1)−2∑
(i,j)∈G |k : (i, k), (j, k) ∈G| −
∑i |Gi|(|Gi| − 1)(3|G| − 2|Gi| − 2)
The lemma follows by summing up all of the probabilities as enumerated above.
It is not hard to observe that the number of occurrences only depends on the
lengths of the two intervals, so E3(RG(t1, t2)) = E3(RG(t2 − t1)).
For MDP, only cases 1, 3, and 8 are possible, and the number of occurrences of
each case is
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 35
1) |G| = n
2) (3)) 3|G|(|G| − 1) = 3n(n− 1)
3) (8)) |G|(|G| − 1)(|G| − 2) = n(n− 1)(n− 2)
By summing the probabilities of these 3 cases, we have a much simpler expression for
E(R3G(t)) for MDP:
E(R3G(t)) = p1(t)n+ p2(t)3n(n− 1) + p4(t)n(n− 1)(n− 2)
= (p1(t)− 3p2(t) + 2p4(t))n+ 3(p2(t)− p4(t))n2 + p4(t)n3.
4.5 Numerical Studies
To check the analytic approximations to p-values, we compare the critical values
obtained from (4.15), (4.18), (4.16), and (4.20) to those obtained from permutation,
under various simulation settings. In each simulation, iid sequences of length 1000
were generated from a given distribution F0 in Rd. MST, MDP, and NNG were
constructed on the data based on Euclidean distance. For each graph, analytic and
permutation critical values were computed for both 0.05 and 0.01 p-value thresholds.
4.5.1 Single Change-Point Alternative
Tables 4.1 - 4.5 show the results for the single change-point alternative. In the column
headers, “A1” denotes the critical values obtained assuming Gaussianity (4.15), “A2”
denotes the critical values obtained after correcting for skewness (4.18), and “Per”
denotes the critical values obtained by 10,000 permutations.
Six different choices for F0 are shown, for two different distributions (standard
normal and exponential with mean 1), each in three different dimensions (d=1,10, or
100). For d = 10 or 100, each element of the data vector is generated independently
from the given distribution. The analytic approximations depend also on constraints
on the region in which the change-point is searched. These are reflected in the choice
of n0 and n1 (l0 and l1 for the changed interval alternative). To make things simple,
we set n1 = n − n0, so that we only allow the case that both groups have at least
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 36
n0 observations. In general, the analytic approximations become less precise when
the minimum segment length decreases. This is mainly because the Gaussian ap-
proximation (and skewness correction) to the distribution of Z(t) degrades for small
samples.
Both the analytic and permutation p-values depend on certain characteristics of
the graph’s structure. The structures of MST (for d ≥ 2) and NNG depend on the
underlying data set, and thus the critical values vary by simulation run. In such
cases, we show results for 5 randomly simulated sequences. Two characteristics of
the graph are also shown for each simulated sequence: The sum of squared node
degrees (∑
i |Ei|2) and the maximum node degree (D). These quantities give some
intuition on the size and density of hubs in the graph. Since the MST for any one-
dimensional data set is a chain, the critical values for MST-based scan do not change
with simulation run for each setting of the parameters.
The structure of the MDP graph is always the same for all data sets. Therefore,
the critical values for MDP-based scan depend only on n, n0, n1 (l0 and l1 for the
changed interval alternative). The critical values for MDP-based scan do not depend
on the dimension or the underlying distribution of the data. As emphasized in Rosen-
baum [2005], it is a truly distribution free method, which can be desirable in high
dimensions.
We can see from the tables that the analytic approximations after skewness cor-
rection perform much better than the analytic approximations under Gaussian as-
sumption, especially when dimension increases. The accuracy of the skew-corrected
approximation does not degrade significantly with dimension. For the statistics based
on MDP, the skew-corrected approximations work quite well when the minimum win-
dow size is as small as 25 at 0.05 significance level, and 50 at 0.01 significance level.
For the statistics based on MST and NNG, the skew corrected approximations remain
accurate for window sizes as small as 25 at both 0.05 and 0.01 significance levels.
There is not much difference between results for simulations based on normal and
those based on exponential distributions. The main factor influencing approximation
accuracy, other than the minimum window size, is the dimension (d). As dimension
increases, the graph becomes more “star-shaped” as reflected by the increase in both∑|Ei|2 and D. As shown in Section 4.4, skewness and other higher order moments
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 37
of Z(t) are a function of polynomials of the node degrees. Thus the increase in the
number and density of hubs makes skewness correction important in high dimensions.
Table 4.1: Critical values for the single change-point scan statistic based on MST at0.05 significance level. n = 1000.
Critical Values Graphn0 = 100 n0 = 50 n0 = 25
A1 A2 Per A1 A2 Per A1 A2 Per∑|Ei|2 D
d = 1 2.98 3.05 3.04 3.08 3.22 3.23 3.14 3.39 3.49 4994 22.92 2.90 2.90 3.00 2.95 2.95 3.05 2.98 2.96 5430 8
N(0,1) 2.92 2.89 2.89 3.00 2.95 2.92 3.05 2.97 2.95 5438 7d = 10 2.92 2.90 2.87 3.00 2.95 2.94 3.05 2.98 2.96 5394 7
2.92 2.89 2.86 3.00 2.94 2.90 3.05 2.97 2.92 5534 82.92 2.89 2.89 3.00 2.95 2.92 3.05 2.97 2.95 5460 72.93 2.91 2.89 3.01 2.97 2.96 3.06 3.00 2.97 5064 7
Exp(1) 2.93 2.91 2.88 3.01 2.97 2.92 3.06 3.00 2.95 5082 7d = 10 2.93 2.91 2.91 3.01 2.98 2.97 3.06 3.01 3.00 5028 5
2.93 2.91 2.87 3.01 2.98 2.93 3.06 3.01 2.97 5028 62.93 2.91 2.88 3.01 2.96 2.92 3.06 2.98 2.94 5180 92.86 2.69 2.68 2.94 2.70 2.68 3.00 2.70 2.68 12454 38
N(0,1) 2.86 2.72 2.72 2.95 2.74 2.72 3.00 2.74 2.72 10904 38d = 100 2.86 2.70 2.66 2.94 2.71 2.66 3.00 2.71 2.66 11294 42
2.87 2.72 2.68 2.95 2.74 2.68 3.00 2.74 2.68 10690 402.86 2.69 2.65 2.94 2.70 2.65 3.00 2.70 2.65 11722 402.85 2.64 2.60 2.93 2.65 2.60 2.99 2.65 2.60 14706 56
Exp(1) 2.87 2.77 2.76 2.95 2.80 2.77 3.01 2.81 2.77 9608 25d = 100 2.84 2.62 2.53 2.93 2.62 2.53 2.99 2.62 2.53 15536 77
2.86 2.74 2.69 2.95 2.76 2.69 3.00 2.76 2.69 10890 302.86 2.72 2.66 2.94 2.73 2.66 3.00 2.73 2.66 12018 39
4.5.2 Changed Interval Alternative
Tables 4.6 - 4.10 show the results of p-value approximations for the changed interval
alternative. The notation and simulation settings are identical to those for the single
change-point alternative in Section 4.5, except that n0 is replaced by l0 for the smallest
window size. (l1 is set to n− l0.)
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 38
Table 4.2: Critical values for the single change-point scan statistic based on MST at0.01 significance level. n = 1000.
Critical Values Graphn0 = 100 n0 = 50 n0 = 25
A1 A2 Per A1 A2 Per A1 A2 Per∑|Ei|2 D
d = 1 3.52 3.62 3.67 3.60 3.81 3.85 3.65 4.05 4.31 4994 23.47 3.43 3.46 3.53 3.46 3.48 3.57 3.48 3.48 5430 8
N(0,1) 3.47 3.43 3.44 3.53 3.46 3.46 3.57 3.47 3.46 5438 7d = 10 3.47 3.43 3.44 3.53 3.46 3.47 3.58 3.48 3.48 5394 7
3.47 3.42 3.38 3.53 3.46 3.40 3.57 3.47 3.41 5534 83.47 3.43 3.44 3.53 3.46 3.46 3.57 3.47 3.46 5460 73.48 3.45 3.40 3.54 3.49 3.44 3.58 3.50 3.45 5064 7
Exp(1) 3.48 3.44 3.40 3.54 3.48 3.42 3.58 3.50 3.44 5082 7d = 10 3.48 3.45 3.47 3.54 3.49 3.49 3.58 3.51 3.52 5028 5
3.48 3.45 3.41 3.54 3.49 3.44 3.58 3.51 3.46 5028 63.48 3.44 3.49 3.54 3.47 3.53 3.58 3.48 3.54 5180 93.42 3.17 3.19 3.48 3.17 3.19 3.53 3.17 3.19 12454 38
N(0,1) 3.42 3.21 3.24 3.49 3.21 3.24 3.53 3.21 3.24 10904 38d = 100 3.42 3.19 3.17 3.49 3.19 3.17 3.53 3.19 3.17 11294 42
3.42 3.22 3.18 3.49 3.22 3.18 3.53 3.22 3.18 10690 403.42 3.18 3.21 3.49 3.18 3.21 3.53 3.18 3.21 11722 403.41 3.14 3.12 3.48 3.14 3.12 3.52 3.14 3.12 14706 56
Exp(1) 3.43 3.28 3.26 3.49 3.28 3.26 3.54 3.28 3.26 9608 25d = 100 3.41 3.15 3.10 3.48 3.15 3.10 3.52 3.15 3.10 15536 77
3.42 3.24 3.21 3.49 3.24 3.21 3.53 3.24 3.21 10890 303.42 3.22 3.13 3.48 3.22 3.13 3.53 3.22 3.13 12018 39
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 39
Table 4.3: Critical values for the single change-point scan statistic based on MDP.n = 1000.
significance level = 0.05
d = 1 d = 10 d = 100n0 A1 A2 N(0,1) Exp(1) N(0,1) Exp(1) N(0,1) Exp(1)
200 2.82 2.84 2.83 2.81 2.85 2.85 2.85 2.83100 2.98 3.07 3.06 3.04 3.08 3.08 3.07 3.0550 3.08 3.27 3.30 3.29 3.35 3.36 3.35 3.3125 3.14 3.48 3.54 3.58 3.57 3.66 3.60 3.60
significance level = 0.01
d = 1 d = 10 d = 100n0 A1 A2 N(0,1) Exp(1) N(0,1) Exp(1) N(0,1) Exp(1)
200 3.38 3.43 3.39 3.38 3.44 3.46 3.45 3.44100 3.52 3.66 3.66 3.64 3.67 3.75 3.67 3.5950 3.60 3.90 3.99 3.99 3.94 4.05 3.95 3.9925 3.65 4.21 4.61 4.65 4.78 4.72 4.59 4.81
From the tables, conclusions similar to those for the single change-point alternative
can be drawn. The analytic approximation after skewness correction performs much
better than the analytic approximation under Gaussian assumption, especially when
dimension increases. The accuracy of skew-corrected approximation does not degrade
significantly with dimension. It does well for MST- and NNG- based tests when
the smallest window size to be considered is as small as 25 for both 0.05 and 0.01
significance levels, and for MDP-based test when the smallest window size is 50.
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 40
Table 4.4: Critical values for the single change-point scan statistic based on NNG at0.05 significance level. n = 1000.
Critical Values Graphn0 = 100 n0 = 50 n0 = 25
A1 A2 Per A1 A2 Per A1 A2 Per∑|Ei|2 D
2.96 2.98 2.95 3.04 3.07 3.03 3.10 3.13 3.08 2008 2N(0,1) 2.96 2.98 2.97 3.05 3.07 3.05 3.10 3.13 3.09 1972 2d = 1 2.96 2.98 3.01 3.04 3.07 3.10 3.10 3.12 3.13 2032 2
2.96 2.98 2.97 3.04 3.07 3.04 3.10 3.13 3.09 2008 22.96 2.99 3.01 3.05 3.08 3.10 3.10 3.13 3.13 1954 22.96 2.99 2.98 3.05 3.08 3.08 3.10 3.13 3.11 1948 2
Exp(1) 2.96 2.98 2.96 3.04 3.07 3.07 3.10 3.13 3.11 2038 2d = 1 2.96 2.98 2.96 3.04 3.08 3.05 3.10 3.13 3.09 2014 2
2.96 2.98 2.99 3.04 3.07 3.08 3.10 3.12 3.13 2008 22.96 2.98 2.99 3.04 3.08 3.08 3.10 3.13 3.13 2038 22.94 2.92 2.89 3.02 2.97 2.93 3.07 3.00 2.96 3370 6
N(0,1) 2.94 2.91 2.90 3.02 2.97 2.95 3.07 2.99 2.96 3502 6d = 10 2.94 2.91 2.89 3.01 2.96 2.95 3.06 2.98 2.96 3444 7
2.94 2.91 2.91 3.01 2.96 2.94 3.06 2.98 2.96 3436 62.94 2.91 2.88 3.02 2.97 2.93 3.07 2.99 2.94 3330 62.94 2.92 2.91 3.02 2.98 2.96 3.07 3.00 2.98 3144 5
Exp(1) 2.94 2.92 2.92 3.02 2.98 2.97 3.07 3.00 2.99 3096 6d = 10 2.94 2.92 2.92 3.02 2.98 2.98 3.07 3.01 3.01 3118 6
2.94 2.93 2.92 3.02 2.98 2.97 3.07 3.01 2.99 3114 52.94 2.92 2.91 3.02 2.98 2.98 3.07 3.01 3.00 3152 62.87 2.65 2.62 2.95 2.65 2.62 3.00 2.65 2.62 9382 52
N(0,1) 2.87 2.73 2.70 2.95 2.75 2.71 3.01 2.76 2.71 8466 24d = 100 2.88 2.76 2.72 2.96 2.78 2.72 3.01 2.79 2.72 7756 20
2.86 2.59 2.56 2.94 2.59 2.56 3.00 2.59 2.56 11092 682.87 2.68 2.64 2.95 2.69 2.64 3.00 2.69 2.64 9538 382.86 2.71 2.70 2.95 2.72 2.70 3.00 2.73 2.70 10222 34
Exp(1) 2.86 2.72 2.68 2.95 2.74 2.69 3.00 2.74 2.69 10390 37d = 100 2.86 2.70 2.64 2.94 2.71 2.64 3.00 2.71 2.64 11574 35
2.87 2.74 2.72 2.95 2.76 2.73 3.01 2.77 2.73 8782 222.87 2.73 2.68 2.95 2.74 2.68 3.01 2.74 2.68 8622 41
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 41
Table 4.5: Critical values for the single change-point scan statistic based on NNG at0.01 significance level. n = 1000.
Critical Values Graphn0 = 100 n0 = 50 n0 = 25
A1 A2 Per A1 A2 Per A1 A2 Per∑|Ei|2 D
3.50 3.53 3.53 3.57 3.61 3.59 3.61 3.65 3.63 2008 2N(0,1) 3.50 3.54 3.52 3.57 3.61 3.63 3.61 3.65 3.66 1972 2d = 1 3.50 3.53 3.58 3.57 3.61 3.66 3.61 3.65 3.71 2032 2
3.50 3.53 3.56 3.57 3.61 3.63 3.61 3.65 3.68 2008 23.50 3.54 3.53 3.57 3.62 3.64 3.61 3.66 3.65 1954 23.50 3.54 3.50 3.57 3.62 3.61 3.61 3.66 3.64 1948 2
Exp(1) 3.50 3.53 3.57 3.57 3.61 3.63 3.61 3.65 3.65 2038 2d = 1 3.50 3.54 3.52 3.57 3.61 3.63 3.61 3.66 3.66 2014 2
3.50 3.53 3.60 3.57 3.61 3.66 3.61 3.65 3.71 2008 23.50 3.54 3.54 3.57 3.62 3.58 3.61 3.66 3.66 2038 23.48 3.45 3.46 3.55 3.49 3.48 3.59 3.50 3.49 3370 6
N(0,1) 3.48 3.44 3.47 3.54 3.48 3.48 3.59 3.49 3.48 3502 6d = 10 3.48 3.44 3.42 3.54 3.47 3.45 3.58 3.48 3.46 3444 7
3.48 3.44 3.43 3.54 3.47 3.46 3.59 3.48 3.47 3436 63.48 3.44 3.44 3.55 3.48 3.48 3.59 3.49 3.48 3330 63.49 3.45 3.46 3.55 3.49 3.51 3.59 3.50 3.51 3144 5
Exp(1) 3.49 3.45 3.48 3.55 3.49 3.52 3.59 3.50 3.52 3096 6d = 10 3.49 3.46 3.48 3.55 3.49 3.54 3.59 3.51 3.57 3118 6
3.49 3.46 3.41 3.55 3.50 3.46 3.59 3.51 3.46 3114 53.49 3.46 3.49 3.55 3.49 3.52 3.59 3.51 3.53 3152 63.42 3.13 3.07 3.49 3.13 3.07 3.54 3.13 3.07 9382 52
N(0,1) 3.43 3.21 3.19 3.50 3.21 3.19 3.54 3.21 3.19 8466 24d = 100 3.44 3.25 3.23 3.50 3.25 3.23 3.54 3.25 3.23 7756 20
3.42 3.09 3.08 3.48 3.09 3.08 3.53 3.09 3.08 11092 683.42 3.16 3.16 3.49 3.16 3.16 3.54 3.16 3.16 9538 383.42 3.20 3.19 3.49 3.20 3.19 3.53 3.20 3.19 10222 34
Exp(1) 3.42 3.22 3.21 3.49 3.22 3.21 3.53 3.22 3.21 10390 37d = 100 3.42 3.18 3.17 3.48 3.18 3.17 3.53 3.18 3.17 11574 35
3.43 3.23 3.23 3.49 3.23 3.23 3.54 3.23 3.23 8782 223.43 3.22 3.24 3.50 3.22 3.24 3.54 3.22 3.24 8622 41
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 42
Table 4.6: Critical values for the changed interval scan statistic based on MST at0.05 significance level. n = 1000.
Critical Values Graphl0 = 100 l0 = 50 l0 = 25
A1 A2 Per A1 A2 Per A1 A2 Per∑|Ei|2 D
d = 1 4.08 4.29 4.24 4.22 4.76 4.73 4.33 5.44 5.77 4994 23.97 3.89 3.84 4.07 3.92 3.89 4.16 3.93 3.89 5454 8
N(0,1) 3.97 3.91 3.81 4.07 3.95 3.85 4.16 3.97 3.87 5400 7d = 10 3.97 3.90 3.81 4.07 3.93 3.90 4.16 3.94 3.91 5448 8
3.97 3.90 3.91 4.07 3.94 3.93 4.16 3.95 3.94 5440 73.97 3.89 3.82 4.07 3.91 3.85 4.15 3.93 3.85 5524 83.99 3.93 3.86 4.09 3.97 3.92 4.17 3.99 3.95 5042 8
Exp(1) 3.99 3.93 3.84 4.09 3.96 3.90 4.17 4.00 3.92 5040 6d = 10 3.99 3.93 3.85 4.09 3.97 3.91 4.17 4.00 3.93 5106 6
3.99 3.93 3.82 4.09 3.97 3.87 4.17 3.99 3.91 5042 63.99 3.91 3.94 4.08 3.95 3.98 4.17 3.97 3.98 5126 83.87 3.51 3.52 3.98 3.51 3.52 4.09 3.51 3.52 11600 40
N(0,1) 3.86 3.49 3.55 3.98 3.49 3.55 4.08 3.49 3.55 13346 64d = 100 3.88 3.57 3.66 3.99 3.57 3.66 4.09 3.57 3.66 10422 34
3.88 3.57 3.58 3.99 3.57 3.58 4.09 3.57 3.58 10804 433.88 3.56 3.58 3.99 3.56 3.58 4.09 3.56 3.58 10862 363.88 3.63 3.59 3.99 3.63 3.59 4.09 3.63 3.59 10384 24
Exp(1) 3.87 3.58 3.49 3.98 3.58 3.49 4.09 3.58 3.49 11922 33d = 100 3.88 3.60 3.63 3.99 3.60 3.63 4.09 3.60 3.63 11194 34
3.89 3.63 3.55 4.00 3.63 3.55 4.10 3.63 3.55 9680 273.88 3.62 3.60 3.99 3.62 3.60 4.09 3.62 3.60 10468 29
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 43
Table 4.7: Critical values for the changed interval scan statistic based on MST at0.01 significance level. n = 1000.
Critical Values Graphl0 = 100 l0 = 50 l0 = 25
A1 A2 Per A1 A2 Per A1 A2 Per∑|Ei|2 D
d = 1 4.51 4.78 4.73 4.63 5.31 5.30 4.72 6.08 6.65 4994 24.42 4.32 4.31 4.50 4.33 4.33 4.58 4.33 4.33 5454 8
N(0,1) 4.42 4.34 4.22 4.51 4.36 4.25 4.58 4.37 4.25 5400 7d = 10 4.42 4.33 4.20 4.50 4.51 4.25 4.58 4.35 4.29 5448 8
4.42 4.34 4.36 4.50 4.32 4.36 4.58 4.36 4.36 5440 74.42 4.32 4.31 4.50 4.33 4.32 4.57 4.33 4.32 5524 84.43 4.36 4.36 4.52 4.39 4.36 4.59 4.39 4.36 5042 8
Exp(1) 4.43 4.36 4.30 4.52 4.39 4.36 4.59 4.40 4.36 5040 6d = 10 4.43 4.36 4.32 4.52 4.39 4.38 4.59 4.40 4.44 5106 6
4.43 4.36 4.27 4.52 4.39 4.33 4.59 4.39 4.33 5042 64.43 4.35 4.35 4.52 4.37 4.35 4.59 4.37 4.35 5126 84.34 3.99 4.28 4.43 3.99 4.28 4.52 3.99 4.28 11600 40
N(0,1) 4.33 3.98 3.95 4.42 3.98 3.95 4.51 3.98 3.95 13346 64d = 100 4.34 4.04 4.12 4.44 4.04 4.12 4.52 4.04 4.12 10422 34
4.34 4.05 4.22 4.43 4.05 4.22 4.52 4.05 4.22 10804 434.34 4.03 4.00 4.43 4.03 4.00 4.52 4.03 4.00 10862 364.34 4.10 3.95 4.44 4.10 3.95 4.52 4.10 3.95 10384 24
Exp(1) 4.33 4.05 3.87 4.43 4.05 3.87 4.52 4.05 3.87 11922 33d = 100 4.34 4.08 4.14 4.43 4.08 4.14 4.52 4.08 4.14 11194 34
4.35 4.10 3.86 4.44 4.10 3.86 4.53 4.10 3.86 9680 274.34 4.08 4.10 4.44 4.08 4.10 4.52 4.08 4.10 10468 29
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 44
Table 4.8: Critical values for the changed interval scan statistic based on MDP.n = 1000.
significance level = 0.05
d = 1 d = 10 d = 100l0 A1 A2 N(0,1) Exp(1) N(0,1) Exp(1) N(0,1) Exp(1)
100 4.08 4.38 4.39 4.46 4.30 4.29 4.32 4.3250 4.22 4.97 5.03 5.12 5.10 4.87 5.19 4.9925 4.33 5.81 6.31 6.32 6.14 6.12 6.60 6.35
significance level = 0.01
d = 1 d = 10 d = 100l0 A1 A2 N(0,1) Exp(1) N(0,1) Exp(1) N(0,1) Exp(1)
100 4.51 4.90 4.91 5.13 4.93 4.92 5.01 4.9150 4.63 5.58 5.63 5.94 5.64 5.48 6.13 5.6325 4.72 6.52 6.91 6.91 6.91 6.91 7.12 6.91
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 45
Table 4.9: Critical values for the changed interval scan statistic based on NNG at0.05 significance level. n = 1000.
Critical Values Graphl0 = 100 l0 = 50 l0 = 25
A1 A2 Per A1 A2 Per A1 A2 Per∑|Ei|2 D
4.04 4.10 4.07 4.15 4.23 4.20 4.23 4.31 4.30 2026 2N(0,1) 4.04 4.10 4.09 4.15 4.24 4.18 4.23 4.31 4.24 1942 2d = 1 4.04 4.10 4.11 4.15 4.24 4.23 4.23 4.31 4.35 1948 2
4.04 4.10 3.96 4.15 4.23 4.11 4.23 4.31 4.25 2038 24.04 4.10 4.04 4.15 4.24 4.17 4.23 4.31 4.31 1960 24.04 4.10 4.00 4.15 4.23 4.14 4.23 4.31 4.24 2086 2
Exp(1) 4.04 4.10 4.08 4.15 4.23 4.20 4.23 4.31 4.24 1990 2d = 1 4.04 4.10 4.00 4.15 4.24 4.15 4.23 4.32 4.27 2014 2
4.04 4.10 4.01 4.15 4.23 4.20 4.23 4.31 4.34 2080 24.04 4.10 4.04 4.15 4.23 4.18 4.23 4.31 4.27 2008 23.99 3.92 3.82 4.09 3.96 3.88 4.18 3.97 3.90 3558 6
N(0,1) 3.99 3.91 3.86 4.09 3.94 3.86 4.18 3.95 3.88 3508 6d = 10 4.00 3.92 3.86 4.10 3.96 3.93 4.18 3.97 3.93 3394 6
3.99 3.91 3.81 4.09 3.94 3.86 4.18 3.95 3.90 3418 63.99 3.91 3.88 4.09 3.94 3.88 4.18 3.95 3.88 3450 64.00 3.94 3.85 4.10 3.98 3.91 4.18 3.99 3.91 3306 6
Exp(1) 4.01 3.95 3.91 4.11 4.00 3.98 4.19 4.02 3.99 3118 5d = 10 4.00 3.94 3.89 4.10 3.98 3.93 4.19 4.00 3.94 3018 5
4.00 3.95 3.90 4.11 3.99 3.93 4.19 4.01 3.93 3014 54.01 3.96 3.95 4.11 4.01 3.97 4.19 4.03 3.99 3092 53.89 3.55 3.48 4.00 3.55 3.48 4.10 3.55 3.48 8240 30
N(0,1) 3.88 3.50 3.49 3.99 3.50 3.49 4.09 3.50 3.49 9360 33d = 100 3.90 3.61 3.60 4.00 3.61 3.60 4.10 3.61 3.60 8482 18
3.88 3.51 3.48 3.99 3.51 3.48 4.09 3.51 3.48 9154 403.88 3.50 3.44 3.99 3.50 3.44 4.09 3.50 3.44 9392 393.88 3.54 3.47 3.99 3.54 3.47 4.09 3.54 3.47 10406 45
Exp(1) 3.88 3.55 3.55 3.99 3.55 3.55 4.09 3.55 3.55 10504 44d = 100 3.88 3.54 3.61 3.99 3.54 3.61 4.09 3.54 3.61 10106 32
3.90 3.64 3.53 4.00 3.63 3.53 4.10 3.63 3.53 8666 223.90 3.58 3.57 4.00 3.58 3.57 4.10 3.58 3.57 8274 28
CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 46
Table 4.10: Critical values for the changed interval scan statistic based on NNG at0.01 significance level. n = 1000.
Critical Values Graphl0 = 100 l0 = 50 l0 = 25
A1 A2 Per A1 A2 Per A1 A2 Per∑|Ei|2 D
4.48 4.55 4.58 4.57 4.67 4.65 4.64 4.73 4.65 2026 2N(0,1) 4.48 4.56 4.53 4.57 4.68 4.71 4.64 4.74 4.79 1942 2d = 1 4.48 4.56 4.56 4.57 4.68 4.72 4.64 4.74 4.83 1948 2
4.48 4.55 4.45 4.57 4.67 4.68 4.64 4.74 4.69 2038 24.48 4.56 4.56 4.57 4.68 4.66 4.64 4.74 4.82 1960 24.48 4.55 4.49 4.57 4.67 4.62 4.64 4.74 4.68 2086 2
Exp(1) 4.48 4.55 4.49 4.57 4.67 4.57 4.64 4.73 4.57 1990 2d = 1 4.48 4.56 4.49 4.57 4.68 4.59 4.64 4.75 4.60 2014 2
4.48 4.55 4.61 4.57 4.67 4.65 4.64 4.74 4.76 2080 24.48 4.55 4.60 4.57 4.67 4.65 4.64 4.73 4.78 2008 24.44 4.35 4.20 4.52 4.39 4.25 4.60 4.37 4.25 3558 6
N(0,1) 4.44 4.34 4.34 4.52 4.35 4.38 4.59 4.35 4.38 3508 6d = 10 4.44 4.35 4.28 4.52 4.36 4.33 4.60 4.37 4.33 3394 6
4.44 4.34 4.30 4.52 4.36 4.30 4.59 4.35 4.30 3418 64.44 4.34 4.22 4.52 4.35 4.22 4.59 4.35 4.22 3450 64.44 4.37 4.31 4.53 4.43 4.39 4.60 4.39 4.39 3306 6
Exp(1) 4.45 4.38 4.39 4.53 4.42 4.50 4.60 4.42 4.50 3118 5d = 10 4.45 4.37 4.31 4.53 4.72 4.33 4.60 4.39 4.38 3018 5
4.45 4.38 4.42 4.53 4.35 4.45 4.60 4.41 4.45 3014 54.45 4.39 4.46 4.53 4.43 4.47 4.61 4.43 4.47 3092 54.35 4.02 3.91 4.44 4.02 3.91 4.53 4.02 3.91 8240 30
N(0,1) 4.34 3.97 3.82 4.44 3.97 3.82 4.52 3.97 3.82 9360 33d = 100 4.36 4.07 3.94 4.45 4.07 3.94 4.53 4.07 3.94 8482 18
4.35 3.99 4.06 4.44 3.99 4.06 4.52 3.99 4.06 9154 404.34 3.98 3.83 4.44 3.98 3.83 4.52 3.98 3.83 9392 394.34 4.02 3.87 4.43 4.02 3.87 4.52 4.02 3.87 10406 45
Exp(1) 4.34 4.03 3.99 4.43 4.03 3.99 4.52 4.03 3.99 10504 44d = 100 4.34 4.02 4.22 4.43 4.02 4.22 4.52 4.02 4.22 10106 32
4.35 4.10 3.95 4.44 4.10 3.95 4.53 4.10 3.95 8666 224.36 4.05 4.02 4.45 4.05 4.02 4.53 4.05 4.02 8274 28
Chapter 5
Assessment of the Method
5.1 Numeric Power Studies
We used simulations to compare the power of the graph-based scan statistics to
parametric approaches. In the first simulation set-up, we generated a sequence of 200
observations from the following model:
yt ∼
N(0, Id), t = 1, . . . , 100;
N(µ,Σ), t = 101, . . . , 200.
As before, d is the dimension of each observation. There is a change-point at 100.
The mean µ of the second half of the data is shifted from 0 by amount ∆ in Euclidean
distance. We considered cases where the covariance matrix remains constant (Σ = Id),
as well as cases where the covariance matrix also changes. When the covariance
matrix changes, we set Σ to a diagonal matrix with Σ[1, 1] = d1/3 and Σ[i, i] = 1 for
i = 2, . . . , d. We chose ∆ for each value of d so that most methods have moderate
power.
Hotelling’s T2 is a parametric test designed specifically for detecting a change
in multivariate normal mean when there is no change in variance. When there is a
change in both mean and variance, the generalized likelihood ratio test (GLR) can be
used. We compare the graph-based scan statistics to scan statistics based on these
47
CHAPTER 5. ASSESSMENT OF THE METHOD 48
two existing methods. For any candidate change-point t, the Hotelling’s T 2 is
T 2(t) =t(n− t)
n(yt − y∗t )
T Σ−1(yt − y∗t ),
where
yt =t∑i=1
yi/t, y∗t =n∑
i=t+1
yi/(n− t),
Σ = (n− 2)−1
[t∑i=1
(yi − yt)(yi − yt)T +
n∑i=t+1
(yi − y∗t )(yi − y∗t )T
].
The GLR is
GLR(t) = n log |Σn| − t log |Σt| − (n− t) log |Σ∗t |,
where
Σt =
∑ti=1(yi − yt)(yi − yt)
T
t, Σ∗t =
∑ni=t+1(yi − y∗t )(yi − y∗t )
T
n− t.
Some constraints apply to T 2(t) and GLR(t). For T 2 the number of observations n
needs to be larger than the dimension of the data d so that Σ can be inverted. For
GLR, both t and n− t need to be larger than the dimension of the data so that the
determinants of Σt, Σ∗t are not zero. Thus, when d ≤ 20, we set n0 = d + 10 and
n1 = n − n0. When d > 20, we set n0 = 50 and n1 = 150. (An exception for GLR
is that when d = 50, n0 and n1 are set to 60 and 140, respectively, so that the test
statistic can be calculated.)
Table 5.1 shows the power comparisons. Scan statistics based on the three ways
of constructing the graph – MST, MDP and NNG – using Euclidean distance are
compared to scan statistics based on maximization of T 2(t) and GLR(t). First, com-
pare the graph-based methods to Hotelling’s T 2: When the variance does not change,
T 2 outperforms all other methods in low to moderate dimension (d < 175). This is
expected, as T 2 was designed specifically for this scenario. Remarkably, graph-based
methods surpass T 2 at its own game when dimension is high (d = 175). Now, consider
the case where the variance also changes. By assuming an incorrect alternative, the
power of T 2 is quickly surpassed by graph-based methods, for d as low as 5.
CHAPTER 5. ASSESSMENT OF THE METHOD 49
Comparing graph-based methods to GLR-based scan statistic, we see a similar
pattern: When dimension is low (d = 1, 5, 10), GLR-based scans dominate in power
when both the mean and variance changes. Graph-based methods exceed GLR in
power when d increases, already performing much better by d = 20, which is consid-
ered quite moderate in today’s applications. The low power of GLR at even moderate
dimension is due to its requirement that the covariance matrix be estimated for both
segments.
We also considered a case where the normality assumption is violated by gener-
ating data from the log-normal distribution (Σ = Id). Then, graph-based methods
outperform T 2 by d = 75, and GLR by d = 5.
Comparing among the graph-based scan statistics, we see that MST and NNG
have comparable power, and dominate MDP in all situations. An explanation is that,
of these three ways of constructing graphs, the MDP retains the least information
from the data, having half as many edges as the other two graphs. The fact that MST
and NNG have similar power in all scenarios suggests that the graph-based method
is not very sensitive to the method of graph construction.
CHAPTER 5. ASSESSMENT OF THE METHOD 50
Table 5.1: Number of simulated sequences (out of 100) with significance less than5%.
Normal data, Σ = I
d 1 5 10 20 50 75 100 125 150 175∆ 0.5 0.65 0.8 0.8 1 1.2 1.2 1.4 1.6 2
T2 81 85 98 76 82 90 74 72 67 46GLR 72 51 28 8 8 - - - - -MST 14 20 18 15 29 45 34 42 47 73MDP 5 7 12 8 19 23 15 18 27 43NNG 8 16 16 16 33 49 37 46 48 77
Normal data, Σ is diagonal with Σ[1, 1] = d1/3,Σ[i, i] = 1, i = 2, . . . , d.
d 1 5 10 20∆ 0.5 0.4 0.1 0.2
T2 80 18 8 3GLR 76 80 67 31MST 8 27 35 65MDP 9 18 22 30NNG 10 17 34 59
Log-normal data, Σ = I.
d 1 5 10 20 50 75 100∆ 0.7 0.9 1 1 1.2 1.4 1.4
T2 83 77 79 58 60 43 29GLR 28 21 18 12 7 - -MST 18 35 47 28 31 62 71MDP 7 15 27 14 27 22 25NNG 19 34 39 28 34 58 74
CHAPTER 5. ASSESSMENT OF THE METHOD 51
5.2 Results on Real Data Examples
5.2.1 Friendship Network
The MIT Media Laboratory conducted a study following 90 subjects, consisting of
students and staff at the university, using mobile phones with pre-installed software
recording call logs from July 2004 to June 2005 [Eagle et al., 2009]. In this analysis,
we extract the information on the caller, callee and time for every call that was made
during the study period. The question of interest is whether phone call patterns
changed during this time, which may reflect a change in relationship among these
subjects. We bin the calls by day and, for each day, construct a network with the
90 subjects as nodes and a link between two subjects if they had at lease one call
on that day. We encode the network of each day by an adjacency matrix, with 1 for
element [i, j] if there is an edge between subject i and subject j, and 0 otherwise.
Thus, the processed data are adjacency matrices, one for each day from 2004/7/20
to 2005/6/14.
We show results for graphs constructed using two different dissimilarity measures.
Let Ai be the 90 by 90 adjacency matrix on day i. We denote vi to be the vector
form of Ai. The dissimilarities are:
(1) the number of different edges: ‖vi − vj‖1 = ‖vi − vj‖22,
(2) the number of different edges, normalized by the geometric mean of the total for
each day:‖vi−vj‖1√‖vi‖1‖vj‖1
.
Results based on different dissimilarities and different ways of constructing the
graph are shown in Figure 5.1. We see that statistics based on MST and NNG give
similar results under both dissimilarities. The statistic based on MDP is not infor-
mative here, possibly because MDP is not dense enough to capture the information
in the data. Based on the scans using MST and NNG, a change-point occurred at
around 2005/1/9, which turns out to be the winter break for that year at MIT. The
p-values for the scan based on MST and NNG under both dissimilarity measures are
all < 0.0001, by both 10,000 permutations and analytic approximation (4.18). Per-
haps a change in courses after winter break changed the social organization among
the subjects.
CHAPTER 5. ASSESSMENT OF THE METHOD 52
Figure 5.1: Results of graph-based scans of the MIT phone call network. Top rowshows results from using number of different edges as the dissimilarity measure andbottom row shows results from using the normalized number of different edges. Thethree columns show three different ways of constructing the graph: MST, MDP, andNNG from left to right. In each plot, Z(t) is plotted along t. The estimated change-point is shown in the caption above the plot. The two vertical lines show n0 and n1;we basically excluded the first 5% and the last 5% of the points. The horizontal linesshow critical values at 0.05 and 0.01 significance levels, with the solid lines showingcritical values computed from 10,000 permutations and the dashed lines showing thosecomputed from the analytic approximation after skewness correction.
CHAPTER 5. ASSESSMENT OF THE METHOD 53
5.2.2 Authorship Debate
Tirant lo Blanch, a chivalry novel published in 1490, is considered to be one of the
best known medieval works of literature in Catalan. The dedicatory letter at the
beginning of the book states,
... So that no one else can be blamed if any faults are found in this work,
I, Joanot Martorell, knight, take sole responsibility for it, as I have carried
out the task singlehandedly...
However, the colophon at the end of the book states something different,
... by the magnificent and virtuous knight, Sir Joanot Martorell, who
because of his death, could only finish writing three parts of it. The
fourth part, which is the end of the book, was written by the illustrious
knight Sir Marti Joan de Galba. If faults are found in that part, let them
be attributed to his ignorance...
This inconsistency sparked a debate, still ongoing, about the authorship of Tirant
lo Blanch since its publication. Opinions have mainly fallen into two camps, one
favoring the single authorship by Joanot Martorell and the other favoring a change of
author somewhere between chapters 350 and 400. (There are in total 487 chapters in
the book.) One objective way to settle this debate is through the statistical analysis
of word usage, which reflects the unique style of writing of different people.
Giron et al. [2005] analyzed two sets of word usage statistics extracted from the
book. The first, which we call the word length data set, categorizes the words in
each chapter by its length, with a single category for all words with length greater
than nine letters. Thus, this data set represents each chapter by a vector of length
10. The second, which we call the context-free word frequency data set, counts the
occurrence of the 25 most frequent context-free words in each chapter. Giron et al.
[2005] analyzed the two data sets using a Bayesian multinomial change-point model
and a Bayesian clustering method, and concluded in favor of the change of author
hypotheses with the estimated change-point between chapters 371 and 382.
Here, we apply the graph based change-point method to the two data sets, treating
each chapter as a time-point. There are in total 487 chapters, and we use the 425
CHAPTER 5. ASSESSMENT OF THE METHOD 54
chapters that have more than 200 words. For both data sets, we normalized the count
vector for each chapter by dividing by the total number of words in the chapter. Thus,
our data is a sequence of 425 normalized frequencies, of dimension 10 for the word
length data and dimension 25 for the context-free word frequency data. The L2 norm
is used to construct the MST, MDP and NNG graphs representing similarity between
chapters. Z(t) and the estimated change-points, computed for each type of graph,
are shown in Figure 5.2. Test results using the three different graphs and the two
data sets support the change of author hypothesis, with the estimated change-point
around chapter 360, which is consistent with the view that there is a change of author
somewhere between chapters 350 and 400. The p-values are shown in Table 5.2.
Table 5.2: p-values for the tests. In each cell, the first value is calculated from 10,000permutations and the second value is calculated from the analytic approximationafter skewness correction.
data MST MDP NNGword length 0.0000/0.0000 0.0041/0.0018 0.0000/0.0000
context-free word frequency 0.0000/0.0000 0.0000/0.0000 0.0000/0.0000
To check the robustness of our analysis, we also applied the scan on data for
the first 350 chapters to see if it rejects the null there. Opinions seem to be quite
uniform that the first 350 chapters were all written by Joanot Martorell. The results
are shown in Figure 5.3. The word length data does not reject the null for the 350
chapters at 0.05 significance level. However, the context-free word frequency data
supports a change-point, although different graphs favor different locations for the
change-point. The p-values of the tests are shown in Table 5.3. These results suggest
that word length may be more robust than context-free word frequency in reflecting
writing styles.
Table 5.3: p-values for the tests only using data from the first 350 chapters.Numbersin each cell have the same meaning as in Table 5.2.
data MST MDP NNG
word length 0.0538/0.0562 0.1061/0.1040 0.3086/0.3527context-free word frequency 0.0000/0.0000 0.0019/0.0009 0.0000/0.0000
CHAPTER 5. ASSESSMENT OF THE METHOD 55
Figure 5.2: Results of graph-based scans of chapter-wise word usage frequencies ofTirant lo Blanch. The first row shows results from the word length data and the sec-ond row shows results from the context-free word frequency data. The three columnsshow scans based on three different graphs: MST, MDP, and NNG from left to right.The content in each plot is the same as in Figure 5.1. In the caption for each plot,the estimated change-point is shown in the form A/B, where A is the index of thechange-point within the 425 chapters used for analysis, and B is the chapter numberin the novel.
CHAPTER 5. ASSESSMENT OF THE METHOD 56
Figure 5.3: Results from the first 350 chapters. The setting of the figure is the sameas in Figure 5.2.
CHAPTER 5. ASSESSMENT OF THE METHOD 57
5.3 Discussion
The new nonparametric method for change-point detection can be applied to high
dimensional and non-Euclidean data. The method requires only the existence of a
dissimilarity measure on the sample space. In applications, the choice of a good
dissimilarity measure is critical, and domain knowledge should be used to design a
measure that is sensitive to the signal of interest. The approach we propose decouples
this application-specific choice of dissimilarity measure from the formal test for a
change-point. Graph-based scan statistics are easy to compute, and the analytic
p-value approximations are generally applicable.
We have shown that the p-value approximations are quite accurate. Our simula-
tions were for a data sequence of length n = 1000. The accuracy of the approximations
depend on n0 (l0 for the changed interval alternative) and not so much on n. Accuracy
also depends on the structure of the graph. When the graph is highly star-shaped,
which is common for high-dimensional data when the Euclidean distance is used in
constructing the graph, the skewness correction is critical for the approximations to
be accurate. For extremely star-shaped graphs, we imagine that adjusting for kurto-
sis and higher order moments might also be helpful. The strategy would be similar
to skewness correction, but more technically complicated. We don’t compute these
higher order terms in this paper, but if needed they can be computed in a similar
fashion as the skewness term with the aid of a symbolic computation software.
The main reason that higher order corrections are necessary in high dimensions
is the increase in size and density of hubs in the graph, as shown in Section 4.5.
If hubs dominate the topology of the graph, perturbation of any hub can change
the topology drastically. Furthermore, R(t), which does not take into account the
interaction between edges, loses all information regarding the high order structure.
Under such circumstances, the particular graph would not be useful for separating
F1 from F0, and we would suggest exploring other dissimilarity measures and graph
construction methods.
Compared to parametric approaches, the graph-based approach requires far fewer
assumptions, but also makes less use of the data. Although this leads to loss of
power in low dimensions if the data indeed follow the parametric model, it leads to
robustness and wider applicability. An important observation is that the graph-based
CHAPTER 5. ASSESSMENT OF THE METHOD 58
approach has desirable power, compared to standard parametric tests, in moderate
and high dimensions. For high dimensional data, it is often hard to predict the
direction and nature of the change. Without such prior knowledge, parametric models
would require the estimation of many parameters, most of which would be unrelated
to the change. For example, the Hotelling T 2 statistic requires the estimation of
the large covariance matrix. If, by prior knowledge or data pre-processing, we can
circumvent the covariance estimation, then Hotelling T 2 would be preferable when
the data satisfies its assumptions – normality with no change of variance. Otherwise,
graph-based approaches gain increasing advantage over Hotelling’s T 2 as d increases,
even in the problem for which Hotelling’s T 2 was explicitly designed.
We mainly explored three different ways of constructing the underlying graph
given a dissimilarity measure. From the numerical results and the analysis of the MIT
cell phone network, we see that scans based on MST and NNG perform similarly, while
scans based on MDP have lower power. We suspect this is due to the fact that MDP
is the least dense graph and utilizes the least amount of information from the original
data set. In this regard, one may try denser graphs which retain more information
from the data than the MST and NNG. One may even consider assigning weights to
the edges. As in all problems, building more assumptions into the statistic leads to
improved power if the assumptions are true, but sacrifices robustness.
If more than one change-point or changed interval were of interest, the graph-
based scan can be applied recursively in a procedure that is called binary or circular
binary segmentation [Vostrikova, 1981, Olshen et al., 2004].
Part II
Graph-Based Tests for Two-Sample
Comparisons of Categorical Data
59
Chapter 6
Introduction
6.1 Background and Challenges
Testing whether two data samples are drawn from the same distribution is a funda-
mental problem in statistics. For low-dimensional Euclidean data, there are many
approaches, both parametric and non-parametric, to this problem. When the data
are categorical, the existing approaches are much more limited. The standard proce-
dure is to assume that each sample is drawn from a multinomial distribution, and the
comparison becomes a test of whether the two samples come from the same multino-
mial distribution. Classical methods, such as the Pearson’s Chi-square test and the
deviance test, work well when we observe each category a large number of times. At
least, the region in the contingency table where the two groups truly differ needs to
be adequately sampled for existing tests to achieve good power. However, in many
modern applications, the number of possible categories is comparable to or even larger
than the sample size. Following are some examples:
Preference rankings: Survey data in marketing or psychometric research often
come in the form of preference rankings. Subjects may be asked to rate wine
(rank from best to worst tasting), pictures (choose 3 most familiar out of 5),
or insurance plans (identify the most and least desirable). See Diaconis [1988]
and Critchlow [1985] for more detailed examples on ranked and partially ranked
data. It is a common problem to compare two groups of subjects to see if there
60
CHAPTER 6. INTRODUCTION 61
is any between-group difference in preference. The number of possible full rank-
ings is the factorial of the number of objects being rated, and the number of
possible rankings is higher if some subjects only partially rank the objects.
Haplotype association: In genetics, a haplotype is a combination of alleles at adja-
cent loci on a chromosome that is transmitted together. A common problem of
genetic association studies is to compare haplotype counts between treatment
and control groups (e.g. see Zaykin et al. [2002] and Furihata et al. [2006]).
Each haplotype can be represented as a fixed-length binary vector. The num-
ber of possible haplotypes is exponential in the number of loci. Haplotypes that
are longer than 10 are often of interest in genetics, leading to > 1, 000 possible
combinations. However, the number of subjects in association studies is often
only in the thousands or even hundreds, and the counts for most haplotypes are
small.
Sequence or document comparisons: In the modern age of digitized texts, it is
often of interest to compare the word composition in two different documents.
A similar problem is the comparison of DNA or protein sequences, which plays
a large role in bioinformatics [Lippert et al., 2002]. The number of possible
words in these applications can be very large, while the counts for most words
are small. For recent interest in this problem see Perry and Beiko [2010], Bush
and Lahn [2006] and Rajan et al. [2007] for examples.
Classical Chi-square tests have low power in the above scenarios due to sparsity
of the contingency table and high dimensionality of the parameter space. For exact
tests, it is possible to generalize the concept to the setting of more than two categories,
but this is computationally challenging [Mehta and Patel, 1983] and not efficient due
to the existence in high dimensions of many equivalent tables, which are tables that
have the same probability as the one observed.
CHAPTER 6. INTRODUCTION 62
6.2 Implicit Information and Their Role in Im-
proving the Tests
When the number of categories is very large, there is often underlying similarity
between different categories that can be exploited. For example, rankings can be
related through Kendall’s or Spearman’s distance. Hamming distance or other more
sophisticated measures can be used to compare haplotypes and fixed-length words
in DNA sequences. In document comparison, the similarities between words are not
equally likely: Some words are synonyms of others; Some are more likely to be used
together. Such similarity information between categories can be properly used to
improve the power of two-sample tests.
We assume that a distance matrix has been given on the set of categories, and
adopt a graph-based approach proposed by Friedman and Rafsky [1979] and Rosen-
baum [2005], where a graph is constructed on all subjects so that subjects more similar
in value are connected by an edge. Friedman and Rafsky’s test is based on a minimum
spanning tree (MST), and Rosenbaum’s test is based on minimum distance pairing
(MDP). The test statistic in both cases is the number of edges connecting subjects
from different groups. The underlying rationale is that, if two groups come from the
same distribution, subjects coming from the same group should be as distant to each
other as subjects coming from different groups. More details of these tests are given
in Section 1.1. Both tests, however, require uniqueness of the underlying graphs.
When the distance matrix on subjects is filled with ties, which is characteristic of
categorical data, neither approach can be directly applied.
Ties in the distance matrix lead to ambiguity in constructing the MST or MDP,
and the number of possible graphs increases rapidly with the number of ties. Some
efforts were made to address this problem. In the analysis of a partially ranked data
set with 38 subjects in 23 categories, Critchlow [1985] tried both the graph obtained
from the union of all MSTs (uMST), and the graph obtained from the union of all
nearest neighbor graphs (uNNG). Nettleton and Banerjee [2001] also used uNNG on
a binary clinical feature data set with 64 subjects in 63 categories. In general, nearest
neighbor graphs do not work well for categorical data, see Section 7.3. In this paper,
Critchlow’s method using the uMST is studied in more detail and a computationally
CHAPTER 6. INTRODUCTION 63
tractable form for categorical data is given. A different statistic, based on averaging
over all optimal graphs of a certain kind, is also proposed and analyzed.
6.3 Notation
We start by introducing our notation. The different categories are indexed by
1, 2, . . . , K. The naming of the categories is arbitrary, that is, category 1 is not
necessarily closer in distance to category 2 than to category 3. The two groups are
labeled a and b. The data is given in the form of a two-way contingency table (Table
6.1). Without loss of generality, we assume that each category has at least one subject
over the two groups. That is, categories with no observation in either group can be
omitted from the analysis without loss of information.
Table 6.1: Basic Notations.
1 2 . . . K Total
Group a na1 na2 . . . naK naGroup b nb1 nb2 . . . nbK nb
Total m1 m2 . . . mK N
mk = nak + nbk, k = 1, . . . , K;
na =K∑k=1
nak, nb =K∑k=1
nbk, N = na + nb =K∑k=1
mk.
Sometimes, we refer to individual subjects themselves, which we denote by
Y1, . . . , YN . Thus, each Yi takes value in 1, . . . , K and has a group label
gi =
a, if Yi belongs to group a;
b, if Yi belongs to group b.(6.1)
We assume that a distance matrix, d(i, j) : i, j = 1, . . . , K has been given on the
set of possible categories, with d(i, j) small if categories i and j are similar. Possible
ways of defining the distance matrix are shown for various examples in Section 6.1.
As in Part I, we use G to denote both the graph and its set of edges, Gi to denote
CHAPTER 6. INTRODUCTION 64
the subgraph including all edges that connect to node i and its set edges. Sometimes,
the name of the graph is not as simple as G and we use EGi to denote the set of edges
in G that contain node i to avoid ambiguity. In addition, we use VGi to denote the
set of nodes in G that are connected to node i by an edges, and EGi,2 to denote the set
of edges in G that contain at least one node in VGi .
Following is a list of abbreviations for different types of graphs and test statistics:
MST: Minimum Spanning Tree,
MDP: Minimum Distance Pairing,
NNG: Nearest Neighbor Graph,
uMST: The graph obtained by taking the union of all MSTs,
uNNG: The graph obtained by taking the union of all NNGs, equivalent to the graph
connecting each point to all of its nearest neighbors,
RG: The test statistic on the graph G,
RaMST: the test statistic averaging over all test statistics computed on each of the
MSTs,
RaMDP: the test statistic averaging over all test statistics computed on each of the
MDPs,
Chapter 7
Graph-Based Test Statistics
Both Friedman and Rafsky’s test and Rosenbaum’s test assume uniqueness of the
type of graph used. For categorical data, ties appear in the distance matrix whenever
a category has multiple counts. Even sparse contingency tables have quite a few cells
containing more than one subject. The number of possible graphs grows rapidly with
the number of ties. Thus, Friedman and Rafsky’s and Rosenbaum’s methods can not
be directly applied to categorical data. For categorical data, distances are often based
on qualitative measures, and thus while their relative ranking may be trustworthy,
their absolute scale is not. Hence, we do not consider methods based directly on the
distance matrix. While there are many ways to construct a graph based on a distance
matrix, we limit our study to MST, MDP and NNG, which are representative. Figure
7.1 illustrates the three different types of graphs on a simple example containing six
points. These six points take on six distinct values.
One natural solution, when the optimizing graph is not unique, is to average
the test statistic over all graphs of the given kind. In this section, we consider the
statistic based on averaging the sum (1.1) over all MSTs (RaMST). Another solution to
non-uniqueness it to take the union over all optimizing graphs, such as the statistic
based on the uMST (RuMST). RaMST and RuMST are analytically tractable and intuitively
appealing, and their derivations are shown in Section 7.1. For comparison, we also
consider the statistic based on averaging (1.1) over all MDPs, RaMDP, and the statistic
based on uNNG, RuNNG. Computation of RaMDP, described in Section 7.2.1, is often
intractable. Computation of uNNG is instantaneous. In Section 7.3, we study by
65
CHAPTER 7. GRAPH-BASED TEST STATISTICS 66
Figure 7.1: Illustration of MST, MDP, and NNG on six points. Notice that only oneof the two possible MSTs on the six points and one of the two possible NNGs on thesix points are shown.
simulation the performance of four graph-based statistics, RaMST, RuMST, RaMDP, RuNNG,
comparing them to each other and to Chi-square tests. Our results show that tests
based on minimum spanning trees have best power, and the intuition for this is
explained. The statistics based on uMDP and average over all NNGs are not included
in the comparison because they do not have the potential of high power according
to the performance of RaMDP and RuNNG in Section 7.3, while calculating them is not
instant.
Computation of RaMST and RuMST is described in more detail in Section 7.4. When
the number of MSTs on categories is large, which is common for categorical data,
computation for RaMST can be very costly. We generalize the statistic based on RaMST
to a similar but simpler form in Section 7.5.
7.1 The Test Statistics Based on MST
7.1.1 RaMST
First, we define more notations. For each k = 1, . . . , K, let Ck ⊂ 1, . . . , N be the
subjects that belong to category k. From Table 6.1, |Ck| = mk. Let Tk be the set of all
spanning trees for Ck. Since the distance between any two subjects in Ck is zero, any
CHAPTER 7. GRAPH-BASED TEST STATISTICS 67
Figure 7.2: Embedding the MST on categories on the subjects. This figure only shows3 out of 15552 possible embeddings.
spanning tree of Ck is a MST of Ck. Let T ∗0 be the set of all MSTs on the categories.
We can embed each tree in T ∗0 as a graph on the subjects by randomly picking one
subject in Ck to represent category k, for k = 1, . . . , K. For each τ ∗0 ∈ T ∗0 , there are
K∏k=1
m|Eτ∗0k |
k (7.1)
different embeddings. For example, Figure 7.2 shows 3 out of 15552 (= 2·33·1·42·32·2)
possible embeddings for a MST on six categories containing 2, 3, 1, 4, 3 and 2 subjects.
Let T0 be the set of all graphs obtained from embedding a tree from T ∗0 on the subjects.
Then
|T0| =∑τ∗0∈T ∗0
(K∏k=1
m|Eτ∗0k |
k
). (7.2)
Let T be the set of all MSTs on the N subjects. Then, any member of T can be
represented as a union of a graph from T0 and a graph from each of Tk : k =
1, . . . , K, and vice versa. Thus,
T =
τ0 ∪ (
K⋃k=1
τk) : τ0 ∈ T0, τk ∈ Tk, k = 1, . . . , K
,
CHAPTER 7. GRAPH-BASED TEST STATISTICS 68
with
|T | = |T0|K∏k=1
Smk , (7.3)
where Sm = mm−2 is the number of spanning trees on m points by Cayley’s formula.
Then, the test statistic based on averaging all MSTs on subjects can be defined as:
RaMST∆= |T |−1
∑τ∈T
Rτ , (7.4)
where Rτ is (1.1) with G = τ . The following theorem gives a computationally
tractable form for RaMST in terms of the cell counts of the contingency table and
the set of possible MSTs on categories.
Theorem 7.1.1. The test statistic based on averaging over all MSTs on subjects is
RaMST =K∑k=1
2naknbkmk
+ |T0|−1∑τ∗0∈T ∗0
K∏k=1
m|Eτ∗0k |
k
∑(u,v)∈τ∗0
naunbv + navnbumumv
. (7.5)
Proof.
RaMST = |T |−1∑τ∈T
Rτ
= |T |−1∑τ0∈T0
∑τ1∈T1
· · ·∑τK∈TK
[Rτ0 +Rτ1 + · · ·+RτK ]
= |T0|−1∑τ0∈T0
Rτ0 +K∑k=1
[∑τk∈Tk
Rτk/Smk
]. (7.6)
First consider the quantity∑
τk∈Tk Rτk/Smk . Since all pairs of subjects in a given
category have the same distance (= 0), the edge between them should appear in
the same number of trees. There are in total mk(mk − 1)/2 possible pairs and each
spanning tree for Ck has mk− 1 edges. Hence, the edge between each pair of subjects
in Ck appears in exactlySmk(mk − 1)
mk(mk − 1)/2=
2Smkmk
CHAPTER 7. GRAPH-BASED TEST STATISTICS 69
trees. Thus, ∑τk∈Tk
Rτk
Smk=
∑i,j∈Ck:i<j
Igi 6=gj2Smk/mk
Smk=
2naknbkmk
. (7.7)
Next consider the summation over T0. For any i ∈ Cu, j ∈ Cv, if (u, v) ∈ τ ∗0 , then the
edge (i, j) appears inK∏k=1
m|Eτ∗0k |
k /(mumv)
elements in T0, since any of the mumv possible edges connecting categories u and v
appear in equal number of graphs in T0. Thus,
∑τ0∈T0
Rτ0 =∑
τ∗0∈T ∗0
∑(u,v)∈τ∗0
∏Kk=1m
|Eτ∗0k|
k
mumv
∑i∈Cu
∑j∈Cv Igi 6=gj
=∑
τ∗0∈T ∗0
∏Kk=1 m
|Eτ∗0k |
k
∑(u,v)∈τ∗0
naunbv+navnbumumv
. (7.8)
Combining (7.6), (7.7) and (7.8) gives (7.5).
The following corollaries show that RaMST has a much simpler form if there is a
unique MST on categories, or if the total number of subjects in each category is the
same.
Corollary 7.1.2. When |T ∗0 | = 1, then
RaMST =K∑k=1
2naknbkmk
+∑
(u,v)∈τ∗0
naunbv + navnbumumv
, (7.9)
where τ ∗0 is the unique MST on categories.
Corollary 7.1.3. When mk ≡ m, k = 1, . . . , K,
RaMST =K∑k=1
2naknbkm
+ |T ∗0 |−1∑τ∗0∈T ∗0
∑(u,v)∈τ∗0
naunbv + navnbum2
. (7.10)
The form (7.9) of the statistic is especially intuitive. For each category k, we call
the term 2naknbk/mk the mixing potential of the category. The mixing potential is
CHAPTER 7. GRAPH-BASED TEST STATISTICS 70
maximized if nak = nbk = mk/2, that is, when the subjects in category k are evenly
divided between groups a and b; it is minimized when the category contains subjects
from only one group. A mixing potential for each edge (u, v) can also be defined as
(naunbv + navnbu)/(mumv). The edge-wise mixing potential is maximized when the
edge connects a category containing only group a subjects with a category containing
only group b subjects; it is minimized when both categories contain subjects only from
one group. Thus, mixing potentials over categories and over edges between categories
measure the similarity between the two groups. Corollary 7.1.2 shows that, when the
MST on categories is unique, the test statistic RaMST reduces to the sum of mixing
potentials over nodes and edges of the MST on categories. The similarity information
on the categories is explicitly incorporated into the test through the sum of mixing
potentials over the edges between categories.
In testing, the sums (7.5), (7.9) and (7.10) must be compared to their permutation
distributions. A generalized statistic that we propose later in Section 7.5 is based
directly on (7.9).
7.1.2 RuMST
Following the notation from the previous section, let M∗0 denote the set of edges
appearing in at least one MST on categories. That is,
M∗0 = (u, v) ∈ τ ∗0 : τ ∗0 ∈ T ∗0 .
In other words, M∗0 is the uMST with the categories as nodes. When there is only
one MST on categories, τ ∗0 , then M∗0 = τ ∗0 ; when there are multiple MSTs on cate-
gories, which is common for categorical data, obtaining M∗0 is not straightforward.
Computation ofM∗0 is discussed in Section 7.4. The following theorem describes the
analytic form of RuMST given M∗0.
Theorem 7.1.4. The test statistic based on uMST is
RuMST =K∑k=1
naknbk +∑
(u,v)∈M∗0
(naunbv + navnbu), (7.11)
CHAPTER 7. GRAPH-BASED TEST STATISTICS 71
Proof. Within each category, every pair of subjects is connected, which gives the first
term of (7.11). If categories u and v are connected in any τ ∗0 ∈ T ∗0 , then each point
in category u is connected to every point in category v, giving the second term of
(7.11). If categories u and v are not connected in any τ ∗0 ∈ T ∗0 , no edge will appear
between categories u and v in uMST.
Remark 7.1.5. Both RuMST and RaMST are derived from sums of Igi 6=gj over edges
of the uMST on subjects. The main difference between them is that RuMST treats
all of the edges equally, while RaMST assigns each edge a weight proportional to the
number of MSTs on subjects in which the edge appears. Comparing (7.11) to (7.9),
the denominators in (7.9) are omitted in (7.9). Each edge within category k appears in
|T |/(mk/2) MSTs, while each edge between categories appears in |T |/(mumv) MSTs.
Therefore, RuMST puts more weight on between-category edges than within-category
edges.
7.2 The Test Statistic Based on MDP
7.2.1 RaMDP
We first assume N , the total number of observations, to be even. Let K0 be the
number of categories containing an odd number of subjects. Since N is even, K0
is even. (K0 can be 0.). Without loss of generality, let categories 1, . . . , K0 be the
categories containing an odd number of subjects, and categories K0 +1, . . . , K be the
categories containing an even number of subjects. More notations are defined below.
• A = x = (x1, . . . , xK0)T : xi ∈ a, b, i = 1, . . . , K0: all possible combinations
of group identities of the subjects with one from each of the categories containing
an odd number of subjects.
• R0(na, nb): the number of edges connecting subjects from different groups av-
eraged over all perfect pairings of na points from group a and nb points from
group b in the same category, with na + nb being even.
• Rx,x ∈ A: the number of edges connecting subjects from different groups
averaged over all MDPs on categories 1, . . . , K0.
CHAPTER 7. GRAPH-BASED TEST STATISTICS 72
Assumption 7.2.1. If a category has an even number of subjects, the subjects are
paired within the category.
Assumption 7.2.1 is usually true for MDP on subjects for categorical data. It is
explicitly stated here to avoid the complicated scenario when the triangle inequality
becomes equality in the distance metric for any three categories.
Proposition 7.2.1. Under Assumption 7.2.1, the test statistic based on averaging
(1.1) over all MDPs is:
RaMDP =K∑
k=K0+1
R0(nak, nbk) +1∏K0
k=1mk
∑x∈A
K0∏i=1
nxii
[Rx +
K0∑j=1
R0(nxjj − 1, nxcjj)
],
(7.12)
where xci =
b if xi = a
a if xi = b,
R0(na, nb) =∑i∈S
i
(na
i
)(nb
i
)i! (na − i− 1)!! (nb − i− 1)!!/(na + nb − 1)!!
(7.13)
with
S =
0, 2, . . . , na ∧ nb if na and nb both even
1, 3, . . . , na ∧ nb if na and nb both odd,
and
Rx = |Ω∗|−1∑ω∗∈Ω∗
∑(i,j)∈ω∗
Ixi 6=xj , (7.14)
where ω∗ is an MDP on categories 1, . . . , K0, and Ω∗ is the set of all these ω∗’s.
Proof. First consider the simpler case: One category with na subjects from group a
and nb subjects from group b, with na + nb even. Since all subjects are in the same
category, any perfect pairing is an MDP. There are in total (na + nb − 1)!! different
perfect pairings.
CHAPTER 7. GRAPH-BASED TEST STATISTICS 73
When both na and nb are even, the possible numbers of edges connecting different
groups are: 0, 2, . . . , na∧nb. Among all the (na+nb−1)!! perfect pairings, the number
of perfect pairings having i ∈ 0, 2, . . . , na ∧ nb edges connecting different groups is(na
i
)(nb
i
)i! (na − i− 1)!! (nb − i− 1)!!.
When both na and nb are odd, the possible numbers of edges connecting different
groups are: 1, 3, . . . , na ∧ nb. Among all the (na + nb − 1)!! perfect pairings, the
number of perfect pairings having i ∈ 1, 3, . . . , na ∧ nb edges connecting different
groups is also (na
i
)(nb
i
)i! (na − i− 1)!! (nb − i− 1)!!.
(7.13) follows immediately.
Under Assumption 7.2.1, an MDP on all subjects would be an MDP on categories
1, . . . , K0, (ω∗), embedded on the subjects similar to the MST case, with all other
subjects paired within each category, so (7.12) follows naturally.
Remark 7.2.2. If N , the total number of observations, is odd, we can add a pseudo
category with one subject, whose distance to any other category is 0. All derivations
are the same, except that the edge containing the pseudo category is discarded from
the MDP on categories in later steps.
7.3 A Numerical Study
In this section, the power of the four tests based on RaMST, RuMST, RaMDP and RuNNG
are studied and compared to Pearson’s Chi-square and deviance tests on simulated
data sets. In each simulation, 30 points are randomly sampled from two different
distributions – N(0, 1) vs N(1, 1), N(0, 1) vs N(0, 4), N(0, 1) vs N(1, 4), and U(0, 5)
vs U(1, 6). The combined sample of 60 points is then discretized into 12 bins of equal
width. The value 12 is chosen so that the average number of data points in each
category is 5, mimicking the low cell count scenario. The bins are ranked by their
CHAPTER 7. GRAPH-BASED TEST STATISTICS 74
start positions, and the distance between two categories is defined as the difference
in their ranks. The p-values for all tests are calculated through 1,000 permutation
samples for each simulation run, and the power is obtained from 1,000 simulation runs.
In Figure 7.3, power is plotted versus type I error for each test and each simulation
setting. Pearson’s Chi-square and deviance tests give very similar results, so only the
results for the deviance test are shown. The deviance test is denoted by “LR” since
it is based on the log-likelihood ratio. Power for all tests at the two most commonly
used significance levels – 0.01 and 0.05 – are listed in Table 7.1.
First, compare RaMST, RaMDP, and RuNNG. RaMST is always significantly more powerful
than RaMDP, which in turn is always more powerful than RuNNG. This result is intuitive
from the definition of the different graphs. Since the MST must span the entire data
set, K − 1 out of its N − 1 edges are forced to connect points between categories.
For MDP, if a category has even number of subjects, the subjects in that category
would be paired amongst themselves; between-category edges is only possible for
those categories having an odd number of subjects. For uNNG, as long as a category
has more than one subject, the subjects in that category would not be connected to
subjects from other categories. Therefore, tests based on MST make the most use
of the similarity information among categories, while the test based on RuNNG makes
the least use of this information. The simulation results show a positive correlation
between using similarity information and the power of the test.
Now, we compare the test based on RuMST to RaMST. As discussed in Remark
7.1.5, the two statistics use the same set of edges but with different weighting. In
simulation, the two statistics perform similarly under the three scenarios that compare
two Normal distributions, while RuMST has very little power, even much lower than
RaMDP and the deviance test, for the comparison of two Uniform distributions with
different supports. When comparing two Normal distributions, the similarity between
two categories is closely related to the difference of the ranks of the categories. That
is, the further apart the ranks of the two categories, the less similar. However, when
comparing two Uniform distributions with different supports – [0,5] vs [1,6] – only
the ranks at the two ends are informative while the middle ranks are not. Since RuMST
puts more weight on between-category edges compared to RaMST, it’s power would be
lower if the similarity measure among categories is not informative.
CHAPTER 7. GRAPH-BASED TEST STATISTICS 75
Note that of all the graph-based tests, only the test based on RaMST consistently
outperforms the deviance test.
Figure 7.3: Power versus type I error for tests based on RaMST, RuMST, RaMDP, thelikelihood ratio (deviance), and RuNNG under different simulation settings.
This simulation study is limited and only uses ranked data. We chose this study
design for its interpretability. Though simple, the results are informative and show the
advantage of averaged MST over averaged MDP and uNNG for categorical data. Also,
CHAPTER 7. GRAPH-BASED TEST STATISTICS 76
N(0,1) vs N(1,1) aMST uMST aMDP uNNG LR Pearson
α = 0.01 0.523 0.495 0.428 0.234 0.355 0.346α = 0.05 0.762 0.740 0.679 0.492 0.605 0.605
N(0,1) vs N(0,4)α = 0.01 0.304 0.321 0.233 0.133 0.165 0.164α = 0.05 0.558 0.585 0.482 0.382 0.394 0.396
N(0,1) vs N(1,4)α = 0.01 0.560 0.600 0.434 0.291 0.352 0.345α = 0.05 0.804 0.824 0.722 0.569 0.632 0.626
U(0,5) vs U(1,6)α = 0.01 0.354 0.218 0.310 0.155 0.283 0.251α = 0.05 0.665 0.486 0.607 0.383 0.600 0.552
Table 7.1: The power of six tests – four graph-based tests based on RaMST, RuMST, RaMDP,RuNNG, the deviance test (LR) and Pearson’s Chi-square test – under two significancelevels (α = 0.01, 0.05) and different simulation settings.
averaged MST is better than uMST when the similarity measure used to construct
the graph is not effective. On the other hand, if the similarity measure is effective,
the test based on uMST is comparable to, and sometimes better than, the test based
on averaged MST. Hence, the rest of this paper focuses on the two tests based on
RaMST and RuMST.
7.4 Computational Issues of RaMST and RuMST
The analytic forms for RaMST and RuMST, (7.5) and (7.11), require enumeration of all
MSTs on categories for RaMST; and enumeration of all edges in M∗0 for RuMST. Let
M = |T ∗0 | be the number of MSTs on categories. If the distance matrix between
categories is continuous-valued, then usually M = 1. Even when the distance matrix
is arithmetic, M is small enough to be manageable for many problems. However,
for problems that exhibit certain symmetries, enumeration of the set of all MSTs on
categories is not computationally feasible. For example, Table 7.4 lists the values of
M for the haplotype association problem in Section 8.2, assuming that there are no
empty categories. In this problem, M is computed using the Matrix-Tree Theorem,
CHAPTER 7. GRAPH-BASED TEST STATISTICS 77
Length of haplotype K M
2 4 43 8 3844 16 424673285 32 2.078× 1019
6 64 1.66× 1045
Table 7.2: The number of categories, K, and the MSTs on categories, M , as haplotypelength increases for the haplotype association problem in Section 8.2. All categoriesare assumed to be non-empty.
yielding the formula
M = 22l−l−1
l∏i=2
exp(
l
i
)log i
,
where l is the haplotype length. We can see from the formula for M that it increases
fast as l increases. When the length of the haplotype is 6, which is a reasonably short
length in genetic studies, there are only 64 possible categories, M equals 1.66× 1045.
For example, when the length of the haplotype is 6, which is a reasonably short length
in genetic studies, there are only 64 possible categories, but M is already larger than
1045. One may argue that in this case, (7.5) may be further simplified using the
symmetry over categories, so that enumeration of |T ∗0 | is not necessary. This is true
if all categories are non-empty, but if one or more of the categories are empty, the
symmetry breaks, while M would still be too large for enumeration.
Table 7.3 summarizes the computation time for RaMST and RuMST in terms of K
and M . Consider first the listing of all edges in uMST on categories, M∗0, which is
required for RuMST. This task can be completed in O(K2) time through an algorithm
proposed by Eppstein [1995]. Details of the algorithm are in Appendix C.1, as well
as its theoretical justification. O(K2) time is usually affordable since K is no larger
than the sample size. Thus, RuMST is computationally feasible for any problem. On the
other hand, RaMST requires the enumeration of all MSTs on categories, not just their
edges, and thus adds O(M) computation time to the algorithm. For the haplotype
example, this makes RaMST computationally infeasible. Thus, in the next Section, we
propose a statistic that is motivated by RaMST but that is computationally tractable
for all problems.
CHAPTER 7. GRAPH-BASED TEST STATISTICS 78
Task Computation TimeRaMST Enumerating all MSTs on categories O(K2 +M)RuMST Listing edges in uMST on categories O(K2)
Table 7.3: Computational time for RaMST and RuMST. M is the number of MSTs oncategories.
7.5 A Fast Method Generalized from RaMST
Corollary 7.1.2 gives a simple and intuitive form of RaMST when there is a unique MST
on categories. In that special case, RaMST is the sum of mixing potentials computed
within each category and mixing potentials computed between categories that are
connected by an edge of the MST τ ∗0 . Evidence against the null increases if this sum
of mixing potentials is small, as compared to random permutation. In (7.9), the MST
τ ∗0 serves as an enumeration of the pairs of categories that are highly similar. There
is nothing sacred about the choice of MST for this role. The intuitive interpretation
for (7.9) remains if we replace τ ∗0 by any other graph C0 that represents proximity
between categories.
Up to this point, we have assumed that a distance matrix on categories is used
to represent the similarity between categories. We now bypass the distance matrix
and assume that similarity is directly represented by a graph C0 with the categories
as nodes. Our goal is to incorporate the proximity information encoded by the graph
into the two group comparison. We propose the following statistic, which we call RC0 ,
obtained by substituting C0 for τ ∗0 in (7.9),
RC0 =K∑k=1
2naknbkmk
+∑
(u,v)∈C0
naunbv + navnbumumv
. (7.15)
The above statistic has a similar interpretation to RaMST: Consider all C0-spanning
graphs, which are graphs on subjects where every pair of subjects are connected
by a path if they are in the same category or they are in two categories that are
connected by a path in C0. Hence, minimum distance C0-spanning graphs connect
subjects within categories by spanning trees and connects exactly one pair of subjects
between each pair of categories that have an edge in C0. RC0 is the averaged sum
CHAPTER 7. GRAPH-BASED TEST STATISTICS 79
(1.1) over all minimum distance C0-spanning graphs.
If C0 is given, computing RC0 only requires O(K + |C0|) time. If C0 is not given,
the choice of C0 can often be guided by domain knowledge. In the examples below,
our choices for C0 include the uMST on categories, which we denote by C-uMST
(same as M∗0), and the uNNG on categories, which we denote by C-uNNG. Since
C-uMST and C-uNNG can both be computed in O(K2) time, RC−uMST and RC−uNNG
require only O(K2) computation time for any problem.
Chapter 8
Examples
In this chapter, the application of RC−uMST, RC−uNNG and RuMST are illustrated on sev-
eral examples, both real and simulated. In the simulated examples, their power are
compared to Chi-square tests. The p-values for all tests are calculated through 1,000
permutation samples for each run, and the power calculated through 1,000 simulation
runs.
8.1 Preference Ranking
Consider comparing two groups of subjects on the ranking of four objects. Let Ξ
be the set of all permutations of the set 1, 2, 3, 4. Data are simulated under the
following model: Subjects from group a have no preference among the four objects,
and so their ranking is uniformly drawn from Ξ. The rankings of subjects from group
b are generated from the distribution
Pθ(ζ) =1
ψ(θ)exp−θd(ζ, ζ0), ζ, ζ0 ∈ Ξ, θ ∈ R, (8.1)
where d(·, ·) is a distance function and ψ a normalizing constant. This probability
model, first considered by Mallows [1957] with Kendall’s or Spearman’s distance,
favors rankings that are similar to a modal ranking ζ0 if θ > 0. See Diaconis [1988]
for more discussions. The larger the value of θ, the more clustering there should be in
group b around the mode ζ0. We experimented with both Kendall’s and Spearman’s
80
CHAPTER 8. EXAMPLES 81
distance and various values for θ. We assumed that the true distance function used
to generate the data is either known and used to construct the graph, or unknown,
in which case an incorrect distance is used.
Figure 8.1 shows C-uMST and C-uNNG formed on a typical data set generated
under θ = 5 with na = nb = 20. Spearman’s distance is used in both the generating
model and for constructing the graph. In this particular example, C-uMST contains
all edges in C-uNNG with three extra edges, shown in thinner lines. The reason this
happens is that no category is as close to category “3241” as category “3142”, and no
category is as close to category “3142” as category “3241”. For MST on categories,
more edges are needed to form a spanning tree. It is clear that in this case, there
are three MSTs on categories, each one obtained by adding one of the three thinner
edges to the C-uNNG. In most simulation runs, C-uMST and C-uNNG are the same,
while in those runs where they differ, C-uNNG is always a subset of C-uMST.
Figure 8.2 shows the power versus type I error for θ = 5, na = nb = 20 under
different combinations of using Kendall’s or Spearman’s distance for the generating
model and for constructing the graph; and Table 8.1 lists the power under two most
commonly used significance levels – 0.01 and 0.05. We see that even when a wrong
distance is used, the graph-based tests still have significantly higher power than the
Chi-square tests. For this simulation setting, RuMST is the most powerful among the
three graph-based tests; RC−uMST and RC−uNNG perform similarly with RC−uMST a little
better in all cases, implying that the extra edges in C-uMST do give additional useful
information.
8.2 Haplotype Association
In this example, we consider a disease model where the probability for disease depends
on the haplotype at four single nucleotide polymorphisms (SNP). We encode the allele
at each SNP as 0 or 1, and so the haplotype can be represented as a binary string.
We assume that the disease probability depends on the number of positions at which
the subject’s haplotype agrees with a target haplotype:
P (Disease) = 0.3 + 0.1× (Number of positions in agreement).
CHAPTER 8. EXAMPLES 82
Figure 8.1: C-uMST and C-uNNG constructed on a typical data set generated underparameters ζ0 = 1234 and θ = 5 with na = nb = 20. The Spearman’s distance is usedin both the generating model and for constructing the graph. Each node is labeledby the ranking it represents, followed by the number of subjects from groups a and bwith that ranking in parentheses.
KK uMST C-uMST C-uNNG Pearson LR
α = 0.01 0.566 0.413 0.397 0.214 0.206α = 0.05 0.784 0.660 0.648 0.450 0.439
KSα = 0.01 0.567 0.395 0.385 0.221 0.209α = 0.05 0.784 0.649 0.631 0.455 0.437
SSα = 0.01 0.588 0.491 0.478 0.247 0.240α = 0.05 0.807 0.715 0.703 0.485 0.480
SKα = 0.01 0.607 0.495 0.486 0.253 0.248α = 0.05 0.811 0.729 0.715 0.494 0.481
Table 8.1: The power of five tests – three graph-based tests based on RuMST, RC−uMST,RC−uNNG and two Chi-square tests – under two significance levels (α = 0.01, 0.05) anddifferent simulation settings.
CHAPTER 8. EXAMPLES 83
Figure 8.2: Power versus type I error for five different tests in the preference rankingexample with θ = 5 and na = nb = 20. One of two distance measures (Kendall’sor Spearman’s distance) was used for the generating model and for constructing thegraph. The title of each plot denotes the choice of distance: The first letter denotesthe distance used in the generating model (“K” is Kendall’s and “S” is Spearman’sdistance); and the second letter denotes the distance used in constructing the graph.For instance, “KS” in the bottom left panel means that Kendall’s distance is used inthe generating model, but Spearman’s distance is used in constructing the graph.
CHAPTER 8. EXAMPLES 84
Thus, the probability of disease can take values 0.3 0.4, 0.5, 0.6 or 0.7 depending
on whether there are 0, 1, 2, 3 or 4 positions in agreement. To make the problem
harder, we assume that seven non-informative SNPs are analyzed together with the
four informative SNPs, and that which and how many of the 11 SNPs are informative
is unknown in the analysis. Thus the data actually consists of haplotypes of length 11.
There are 211 = 2, 048 possible categories. In each simulation, 1,000 haplotypes with
length 11 are generated uniformly from all possible haplotypes. Each subject with a
given haplotype is signed as “patient” or “normal” according to the disease model.
Since only 1,000 subjects are simulated in each run, not all of the 2,048 categories
are represented. The number of non-empty categories in each run ranged from 755
to 823, with an average of 791 in the 1000 simulation runs. The Hamming distance
is used to construct the graph. Figure 8.3 shows the power versus type I error plots
for the five tests. It is clear that, by incorporating the information in the graph, tests
based on RuMST, RC−uMST and RC−uNNG all have much higher power than the Pearson’s
Chi-square and deviance tests. Among the three graph-based tests, the one based on
RuMST works a little better than the ones based on RC−uMST and RC−uNNG.
8.3 Binary Clinical Features
This example comes from Anderson et al. [1972] and Nettleton and Banerjee [2001].
Data on the presence or absence of 17 clinical features of the eye ailment Kerato-
conjunctivitis Sicca (KCS) are given for two groups of patients. A question asked
by Nettleton and Banerjee was whether the two groups of patients share a common
distribution with respect to these clinical features. The sizes of the groups are 40 and
24. It turned out that only two subjects had the same outcome for the 17 clinical
features, so there are in total 63 distinct categories. The Hamming distance is used to
construct the graph, and p-values are calculated through 10, 000 permutation samples
and shown in Table 8.2. Nettleton and Banerjee’s method is based on the uNNG on
subjects. As discussed before and confirmed by simulation studies in Section 7.3, the
uNNG on subjects has lower power than MST based tests when many categories have
more than one subject. This is not a problem in this data because only one category
has more than one subject. We see that RuMST, RC−uMST and RC−uNNG all detected the
CHAPTER 8. EXAMPLES 85
Figure 8.3: The power versus type I error plots for the five tests for the haplotypeexample. The length of the haplotype is 11, but only 4 positions informative.
CHAPTER 8. EXAMPLES 86
difference between the two groups of patients, while the Chi-square tests did not.
Table 8.2: p-values for the KCS data set.RuMST RC−uMST RC−uNNG Nettleton and Banerjee’s Pearson LR0.0011 0.0010 0.0006 0.0007 0.5200 0.5200
Chapter 9
Permutation Distributions of the
Test Statistics
Based on the results in Sections 7.3-7.5, we focus on RC−uMST and RuMST. In this section,
we consider the permutation distributions of these two statistics in their generalized
forms. That is, we consider RC0 and TC0 , the latter defined as
TC0 =K∑u=1
naunbu +∑
(u,v)∈C0
(naunbv + nbunav) (9.1)
TC−uMST is equivalent to RuMST.
We define two quantities that will be used to characterize the permutation distri-
butions:
λ := maxu|EC0u |, the maximum node degree in C0. (9.2)
β := maxu
mu, the maximum total count for a category. (9.3)
By permutation distribution, we are referring to the distribution of the statistic under
random uniform permutation of the group labels. This is used as the null distribution
to assess statistical significance. We use PP, EP and VarP to denote the probability,
expectation and variance under the permutation null.
87
CHAPTER 9. PERMUTATION DISTRIBUTIONS 88
9.1 RC0
The following lemma states that the first two moments of RC0 under the permutation
null can be computed instantaneously using basic summary statistics of the graph
and cell counts of the contingency table.
Lemma 9.1.1. The mean and variance of RC0 under the permutation null are
EP[RC0 ] = (N −K + |C0|)2p1, (9.4)
VarP[RC0 ] = 4(p1 − p2)
(N −K + 2|C0|+
K∑u=1
|Eu|2
4mu
−K∑u=1
|Eu|mu
)(9.5)
+ (6p2 − 4p1)
(K −
K∑u=1
1
mu
)+ p2
∑(u,v)∈C0
1
mumv
+ (N −K + |C0|)2(p2 − 4p21),
where
p1 =nanb
N(N − 1), p2 =
4na(na − 1)nb(nb − 1)
N(N − 1)(N − 2)(N − 3). (9.6)
Proof of Lemma 9.1.1 is given in Appendix C.2.1.
Remark 9.1.2. As N →∞, na/N → γ ∈ (0, 1), we have p2 = 4p21 and thus
VarP[RC0 ] = 4(p1 − p2)
(N −K + 2|C0|+
K∑u=1
|Eu|2
4mu
−K∑u=1
|Eu|mu
)
+ (6p2 − 4p1)
(K −
K∑u=1
1
mu
)+ p2
∑(u,v)∈C0
1
mumv
.
Furthermore, if γ = 0.5, then p1 = p2 = 1/4 and we have
VarP[RC0 ] =1
2
(K −
K∑u=1
1
mu
)+
1
4
∑(u,v)∈C0
1
mumv
.
We next give sufficient conditions guaranteeing the convergence to normality of
RC0 after standardization by its mean and variance.
CHAPTER 9. PERMUTATION DISTRIBUTIONS 89
Condition 1.
K∑u=1
mu(mu + |EC0u |)(mu +
∑v∈Vu
mv + |EC0u,2|) ∼ o(K3/2),
∑(u,v)∈C0
(mu +mv + |EC0u |+ |EC0
v |)(mu +mv +∑
w∈(Vu∪Vv)
mw + |EC0u,2|+ |EC0
v,2|) ∼ o(K3/2).
Condition 1 constrains the size of “hubs” in the graph: The node degrees in C0
and the number of observations in each category must not get too large. It can be
simplified to stronger conditions that are easier to comprehend. For example, the
following implies Condition 1:
Condition 1′′. β6λ2 and λ8 are both o(K).
The second condition is usually trivial:
Condition 2. N, |C0|, and∑
(u,v)∈C0
1mumv
are all O(K).
The asymptotic distribution of the standardized form of RC0 is given in the fol-
lowing theorem.
Theorem 9.1.3. Assume that conditions 1 and 2 hold. Under the permutation null,
the standardized statisticRC0 − EP[RC0 ]√
VarP[RC0 ]
converges in distribution to N(0, 1) as K → ∞ and na/N is bounded away from 0
and 1.
The proof of Theorem 9.1.3 is given in Appendix C.2.2.
Theorem 9.1.3 can be applied to any type of graph, allowing for repeated observa-
tions of each node. Since the statistics in Friedman and Rafsky [1979] and Rosenbaum
[2005] do not allow ties, their asymptotic normality results are also restricted to the
case where each node is observed only once. To compare Theorem 9.1.3 to its coun-
terpart in these two papers, we let G = C0 and assume that mu ≡ 1. Thus,
N = K,∑
(u,v)∈C0
1
mumv
= |C0| = |G|.
CHAPTER 9. PERMUTATION DISTRIBUTIONS 90
Condition 2 requires that |G| ∼ O(K) and Condition 1 can be simplified to:
K∑u=1
|EGu ||EGu,2| ∼ o(K3/2),
∑(u,v)∈G
(|EGu |+ |EGv |)(|EGu,2|+ |EGv,2|) ∼ o(K3/2).
Hence, Theorem 9.1.3 implies the asymptotic normality result in Rosenbaum
[2005] since |EGu | ≡ 1, |EGu,2| ≡ 1, |G| = K/2 for MDP. Friedman and Rafsky proved a
more general condition for asymptotic normality of sums (1.1) after standardization:
For sparse graphs where |G| ∼ O(K), the number of edge pairs that share a common
node must be O(K). Condition 1 is neither stronger or weaker than Friedman and
Rafsky’s condition. For example, consider a graph with one node having degree K1/2
and all other nodes having degree 1; this graph satisfies Friedman and Rafsky’s con-
dition but not Condition 1, since∑
(u,v)∈G |EGu ||EGu,2| = O(K3/2). On the other hand, a
graph with√K nodes having degree K0.3 and all other nodes having degree 1 would
satisfy Condition 1 but not Friedman and Rafsky’s condition.
9.2 TC0
The following lemma is the counterpart of Lemma 9.1.1 for RC0 . It’s proof is given
in Appendix C.2.3.
Lemma 9.2.1. The mean and variance of TC0 under the permutation null are
EP[TC0 ] =
K∑u=1
mu(mu − 1) + 2∑
(u,v)∈C0
mumv
p1, (9.7)
CHAPTER 9. PERMUTATION DISTRIBUTIONS 91
VarP[TC0 ] = (p1 − p2)K∑u=1
mu(mu +∑v∈Vu
mv − 1)(mu +∑v∈Vu
mv − 2) (9.8)
+ (p1 − p2/2)
K∑u=1
mu(mu − 1) + 2∑
(u,v)∈C0
mumv
+ (p2 − 4p2
1)
K∑u=1
mu(mu − 1) + 2∑
(u,v)∈C0
mumv
2
,
where p1 and p2 are given in (9.6).
The next theorem gives a sufficient condition for asymptotic normality of TC0
under the permutation null.
Theorem 9.2.2. If∑K
u=1mu(mu+∑
v∈Vumv)2 ∼ O(N), then under the permutation
null distribution, the standardized statistic
TC0 − EP[TC0 ]√VarP[TC0 ]
,
where EP[TC0 ] and VarP[TC0 ] are given in (9.7) and (9.8), converges in distribution
to N(0, 1) as N →∞, and na/N bounded away from 0 and 1.
Proof. Let G be the uMST on subjects. Then as long as∑N
i=1 |Gi|(|Gi|−1) ∼ O(N),
asymptotic normality can be ensured following Friedman and Rafsky [1979]’s result.
Notice that if i is in category u, then |Gi| = (mu − 1) +∑Vumv.
9.3 Checking the p-Values Under Normal Approx-
imations
We now check the normal approximations to the p-values of the three graph-based
statistics – RC−uMST, RC−uNNG and RuMST – through simulation. We adopt the setting
of the haplotype example. In each simulation run, N haplotypes with length l are
generated uniformly from all possible haplotypes with length l. They are assigned to
either group with equal probability. Hence, the two groups have the same distribution.
CHAPTER 9. PERMUTATION DISTRIBUTIONS 92
For each simulation run, we calculate the difference between theoretical p-values from
the normal approximation and the permutation p-values from 10,000 permutations
for the three statistics. We consider different sparsity settings by varying l, which
controls the number of categories, and N . Under each setting, 100 simulation runs
are done, with boxplots of the differences between theoretical and simulation p-value
shown in Figure 9.1. We increased l from 6 to 10, and thus the number of possible
categories considered grows from 64 to 1024. The sample size N varies from 100 to
1000. This spectrum of values is reasonable for a genetic association study.
Simulation results under this setting shows that the normal approximation is
better for RC−uMST and RC−uNNG than for RuMST. Accuracy of normal approximation
improves for all statistics as l and N increase. For RC−uMST and RC−uNNG, when the
number of possible categories is larger than 256 and the number of observations is
larger than 200, the p-value from normal approximation is quite accurate. While
for RuMST, the number of observations needs to be larger than 500 to achieve similar
accuracy. For RuMST, when the number of possible categories is larger than the number
of observations, the p-value calculated from the normal approximation is negatively
biased, and thus less conservative. The bias is less severe for RC−uMST and RC−uNNG,
while still problematic when the number of possible categories is 1024 and the number
of observation is only 100. Skewness correction can be done to make the theoretical p-
values more accurate, but whenN is so small, it would be easier to just do permutation
directly.
CHAPTER 9. PERMUTATION DISTRIBUTIONS 93
Figure 9.1: Boxplots for the differences between p-values calculated from normalapproximation and 10,000 permutations.
Chapter 10
Discussion
We have described a new approach for comparing two categorical samples, which
is appealing when the contingency table is sparsely populated. Sparse contingency
tables are common in many modern applications where the number of categories,
K, is large compared to sample size. In such situations, the different categories can
usually be related to each other in a systematic way. The new approach utilizes
a graphical encoding of the similarity between categories to improve the power of
two-sample comparison. We showed, through simulations and real examples, that
utilizing graphical information improves the power over deviance and Pearson’s Chi-
square tests. The proposed statistics are shown to be asymptotically normal after
standardization, under assumptions that limit the hub size and density of the graph.
This allows instantaneous type I error control for large data sets.
The power of the new approach depends on the choice of an informative similarity
measure between categories. This part of the analysis should rely on domain knowl-
edge that is specific to each application. For ranking data from surveys, one can
start with the standard distance measures used in Example 8.1. When the number of
categories is large, drawing relationships between categories is a necessary and often
default step in analyzing the data.
Both RC−uMST and RuMST work well when the similarity information is effective
with RuMST usually having better power. However, when the similarity measure is
not as informative, RuMST can have very low power, even when compared to Chi-
square tests. In our simulation studies derived from the Haplotype problem, the
94
CHAPTER 10. DISCUSSION 95
normal approximation is more accurate for RC−uMST than for RuMST. For RuMST, p-values
obtained from normal approximation are lower than actual p-values in extremely
sparse situations. All p-value approximations work well when the sample size is
comparable to the number of categories.
Generalization of this approach to multi-sample comparison is straightforward by
letting gi take K ′ distinct values, where K ′ is the number of groups.
Appendix A
Existing Theorems Used in Proofs
A.1 Stein’s Method
Here, we state one form of the Stein’s method we use. Consider sums of the form
W =∑
i∈J ξi, where J is an index set and ξ are random variables with E[ξi] = 0, and
E[W 2] = 1. The following assumption restricts the dependence between ξi : i ∈ J .
Assumption A.1.1. [Chen and Shao, 2005, p. 17] For each i ∈ J there exists
Si ⊂ Ti ⊂ J such that ξi is independent of ξSci and ξSi is independent of ξT ci .
Theorem A.1.1. [Chen and Shao, 2005, Theorem 3.4] Under Assumption A.1.1, we
have
suph∈Lip(1)
|Eh(W )− Eh(Z)| ≤ δ,
where Lip(1) is all uniformly Lipschitz functions, Z has N (0, 1) distribution and
δ = 2∑i∈J
(E|ξiηiθi|+ |E(ξiηi)|E|θi|) +∑i∈J
E|ξiη2i |, (A.1)
with ηi =∑
j∈Si ξj and θi =∑
j∈Ti ξj, where Si and Ti are defined in Assumption
A.1.1.
96
Appendix B
Supporting Materials for Part I
B.1 Proofs for the Limiting Distributions
We here prove that ZG([nu]) : 0 < u < 1 converges to a Gaussian process. The
proof for the convergence of ZG([nu], [nv]) : 0 < u < v < 1 to two-dimensional
Gaussian random field can be done in the same manner but with a more careful
treatment of the indexes.
To prove ZG([nu]) : 0 < u < 1 converges to a Gaussian process, we only need
to show that (ZG([nu1]), ZG([nu2]), . . . , ZG([nuK ])) is multivariate Gaussian for any
0 < u1 < u2 < · · · < uK < 1 and fixed K. For simplicity, let tk = [nuk], k = 1, . . . , K.
To prove that (ZG(t1), ZG(t2), . . . , ZG(tK)) is multivariate Gaussian, we take one
step back. In permutation distribution, we permute the order of the observations.
Let π(i) be the observed time of yi after permutation, then (π(1), . . . , π(n)) is a
permutation of 1, . . . , n. So π(i) is the observed time of yi after permutation. To get
the permutation distribution, we can do it in two steps: 1) For each i, π(i) is sampled
uniformly from 1 to n; 2) only those that each value in 1, . . . , n is sampled once are
retained. It is easy to see that each permutation has the same occurrence probability
after the two steps.
We call the distribution resulting from only performing the first step the bootstrap
distribution, and we use PB, EB and VarB to denote the probability, expectation and
variance, respectively. (P, E, Var without the subscript B are used to denote the
equivalences under the permutation null.) Let
97
APPENDIX B. SUPPORTING MATERIALS FOR PART I 98
ZBG (t) = −RG(t)− EB(RG(t))√
VarB(RG(t));
nB(t) =n∑i=1
Iπ(i)≤t, XB(t) =nB(t)− t√t(1− t/n)
.
Then following a similar argument in the proof for Lemma 3.2.2 but replacing the
permutation distribution with bootstrap distribution, we have
EB(RG(t)) = pB1 (t)|G|,
VarB(RG(t)) = pB2 (t)|G|+(
1
2pB1 (t)− pB2 (t)
)∑i
|Gi|2,
where
pB1 (t) =2t(n− t)
n2,
pB2 (t) =4t2(n− t)2
n4.
We prove the following two lemmas.
Lemma B.1.1. If∑
e∈G |Ae||Be| ∼ o(|G|3/2), |G| ∼ O(n), then under the bootstrap
null, for 0 < u1 < u2 < · · · < uk < 1,
(ZBG (t1), ZB
G (t2), . . . , ZBG (tK), XB(t1), XB(t2), . . . , XB(tK))
has a non-degenerating multivariate Gaussian distribution.
Lemma B.1.2. When |G| ∼ o(n2), for t ∼ O(n), as |G| → ∞, we have
1.VarB(RG(t))
Var(RG(t))→ 1.
2.EB(RG(t))− E(RG(t))√
VarB(RG(t))→ 0.
APPENDIX B. SUPPORTING MATERIALS FOR PART I 99
Since (ZBG (t1), ZB
G (t2), . . . , ZBG (tK)|XB(t1) = 0, XB(t2) = 0, . . . , XB(tK) = 0) un-
der the bootstrap null has the same distribution as (ZBG (t1), ZB(t1), . . . , ZB(tK)) un-
der the permutation null and the fact that
RG(t)− E(RG(t))√Var(RG(t)
=VarB(RG(t))
Var(RG(t))
(RG(t)− EB(RG(t))√
VarB(RG(t))+
EB(RG(t))− E(RG(t))√VarB(RG(t))
),
with Lemma B.1.1 and Lemma B.1.2, we have the result that (ZG([nu1]), ZG([nu2]),
. . . , ZG([nuK ])) is multivariate Gaussian.
We next show the proof for the two lemmas.
Proof for Lemma B.1.1. We only need to show that
(a) VarB(∑K
k=1(akZBG (tk) + bkX
B(tk))) is bounded away from 0 for∑
k(a2k + b2
k) 6= 0.
(b)∑K
k=1(akZBG (tk) + bkX
B(tk)) is normally distributed for any fixed ak,
Let σB(tk) =√
VarB(RG(tk)). Following similar arguments in Section 4.2.2 but
replacing the permutation distribution with bootstrap distribution, we have
covB(ZBG (t1), ZB
G (t2)) =4u2
1(1− u2)2|G|+ u1(1− u2)(1− 2u1)(1− 2u2)∑
i |Gi|2
σB(t1)σB(t2).
|covB(ZBG (t1), ZB
G (t2))| is strictly bounded from 1 when u1 6= u2.
Notice that
R(t)nB(t) =∑
(i,j)∈G
(Iπ(i)≤t,π(j)>t + Iπ(i)>t,π(j)≤t
) n∑l=1
Iπ(l)≤t
=∑
(i,j)∈G
(Iπ(i)≤t,π(j)>t +
∑l 6=i,j
Iπ(i)≤t,π(j)>t,π(l)≤t
+Iπ(i)>t,π(j)≤t +∑l 6=i,j
Iπ(i)>t,π(j)≤t,π(l)≤t
),
so EB(R(t)nB(t)) = |G|2u(1− u)(un− 2u+ 1), then
covB(ZBG (t), XB(t)) =
2u(1− u)(1− 2u)|G|σB(t)
√nu(1− u)
.
APPENDIX B. SUPPORTING MATERIALS FOR PART I 100
If u = 1/2, then covB(ZBG (t), XB(t)) = 0, which is strictly bounded away from 1. For
u 6= 1/2,
covB(ZBG (t), XB(t)) =
1√n∑i |Gi|2
4|G|2 + nu(1−u)|G|(1−2u)2
Since n∑
i |Gi|2 ≥ 4|G|2 by Cauchy-Schwartz, and |G| ∼ O(n), we have
covB(ZBG (t), XB(t)) ≥ 0
and strictly bounded away from 1.
Given the essential arguments |covB(ZBG (t1), ZB
G (t2))| is strictly bounded from 1
when u1 6= u2, and covB(ZBG (t), XB(t)) ≥ 0 and strictly bounded from 1 when t ∼
O(n), (a) can be shown with some arithmetic arguments.
Let VarB(∑K
k=1(akZBG (tk)+bkX
B(tk))) := σ0, then σ0 ∼ O(1) for∑
k(a2k+b2
k) 6= 0.
We prove (b) using the Stein’s method. In particular, the version of Stein’s method
stated in Appendix A.1 is used. We adopt the same notation with the index set
J = G, 1, . . . , n.Let
ξe,k =Igπ(e−)(tk)6=gπ(e+)(tk) − pB1 (tk)
σB(tk),
Since Igπ(e−)(tk)6=gπ(e+)(tk) ∈ 0, 1, pB1 (tk) ∈ (0, 1), we have
|ξe,k| ≤1
σB(tk).
Let
ξi,k =Iπ(i)≤tk − uk√nuk(1− uk)
.
Similarly, we have
|ξi,k| ≤1√
nuk(1− uk).
Let ξe =∑
k akξe,k/σ0, ξi =∑
k bkξi,k/σ0, then W =∑
j∈J ξj =∑
k(akZBG (tk) +
bkXB(tk))/σ0, EB(W ) = 0, EB(W
2) = 1. Let a = max(maxk ak,maxk bk), σ =
APPENDIX B. SUPPORTING MATERIALS FOR PART I 101
min(mink σB(tk),mink
√nuk(1− uk)), then σ is at least of order n, and
|ξj| ≤aK
σσ0
, ∀j ∈ J .
For e ∈ G, let
Se = Ae, e−, e+,
Te = Be ∪ Nodes in Ae,
where Ae, Be defined in (4.3) and (4.4). Then Se and Te satisfy Assumption A.1.1.
For i = 1, . . . , n, let
Si = Gi
Ti = Gi,2 ∪ Nodes in Gi,
where Gi,2 is the subgraph of G including all edges connect to Gi. Then Si and Ti
satisfy Assumption A.1.1.
We have |Se| = |Ae|+ 2, |Te| = |Be|+ |Ae|+ 1, |Si| = |Gi|, |Ti| = |Gi,2|+ |Gi|+ 1.
By Theorem A.1.1, we have suph∈Lip(1) |Eh(W ) − Eh(Z)| ≤ δ for Z ∼ N (0, 1),
where
δ = 2∑j∈J
(E|ξjηjθj|+ |E(ξjηj)|E|θj|) +∑j∈J
E|ξjη2j |
= 2∑e∈G
(E|ξeηeθe|+ |E(ξeηe)|E|θe|) +∑e∈G
E|ξeη2e |
+ 2n∑i=1
(E|ξiηiθi|+ |E(ξiηi)|E|θi|) +n∑i=1
E|ξiη2i |
≤ a3K3
σ3σ30
(∑e∈G
5(|Ae|+ 2)(|Be|+ |Ae|+ 1) +n∑i=1
5|Gi|(|Gi,2|+ |Gi|+ 1)
)
≤ a3K3
σ3σ30
(45∑e∈G
|Ae||Be|+ 15n∑i=1
|Gi||Gi,2|
)
Since σ is at least of order n, σ0 ∼ O(1), |G| ∼ O(n), when∑
e∈G |Ae||Be| ∼ o(n3/2)
APPENDIX B. SUPPORTING MATERIALS FOR PART I 102
and∑n
i=1 |Gi||Gi,2| ∼ o(n3/2), we have δ → 0 as n→∞.
Also observe that if e = (i, j), then Gi, Gj ⊆ Ae, Gi,2, Gj,2 ⊆ Be. For each node
i, we can randomly pick an edge e that connects i, and we have |Gi||Gi,2| ≤ |Ae||Be|.Each node in the graph can be picked twice in maximum since an edge connects two
nodes, therefore,n∑i=1
|Gi||Gi,2| ≤ 2∑e∈G
|Ae||Be|.
So∑
e∈G |Ae||Be| ∼ o(n3/2) ensures∑n
i=1 |Gi||Gi,2| ∼ o(n3/2).
Proof for Lemma B.1.2. Let u = limn→∞ t/n, then
limn→∞
p1(t) = limn→∞
pB1 (t) = 2u(1− u),
limn→∞
p2(t) = limn→∞
pB2 (t) = 4u2(1− u)2,
limn→∞
Var(RG(t)) = limn→∞
VarB(RG(t)) = 4u2(1− u)2|G|+ u(1− u)(1− 2u)2∑i
|Gi|2.
SoVarB(RG(t))
Var(RG(t))→ 1.
Since
EB(RG(t))− E(RG(t)) = (pB1 (t)− p1(t))|G| = −2t(n− t)n3
|G|,
we have
limn→∞
EB(RG(t))− E(RG(t))√VarB(RG(t))
= − limn→∞
2u(1− u)|G|/n√4u2(1− u)2|G|+ u(1− u)(1− 2u)2
∑i |Gi|2
= − limn→∞
2u(1− u)√4u2(1− u)2n2/|G|+ u(1− u)(1− 2u)2n2
∑i |Gi|2/|G|2
,
which is 0 when |G| ∼ o(n2).
APPENDIX B. SUPPORTING MATERIALS FOR PART I 103
B.2 Effect of Skewness
To gain a better understanding of the role of skewness, we explore the following
quantities involved in the p-value approximations:
• γG(t)∆= E[Z3
G(t)],
• θb,G(t)∆= (−1 +
√1 + 2γG(t)b)/γG(t),
• SG(t)∆= 1√
1+γG(t)θb,G(t)exp(1
2(b− θb,G(t))2 +
γG(t)θb,G(t)3
6).
Figure B.1 shows the three quantities versus t for the single change-point scan statistic
on a MDP graph when n = 1000, b = 3. Since the structure of MDP is always the
same and does not depend on the distribution of yi, Figure B.1 is representative of all
MDP graphs with n = 1000 subjects and threshold b = 3. We can see from the figure
that γ is always larger than 0, indicating right skewness. When γ = 0, θb = b; when
γ > 0, θb < b. When ZG(t) is right-skewed, the analytic approximation of the p-value
assuming Gaussianity is smaller than the actual p-value, so the skewness correction
should increase the p-value approximation. This is indeed true as SG(t) is U-shaped
with a minimum of 1.
Each node in the MDP has degree 1. The shapes of γG(t) and θb,G(t) for ZG(t)
computed on graphs with very low number of hubs are similar to their shapes for
ZG(t) computed on the MDP. For example, for data in low dimensions (< 5), scans
based on MST and NNG constructed based on Euclidean distance have similar skew-
ness properties as described above. However, as the dimension of the data increases,
MST and NNG constructed based on Euclidean distance tend to become dominated
by hubs, and the distribution of ZG(t) becomes left-skewed. For a left-skewed distri-
bution, γ ≤ 0, θb ≥ b, and S ≤ 1. One problem for left-skewed distributions is that if
γ is smaller than −1/(2b), the current approximation does not yield real-valued so-
lution for θb. This issue is discussed in Remark 4.4.1 and here we provide a heuristic
solution to this problem based on an extrapolation procedure.
We illustrate procedure through a MST constructed on a simulated 100-dimensional
data based on Euclidean distance. From Figure B.2, we see that θb,G(t) and SG(t) are
APPENDIX B. SUPPORTING MATERIALS FOR PART I 104
not defined except in the middle region. In this case, the integrand
SG(nu)hG(n, x)ν√
2b20hG(n, x)
is directly extrapolated to the edge regions using the boundary tangent at each side.
If extrapolation is negative, it is set to zero. Figure B.3 illustrates the integrand
before and after extrapolation.
Figure B.1: The three quantities, γG(t), θb,G(t) and SG(t) from left to right, for aMDP graph. n = 1000, b = 3.
Figure B.2: The three quantities, γG(t), θb,G(t) and SG(t) from left to right, for a MSTgraph constructed using Euclidean distance on a sequence of n = 1000 observationsiid drawn from N(0, I100). b = 3.
APPENDIX B. SUPPORTING MATERIALS FOR PART I 105
Figure B.3: The integrand before (left) and after (right) extrapolation. The integrandcan only be directly calculated in the middle part (t ∈ [248, 752]), and the outer partis obtained by extending using the boundary tangent.
Appendix C
Supporting Materials for Part II
C.1 Computation Issues for RaMST and RuMST
The main task for computing RaMST and RuMST are to enumerate all MSTs on categories
for RaMST and to list the edges in M∗0 for RuMST. Other tasks can be finished in O(K)
time.
Let G be the complete graph on the K categories. |G| = K(K − 1)/2. Eppstein
[1995] proposed a graph operation called the sliding transformation which, when ap-
plied to G, produces an equivalent graph such that the MSTs on categories correspond
one-for-one with the spanning trees of the equivalent graph. The enumeration of all
spanning trees, without having to optimize for total distance, is relatively straight-
forward. Thus, we adopted the following computational approach: Use Eppstein’s
method to construct the equivalent graph of G, enumerate all spanning trees of the
equivalent graph, then transform back to get the set of MSTs on G. The sliding trans-
formation constructs the equivalent graph in O(|G| + K logK) = O(K2) time. To
perform the sliding transformation, an initial MST is needed. Prim’s algorithm can
be used to obtain the initial MST, which needs O(K2) time, not increasing the time
complexity. The theoretical justification of this algorithm can be found in Eppstein
[1995] and Section C.1.1, which completes many of the proofs of Eppstein [1995].
After removing any loops formed during the the sliding transformations, each
remaining edge appears in at least one spanning tree of the equivalent graph, thus
appearing in at least one MST on G. Now we have the list of edges in uMST on G,
106
APPENDIX C. SUPPORTING MATERIALS FOR PART II 107
and thus RuMST can be calculated in O(K2) time.
For enumerating all spanning trees of the equivalent graph, the algorithm proposed
by Shioura and Tamura [1995] is used, which requires O(K+ |G|+M) = O(K2 +M)
computation time. This was proven to be optimal in time complexity. Shioura and
Akihisa’s algorithm starts from a spanning tree formed by depth-first search, then
replaces one edge at a time using cycle structures in the graph, traversing the space
of all spanning trees of the graph. Hence, computing RaMST takes O(K2 +M) time.
C.1.1 Theoretical Justifications
This section proves the four lemmas stated (but not completely proved) in Eppstein
[1995] so as to draw the conclusion that applying sliding transformation produces an
equivalent graph such that the MST of the original graph correspond one-for-one with
the spanning trees of the equivalent graph. Let G be the original graph. We begin
with the definition of sliding transformation from Eppstein [1995]:
Let edges e = (u, v) and f = (v, w) share a common vertex v, and suppose
w(e) < w(f). We define the result of sliding edge f along edge e as the
graph G′ formed by deleting f from G and inserting in its place an edge
f ′ = (u,w) with the same weight.
and the definition of equivalent graph EG:
Let T0 be some particular minimum spanning tree in G, and choose some
vertex to be the root of T0. Then we form EG by repeatedly performing
sliding operations that slide an edge f = (v, w) along an edge e = (u, v)
as long as e is in T0 and u is closer to the root of T0 than is v.
We next state the four lemmas in Eppstein [1995] and give their proofs.
Lemma C.1.1. Let G′ be formed from G by any sequence of sliding operations.
Then each set of edges giving a minimum spanning tree of G corresponds to a unique
minimum spanning tree in G′ and vice versa.
Proof. We only need to show this for one sliding operation. Let G be the graph before
the sliding and G′ be the graph after sliding. This sliding operation is performed on
edge f as described in the definition of the sliding transformation.
APPENDIX C. SUPPORTING MATERIALS FOR PART II 108
Let Jm be the set of MST of G.
Jm0 = T ∈ Jm : f /∈ T,
Jmf = T ∈ Jm : f ∈ T, e /∈ T,
Jmef = T ∈ Jm : e, f ∈ T.
Let Jn be the set of spanning trees of G, and Jn0 , Jnf , Jnef are defined similarly.
J ′m, J ′m0, J ′mf , J
′mef
, J ′n, J ′n0, J ′nf , J
′nef
are the counterparts for G′ and those related
with f are replaced with f ′. What we need to show is the we can find one-for-
one correspondence for elements in Jm and J ′m. Observe that Jm0 , Jmf , Jmef is a
partition of Jm, and J ′m0, J ′mf , J
′mef
if a partition of J ′m. We next show the one-for-
one correspondence for each of the three subsets.
Observe that Jn0 = J ′n0, Jm0 ⊆ Jn0 , J
′m0⊆ J ′n0
. Therefore, ∀ Tm0 ∈ Jm0 , T′n0∈ J ′n0
,∑i∈Tm0
w(i) ≤∑
i∈T ′n0w(i). Hence J ′m0
⊆ Jm0 . With a similar argument, we have
Jm0 ⊆ J ′m0. Hence, Jm0 = J ′m0
.
For any Tmf ∈ Jmf , since f = (v, w) ∈ Tmf , e = (u, v) /∈ Tmf , there are two
possibilities to connect u, v, w in Tmf :
(1) −u− · · · − v − w − . . .
(2) −u− · · · − w − v − . . .
For the second scenario, if we delete f = (v, w) and add e = (u, v), it will still be
a tree but having smaller weight sum (performing this to the first scenario does not
lead to a tree any more), so Tmf can only be of the first form. Let Tmf be the graph
from Tmf by deleting f = (v, w) and adding f ′ = (u,w). Since Tmf is of the first
form, Tmf resulted from the two steps is still a tree. Since w(f ′) = w(f), we have∑i∈Tmf
w(i) =∑i∈Tmf
w(i). (C.1)
For any T ′mf ∈ J ′mf , since f ′ = (u,w) ∈ T ′mf , e = (u, v) /∈ T ′mf , there are two
possibilities to connect u, v, w in T ′mf :
(1) −v − · · · − u− w − . . .
APPENDIX C. SUPPORTING MATERIALS FOR PART II 109
(2) −v − · · · − w − u− . . .
For the second scenario, if we delete f ′ = (u,w) and add e = (u, v), it will still be
a tree but having smaller weight sum (performing this to the first scenario does not
lead to a tree any more), so T ′mf can only be of the first form. Let T ′mf be the graph
from T ′mf by deleting f ′ = (u,w) and adding f = (v, w). Since T ′mf is of the first
form, Tmf resulted from the two steps is still a tree. Since w(f ′) = w(f), we have∑i∈T ′mf
w(i) =∑i∈T ′mf
w(i). (C.2)
Let Jmf be the set of trees Tmf resulting from Tmf ∈ Jmf by deleting f and adding
f ′, J ′mf be the set of trees T ′mf resulting from T ′mf ∈ J′mf
by deleting f ′ and adding
f . It is easy to observe that Jmf ⊆ J ′nf and J ′mf ⊆ Jnf . Hence
∑i∈T ′mf
w(i) ≤∑i∈Tmf
w(i), (C.3)
∑i∈Tmf
w(i) ≤∑i∈T ′mf
w(i). (C.4)
(C.1), (C.2), (C.3), and (C.4) lead to∑i∈Tmf
w(i) =∑i∈Tmf
w(i) =∑i∈T ′mf
w(i) =∑i∈T ′mf
w(i).
Therefore J ′mf ⊆ Jmf , Jmf ⊆ J ′mf . Since there is one-for-one correspondence for
J ′mf and J ′mf , and one-for-one correspondence for Jmf and Jmf , we have J ′mf = Jmf ,
Jmf = J ′mf . Hence, there is one-for-one correspondence for Jmf and J ′mf .
For any Tmef ∈ Jmef , let Tmef be the graph resulted from Tmef by deleting f and
adding f ′. Then Tmef is still a tree, and∑i∈Tmef
w(i) =∑
i∈Tmef
w(i). (C.5)
Let Jmef be the set of the trees Tmef , then since e, f ′ ∈ Tmef , Jmef ⊆ J ′nef . Thus,
APPENDIX C. SUPPORTING MATERIALS FOR PART II 110
for any T ′mef ∈ J′mef
, we have
∑i∈T ′mef
w(i) ≤∑
i∈Tmef
w(i). (C.6)
For any T ′mef ∈ J′mef
, let T ′mef be the graph resulted from T ′mef by deleting f ′ and
adding f . Then T ′mef is still a tree, and
∑i∈T ′mef
w(i) =∑
i∈T ′mef
w(i). (C.7)
Let J ′mef be the set of the trees T ′mef , then J ′mef ⊆ Jnef . For any Tmef ∈ Jmef , we have
∑i∈Tmef
w(i) ≤∑
i∈T ′mef
w(i). (C.8)
(C.5), (C.6), (C.7), and (C.8) lead to∑i∈Tmef
w(i) =∑
i∈Tmef
w(i) =∑
i∈T ′mef
w(i) =∑
i∈T ′mef
w(i).
Therefore, Jmef ⊆ J ′mef and J ′mef ⊆ Jmef . Since there is one-for-one correspondence
for J ′mef and J ′mef , and one-for-one correspondence for Jmef and Jmef , we have J ′mef =
Jmef , Jmef = J ′mef . Hence, there is one-for-one correspondence for Jmef and J ′mef .
Lemma C.1.2. If we are given a graph G and a rooted minimum spanning tree T0,
then the graph EG described above is well-defined and does not depend on the order
in which sliding transformations are performed.
Proof. Let the root of T0 to be O. For simplicity, we use the same letter to denote
the edge before and after the sliding transformation, although one of the nodes of the
edge is changed. The set of edges forming T0 after sliding transformation still forms a
tree, and by Lemma C.1.1, this tree is still a minimum spanning tree in the resulting
APPENDIX C. SUPPORTING MATERIALS FOR PART II 111
graph. For simplicity, we still call this tree T0. Also, we call the type of sliding trans-
formation performed in the definition of EG be T0-sliding transformation. Observe
one fact that if edge e = (e−, e+) is in T0, with e− closer to O, then any T0-sliding
transformation does not change e+. That is, e− can be changed to some other node,
while e+ is always the same node. Therefore, we can view e+ as fixed for any e in T0.
Let g = (u,w) be an edge in the original graph G, and we discuss its destination in
EG.
Case I: g ∈ T0.
Without loss of generality, let u be closer to O than w is. In T0, let the edges
connecting O to u be e1, . . . , en, with ei = (e−i , e+i ) such that e−i is the one closer to
O. So e−1 = O, e+n = u. Let
mg = maxk ≥ 0 : w(ek) ≥ w(g).
If w(ek) < w(g),∀k = 1, . . . , n, then mg = 0. Then, no matter which order of sliding
transformation is used, g will connect (e+mg , w) in the EG (e+
0∆= 0). This is true based
on the following facts:
i) T0-sliding transformation of any edge other than g and ei, i = 1, . . . , n does not
the path from connecting O to u.
ii) T0-sliding transformation of ei, i = 1, . . . , n, will not move e+mg further to O than
g− is.
iii) T0-sliding transformation of g will move g− to e+mg eventually.
Since e+mg and g+ = w are fixed in any T0-sliding transformation, the position of g in
EG is fixed for whatever order of T0-sliding transformation.
Case II: g /∈ T0. In T0, let the edges connecting O to u be e1, . . . , en, and the edges
connecting O to w be f1, . . . , fm. e1, . . . , en and f1, . . . , fm may have overlap.
APPENDIX C. SUPPORTING MATERIALS FOR PART II 112
Let mgu and mgw be defined similarly as mg:
mgu = maxk ≥ 0 : w(ek) ≥ w(g),
mgw = maxk ≥ 0 : w(fk) ≥ w(g).
Then, by a similar argument as above, no matter which order of sliding transformation
is used, g will connect e+mgu
and f+mgw
in EG. Since e+mgu
and f+mgw
are fixed in any T0-
sliding transformation, the position of g in EG is fixed for whatever order of T0-sliding
transformation.
Lemma C.1.3. Any tree T in EG is minimum iff for each w, n(w, T ) = n(w, T0).
(n(w, T ) is the number of edges in T having weight w.)
Proof. The sufficiency of the condition is obvious. If the condition holds, then∑i∈T w(i) =
∑i∈T0 w(i). By definition, T is minimum. The necessity of the con-
dition is proved in the following stronger lemma.
Lemma C.1.4. For any w and any tree T in EG, n(w, T ) = n(w, T0).
Proof. Assume edges in T0 have weights w1, . . . , wm with w1 > w2 > · · · > wm and
n(wi, T0) = ni, i = 1, . . . ,m. For any edge with weight w′ > w1, then that edge is not
in T0. According to proof of Lemma C.1.2, since its weight is larger than any edge
in T0, the edge would connect O and O in EG, i.e., forming a loop at O. Therefore,
this edge will not appear in any tree in EG.
To remove ambiguity, let TEG0 be the tree in EG that consisting all edges in
T0. By proof of Lemma C.1.2, any edges of weight w1 will be moved by T0-sliding
transformation until either they ran into other edges of the same weight, or they reach
O. Therefore, edges of weight w1 in TEG0 also form a tree containing root O and n1
other nodes. We call this subtree T(1)0 . By a similar argument, edges with weights w1
and w2 also in TEG0 form a subtree, which we call T(2)0 . This can be continued, and
T(m)0 is TEG0 .
Let the nodes other than O in T(1)0 be v11, . . . , v1n1 , the nodes in T
(2)0 but not T
(1)0
be v21, . . . , v2n2 , and so on. We call the set of the nodes in T(i)0 but not T
(i−1)0 be
APPENDIX C. SUPPORTING MATERIALS FOR PART II 113
V (i) = vi1, . . . , vini, i = 1, . . . ,m. To make it complete, we let T(0)0 be the node O,
and also define V (0 = O.For any edge g = (u,w) in EG, ∃i, j ∈ 0, 1, . . . ,m, such that u ∈ V (i) and
w ∈ V (j). (i and j can be the same.) Then w(g) ≤ wi∨j because otherwise T0-sliding
transformation can be further performed on this edge.
For any tree T in EG, let ET,i = e ∈ T : e− ∈ V (0), V (1), . . . , V (i), e+ ∈ V (i).Then any edge in T belong to one of the ET,i’s, i = 1, . . . ,m. Let nT,i = |ET,i|. Since
T is a tree, we havem∑i=1
nT,i =m∑i=1
ni. (C.9)
Also, since T is a tree, we have
k∑i=1
nT,i ≤ |O, V (1), . . . , V (k)| − 1 =k∑i=1
ni. (C.10)
Since any edge in ET,i has weight no larger than wi, we have
∑i∈T
w(i) ≤m∑i=1
nT,iwi. (C.11)
By the proof of Lemma C.1.1, we know that TEG0 is a minimum spanning tree in EG.
Som∑i=1
niwi ≤∑i∈T
w(i). (C.12)
(C.11) and (C.12) give that
m∑i=1
niwi ≤m∑i=1
nT,iwi. (C.13)
Together with (C.10) and w1 > w2 > · · · > wm, we have
ni = nT,i, i = 1, . . . ,m,
and every edge in ET,i has weight wi.
APPENDIX C. SUPPORTING MATERIALS FOR PART II 114
C.2 Proofs for Lemmas and Theorems in Permu-
tation Distributions
C.2.1 Proof of Lemma 9.1.1
Proof. Define
RA =K∑u=1
1
mu
∑i,j∈Cu
Igi 6=gj ,
and
RB =∑
(u,v)∈C0
1
mumv
∑i∈Cu,j∈Cv
Igi 6=gj .
We have
EP[RC0 ] = EP[RA] + EP[RB]
=K∑u=1
1
mu
∑i,j∈Cu
PP(gi 6= gj) +∑
(u,v)∈C0
1
mumv
∑i∈Cu,j∈Cv
PP(gi 6= gj).
Since PP(gi 6= gj) =
0 if i = j
2nanbN(N−1)
if i 6= j, thus
EP[RC0 ] =K∑u=1
1
mu
mu(mu − 1)2nanb
N(N − 1)+
∑(u,v)∈C0
1
mumv
mumv2nanb
N(N − 1)
= (N −K + |C0|)2nanb
N(N − 1).
Now, to compute the second moment, first note that
EP[R2C0
] = EP[R2A] + EP[R
2B] + 2EP[RARB].
APPENDIX C. SUPPORTING MATERIALS FOR PART II 115
Expanding the right-hand-side in above,
EP[R2A] =
k∑u,v=1
1
mumv
∑i,j∈Cu, k,l∈Cv
PP(gi 6= gj, gk 6= gl),
EP[R2B] =
∑(u,v)∈C0
1
m2um
2v
∑i,k∈Cu, j,l∈Cv
PP(gi 6= gj, gk 6= gl)
+ 2∑
(u,v),(w,y)⊂C0
1
mumvmwmy
∑i∈Cu, j∈Cv , k∈Cw, l∈Cy
PP(gi 6= gj, gk 6= gl),
EP[RARB] =K∑u=1
∑(v,w)∈C0
1
mumvmw
∑i,j∈Cu, k∈Cv , l∈Cw
PP(gi 6= gj, gk 6= gl).
Since
PP(gi 6= gj, gk 6= gl) =
0 if i = j and/or k = l
2nanbN(N−1)
= 2p1 if
i = k, j = l, i 6= j
i = l, j = k, i 6= j
nanbN(N−1)
= p1 if
i = k, j 6= i, l
i = l, j 6= i, k
j = k, i 6= j, l
j = l, i 6= j, k4na(na−1)nb(nb−1)N(N−1)(N−2)(N−3)
= p2 if i, j, k, l are all different,
we have
EP[R2A] =
K∑u=1
1
m2u
∑i,j,k,l∈Cu
PP(gi 6= gj, gk 6= gl)
+k∑
u=1
∑v 6=u
1
mumv
∑i,j∈Cu, k,l∈Cv
PP(gi 6= gj, gk 6= gl)
=K∑u=1
1
m2u
[2mu(mu − 1)(2p1) + 4mu(mu − 1)(mu − 2)p1]
+K∑u=1
1
m2u
[mu(mu − 1)(mu − 2)(mu − 3)p2]
APPENDIX C. SUPPORTING MATERIALS FOR PART II 116
+k∑
u=1
∑v 6=u
1
mumv
mu(mu − 1)mv(mv − 1)p2
= 4
(N − 2K +
K∑u=1
1
mu
)p1 + (N −K − 4)(N −K)p2
+ 6
(K −
K∑u=1
1
mu
)p2,
EP[R2B] =
∑(u,v)∈C0
1
m2um
2v
∑i,k∈Cu, j,l∈Cv
PP(gi 6= gj, gk 6= gl)
+∑
(u,v),(u,w)∈C0, v 6=w
1
m2umvmw
∑i,k∈Cu, j∈Cv , l∈Cw
PP(gi 6= gj, gk 6= gl)
+∑
(u, v), (w, y) ∈ C0
u, v, w, y all different
1
mumvmwmy
∑i ∈ Cu, j ∈ Cvk ∈ Cw, l ∈ Cy
PP(gi 6= gj, gk 6= gl)
=∑
(u,v)∈C0
1
m2um
2v
[mumv(2p1) +mumv(mu +mv − 2)p1]
+∑
(u,v)∈C0
1
m2um
2v
[mu(mu − 1)mv(mv − 1)p2]
+∑
(u,v),(u,w)∈C0, v 6=w
1
m2umvmw
[mumvmwp1 +mu(mu − 1)mvmwp2]
+∑
(u, v), (w, y) ∈ C0
u, v, w, y all different
1
mumvmwmy
mumvmwmyp2
=∑
(u,v)∈C0
1
mumv
[(mu +mv)p1 + (mu − 1)(mv − 1)p2]
+∑
(u,v),(u,w)∈C0, v 6=w
1
mu
[p1 + (mu − 1)p2]
+ 2|(u, v), (w, y) ⊂ C0 : u, v, w, y all different|p2
APPENDIX C. SUPPORTING MATERIALS FOR PART II 117
=K∑u=1
|EC0u |2
mu
(p1 − p2) + |C0|2p2 +∑
(u,v)∈C0
1
mumv
p2,
EP[RARB] =K∑u=1
∑(u,v)∈EC0
u
1
m2umv
∑i,j,k∈Cu, l∈Cw
PP(gi 6= gj, gk 6= gl)
+K∑u=1
∑(v,w)∈C0\E
C0u
1
mumvmw
∑i,j∈Cu
∑k∈Cv ,l∈Cw
PP(gi 6= gj, gk 6= gl)
=K∑u=1
∑(u,v)∈EC0
u
1
m2umv
[2mu(mu − 1)mvp1 +mu(mu − 1)(mu − 2)mvp2]
+K∑u=1
∑(v,w)∈C0\E
C0u
1
mumvmw
mu(mu − 1)mvmwp2
= |C0|(N −K)p2 + 2(p1 − p2)
(2|C0| −
|EC0u |mu
).
VarP[RC0 ] follows by combining the above in computing EP[R2C0
], and then subtracting
E2P[RC0 ].
C.2.2 Proof of Theorem 9.1.3
To prove Theorem 9.1.3, we first prove a simpler result: Asymptotic normality of the
statistic under the bootstrap null, defined as the distribution obtained by sampling
the group labels from the observed vector of group labels with replacement. Let PB,
EB and VarB denote respectively the probability, expectation and variance under the
bootstrap null.
Lemma C.2.1. Assuming condition 1, under the bootstrap null distribution, the stan-
dardized statisticRC0 − EB[RC0 ]√
VarB[RC0 ]
converges in distribution to N(0, 1) as K → ∞, where EB[RC0 ] and VarB[RC0 ] are
APPENDIX C. SUPPORTING MATERIALS FOR PART II 118
given below.
EB[RC0 ] = (N −K + |C0|)2p3, (C.14)
VarB[RC0 ] = 4(p3 − p4)
(N −K + 2|C0|+
K∑u=1
|EC0u |2
4mu
−K∑u=1
|EC0u |mu
)(C.15)
+ (6p4 − 4p3)
(K −
K∑u=1
1
mu
)+ p4
∑(u,v)∈C0
1
mumv
,
where
p3 =nanbN2
, p4 =4n2
an2b
N4= 4p2
3. (C.16)
Proof. The mean and variance of RC0 under the bootstrap null, (C.14) and (C.15),
can be obtained following similar steps as the proof of Lemma 9.1.1, noting that,
under the bootstrap null,
PB(gi 6= gj) =
0 if i = j2nanbN2 = 2p3 if i 6= j
,
and
PB(gi 6= gj, gk 6= gl) =
0 if i = j and/or k = l
2nanbN2 = 2p3 if
i = k, j = l, i 6= j
i = l, j = k, i 6= j
nanbN2 = p3 if
i = k, j 6= i, l
i = l, j 6= i, k
j = k, i 6= j, l
j = l, i 6= j, k4n2an
2b
N4 = p4 if i, j, k, l are all different .
To prove asymptotic normality, we rely on Stein’s method A.1. We first define some
more notations. For any node u of C0, let
Ru =2naunbumu
, du = EB[Ru] = 2(mu − 1)p3,
APPENDIX C. SUPPORTING MATERIALS FOR PART II 119
where p3 is defined in (C.16). Similarly, for any edge (u, v) of C0, let
Ruv =naunbv + navnbu
mumv
, duv = EB[Ruv] = 2p3.
Let σ2B = VarB[RC0 ], ξu, ξuv be the standardized mixing potentials,
ξu =Ru − duσB
, (C.17)
ξuv =Ruv − duv
σB. (C.18)
Finally, we define the index sets for ξu and ξuv:
J1 = 1, . . . , K,
J2 = uv : u < v such that (u, v) ∈ C0,
and let J = J1∪J2. Since RC0 =∑K
u=1Ru+∑
(u,v)∈C0Ruv, the standardized statistic
is
W :=∑i∈J
ξi =∑u∈J1
Ru − duσB
+∑uv∈J2
Ruv − duvσB
=RC0 − EB[RC0 ]
σB.
Our notation follows those of Theorem A.1.1 and Assumption A.1.1. For u ∈ J1, let
Su = u ∪ uv, vu : (u, v) ∈ C0,
Tu = Su ∪ v, vw,wv : (u, v), (v, w) ∈ C0.
For uv ∈ J2, let
Suv = uv, u, v ∪ uw,wu : (u,w) ∈ C0 ∪ vw,wv : (v, w) ∈ C0,
Tuv = Suv ∪ w,wy, yw : (u,w), (w, y) ∈ C0 ∪ w,wy, yw : (v, w), (w, y) ∈ C0.
Su, Tu, Suv, Tuv defined in this way satisfy Assumption A.1.1.
Since Ru ∈ [0, mu2
], p3 ∈ [0, 14], and Ruv ∈ [0, 1], we have du ∈ [0, mu−1
2], duv ∈ [0, 1
2],
APPENDIX C. SUPPORTING MATERIALS FOR PART II 120
and therefore |ξu| ≤ mu2σB
, |ξuv| ≤ 1σB
. Hence,
∑j∈Su
|ξj| ≤1
σB(mu + |EC0
u |), u ∈ J1,
∑j∈Tu
|ξj| ≤1
σB(mu +
∑v∈Vu
mv + |EC0u,2|), u ∈ J1,
∑j∈Suv
|ξj| ≤1
σB(mu +mv + |EC0
u |+ |EC0v |), uv ∈ J2,
∑j∈Tuv
|ξj| ≤1
σB(mu +mv +
∑w∈Vu∪Vv
mw + |EC0u,2|+ |EC0
v,2|), uv ∈ J2.
As in Theorem A.1.1, let ηi =∑
j∈Si ξj and θi =∑
j∈Ti ξj. Then
EB|ξiηiθi| = EB|ξi∑j∈Si
ξj∑k∈Ti
ξk| ≤ EB|ξi|∑j∈Si
|ξj|∑k∈Ti
|ξk|,
|EB(ξiηi)|EB|θi| ≤ EB|ξi∑j∈Si
ξj|EB|∑j∈Ti
ξj| ≤ EB|ξi|∑j∈Si
|ξj|EB
∑j∈Ti
|ξj|,
EB|ξiη2i | = EB|ξi
∑j∈Si
∑k∈Si
ξjξk| ≤ EB|ξi|∑j∈Si
|ξj|∑k∈Si
|ξk|.
Thus, for i = u ∈ J1, the terms EB|ξiηiθi|, |EB(ξiηi)|EB|θi|, and EB|ξiη2i | are all
bounded by1
σ3B
mu(mu + |EC0u |)(mu +
∑v∈Vu
mv + |EC0u,2|),
and for i = uv ∈ J2, the terms EB|ξiηiθi|, |EB(ξiηi)|EB|θi|, and EB|ξiη2i | are all bounded
by1
σ3B
(mu +mv + |EC0u |+ |Ev|)(mu +mv +
∑w∈Vu∪Vv
mw + |EC0u,2|+ |EC0
v,2|).
APPENDIX C. SUPPORTING MATERIALS FOR PART II 121
Hence,
δ ≤ 5
σ3B
(K∑u=1
mu(mu + |EC0u |)(mu +
∑v∈Vu
mv + |EC0u,2|)
+∑
(u,v)∈C0
(mu +mv + |EC0u |+ |EC0
v |)(mu +mv +∑
w∈Vu∪Vv
mw + |EC0u,2|+ |EC0
v,2|)
.
Since σB is of order√K or higher, under condition 1, δ → 0 as K →∞.
Proof of Theorem 9.1.3. To show the asymptotic normality of the standardized statis-
tic under the permutation null, we only need to show that (RC0 , nBa ) converges to a
non-degenerating bivariate Gaussian distribution under the bootstrap null, where nBa
is the number of observations that belong to group a in the bootstrap sample. Then
asymptotic normality of RC0 under the permutation null follows from the fact that
its distribution is equal to the conditional distribution of RC0 given nBa = na. The
standardized bivariate vector is(RC0 − EB[RC0 ]√
VarB[RC0 ],nBa −Npa
σ0
)
with pa = na/N, σ20 = Npa(1 − pa). By the Cramer-Wold device, we only need to
show that
a1RC0 − EB[RC0 ]√
VarB[RC0 ]+ a2
nBa −Npaσ0
is asymptotic Gaussian under the bootstrap null for all a1, a2 ∈ R, a1a2 6= 0.
Let ξi, i ∈ J be defined in the same way as in the proof of Lemma C.2.1. Let
J3 = |J |+ 1, . . . , |J |+K. For i ∈ J3, let
ξi =nai′ − pami′
σ0
, i′ = i− |J |.
We use Theorem A.1.1 to show the asymptotic Gaussianity of∑
i∈J a1ξi+∑
i∈J3 a2ξi.
We need to redefine the neighborhood sets to satisfy Assumption A.1.1.
APPENDIX C. SUPPORTING MATERIALS FOR PART II 122
For u ∈ J1,
Su = u, u+ |J | ∪ uv, vu : (u, v) ∈ C0,
Tu = Su ∪ v, v + |J |, vw, wv : (u, v), (v, w) ∈ C0.
For uv ∈ J2,
Suv = uv, u, v, u+ |J |, v + |J | ∪ uw,wu : (u,w) ∈ C0
∪ vw,wv : (v, w) ∈ C0,
Tuv = Suv ∪ w,w + |J |, wy, yw : (u,w), (w, y) ∈ C0
∪ w,w + |J |, wy, yw : (v, w), (w, y) ∈ C0.
And for u ∈ J3,
Su = u, u′ ∪ u′v, vu′ : (u′, v) ∈ C0, u′ = u− |J |,
Tu = Su ∪ v, v + |J |, vw, wv : (u′, v), (v, w) ∈ C0.
From the proof of Lemma C.2.1, we have
|ξu| ≤mu
2σB, ∀u ∈ J1; |ξuv| ≤
1
σB, ∀uv ∈ J2.
For u ∈ J3,
|ξu| ≤mu′
σ0
, u′ = u− |J |.
APPENDIX C. SUPPORTING MATERIALS FOR PART II 123
Let σ = min(σB, σ0), then
∑j∈Su
|ξj| ≤1
σ(2mu + |EC0
u |), u ∈ J1 ∪ J3,
∑j∈Tu
|ξj| ≤1
σ(2mu + 2
∑v∈Vu
mv + |EC0u,2|), u ∈ J1 ∪ J3,
∑j∈Suv
|ξj| ≤1
σ(2mu + 2mv + |EC0
u |+ |EC0v |), uv ∈ J2,
∑j∈Tuv
|ξj| ≤1
σ(2mu + 2mv + 2
∑w∈Vu∪Vv
mw + |EC0u,2|+ |EC0
v,2|), uv ∈ J2.
Thus, for i = u ∈ J1 ∪ J3, the terms EB|ξiηiθi|, |EB(ξiηi)|EB|θi|, and EB|ξiη2i | are
all bounded by
1
σ3mu(2mu + |EC0
u |)(2mu + 2∑v∈Vu
mv + |EC0u,2|),
and for i = uv ∈ J2, terms EB|ξiηiθi|, |EB(ξiηi)|EB|θi|, and EB|ξiη2i | are all bounded
by
1
σ3(2mu + 2mv + |EC0
u |+ |Ev|)(2mu + 2mv + 2∑
w∈Vu∪Vv
mw + |EC0u,2|+ |EC0
v,2|).
Define Wa1,a2 =∑
i∈J a1ξi +∑
i∈J3 a2ξi. The value of δ in Theorem A.1.1 has the
form
δ =1√
EB[W 2a1,a2
]
(2∑i∈J
(EB|a1ξiηiθi|+ |EB(a1ξiηi)|EB|θi|) +∑i∈J
EB|a1ξiη2i |
+2∑i∈J3
(EB|a2ξiηiθi|+ |EB(a2ξiηi)|EB|θi|) +∑i∈J3
EB|a2ξiη2i |
),
where ηi =∑
j∈Si ξj(a1Ij∈J + a2Ij∈J3), and θi =∑
j∈Ti ξj(a1Ij∈J + a2Ij∈J3).
APPENDIX C. SUPPORTING MATERIALS FOR PART II 124
Let a = max(|a1|, |a2|), we have
EB|a1ξiηiθi|, EB|a2ξiηiθi| ≤ a3EB|ξi∑j∈Si
ξj∑k∈Ti
ξk|
≤ a3EB|ξi|∑j∈Si
|ξj|∑k∈Ti
|ξk|,
|EB(a1ξiηi)|EB|θi|, |EB(a2ξiηi)|EB|θi| ≤ a3EB|ξi∑j∈Si
ξj|EB|∑j∈Ti
ξj|
≤ a3EB|ξi|∑j∈Si
|ξj|EB
∑j∈Ti
|ξj|,
EB|a1ξiη2i |, EB|a2ξiη
2i | ≤ a3EB|ξi
∑j∈Si
∑k∈Si
ξjξk|
≤ a3EB|ξi|∑j∈Si
|ξj|∑k∈Si
|ξk|.
Thus,
δ ≤ 40a3
σ3√
EB[W 2a1,a2
]
(K∑u=1
mu(mu + |EC0u |)(mu +
∑v∈Vu
mv + |EC0u,2|)
+∑
(u,v)∈C0
(mu +mv + |EC0u |+ |EC0
v |)(mu +mv +∑
w∈Vu∪Vv
mw + |EC0u,2|+ |EC0
v,2|)
.
Since σ2B is at least of order K and σ2
0 is of order N , σ2 is at least of order K by
Condition 2. If EB[W2a1,a2
] is uniformly strictly bounded from 0 for any a1a2 6= 0, then
under Condition 1, δ → 0 as K →∞.
We next show that under Condition 2, EB[W2a1,a2
] is uniformly strictly bounded
from 0 for any a1a2 6= 0.
Let W1 =∑
i∈J ξi,W2 =∑
i∈J3 ξi, then
EB[W2a1,a2
] = a21EBW
21 + a2
2EBW22 + 2a1a2EB[W1W2]
= a21 + a2
2 + 2a1a2EB[W1W2]
Thus, we only need to show that the absolute correlation between W1 and W2 is
uniformly strictly bounded from 1. Notice that, in the theorem, we require na/N to
APPENDIX C. SUPPORTING MATERIALS FOR PART II 125
be bounded from 0 and 1, so pa and pb are both bounded from 0 and 1.
Correlation between RC0 and nBa : Observe that
RC0nBa =
K∑u=1
1
mu
∑i,j∈Cu
Igi 6=gj +∑
(u,v)∈C0
1
mumv
∑i∈Cu,j∈Cv
Igi 6=gj
N∑x=1
Igx=a
=K∑u=1
1
mu
∑i,j∈Cu
(Igi 6=gj
N∑x=1
Igx=a
)+
∑(u,v)∈C0
1
mumv
∑i∈Cu,j∈Cv
(Igi 6=gj
N∑x=1
Igx=a
).
For any i 6= j,
EB
[Igi 6=gj
N∑x=1
Igx=a
]= EB
[Igi 6=gj ,gi=a + Igi 6=gj ,gj=a +
∑x 6=i,j
Igi 6=gj ,gx=a
]= PB(gi = a, gj = b) + PB(gi = b, gj = a) +
∑x 6=i,j
PB(gi 6= gj, gx = a)
= papb + papb + 2papbpa(N − 2) = 2papb(Npa + 1− 2pa).
Hence
EB[RC0nBa ] = (N −K + |C0|)2papb(Npa + 1− 2pa).
Since EB[RC0 ] = (N −K + |C0|)2papb and EB[nBa ] = Npa, we have
covB(RC0 , nBa ) = (N −K + |EC0|)2papb(1− 2pa). (C.19)
If pa = 1/2, then covB(RC0 , nBa ) = 0. Since VarB[RC0 ] and VarB[n
Ba ] = Npapb are
positive, corrB(RC0 , nBa ) = 0, clearly bounded from 1. We consider pa 6= 1/2 in the
following.
APPENDIX C. SUPPORTING MATERIALS FOR PART II 126
VarB[RC0 ] = 4papb(1− 4papb)
(N −K + 2|C0|+
K∑u=1
|EC0u |2
4mu
−K∑u=1
|EC0u |mu
)
+ 4papb(6papb − 1)
(K −
K∑u=1
1
mu
)+ 4p2
ap2b
∑(u,v)∈C0
1
mumv
= 4papb(1− 4papb)
(N − 2K + 2|C0|+
K∑u=1
(|EC0u |/2− 1)2
mu
)
+ 8p2ap
2b
(K −
K∑u=1
1
mu
)+ 4p2
ap2b
∑(u,v)∈C0
1
mumv
.
Since
NK∑u=1
(|EC0u |/2− 1)2
mu
=K∑u=1
mu
K∑u=1
(|EC0u |/2− 1)2
mu
≥
K∑u=1
√mu
(|EC0u |/2− 1)2
mu
2
=
(K∑u=1
||EC0u |/2− 1|
)2
≥
(K∑u=1
(|EC0u |/2− 1)
)2
= (|C0| −K)2 ,
we have
VarB[RC0 ]VarB[nBa ] ≥ 4p2
ap2b(1− 4papb)[N −K + |C0|]2 + 4p3
ap3bN
∑(u,v)∈C0
1
mumv
.
Hence,
|corrB(RC0 , nBa )| ≤ 1√
1 +papbN
∑(u,v)∈C0
1mumv
(1−4papb)[N−K+|C0|]2
.
WhenN, |C0|,∑
(u,v)∈C0
1mumv
∼ O(K), |corrB(RC0 , nBa )| is bounded by a value smaller
than 1.
APPENDIX C. SUPPORTING MATERIALS FOR PART II 127
C.2.3 Proof of Lemma 9.2.1
Let G be the uMST on subjects, and EGi = (i, j) : (i, j) ∈ G. Then |EGi | =
mu+∑Vumv−1, |G| =
∑Ku=1 mu(mu−1)/2+
∑(u,v)∈C0
mumv. Since EP[TC0 ] = |G|2p1,
and the result follows.
Now, we compute the second moment.
EP[T2C0
] =∑
(i,j),(k,l)∈G
PP(gi 6= gj, gk 6= gl)
=∑
(i,j)∈G
PP(gi 6= gj) +∑
(i,j),(i,k)∈G,j 6=k
PP(gi 6= gj, gi 6= gk)
+∑
(i, j), (k, l) ∈ Gi, j, k, l all different
PP(gi 6= gj, gk 6= gl)
= |G|2p1 +N∑i=1
|EGi |(|EGi | − 1)p1 + (|G|2 − |G| −N∑i=1
|EGi |(|EGi | − 1))p2
= (p1 − p2)K∑u=1
mu(mu +∑v∈Vu
mv − 1)(mu +∑v∈Vu
mv − 2)
+ (p1 − p2/2)
K∑u=1
mu(mu − 1) + 2∑
(u,v)∈C0
mumv
+ p2
K∑u=1
mu(mu − 1) + 2∑
(u,v)∈C0
mumv
2
.
VarP[TC0 ] follows by EP[T2C0
]− E2P[TC0 ].
Bibliography
J.A. Anderson, K. Whaley, J. Williamson, and W.W. Buchanan. A statistical aid to
the diagnosis of keratoconjunctivitis sicca. QJM, 41(2):175, 1972.
Eliot C Bush and Bruce T Lahn. The evolution of word composition in metazoan
promoter sequence. PLoS computational biology, 2(11):e150, 2006.
E.G. Carlstein, H.G. Muller, and D. Siegmund. Change-point problems, volume 23.
Inst of Mathematical Statistic, 1994.
L.H.Y. Chen and Q.M. Shao. Stein’s method for normal approximation. An intro-
duction to Stein’s method, 4:1–59, 2005.
G.W. Cobb. The problem of the nile: conditional solution to a changepoint problem.
Biometrika, 65(2):243–251, 1978.
D.E. Critchlow. Metric methods for analyzing partially ranked data, volume 34.
Springer, 1985.
F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm.
Signal Processing, IEEE Transactions on, 53(8):2961–2974, 2005.
P. Diaconis. Group representations in probability and statistics. Lecture Notes-
Monograph Series, 11, 1988.
N. Eagle, A.S. Pentland, and D. Lazer. Inferring friendship network structure by
using mobile phone data. Proceedings of the National Academy of Sciences, 106
(36):15274–15278, 2009.
128
BIBLIOGRAPHY 129
D. Eppstein. Representing all minimum spanning trees with applications to counting
and generation. Citeseer, 1995.
J.H. Friedman and L.C. Rafsky. Multivariate generalizations of the wald-wolfowitz
and smirnov two-sample tests. The Annals of Statistics, pages 697–717, 1979.
S. Furihata, T. Ito, and N. Kamatani. Test of association between haplotypes and
phenotypes in case–control studies: Examination of validity of the application of
an algorithm for samples from cohort or clinical trials to case–control samples using
simulated and real data. Genetics, 174(3):1505–1516, 2006.
J. Giron, J. Ginebra, and A. Riba. Bayesian analysis of a multinomial sequence and
homogeneity of literary style. The American Statistician, 59(1):19–30, 2005.
Z. Harchaoui, F. Bach, and E. Moulines. Kernel change-point analysis. 2009.
B. James, K.L. James, and D. Siegmund. Tests for a change-point. Biometrika, 74
(1):71, 1987.
B. James, K.L. James, and D. Siegmund. Asymptotic approximations for likelihood
ratio tests and confidence regions for a change-point in the mean of a multivariate
gaussian process. Statistica Sinica, 2(1):69–90, 1992.
G. Kossinets and D.J. Watts. Empirical analysis of an evolving social network. Sci-
ence, 311(5757):88–90, 2006.
R.A. Lippert, H. Huang, and M.S. Waterman. Distributional regimes for the number
of k-word matches between two random sequences. Proceedings of the National
Academy of Sciences, 99(22):13980, 2002.
A. Lung-Yut-Fong, C. Levy-Leduc, and O. Cappe. Homogeneity and change-
point detection tests for multivariate data using rank statistics. Arxiv preprint
arXiv:1107.1971, 2011.
C.L. Mallows. Non-null ranking models. i. Biometrika, 44(1/2):114–130, 1957.
BIBLIOGRAPHY 130
C.R. Mehta and N.R. Patel. A network algorithm for performing fisher’s exact test in
r× c contingency tables. Journal of the American Statistical Association, 78(382):
427–434, 1983.
D. Nettleton and T. Banerjee. Testing the equality of distributions of random vectors
with categorical components. Computational statistics & data analysis, 37(2):195–
208, 2001.
A.B. Olshen, ES Venkatraman, R. Lucito, and M. Wigler. Circular binary segmen-
tation for the analysis of array-based dna copy number data. Biostatistics, 5(4):
557–572, 2004.
Scott C Perry and Robert G Beiko. Distinguishing microbial genome fragments based
on their composition: evolutionary and comparative genomic perspectives. Genome
biology and evolution, 2:117, 2010.
Issaac Rajan, Sarang Aravamuthan, and Sharmila S Mande. Identification of compo-
sitionally distinct regions in genomes using the centroid method. Bioinformatics,
23(20):2672–2677, 2007.
P.R. Rosenbaum. An exact distribution-free test comparing two multivariate dis-
tributions based on adjacency. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 67(4):515–530, 2005.
Akiyoshi Shioura and Akihisa Tamura. Efficiently scanning all spanning trees of an
undirected graph. Journal of the Operations Research Society of Japan, 38(3):331–
344, 1995. ISSN 04534514. URL http://ci.nii.ac.jp/naid/110001184429/en/.
D. Siegmund. Approximate tail probabilities for the maxima of some random fields.
The Annals of Probability, pages 487–501, 1988.
D. Siegmund and B. Yakir. The statistics of gene mapping. Springer, 2007.
D. Siegmund, B. Yakir, and N.R. Zhang. Detecting simultaneous variant intervals in
aligned sequences. The Annals of Applied Statistics, 5(2A):645–668, 2011.
BIBLIOGRAPHY 131
D.O. Siegmund. Tail approximations for maxima of random fields. In Probability the-
ory: proceedings of the 1989 Singapore probability Conference held at the National
University of Singapore, June 8-16, 1989, page 147. Walter de Gruyter, 1992.
MS Srivastava and K.J. Worsley. Likelihood ratio tests for a change in the multivariate
normal mean. Journal of the American Statistical Association, pages 199–204, 1986.
H.K. Tang and D. Siegmund. Mapping quantitative trait loci in oligogenic models.
Biostatistics, 2(2):147–162, 2001.
A. Tsirigos and I. Rigoutsos. A new computational method for the detection of
horizontal gene transfer events. Nucleic acids research, 33(3):922–933, 2005.
I-Ping Tu, David Siegmund, et al. The maximum of a function of a markov chain
and application to linkage analysis. Advances in Applied Probability, 31(2):510–531,
1999.
L.J. Vostrikova. Detecting disorder in multidimensional random processes. In Soviet
Mathematics Doklady, volume 24, pages 55–59, 1981.
M. Woodroofe. Frequentist properties of bayesian sequential tests. Biometrika, 63
(1):101–110, 1976.
M. Woodroofe. Large deviations of likelihood ratio statistics with applications to
sequential testing. The Annals of Statistics, pages 72–84, 1978.
D.V. Zaykin, P.H. Westfall, S.S. Young, M.A. Karnoub, M.J. Wagner, and M.G. Ehm.
Testing association of statistically inferred haplotypes with discrete and continuous
traits in samples of unrelated individuals. Human heredity, 53(2):79–91, 2002.
N.R. Zhang, D.O. Siegmund, H. Ji, and J.Z. Li. Detecting simultaneous changepoints
in multiple sequences. Biometrika, 97(3):631–645, 2010.