Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS
By
Xin Qi
A DISSERTATION
Submitted toMichigan State University
in partial fulfillment of the requirementsfor the degree of
Statistics - Doctor of Philosophy
2014
ABSTRACT
FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS
By
Xin Qi
As a branch of statistics, functional data analysis analyzes data based on the information
about curves, functions, surfaces or anything else varying over a continuum such as time or
spatial location, etc. This topic has been drawing more and more attention by scientists
and other people in recent years since in many real problems, samples are often collected on
curves and other functional observations. Therefore, it also calls for various statistical models
and methodologies to perform inference on the data appropriately. In this article, I introduce
some examples of functional analysis problem from real world and develop methodologies
that provide powerful tools to address statistical concerns.
iii
TABLE OF CONTENTS
LIST OF TABLES ........................................................................................................................v
LIST OF FIGURES .................................................................................................................... vi
Chapter 1 Bayesian Inference for Covariance Ordering in Multivariate Functional Data ...1
1.1 Introduction ...........................................................................................................................1
1.2 Spline Mixed Model ..............................................................................................................4
1.3 Bayesian Hypothesis Testing on Covariance Ordering .........................................................6
1.3.1 Model Selection and Hypothesis Testing ..................................................................6
1.3.2 New Prior on with Ordering Representation ........................................................8
1.3.3 Bayesian Estimation ...................................................................................................9
1.4 Analysis of Audio Frequency Data of Wildlife Species in Michigan .................................10
1.4.1 Bayesian Inference Results ......................................................................................12
1.4.2 Discussion ................................................................................................................15
Chapter 2 Joinpoint Detection using -Penalized Spline Method ........................................17
2.1 Introduction .........................................................................................................................17
2.2 -Penalized Regression Spline Model ...............................................................................19
2.3 Consistency of the LASSO Type Estimator ........................................................................21
2.4 Simulation Study .................................................................................................................25
2.5 A Case Study: National Cancer Incidence Rate Analysis ...................................................33
2.6 Conclusion and Discussion .................................................................................................37
Chapter 3 Nonparametric Bayesian Clustering of Functional Data ......................................38
3.1 Introduction .........................................................................................................................38
3.2 Spline Mixed Model for Clustering Multiple Curves .........................................................41
3.2.1 Model Specification ......................................................................................................41
3.2.2 Derivation of Functional Dirichlet Process and Other Prior .......................................43
3.3 Theory of Bayesian Inference ..............................................................................................46
3.3.1 Propriety of the Posterior Distribution .........................................................................46
3.3.2 Gibbs Updating Steps ...................................................................................................46
3.4 Bayesian Inference: A Measure for Comparing a Pair of Clustering of Sites ....................52
3.5 Numerical Analysis: Simulation Study and Real Data Example .........................................54
3.5.1 An Alternative Less-informative Prior Choice for ..................................................55
iv
3.5.2 Simulations and Comparison with An Existing Approach ...........................................56
3.5.3 A Real Data Example: Lung Cancer Rates in the U.S. ................................................57
3.6 Conclusion and Discussion .................................................................................................59
APPENDICES .............................................................................................................................64
Appendix A Proof of Theorems in Chapter 1 ...........................................................................65
Appendix B MCMC Algorithm via Gibbs Sampling for Bayesian Inference in Chapter 1 .....74
Appendix C Maximum Likelihood Estimation in Chapter 1 ....................................................76
Appendix D Proof of Theorems and Lemmas in Chapter 2 ......................................................78
Appendix E Proof of Theorems in Chapter 3 ............................................................................83
BIBLIOGRAPHY .......................................................................................................................86
v
LIST OF TABLES
Table 1.1: Bayes Factor for Significant Orderings at L00 ............................................................15
Table 1.2: Bayes Factor for Significant Orderings at L02 ............................................................15
Table 2.1: A Summary of Four Methods to Compare with the PTB ............................................28
Table 2.2: Simulation Result for , ........................................................29
Table 2.3: Simulation Result for , ......................................................30
Table 2.4: Simulation Result for , ......................................................31
Table 2.5: Simulation Result for , ......................................................33
Table 2.6: Simulation Result for , ...................................................34
Table 2.7: Simulation Result for , .................................................36
Table 2.8: Joinpoint Detection for Real Data Analysis ................................................................36
Table 3.1: Cluster Configuration Diagnostics for Simulations: Bayesian DP Approach .............56
Table 3.2: Cluster Configuration Diagnostics for Simulations: James and Sugar's Approach .....56
Table 3.3: The Distribution of Number of Clusters ..................................................................58
Table 3.4: Estimated Cluster Configuration .................................................................................59
vi
LIST OF FIGURES
Figure 1.1: An Example of Biodiversity Data ................................................................................3
Figure 2.1: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC = , .................................................................................................................................32
Figure 2.2: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC = ,
.................................................................................................................................32
Figure 2.3: Incidence Rates of Cancer for overall U.S. (1973 - 1999) .........................................35
Figure 2.4: Fitted Incidence Rates (1973 - 1999) .........................................................................35
Figure 3.1: Dendrogram for the Clustering Estimation ................................................................61
Figure 3.2: Estimated Cluster Configuration ................................................................................62
Figure 3.3: Overall Trend by Clusters ..........................................................................................63
Chapter 1
Bayesian Inference for Covariance
Ordering in Multivariate Functional
Data
1.1 Introduction
Biodiversity, often defined as the variety of life at all levels of organization [9], has been
studied at species level for most of the time. It is closely related to human lives and human
activities such as agriculture, health, business and industry, etc. In recent years, unsus-
tainable ecological practices such as habitat destruction, agricultural over-harvesting, and
pollution have been damaging species diversity. According to International Union for the
Conservation of Nature (IUCN) [19], the world’s main authority on the conservation status
of species, through 2012, 19,817 out of 63,837 assessed species were threatened with extinc-
tion. The Convention on Biological Diversity (CBD) was recognized to help conservations
of biodiversity and sustainable use of its components. Statistical analysis of biodiversity
measurements are essential and valuable to conservationists to understand the nature of
biodiversity. With species indices and habitat variability assessments, conservationists can
identify areas of ecological importance that should be protected.
1
The audio measurements on suburban or rural areas can be used to study impact of
human activities on the other wildlife animals. The audio measurements taken at a regular
time intervals are converted to energy readings on the frequency domain and energy readings
are then divided into several frequency bands. Each frequency band corresponds to a different
category of wildlife animals according to the sound they create. By investigating how energy
readings in each band are changing over time, we can study interaction between different
groups of wildlife animals. For example, Figure 1.1 (a) shows the energy readings of several
frequency bands at one suburban location in the upper peninsular of Michigan on June 15
2010 while Figure 1.1 (b) gives a similar plot for a rural location. Note that one of frequency
bands in black color is related to the sounds created by human activity such as the noise
from automobile vehicles in highways. We can see that energy readings of other frequency
bands (in red, green and blue) behave differently depending on level of energy readings by
human activities. Also, the patterns of such interaction vary over different location as in
(a) and (b). This type of data can be viewed as the set of functional curves, in which each
curve corresponds to the energy readings over time for each frequency band. To investigate
interaction between curves (species), we introduce spline mixed models. Off-diagonal entries
of the covariance matrix for random effects can be interpreted as correlation between different
frequency bands (i.e. different species) so that we can compare off-diagonal entries to study
sensitivity of species in one frequency band against species in another frequency band. This
can be done by Bayesian hypothesis testing on covariance ordering. However, entries of
the covariance matrix are on the constrained parameter space due to positive definiteness
criterion and different orderings of covariances may have unbalanced prior probabilities,
which is not desirable.
In this chapter, we investigate the relationship and interaction between species by con-
2
Figure 1.1: An Example of Biodiversity Data
ducting Bayesian hypothesis testing on the covariance ordering of random effects in a spline
mixed model for multivariate functional data. We develop a default prior of a covariance
matrix which yields balanced prior probabilities on the covariance ordering after appropriate
re-parametrization. With the proposed default prior, we fit a spline mixed model using ener-
gy readings data from two locations in the upper peninsular of Michigan. Our data analysis
claims significant difference in the relationship among group of species between such two
locations.
We state our spline mixed model in section 1.2 and introduce Bayesian hypothesis testing
as well as construction of a default prior for the covariance matrix of random effects in section
1.3. In section 1.4, we apply our result to the energy reading data in Michigan to compare
and discuss the activities of different species.
3
1.2 Spline Mixed Model
To investigate interaction between multiple curves (functions) over time which is of interest
in the application of biodiversity, we consider a spline mixed model. Suppose that there
are d curves, fj(τ)dj=1 on τ ∈ [0, T ]. We observe such curves at n discrete time points,
0 < τ1 < τ2 < · · · < τn < T during the time period [0, T ]. In the application of audio
measurements, each curve corresponds to one frequency band, a time period is one day. The
data were collected over several days and we assume that we have repeated measurements
on days since it is reasonable to assume that daily activity patterns of animals including
human are not too different over days. Let y(i)jt be the observed value of the j-th curve at
time τt on the i-th day for j = 1, · · · , d, t = 1, · · · , n and i = 1, · · · ,M .
For functional data, it is common to approximate a function using basis functions. In
this regard, we adopt approximation of functions using spline basis functions. That is,
f(τ) ≈q∑
m=0
τm
m!βm +
k∑m=1
(τ − ξm)q+
q!γm,
where a+ = maxa, 0. This spline approximation uses order q spline polynomial func-
tions with knot points ξ1, · · · , ξk. With the data structure described above, we intro-
duce the following spline mixed model in a matrix form for the i-th day. Let y(i) =(y
(i)11 , y
(i)12 , . . . , y
(i)1n , . . . , y
(i)d1 , y
(i)d2 , . . . , y
(i)dn
)Tbe the set of observed curves for the i-th day
for i = 1, . . . ,M . Then, we have
y(i) = (Id ⊗X)β(i) + (Id ⊗ Z)γ(i) + ε(i), (1.1)
where β(i) =(β
(i)10 , · · · , β
(i)1q , · · · , β
(i)d0 , · · · , β
(i)dq
)Tare fixed effects,
4
γ(i) =(γ
(i)11 , · · · , γ
(i)1k , · · · , γ
(i)d1 , · · · , γ
(i)dk
)Tare random effects and
ε(i) =(ε(i)11 , ε
(i)12 , · · · , ε
(i)1n, · · · , ε
(i)d1 , ε
(i)d2 , · · · , ε
(i)dn
)Tare error terms.
The corresponding design matrices for fixed effects and random effects are
X =
1 τ1 . . . 1q!τ
q1
1 τ2 . . . 1q!τ
q2
...
1 τn . . . 1q!τ
qn
, Z =
1
q!
(τ1 − ξ1)q+ (τ1 − ξ2)
q+ . . . (τ1 − ξk)
q+
(τ2 − ξ1)q+ (τ2 − ξ2)
q+ . . . (τ2 − ξk)
q+
...
(τn − ξ1)q+ (τn − ξ2)
q+ . . . (τn − ξk)
q+
.
We also assume that ε(i) ∼ N(0,Σ), where Σ = D ⊗ Σϕ with D = diag(σ21, σ
22, . . . , σ
2d)
and Σϕ = (ϕuv)n×n and ϕuv =
1 if u = v
ϕ if |u− v| = 1
0 otherwise
. σ2j allows to have different variability
of each curve. Σϕ is a temporal covariance matrix to capture possible temporal dependence
in each curve.
For the covariance structure of random effects, we assume γ(i) ∼ N(0,Σ0 ⊗ Ik) with
Σ0 = (σjl)d×d so that cov(γ(i)j , γ
(i)l ) = σjlIk for j, l = 1, . . . , d and σjl models the interaction
between the j-th and l-th curves.
Let y =(y(1), y(2), . . . , y(M)
)Tbe the collection of all data points. Then, the matrix
form of the entire model is
y = Xβ + Zγ + ε, (1.2)
where X = IM⊗Id⊗X and Z = IM⊗Id⊗Z. Note that γ ∼ N(0, G) with G = IM⊗Σ0⊗Ik.
5
1.3 Bayesian Hypothesis Testing on Covariance Order-
ing
To investigate interaction between curves using the spline mixed model introduced in the
previous section, we consider the ordering of the covariance matrix for random effects. In
particular, we are interested in a Bayesian test on a specific ordering within a row. e.g.,
σ12 > σ13 > · · · > σ1d > 0.
1.3.1 Model Selection and Hypothesis Testing
Let θ = Σ0 be the parameter of interest. We assume that the parameter space, Θ, is
partitioned into Θi such that Θ =⋃d!i=1 Θi and Θi is a partition of Θ with d! possible
ordering restrictions of σ12, σ13, . . . , σ1d, 0. Note that each subspace Θi represents a model
Mi with the corresponding ordering. To perform model selection, we consider testing the
hypothesis Hi : θ ∈ Θi versus Hj : θ ∈ Θj , where i, j = 1, . . . , d!, i 6= j.
Given the priors πi = π(Θi) and πj = π(Θj), suppose the posterior probabilities are
νi = π(Θi|y) and νj = π(Θj |y). Then the Bayes factor is defined as
B =posterior odds ratio
prior odds ratio=
νi/νjπi/πj
=
∫Θiπ(θi)f(y|θi) dθi∫
Θjπ(θj)f(y|θj) dθj
/πi/πj . (1.3)
When the Bayes factor becomes large enough, we tend to propose Hi.
To compute the Bayes factor, we need to put an appropriate prior on Σ0. The prior
distribution should bring a proper posterior probability and computation of πi/πj should be
easy. However, the common prior models for a covariance matrix such as Jeffrey’s prior and
inverse Wishart prior do not satisfy such criteria. This leads us to develop a default prior
6
which brings a proper posterior distribution as well as easiness of computation.
First, we show that Jeffrey’s reference prior leads to an improper posterior, so that it is
not suitable to use for our test.
Theorem 1. Assume d ≥ 4 and Mk > 2. The improper priors π(Σ0) ∝ |Σ0|−p with (1)
p = d+12 , (2) p = d
2 − 1 and (3) p = d give rise to improper posteriors.
The proof of the theorem is provided in Appendix A.
By Theorem 1, we can not calculate νi = π(Θi|y) and νj = π(Θj |y) which indicates
inappropriateness of Jeffrey’s reference prior for the proposed spline mixed models.
When inverse Wishart prior is considered, it is clear that the posterior distribution is
proper. However, inverse Wishart prior is not feasible for Σ0 for our testing problem above
since it has no explicit ordering representation. For instance, assume d = 4 and consider
testing hypotheses H1 : σ12 > σ13 > σ14 > 0 vs. H2 : σ12 > 0 > σ13 > σ14. Note that in
this case, Θ1 = σ12 > σ13 > σ14 > 0 and Θ2 = σ12 > 0 > σ13 > σ14. Then for inverse
Wishart prior Σ0 ∼ IW (S0,m), where S0 is a d× d positive definite matrix and m > d− 1
is a constant, it is difficult to calculate π1 = P (Σ0 ∈ Θ1) and π2 = P (Σ0 ∈ Θ2). Also, there
is no guarantee to have equal prior probability so that no prior subjectivity can affect the
Bayes factor. In the next section, we introduce a new prior on the covariance matrix which
solve both issues. That is, the new prior guarantees propriety of the posterior distribution
and equal prior probability so that we do not need to calculate πi and πj in the Bayes factor
due to cancellation in πi/πj .
7
1.3.2 New Prior on Σ0 with Ordering Representation
We propose a partially improper prior on a covariance matrix via Cholesky decomposition.
For simplicity, we assume Σϕ = In. As we see later in section 1.4, the real data supports
such simple assumption.
Let Σ0 = LLT be the Cholesky decomposition of Σ0, where L =
`11 0
`1 L11
is a lower
triangular matrix. Then we have the following representation for Σ0:
Σ0 =
σ11 σ1∗
σT∗1 Σ11
=
`11 0
`1 L11
`11 `T1
0 LT11
=
`211 `11`T1
`11`1 `1`T1 + L11L
T11
where `11 > 0 and `1 ∈ Rd−1.
This re-parametrization separates out the constrained vector of parameters (`1) and
unconstrained parameters (`11 and L11). Note that σ1∗ = (σ12, σ13, . . . , σ1d) = `11`T1 . Since
`11 > 0, each subset of parameter space, Θi, corresponds to an ordering of entries in `1 for
i = 1, 2, . . . , d!. The following theorem guarantees a proper posterior distribution when the
flat prior on `1 is used.
Theorem 2. Assume n > q+1+k and Mk ≥ d−1. Let Σ0 = LLT , where L =
`11 0
`1 L11
.
Suppose that the priors satisfy
(1) π(β) ∝ 1,
(2) π(D) =∏dj=1 π(σ2
j ), where π(σ2j ) ∝
( 1σ2j
)αj+1, αj > 0, ∀j = 1, . . . , d,
(3) π(L) = π(`11)π(`1)π(L11), where π(`1) ∝ 1, π(`11) and π(L11) are proper.
Then the posterior is proper.
The proof of theorem is provided in Appendix A.
8
Now we reconsider the testing example H1 : σ12 > σ13 > σ14 > 0 vs. H2 : σ12 > 0 >
σ13 > σ14 again. Theorem 2 claims proper posteriors so that ν1 = π(Θ1|y) and ν2 = π(Θ2|y)
are well defined. Furthermore, when the flat prior π(`1) ∝ 1 is used, the relationship between
σ1∗ and `1 simply implies π(Θ1) = π(Θ2) = 1/d! if we add 0 into the consideration to have
ordering of covariance parameters. Hence in this situation, the priors are canceled out and
the Bayes factor remains to be the posterior ratio B = νi/νj , which can be calculated from
posterior samples straightforwardly.
1.3.3 Bayesian Estimation
To obtain the posterior samples of parameters to calculate the Bayes factor, we use a Markov
Chain Monte Carlo (MCMC) algorithm, in particular, Gibbs sampling algorithm. To derive
the posterior distribution, we consider the following priors for the parameters, β, σ2j and
Σ0. We assume π(β) ∝ 1, π(σ2j ) ∼ IG(aj , bj), where IG(a, b) denotes the inverse Gamma
distribution with the density function π(x) ∝ x−(a+1)e−1/bx. For the prior of Σ0, we consider
the proposed prior of L developed in the previous section. That is, π(L) = π(`11)π(`1)π(L11)
with π(`1) ∝ 1, π(`11) ∼ IG(a0, b0) and π(L11LT11) ∼ IW (Ω11, d − 1). Then, by Theorem
2, the posterior distribution is proper and the MCMC algorithm is valid with the proposed
prior.
In this section, we derive the conditional posterior distributions of `11, `1 and L11.
Conditional posterior distribution of other parameters are given in Appendix B.
First, we introduce the matrix S which is quadratic forms of random effects. That is,
S =∑Mi=1
∑kl=1 g
(i)l
Tg
(i)l with g
(i)l =
(γ
(i)1l , γ
(i)2l , . . . , γ
(i)dl
), for l = 1, . . . , k and i = 1, . . . ,M .
9
Then, we can obtain the conditional posterior distributions of `1, `211 and L11LT11:
`1 | γ, `11, L11, . . . ∼ N(`∗1, V∗`1
),
`211 | γ, `1, L11, . . . ∼ IG(Mk − d+ 1
2+ α0,
(s11
2+
1
β0
)−1),
L11LT11 | γ, `11, `1 ∼ IW (Mk − d− 1, S∗ + Id−1)
where `∗1 =`11s11
s1, V ∗`1=
`211s11
(L11LT11) and S∗ = S11 − 1
s11s1s
T1 for S =
s11 sT1
s1 S11
.
With above conditional posterior distributions, we can obtain posterior samples by the
Gibbs sampling algorithm.
1.4 Analysis of Audio Frequency Data of Wildlife Species
in Michigan
In this section, we apply our methodology to audio frequency data of wildlife species in
Michigan.
The wildlife sounds from two locations, L00 and L02, in Upper Peninsula of Michigan,
U.S. were collected in 1 minute interval from June 1 2010 to June 15 2010 at every 30
minutes from 12 a.m. to 11:30 p.m. on each day. L00 is located at Crawford Bay Lane
area, Cheboygan, MI which is close to main city roads while L02 is at Crane Nest Narrows,
Cheboygan, MI and it is relatively pristine area. Then, the audio data were converted
into the frequency data and divided into ten frequency bands which correspond to different
groups of species (human, bird type 1, bird type 2, etc.). More relevant information about
the source of data can be found in the Remote Environmental Assessment Laboratory at
10
Michigan State University (real.msu.edu). The behavior of energy readings in each frequency
band represents activities of different types of species and we can observe how other species
interact with human in their activities by how the energy reading curves are changing over
time of day.
Let P (i)j (t) ≥ 0 : τt ∈ [0, T ]; j = 1, . . . , D; i = 1, . . . ,M be the energy readings for D
frequency bands obtained from recorded audio measurements. Since the energy readings are
scaled, P(i)j (t) is the proportion for the jth frequency band at time τt on the ith day. Thus,
a natural constraint on P(i)j (t) is
∑Dj=1 P
(i)j (t) = 1 for t = 1, · · · , n and i = 1, . . . ,M . For
fixed i and t, P (i)j (t)Dj=1 can be viewed as compositional data. A typical additive log-ratio
transformation is widely applied to compositional data, which we adopt for P(i)j (t) as well.
We define the additive log-ratio transformation as
y(i)jt = log
P(i)j (t)
P(i)D (t)
, j = 1, . . . , D − 1.
Assume that the set of time points, T = τ1, τ2, . . . , τn with 0 < τ1 < τ2 < · · · < τn < T
and d = D−1. Lety
(i)jt : t = 1, . . . , n; j = 1, . . . , d; i = 1, . . . ,M
be the response variables
for the proposed spline mixed model.
Since the last four frequency bands contribute on audio intensity relatively small through
the period, we consider first six frequency bands and combined the last four frequency bands
into one component. Thus, d = 6. The number of time points, n is 48 and the number
of days, M is 15. Then, we apply the spline mixed model introduced in section 1.2. An
intuitive choice of the order of polynomial for fixed effects is q = 1. We set up k = 8 equally
spaced time nodes as knots for the spline functions.
For the model (1.2) in section 1.2, we start with some level of time-dependence in Σϕ
11
with the parameter ϕ. To investigate the time-dependence, we obtain maximum likelihood
estimator (MLE) of β, γ and ϕ by assuming γ as fixed effect coefficients for simplicity.
The derivation of optimal equations for MLE is provided in Appendix C. The MLE of ϕ is
ϕ = −0.038 for location L00 and ϕ = 0.024 for location L02. We see in both locations, the
correlation between time points are weak, which suggests us to assume ϕ = 0 for the rest of
the analysis.
1.4.1 Bayesian Inference Results
We run the MCMC algorithm via Gibbs sampling for N = 5000 iterations with three chains.
To check the convergence of chains, we carry out Gelman-Rubin diagnostics by calculating
the potential scale reduction factor [10]. All three chains converged after 2000 iterations. Our
estimation is based on the posterior mean of last 1500 posterior samples, i.e. 500 posterior
samples from each chain. The upper 4 × 4 submatrices of estimated covariance matrices
of random effects are shown below. Σ0, Σ2 are the MLEs and Σ0, Σ2 are the Bayesian
estimates for locations L00 and L02, respectively.
For location L00,
Σ0 =
0.2307 0.1384 0.1140 0.0611
0.1384 0.1872 0.1546 0.0826
0.1140 0.1546 0.1703 0.1056
0.0611 0.0826 0.1056 0.1114
, Σ0 =
0.2646 0.1334 0.1079 0.0605
0.1334 0.2119 0.1599 0.0830
0.1079 0.1599 0.1906 0.1143
0.0605 0.0830 0.1143 0.1481
.
12
For location L02,
Σ2 =
0.0768 0.0075 −0.0326 −0.0365
0.0075 0.0758 0.0686 0.0400
−0.0326 0.0686 0.1500 0.1209
−0.0365 0.0400 0.1209 0.1671
, Σ2 =
0.1118 0.0287 −0.0201 −0.0271
0.0287 0.1073 0.0795 0.0361
−0.0201 0.0795 0.1657 0.1173
−0.0271 0.0361 0.1173 0.1865
.
We can see that, overall, the Bayesian estimation for both locations, Σ0 and Σ2 are very
close to the corresponding MLE, Σ0 and Σ0, respectively. They have same signs for all the
entries of covariance matrices, which justifies our Bayesian method from the likelihood per-
spective. Moreover, unlike for location L00, Σ2 is slightly different from Σ2 in the sense that
it gives larger variance for each component σ2j and lower magnitude of negative correlations,
i.e. σ13 and σ14 than the MLE. Finally, we suspect a difference in the overall relationship
(over time points) between band 1 and either band 3 or 4 for the two locations due to the
opposite sign of corresponding entries in two different locations. However, to further support
our finding, we have to perform more detailed posterior sample analysis.
Bayesian method allows us to obtain posterior samples of σjl from which we can investi-
gate interaction between different types of species (i.e. different frequency bands) as well as
compare the results from two locations. In particular, we are interested in the relationship
between frequency band 1 and frequency bands 2 to 4. The frequency band 1 corresponds to
the sounds from human activity while frequency bands 2, 3 and 4 represent three different
categories of wildlife animals including birds. So, we compute posterior probabilities of all
orderings of σ12, σ13, σ14, 0. 0 is included to see whether σjl is positive or negative.
There are 4! = 24 possible orderings. We list the first three highest posterior probabilities
13
among 24 possible orderings for each location.
For location L00,
P (Θ(0)1 = σ12 > σ13 > σ14 > 0 | data) = P (`12 > `13 > `14 > 0 | data) = 0.702,
P (Θ(0)2 = σ13 > σ12 > σ14 > 0 | data) = P (`13 > `12 > `14 > 0 | data) = 0.133,
P (Θ(0)3 = σ12 > σ14 > σ13 > 0 | data) = P (`12 > `14 > `13 > 0 | data) = 0.083.
For location L02,
P (Θ(2)1 = σ12 > 0 > σ13 > σ14 | data) = P (`12 > 0 > `13 > `14 | data) = 0.493,
P (Θ(2)2 = σ12 > σ13 > 0 > σ14 | data) = P (`12 > `13 > 0 > `14 | data) = 0.222,
P (Θ(2)3 = σ12 > 0 > σ14 > σ13 | data) = P (`12 > 0 > `14 > `13 | data) = 0.149.
Bayes factors computed from the posterior probabilities can be used to select a particular
ordering for each location. Thus, we consider testing the following hypotheses
Hi : θ(L) ∈ Θ(L)i vs. Hj : θ(L) ∈ Θ
(L)j ,
where i, j = 1, 2, 3 represent those particular ordering with high posterior probabilities L =
0, 2 represents the location of the data collected. Since π(Θ(L)i ) = π(Θ
(L)j ) for all i, j = 1, 2, 3
and L = 0, 2, the Bayes factor is just a ratio of posterior probabilities, i.e., B(L)ij = ν
(L)i /ν
(L)j .
Table 1.1 and 1.2 show the Bayes factors B(0)ij at locations L00 and L02, respectively.
From the above analysis, we found that there are significant differences in the patterns of
audio frequency bands over time in two locations, L00 and L02. For L00, all three bands 2,
14
Denom. \ Num. σ12 > σ13 > σ14 > 0 σ13 > σ12 > σ14 > 0 σ12 > σ14 > σ13 > 0σ12 > σ13 > σ14 > 0 1 0.19 0.12σ13 > σ12 > σ14 > 0 5.28 1 0.63σ12 > σ14 > σ13 > 0 8.42 1.60 1
Table 1.1: Bayes Factor for Significant Orderings at L00
Denom. \ Num. σ12 > 0 > σ13 > σ14 σ12 > σ13 > 0 > σ14 σ12 > 0 > σ14 > σ13
σ12 > 0 > σ13 > σ14 1 0.45 0.30σ12 > σ13 > 0 > σ14 2.22 1 0.67σ12 > 0 > σ14 > σ13 3.32 1.50 1
Table 1.2: Bayes Factor for Significant Orderings at L02
3 and 4 were synchronous with human activity, with band 4 in the lowest level and band 2
in the highest level. On the other hand, band 4 was most sensitive to human activity for the
pristine location L02, following by band 3. However, band 2 was synchronous with human
activity in L02.
1.4.2 Discussion
In this chapter, we investigated audio measurements data from the real ecological world.
Our primary objective is to extract the information that reflects the relationship and inter-
action between the variables of interest and then quantify the synchronicity or discrepancy
of such interaction among locations. By classifying the observations as the compositional
data varying over time points, we are able to view them as functional curves. We proposed
a spline mixed model and interpret the random effect coefficients of spline polynomials as a
measure of interaction between curves over time. By specifying the covariance components
of random effects, we successfully transformed our concerns to Bayesian inference for the
orderings of covariance matrices. To solve our problem, we developed a new non-informative
prior on the covariance matrices with ordering which possesses posterior propriety as well as
15
simple implementation for testing. Finally we applied our model with the new prior to the
biodiversity data from Michigan and our Bayesian analysis successfully identified the main
characteristics of interactions between species.
Some further extension of our current work may be worthwhile. For instance, it can
be challenging but interesting if one consider generalizing the type of prior we proposed to
covariance matrices with a wider class of orderings.
16
Chapter 2
Joinpoint Detection using
L1-Penalized Spline Method
2.1 Introduction
Cancer is one of the major causes of death in the United States. According to a global and
regional mortality study from 235 causes of death for 20 age groups presented on The Lancet,
it claimed 8.0 million lives in 2010, 15.1% of all deaths worldwide, with large increases in
deaths from trachea, bronchus, and lung cancers [15]. Statistical analyses of cancer incidence
have gained increasing attention in recent years given such challenging public health issue.
Many of them are conducted to investigate a single cancer rate curve varying over time. An
major interest is detecting the changes of the trend of annual cancer incident rates. Such
concern can be viewed as a joinpoint identification problem based on a certain structure of
regression models in statistics.
Several methods have been developed for the joinpoint problem recently, including [14],
[21], [11] and [4]. In general, the joinpoint regression model can be defined in the following
way. Let y be an observed response at time point x ∈ S = x1, x2, · · · , xn with x1 ≤ x2 ≤
· · · ≤ xn. Then we assume
y = β0 + β1x+ δ1(x− τ1)+ + · · ·+ δk(x− τk)+ + ε, (2.1)
17
where τ1 < τ2 · · · < τk are unknown joinpoints, ε represents the error and a+ = a1a > 0.
The main objective of joinpoint detection is to discover the correct number of joinpoints (k)
as well as the location of them (τs).
One classical permutation test based (PTB) approach, proposed by Kim, et al. [14]
in 2000, has been adopted by the National Cancer Institute (NCI) as an analytical tool
(Joinpoint Software) in their Surveillance Research Program. It generally applies a grid
search technique to fit a regression model under homoscedastic and uncorrelated errors. The
identification of significant joinpoints is conducted by multiple pairwise permutation tests
which compare two best regression models fitted with different sets of presumed joinpoints.
From a statistical perspective, it only involves basic estimation and testing knowledge for
piecewise linear regression models thus is well motivated and easy to understand. However,
due to the nature of permutation test and grid search technique, the PTB approach requires
heavy computation if either maximum possible number of joinpoints (k) or the sample size
(n) is relatively large, for example, k > 5 or n > 50 usually. Such limitation makes joinpoint
detection with big dataset infeasible using the PTB.
In this paper we propose a new method to detect the joinpoints on a discrete time grid
by introducing an `1 penalized regression spline model. Our model specifies spline base at
time knots as the covariates thus will be able to capture potential features of joinpoints as a
subset of statistically significant slopes. Meanwhile, by adding a penalty term, we allow for
a shrinkage of the spline model so that insignificant slope changes will be eliminated.
The rest of this chapter is organized in the following way. We define the penalized
regression spline model in section 2.2 and validate some theoretical properties of our estimates
in section 2.3. In section 2.4, we carry out a simulation study of our method and compare
the performance with the PTB method. Section 2.5 provides a case study of real data by
18
applying our approach to the cancer incidence rates for the overall U.S. from 1973 to 1999.
Note that a comparison with the results from the Joinpoint Software will also be presented.
We finally conclude with a discussion about possible extensions as well as potential future
works in section 2.6.
2.2 L1-Penalized Regression Spline Model
In this section we introduce the penalized regression spline model. Assume we observe data
yini=1 independently on time points xini=1 with x1 ≤ x2 ≤ · · · ≤ xn. We define a
penalized regression model with first order spline bases as follows.
y = β0 + β1x+ β2(x− τ2)+ + · · ·+ βp−1(x− τp)+ + ε, (2.2)
where ε ∼ N(0, σ2) and τ2 < · · · < τp are (p − 1) time knots in the entire time interval. It
specifies p covariates as in the linear model.
For the annual cancer incidence rate problem, it is appropriate and straightforward to
assume the set of candidate joinpoints is identical to the set of observed covariates (years),
that is, p = n − 1 and τj = xj for j = 2, · · · , n − 1. And without loss of generality, we
assume they are equally-spaced without ties. Furthermore, to guarantee some conditions for
theoretical justification of our model, we particularly scale observed time points along with
spline bases such that xj = jn , j = 1, 2, · · · , n. By introducing a matrix form representation
of our model, we have
y = Xβ + ε, ε ∼ N(0, σ2In), (2.3)
19
where y = (y1, y2, · · · , yn)T , β = (β0, β1, · · · , βn−1)T , ε = (ε1, ε2, · · · , εn)T and
X =
1 x1 (x1 − x2)+ · · · (x1 − xn−1)+
......
......
1 xn (xn − x2)+ · · · (xn − xn−1)+
(2.4)
Note that the PTB method works only on a small subset of time points as a candidate set
for joinpoints when performing the estimation or testing via grid search each time. On the
other hand, our penalizing method allows for the entire set of candidates. It is motivated in
a natural and data-driven way. Note that in the model we exclude the first year as well as
the last year since it is not reasonable to view them as possible joinpoints without further
information. When we have a large sample size n, the usual least square estimate of β
may contain a large number of nonzero components and fail to remain to be a consistent
estimator. It suggests us applying one of most commonly used penalty structure, Least
Absolute Shrinkage and Selection Operator (LASSO), introduced by Tibshirani [20]. It
yields to minimize the `1 penalized negative log-likelihood. Because of the singularity of
`1 norm at the origin, LASSO is a powerful tool for model selection. By doing so, we
only keep the candidates with nonzero estimated coefficients, which can be automatically
interpreted as significant slope changes thus solid evidence of joinpoints. In our study,
we treat the intercept (β0) and the first slope (β1) as the baseline trend for curves. It
follows that they are not supposed to characterize joinpoints thus are not penalized as other
truncated spline coefficients. However, it is still feasible to restrict the absolute magnitude of
them by a pre-specified constant in practice. So we consider a constrained parameter space
B = β : |β0|+ |β1| ≤M, where M is a positive constant independent of n. Our LASSO
20
type estimator is provided by
β = arg minβ∈B
‖y −Xβ‖22n+ λ
n−1∑j=2
|βj |
. (2.5)
The parameter λ appeared above is called the tuning (shrinkage) parameter. It essentially
controls how hard the βs are penalized. In practice, λ is often pre-specified or obtained by
some other data driven validation procedures. Several algorithms for LASSO, including the
well known Least Angle Regression (LAR) [5], have been developed in recent years. Most
of them automatically assume penalization for the full set of coefficients. However, since
in our model the baseline slope β1 is not penalized, we have to utilize an implementation
which allows penalization for a partial set of parameters. And the coordinate decent algo-
rithm proposed by Friedman, Hastie and Tibshirani [8] meets our demands by solving the
regularization paths as we desire.
2.3 Consistency of the LASSO Type Estimator
As we mentioned in the previous section, the LASSO method features its computational
feasibility. Meanwhile, under certain assumptions of the regression model, it also provides
statistical accuracy. The estimator under LASSO approach has been proved to be consis-
tent by a series of notable works. Some basic results are available in [3]. And we will
provide a proof of consistency of the LASSO type estimator under (2.5) following their
approach. From now on we define Σ = XTX/n and assume its j-th diagonal entry to
be σ2j , i.e. σ2
j = Σjj . Meanwhile, given the random error ε, we define a set of even-
t Jλ0=ω : max1≤j≤n 2|εTX(j)|/n ≤ λ0
, where X(j) denotes the jth column of the
21
design matrix X. First we give a lemma indicating a high probability of set Jλ0.
Lemma 1. Assume σ2j ≤ K for j = 1, 2, · · · , n and some K > 0. Then for all t > 0 and
λ0 = 2Kσ
√t2+2 log n
n , we have
P (Jλ0) ≥ 1− 2 exp(−t2/2). (2.6)
The proof of Lemma 1 and other proposition, lemma and theorem are provided in the
Appendix. Especially, when taking t2 ≥ 4 log n, we obtain
P
(max
1≤j≤n2|εTX(j)|
n> λ0
)≤ 2e−2 log n = 2/n2 (2.7)
for λ0 ≥ 2√
6Kσ
√log nn .
In practice we may need to replace unknown variance parameter σ2 by a reasonable
estimator σ2. The following lemma shows that we can choose the sample second raw moment
as a candidate for σ2.
Lemma 2. Let r = ‖Xβ‖22/σ2 be the signal-to-noise ratio of the model (2.2). Consider the
estimator σ2 = Y TY /n for σ. Given any 0 < α < 1, we have
P(σ2 < (1− α)σ2
)≤ min
2(n+ 2r)
α2n2,12(n+ 2r)2 + 48(n+ 4r)
α4n4, 1
. (2.8)
The previous lemma claims that the σ2 proposed above can not be relatively too small
compared to σ2. Now combining two lemmas above we conclude the consistency of our
estimator β.
Theorem 3. (Consistency of the LASSO Type Estimator) Suppose σ2j ≤ K for j = 1, 2, · · · , n
22
and some K > 0. Let σ2 = Y TY /n and λ = 12Kσ√
log n/n. Consider the model (2.2)
with the estimator β in (2.5). If
limn→∞
‖β‖1√n/ log n
= 0 and rn =‖Xβ‖22σ2
≤ R <∞ for some R > 0
then we have ‖X(β − β)‖22/n→ 0 almost surely as n→∞, i.e.
P(
limn→∞
‖X(β − β)‖22/n = 0)
= 1. (2.9)
It is easy to verify that the matrix Σ in model (2.2) satisfies σ2j ≤ K for all j with K = 1.
Furthermore, note that the proof of Theorem 3 remains valid when the number of covariates
p is greater than the number of time points n although they are identical under our design.
That is, our model (2.2) can be extended to allow more intermediate time knots as joinpoint
candidates.
The main theorem above claims the consistency of our LASSO estimator in terms of
prediction error. A stronger conclusion on such type of estimators has also been stud-
ied in [3]. In their publication, a more refined `1 consistency result for β is provided
under some additional compatibility conditions for the design matrix X. An alternative
restricted eigenvalue condition is given [2]. Given an index set S ⊂ 1, 2, · · · , n, denote
βS = (β0,S , β1,S , · · · , βn−1,S)T where βj,S = βj1j ∈ S for j = 1, 2, · · · , n − 1. The
compatibility condition is stated as follows.
Assumption 1. (Compatibility Condition) Let Σ = XTX/n be the Gram matrix. Given
the true index set S0 with cardinality s0 = |S0|, there exists φn > 0 such that for all β
23
satisfying ‖βSc0‖1 ≤ 3‖βS0
‖1, it holds that
‖βS0‖21 ≤
s0
φ2n
(βT Σβ
)and
s0λ2
φ2n
= o(1) (2.10)
Unfortunately, we point out that under our design with λ = O(√
log n/n), the matrix Σ
violates the compatibility condition.
Proposition 1. The eigenvalues of the design matrix in (2.4) are
ν1 =n+ 2−
√n2 + 4
2n, ν2 = ν3 = · · · = νn−1 =
1
n, νn =
n+ 2 +√n2 + 4
2n(2.11)
From the claim above we see that at least (n− 1) of the eigenvalues of X are no greater
than n−1. Therefore, by the properties of eigenvalues, we conclude that the smallest eigen-
value of Σ = XTX/n, denoted by ν1(Σ), is at most n−3. Given the simple fact that
‖βS0‖21 ≤ s0‖β‖22 and ν1(Σ)‖β‖22 ≤ βT Σβ for all β, it follows that φ2
n ≤ n−3 hence
λ2s0/φ2n ≥ O(n2 log(n)), which contradicts the assumption that s0λ
2/φ2n = o(1) in (2.10).
In other words, unless we choose a λ extremely close to 0, we will not be able to meet the
compatibility condition.
As discussed above, the violation of compatibility for Σ implies that we are not able
to use a similar technique as before to prove the `1 consistency result for β. This is not
surprising since the design matrix X consists of rescaled time nodes xj = jn , j = 1, 2, · · · , n
so that we may need conditions such as ‖β‖2 = O(n) to have r = ‖Xβ‖22/σ2 < ∞. Also,
it is natural to assume s0 < ∞. Then, given S0 = S, ‖β‖1 = O(n). Hence the usual `1
consistency for β may not be a reasonable criterion under our settings.
On the other hand, if we define S = 2 ≤ j ≤ n − 1 : βj 6= 0 as our selected set of
24
joinpoints, one may desire the following oracle properties
P (S = S0)→ 1 or P (S ⊃ S0)→ 1 as n→∞ (2.12)
It indicates that the estimated effective joinpoints will at least cover the true set S0 when n is
large. In our case, we may seek a weaker oracle property. Note thatXβ = (X1β,X2β, · · · ,Xnβ)T ,
where Xjs are row vectors of X. Let S∗ = 2 ≤ j ≤ n− 1 : Xjβ 6= 0 and S∗ = 2 ≤ j ≤
n− 1 : Xjβ 6= 0. Then one may show that
P (S∗ = S∗)→ 1 or P (S∗ ⊃ S∗)→ 1 as n→∞ (2.13)
Note that here Xj (j ≥ 2) has nonzero elements only in its first j entries. After all, it
is possible that although as a sufficient condition, the compatibility assumption fails under
design X, the numerical result still shows excellent performance in variable selections.
2.4 Simulation Study
In this section, we will examine the performance of joinpoint detection based on the `1 pe-
nalized regression (LASSO) method and the PTB method realized by the Joinpoint Software
via several simulations.
To be consistent with the real data analysis in the next section, we first consider n = 27
and x ∈ S = 1, 2, · · · , 27. The true models with two joinpoints can be written as follows.
y = β0 + β1x+ δ1(x− τ1)+ + δ2(x− τ2)+ + ε, (2.14)
25
where the errors ε are still assumed to be independent Gaussian N(0, σ2). To further specify
different models based on quantity of slope changes, assignment of joinpoint positions and the
noise level, we let β0 = 5, (τ1, τ2) = (8, 18) or (18, 23) and σ2 = 0.0001 or 0.001. Meanwhile,
as given in [14], the slopes, β1, δ1 and δ2 are determined by the annual percentage change
APC = (APC1,APC2,APC3) in the following way: β1 = log(1 + 0.01APC1), δ1 = log(1 +
0.01APC2)− log(1 + 0.01APC1) and δ2 = log(1 + 0.01APC3)− log(1 + 0.01APC2). In our
simulation we set APC = (1, 3, 1) or (4,−1, 2), which leads to different trends of data. Hence
all the combinations of true parameters give us eight different scenarios.
In the LASSO spline model, we clarify the design matrix X as the form of (2.4) with xj =
j/27, j = 1, 2, · · · , 27. That is, we rescale observed time points without losing information.
One of the major concern when fitting the generated dataset to our model is to select
appropriate tuning (penalty) parameters (λs) since it directly controls the sparsity, or the
effective number of joinpoints. Some popular rules, such as cross validation (CV) [12] and
Bayesian information criteria (BIC) [22], are widely used to handle this issue. As a model
validation technique, cross validation partitions a sample of data into a training set and
a testing set, performs the analysis on the training set, and validates the analysis on the
testing set. It indicates better performance in terms of accuracy of prediction in many
literatures. We will apply a 5-fold cross validation in the simulations. For instance, the
original observations are randomly partitioned into five almost equal size subsamples. A
single subsample is retained as the validation data for testing the model while the remaining
four are used as training data. The cross validation process is then repeated five times, with
each of the five subsamples used exactly once as the validation data. We simply select the
tuning parameter λ with minimum mean squared error (MSE). However, sometimes we may
prefer a more simplified model with fewer covariates, or, in our case, not too many estimated
26
joinpoints. Hastie, Tibshirani and Friedman suggest a ‘one standard error rule’ [8], which
chooses the largest λ value within one standard error of the original minimizer λ given by
cross validation, i.e. λ = supλ : MSE(λ) ≤ MSE(λ) + SE(λ), where MSE(λ) and SE(λ)
denote the mean squared error and standard error under λ. We will provide the results under
both original cross validation approach (L-CV) and one standard error rule (L-SE).
As an alternative model for comparison purpose, we also apply the PTB method imple-
mented in the Joinpoint Software. Since the PTB approach is based on the grid search
when fitting the model, it is inevitably time consuming even if one requires a moder-
ate amount of candidate joinpoints given a relatively large sample size of data (usually
when n ≥ 40). Especially, when the maximum number of joinpoint is greater than 4, the
computation will be quite slow. As advised by the user manual of the software, we set
the maximum number of joinpoint as 4. Since our penalized spline model does not have
such limitation and process the same data in a much faster way as described in section
2.2, we claim that it demonstrates more flexibility in modeling various types of real data
effectively. Moreover, one of the settings for the PTB implementation imposes a mini-
mum distance (3 as default) between two consecutive estimated joinpoints. Therefore, to
make a fair comparison, we also consider a modification for both L-CV and L-SE meth-
ods: First we call a sequentially ordered subset of originally selected components a ‘clus-
ter’ if it is the biggest set consisting of two or more consecutive nonzero coefficients that
preserve the same sign, either all positive or all negative, i.e. a set either of the form
B+(s, t) := (βs, βs+1, · · · , βt−1, βt) : βs−1 ≤ 0; βs, · · · , βt > 0; βt+1 ≤ 0 or of the for-
m B−(s, t) := (βs, βs+1, · · · , βt−1, βt) : βs−1 ≥ 0; βs, · · · , βt < 0; βt+1 ≥ 0 for some
2 ≤ s < t ≤ n − 2. Then for all other selected components that are not from a ‘cluster’,
we call each of them a ‘singleton’. Our modification eventually pick only one component
27
from each ‘cluster’ along with all the ‘singletons’ which minimizes the mean squared error as
the joinpoints. We denote such modification for L-CV and L-SE methods as L-CV-M and
L-SE-M, respectively. To give a summary of all the methods we use to compare with the
PTB, we present the following 2× 2 chart.
Tuning Parameter \ Refinement Original LASSO Modified LASSOλ1 via Cross-validation L-CV L-CV-Mλ2 by the ‘One Standard Error’ Rule L-SE L-SE-M
Table 2.1: A Summary of Four Methods to Compare with the PTB
We simulate the data 1000 times for each of eight scenarios and fit them by L-CV, L-
CV-M, L-SE, L-SE-M and PTB methods. The performance of joinpoint detection can be
characterized by three quantities: (1) the average number of correctly identified joinpoints,
#CJ; (2) the average number of incorrectly identified joinpoints, #IJ and (3) the average of
mean squared error of predictions, PMSE = ‖X(β−β)‖2/n. The results are listed in Table
2.2 and 2.3.
From the result we see that in general, the `1 penalized regression spline model with
original cross validation procedure (L-CV) provides better detection of true joinpoints com-
pared to PTB or L-SE. However, a relatively high #IJ value indicates it usually introduces
more joinpoints than the other two methods. The penalized model with cross validation
under one standard error rule (L-SE) performs equally or better than PTB in detecting
the true joinpoints for most of the time. And it brings down the #IJ value compared to
L-CV model due to a more penalization of parameters. Meanwhile, the modified approach
L-CV-M usually leads to more #CJ values than L-SE and less #IJ values than L-CV thus
gives a more balanced and robust estimation between L-CV and L-SE. The L-SE-M reduces
the number of effective joinpoints dramatically but is more likely to miss the true joinpoints
28
σ2 APC Method #CJ #IJ PMSE(×10−4)0.0001 (1, 3, 1) PTB 1.377 1.284 0.266
L-CV 1.648 3.852 0.940L-CV-M 1.436 1.659 0.277
L-SE 1.592 3.831 1.093L-SE-M 1.392 1.691 0.419
(4,−1, 2) PTB 1.897 0.236 0.200L-CV 1.942 2.982 0.993
L-CV-M 1.909 0.694 0.208L-SE 1.933 2.996 1.019
L-SE-M 1.900 0.714 0.211
σ2 APC Method #CJ #IJ PMSE(×10−4)0.0010 (1, 3, 1) PTB 0.306 3.047 3.956
L-CV 0.696 4.956 3.679L-CV-M 0.448 3.648 3.711
L-SE 0.283 3.375 5.673L-SE-M 0.215 2.754 5.663
(4,−1, 2) PTB 0.960 2.106 2.754L-CV 1.391 5.121 3.351
L-CV-M 1.032 3.367 2.913L-SE 1.171 4.485 5.523
L-SE-M 0.966 2.952 3.247
Table 2.2: Simulation Result for n = 27, (τ1, τ2) = (8, 18)
compared to other methods. As for the computational efficiency, as we expected, LASSO
models showed notable superiority to the PTB by equipping with the coordinate decent
algorithm mentioned in section 2.2.
To investigate the performance of our LASSO method further under larger sample size,
we design other groups of simulations with n = 54 and 135, corresponding to 2 and 5 times
the original sample size of responses. We keep the settings for APC values and σ2 same as
the previous simulation settings. However, as the time line extends, it is reasonable to simply
maintain the relative location of true joinpoints, for example, (τ1, τ2) = (16, 36) or (36, 46)
respectively for n = 54. Also note that since the PTB method will be significantly time-
29
σ2 APC Method #CJ #IJ PMSE(×10−4)0.0001 (1, 3, 1) PTB 1.118 1.667 0.339
L-CV 1.290 3.309 0.444L-CV-M 1.093 1.913 0.343
L-SE 1.010 3.113 0.766L-SE-M 0.749 2.190 0.571
(4,−1, 2) PTB 1.748 0.561 0.232L-CV 1.714 3.314 0.551
L-CV-M 1.575 1.023 0.353L-SE 1.435 3.429 0.945
L-SE-M 1.259 1.470 0.585
σ2 APC Method #CJ #IJ PMSE(×10−4)0.0010 (1, 3, 1) PTB 0.242 2.637 2.498
L-CV 0.623 3.703 2.533L-CV-M 0.510 2.793 2.659
L-SE 0.272 2.586 4.112L-SE-M 0.231 2.308 3.867
(4,−1, 2) PTB 0.448 2.368 3.275L-CV 1.134 3.897 2.856
L-CV-M 0.774 2.646 2.981L-SE 0.899 2.608 5.467
L-SE-M 0.604 2.098 3.258
Table 2.3: Simulation Result for n = 27, (τ1, τ2) = (18, 23)
consuming under large sample size, we decide not to include it in the comparison. The
results are shown on Table 2.4 through 2.7.
It is clear that as the sample size n increases, the LASSO model tends to select more
candidate components as joinpoints, which will be more likely to cover the true joinpoints.
Meanwhile, the modified LASSO, L-CV-M and L-SE-M reduce #IJ dramatically by elimi-
nating the redundant information within ‘clusters’. In general, both our methods L-CV-M
and L-SE-M give acceptable estimates.
On the other hand, note that as the sample size n increases, it is more worthwhile to
study the LASSO model with asymptotic tuning parameter λ ∼ O(σ√
log(n)/n) than the
30
σ2 APC Method #CJ #IJ PMSE(×10−4)0.0001 (1, 3, 1) L-CV 1.729 6.621 0.954
L-CV-M 1.445 2.250 0.163L-SE 1.729 6.626 0.959
L-SE-M 1.443 2.256 0.163(4,−1, 2) L-CV 1.883 5.227 1.004
L-CV-M 1.696 1.434 0.274L-SE 1.883 5.220 1.010
L-SE-M 1.696 1.433 0.274
σ2 APC Method #CJ #IJ PMSE(×10−4)0.0010 (1, 3, 1) L-CV 1.458 8.487 2.089
L-CV-M 0.685 4.663 1.612L-SE 1.329 7.406 3.417
L-SE-M 0.680 4.107 1.645(4,−1, 2) L-CV 1.615 6.827 2.250
L-CV-M 1.065 4.041 1.878L-SE 1.546 6.136 3.386
L-SE-M 1.097 3.578 1.775
Table 2.4: Simulation Result for n = 54, (τ1, τ2) = (16, 36)
usual one selected from cross-validation techniques. Thus we also use λ0 = 4√
6σ√
log(n)/n
as specified in section 2.3 to verify the asymptotic behavior of the predicted mean squared
error (PMSE). Some results under σ2 = 0.0010 are shown in Figure 2.1 and 2.2. To save
computation time, we only take n = 27× 5k for k = 3, · · · , 15 for demonstration.
We can see a clear decreasing trend for the PMSE, which meets our expectation since
we have shown that the prediction error should be close to 0 once the sample size is large
enough.
31
Figure 2.1: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC =(1, 3, 1), σ2 = 0.0010
Figure 2.2: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC =(4,−1, 2), σ2 = 0.0010
32
σ2 APC Method #CJ #IJ PMSE(×10−4)0.0001 (1, 3, 1) L-CV 1.593 5.239 0.416
L-CV-M 1.391 1.830 0.151L-SE 1.518 5.307 0.466
L-SE-M 1.324 1.945 0.163(4,−1, 2) L-CV 1.833 6.497 0.532
L-CV-M 0.828 2.876 1.537L-SE 1.763 6.479 0.618
L-SE-M 0.810 2.896 1.511
σ2 APC Method #CJ #IJ PMSE(×10−4)0.0010 (1, 3, 1) L-CV 1.011 6.176 1.650
L-CV-M 0.635 3.951 1.544L-SE 0.669 4.262 3.489
L-SE-M 0.412 3.186 2.329(4,−1, 2) L-CV 1.545 7.744 1.739
L-CV-M 0.748 3.894 2.035L-SE 1.185 5.753 3.488
L-SE-M 0.652 3.240 2.460
Table 2.5: Simulation Result for n = 54, (τ1, τ2) = (36, 46)
2.5 A Case Study: National Cancer Incidence Rate
Analysis
As a real application of the penalized regression model, throughout this section we will
employ our method to analyze the cancer rates data.
First we provide Figure 2.3 showing the age-adjusted incidence rates of colon and rectal
cancer for the overall population in the U.S. from 1973 to 1999. These data reported in the
SEER Cancer Statistics Review [17] utilizing the Joinpoint Software. It is easy to see some
clear trends and changes over the time.
We again apply PTB and LASSO to the cancer rate data. As indicated in the previous
section, when applying the PTB method via the Joinpoint Software, we set the maximum
number of potential joinpoints to be four. We use the modified LASSO methods L-CV-M
33
σ2 APC Method #CJ #IJ PMSE(×10−4)0.0001 (1, 3, 1) L-CV 1.300 16.08 1.584
L-CV-M 0.661 5.663 0.137L-SE 1.300 16.08 1.584
L-SE-M 0.661 5.663 0.137(4,−1, 2) L-CV 1.518 13.79 1.438
L-CV-M 0.611 4.736 6.280L-SE 1.518 13.79 1.441
L-CV-M 0.611 4.736 6.280
σ2 APC Method #CJ #IJ PMSE(×10−4)0.0010 (1, 3, 1) L-CV 1.479 18.35 1.783
L-CV-M 0.688 6.461 0.827L-SE 1.483 17.75 2.083
L-SE-M 0.692 6.257 0.815(4,−1, 2) L-CV 1.495 15.92 1.939
L-CV-M 0.598 6.120 6.211L-SE 1.492 15.43 2.231
L-SE-M 0.578 6.01 5.760
Table 2.6: Simulation Result for n = 135, (τ1, τ2) = (40, 90)
and L-SE-M for comparison purpose. The result of fitting is given in Table 2.8.
In Table 2.8, PTB (#2) denotes the original unrestricted PTB method while PTB (#4) is
implemented under a pre-specified number of joinpoints as four only for comparison purpose.
From the results we can see that the LASSO type penalized regression methods tend to be
more aggressive when selecting joinpoints than PTB. We have no strong reasons to suspect
an over-fitting issue for this model since the tuning parameter λs are well chosen by cross
validation procedures. Moreover, we see that L-SE-M, the most penalized method among
all of our approaches, gives the same result as the original PTB (#2) while L-CV-M selects
more joinpoints than any others. So it is clear that our method provides more flexibility
in determining the total number of joinpoints because we are always able to control such
important quantity in terms of a series of appropriate tuning parameters. Finally, we show
the plot for fitted models in Figure 2.4, where we see both approaches successfully pick out
34
Figure 2.3: Incidence Rates of Cancer for overall U.S. (1973 - 1999)
Figure 2.4: Fitted Incidence Rates (1973 - 1999)
35
σ2 APC Method #CJ #IJ PMSE(×10−4)0.0001 (1, 3, 1) L-CV 1.897 11.89 0.551
L-CV-M 0.676 3.848 0.211L-SE 1.895 11.88 0.554
L-SE-M 0.675 3.850 0.211(4,−1, 2) L-CV 1.550 16.72 0.905
L-CV-M 0.271 4.815 15.05L-SE 1.553 16.69 0.925
L-SE-M 0.272 4.804 15.07
σ2 APC Method #CJ #IJ PMSE(×10−4)0.0010 (1, 3, 1) L-CV 1.736 15.41 1.060
L-CV-M 0.465 5.504 0.831L-SE 1.518 13.85 1.903
L-SE-M 0.468 4.790 0.816(4,−1, 2) L-CV 1.575 20.53 1.411
L-CV-M 0.247 5.937 13.68L-SE 1.513 17.93 2.266
L-SE-M 0.237 5.333 14.26
Table 2.7: Simulation Result for n = 135, (τ1, τ2) = (90, 115)
Method # of Joinpoints Joinpoints Details (Year)PTB (#2) 2 1986, 1995PTB (#4) 4 1977, 1982, 1985, 1995L-CV-M 7 1976, 1980, 1982, 1985, 1990, 1993, 1995L-SE-M 2 1985, 1996
Table 2.8: Joinpoint Detection for Real Data Analysis
the peak value around year 1985 as well as the bottom value at year 1995 and our approach
offers a quite close fit to the incidence rate curve.
Overall, we are satisfied with the performance of our penalized regression model regarding
both coverage and accuracy of the joinpoint detection. However, one should note that we did
assume homoscedasticity and independence of the errors in our model for simplicity. Thus
further potential approaches involving more complicated but reasonable error structures are
always worthwhile.
36
2.6 Conclusion and Discussion
As a summary of this chapter, we proposed a new method to solve the joinpoints detection
problem on discrete time points using `1 type penalized regression spline model (LASSO). We
proved a basic version of consistency of the joinpoint estimator for the LASSO and compared
it with another commonly used approach, PTB via simulations and real data analysis. Based
on the results, we conclude high performance of our method. It showed superiority to the
PTB in terms of both efficiency and accuracy. It is also easy to understand as well as to
implement. In practice, one may obtain the optimal estimation of joinpoints by carefully
selecting a set of appropriate tuning parameters. Moreover, a problem of exploring stronger
consistency properties of our estimator is still open and worthwhile for further investigation.
37
Chapter 3
Nonparametric Bayesian Clustering of
Functional Data
3.1 Introduction
We start this chapter as a continuation to study the mortality incidence rates of cancers.
However, unlike in the previous chapter, we first raise a more general scenario when multiple
curves are available for comparison and analysis at the same time. For example, assuming
now cancer data can be obtained at a state level from the U.S., then one may overlook
potential relationships or interactions among curves from different states by simply fitting
them individually. Actually, even if we consider the same question for joinpoint detection
as in Chapter 2, based on studying a set of multiple functional simulation data appearing
in section 3.5.2, our L1 penalized method in Chapter 2 gives us #CJ = 0.531 and #IJ =
1.449, while the model proposed by Dass et al. (2014), which is similar to the one we are
going to present in section 3.2, leads to #CJ = 0.816 and #IJ = 0.551, respectively. Thus
it is expected that with more information on curves such as the group memberships, we are
likely to obtain a better estimation of their trends. Therefore, we are well motivated by this
comparison and will now focus on grouping the curves that have similar patterns together.
That is, our main concern is whether they form some groups with similar overall trends
but have significant variations between groups. For instance, one may suspect that the
38
mortality rates for all 48 states of the U.S. indicate certain geographical similarities. Hence
an appropriate model is required to capture the characteristics of the pattern of curves and
to classify them into groups, or, clusters as the primary objective.
In general, the clustering problem briefly introduced above can be conveniently quanti-
fied in the following longitudinal way. Let fs(t) be a true function at time t ∈ [0, T ] for
the s-th individual, s = 1, · · · , N . It is usual and natural to assume each function fs to be
smooth so that it can be approximated by polynomial functions of any degree. A natural and
heuristic way to study the clustering problem came from the classification perspective when
MacQueen (1967), Hartigan and Wong (1978) introduced the famous K-means algorithm to
partition data into groups. Each curve is first viewed as a discretely observed vector. Then
the algorithm, which aims to minimize the within-group sum of squares, is directly applied
to multiple vectors. The membership of clusters is determined as a result of classification.
Overall, entire process is straightforward and easy to realize. But there are also clear disad-
vantages of such approach. For example, it requires the number of groups or clusters to be
specified at the beginning and fixed throughout the procedure although it is unknown for
most of the real studies. Moreover, this clustering method treats each individual curve as an
independent outcome thus overlooks potential correlations or interactions between curves,
which may not be negligible factor when recovering the true clusters. Later, some more
complicated model-based methods have been developed to address the issue. James and
Sugar (2003) proposed a mixed effects model which is more powerful for clustering sparsely
observed functional data. In their approach, B-splines are considered to capture individual
curves and random spline coefficients with a mixture normal distribution are used to cluster
them. It handles missing values well and provides reasonable estimation and confidence in-
tervals for them. However, like K-means algorithm, the number of clusters in the model has
39
also to be pre-determined by independent criteria. Meanwhile, the knot points are the same
across all clusters, which reduce the flexibility of model as well as its ability to process a group
of curves with varying connecting points. Abraham et al. (2003) combined B-spline basis
with K-means procedure on fixed effects to study such problem. Some other model-based
approaches include Banfield and Raftery (1993), Serban and Wasserman (2005), Heard et
al. (2006) and Ray and Mallick (2006).
We develop our model as a powerful tool for classification purpose by applying a classical
nonparametric Bayesian approach: Dirichlet Process (DP) methodology [7] in an innovative
way to clustering the trends of different curves as polynomial components. Our method will
utilize such representation of smooth curves to obtain relevant information such as connection
and fluctuation. In general, our proposed approach does not require an initial estimate of
the number of clusters as most of the other methods mentioned in the previous paragraph.
Instead, it provides a reasonable estimate along with an empirical distribution. It also allows
curves to be characterized by different numbers and locations of knot points, which has more
capability of fitting individual curves well. Furthermore, it performs the clustering based
on a good summary of the shape and trend of curves thus demonstrates the similarity of
patterns between curves within the same cluster.
The rest of this chapter is organized as follows. We present our functional spline mixed
model and prior specifications for Bayesian inference in section 3.2. Some prior components
are taken to be improper, and therefore, propriety of the posterior becomes a concern.
Section 3.3 establishes the propriety of the posterior distribution under some mild conditions.
Section 3.4 provides details of the Bayesian inference, particularly, performance measures of
clustering configuration for the comparison. At the same time, we also explore a new choice
of more non-informative prior on the concentration parameter. Section 3.5 shows results from
40
numerical studies including simulated data as well as the real lung cancer data. Finally, we
summarize our work and give some discussion for the proposed model and future research.
3.2 Spline Mixed Model for Clustering Multiple Curves
3.2.1 Model Specification
Suppose that we observe functions fs(t)Ns=1 at t = t1, t2, · · · , tn. Let ξ = (ξ1, · · · , ξK)
be knot points for spline basis functions, where K is the number of knots and ξ1 > t1 and
ξK < tn. Let ξ0 = t1 and ξK+1 = tn. The spline basis functions of order q are then given
by
hm(x) = xm−1 for m = 1, · · · , q
and
hq+l(x) = (x− ξl)q−1+ for l = 1, · · · , K.
In practice, most widely used orders are q = 1, 2, 4, where q = 1 corresponds to a constant
function, q = 2 corresponds to a linear function and q = 4 corresponds to a cubic spline
function. Then, we assume fs(t) can be approximated as following for s = 1, 2, · · · , N .
fs(t) ≈q+Ks∑m=1
β(s)m h
(s)m (t),
where h(s)m (t) depends on the s-th individual with time knots ξl =
(ξ
(s)1 , · · · , ξ(s)
Ks
).
Let Yst be the observed value of fs(t) at time t for t = 1, · · · , n. Then, the observed
41
values are modeled using spline representation such that
Yst =
q+Ks∑m=1
β(s)m h
(s)m (t) + est,
where est is an error that represents variability from various sources. We decompose the
error as est = us + εst and consider us as a random effect to capture variability between
curves and εst as an error that reflects variability across time for each s-th function. Thus
the model can be written as
Yst =
q+Ks∑m=1
β(s)m h
(s)m (t) + us + εst
with us ∼ N(0, τ2) and εst ∼ N(0, σ2s). Let Y s = (Yst1 , · · · , Ystn)T , Y = (Y T
1 , · · · ,YTN )T ,
εs = (εst1 , · · · , εstn)T , ε = (εT1 , · · · , εTN )T and u = (u1, · · · , uN )T . Then, we can write the
model in the matrix form for the s-th individual as
Y s = X(s)β(s) + us1n + εs,
where X(s) = (xij)n×(q+Ks) with xij = hj(ti) for i = 1, · · · , n and j = 1, · · · , q + Ks, and
β(s) = (β(s)1 , · · · , β(s)
q+Ks)T . For all s = 1, · · · , N , we have
Y = Xβ + u⊗ 1n + ε, (3.1)
with X = diag(X(1), · · · , X(N)) and β = (β(1)T , · · · ,β(N)T )T , where ⊗ denotes the Kro-
necker product.
42
3.2.2 Derivation of Functional Dirichlet Process and Other Prior
We are interested in clustering functions based on their shapes. Such clustering problem
based on the coefficients of basis functions are proposed by other authors, in particular,
using coefficients of spline basis functions such as [1] and [16]. However, the locations and
the number of knots are pre-determined before clustering is considered. Different groups of
functions may need different number of knots and knot-locations to capture the shape of
those functions. Thus, models should take into account such flexibility. To model clustering
of N sites with respect to their knot-point locations and corresponding magnitudes, we
apply the Dirichlet Process (DP) priors to the functions fs(t) with an appropriate centering
measure so that the number of knots and locations of knots are random and estimated during
inferential procedures.
To utilize the DP prior, we consider a function θs(t) = fs(t) and let Θ be the space of
all possible θs(t). A functional DP on the space of all distributions on Θ, where we define Θ
later, has two hyper parameters that characterize the DP such that DP ≡ DP (α0G0). α0
is called precision parameter and G0 is called the baseline (or centering) distribution on Θ.
It is well known that a randomly generated distribution F from DP (α0G0) is almost surely
discrete and admits the representation [18].
F =∞∑i=1
ωiδθi (3.2)
where δz denotes a point mass at z, ω1 = η1, ωi = ηi∏i−1k=1(1 − ηk), for i = 2, 3, · · · with
θ1, θ2, · · · i.i.d from distribution G0.
To specify G0 on Θ, it is convenient to consider a hierarchical structure. The randomness
43
of G0 is specified by assuming randomness on K, the number of knots, ξ, the locations of
knots, β, the spline coefficients and u, the random effect component.
(1) First, we specify the distribution of K. We assume that each time interval between
knot points has at least w units. We then consider a truncated Poisson distribution for the
distribution of K. That is,
p(k) = P (K = k) =e−λλ
k
k!∑k∗l=0 e
−λλll!
=λk
k!∑k∗l=0
λll!
for k = 0, 1, · · · , k∗, where k∗ = [n−1w − 1] to ensure n0 ≡ n− 1− (K + 1)w > 0.
(2) Given K = k, we assume that for (n1, · · · , nk+1), where nl ≥ 0 and∑k+1l=1 nl = n0,
(n1, n2, · · · , nk+1) ∼ Multinomial
(n0,
1
k + 1, · · · , 1
k + 1
).
nl +w becomes the length of a time interval between knot points for l = 1, · · · , k+ 1. That
is, ξl is defined recursively as ξ0 = 1 and ξl = nl + ξl−1 + w for l = 1, · · · , k.
(3) Given K = k and ξ = (ξ1, · · · , ξk), generate β1, · · · , βq+k i.i.d from the density π0 on
R1.
By the steps (1)-(3), we set θ(t) = f(t) =∑q+Km=1 βmhm(t) and θ = (f(t1), · · · , f(tn))T =
Xβ. From the hierarchical specification above, it follows that the infinitesimal measure for
G0 is given as
G0(dθ) = p(k)
(Γ(n0 + 1)∏k+1l=1 Γ(nl + 1)
(1
k + 1
)n0) q+k∏m=1
π0(βm)dβm. (3.3)
In our analysis, we take π0(β) ∝ 1.
Finally, to complete the hierarchical model, we specify priors of α0, λ, τ2 and σ2 =
44
(σ21, · · · , σ
2N ): π1(λ) = Ga(aλ, bλ), π2(τ2) = IG(aτ , bτ ) and π3(σ2
s) = IG(aσ, bσ), where
Ga(a, b) denotes the Gamma distribution with density function π(x) ∝ xa−1ex/b and IG(a, b)
has already been defined in Chapter 1. As for the precision parameter α0, we also choose
IG(aα, bα) in order to guarantee a simple form of posterior sampling distributions. However,
as we show in section 3.5 when performing the numerical analysis, such prior specification
may be subjective under some circumstances. Thus we will discuss on an alternative choice
of prior distribution of α0. In general, the hyper-parameters for the Gamma and inverse
Gamma distributions are chosen to have large variance so that the impact of the prior input
is minimal. The specific choices of the hyper-parameters for the various Gamma and inverse
Gamma distributions are given in section 3.5.
As mentioned in section 3.1, it is notable that Dass et al. (2014) proposed a mixed
effects model with similar DP prior to identify change-points of multiple curves. Although
both our model in this article and theirs feature a nonparametric Bayesian approach, we
feel worthwhile to point out the major difference between our models and objectives. In
the approach of Dass et al. (2014), piecewise disconnected linear segments were used to
model the change-points of functional data. They appear in the model as a set of slopes
and intercepts. Such design of model is not intended for clustering. On the other hand,
our model features B-spline basis which is designed to capture and cluster the shape and
trend of continuous curves well. Even if the order of B-splines is chosen to be 2, which
corresponds to connected linear functions, because of the discrepancy of the objective and
nature of our models, the interpretation of coefficients and other estimated parameters will
be clearly different.
45
3.3 Theory of Bayesian Inference
This section presents some fundamental theories of the nonparametric Bayesian inference
for our clustering spline mixed model, including a validation of propriety of the posterior
distribution and Gibbs sampling schemes for generating the posterior samples.
3.3.1 Propriety of the Posterior Distribution
Let S ≡ 1, 2, · · · , N be the set of all N sites. Denote a partition of S by c = ∪dr=1Cr, and
define P to be the set of all partitions of S. We consider probability distributions π(c) on
P :
π(c) =Γ(α0)
Γ(α0 +N)αd0
d∏r=1
(|Cr| − 1)! (3.4)
where |Cr| is the number of elements in Cr.
Theorem 4. Let σ2 = (σ21, σ
22, · · · , σ
2N ), θ = (θ1,θ2, · · · ,θN ) be the collection of functions
on the N sites with θs = (β(s), Ks, Ts), and φ = (u, τ2) with support Φ = RN × (0,∞).
Given model (3.1) and specification of priors in section 3.2.2, if Ks < minaσ + (n −
1)/2, [(n− 1)/w] for all s and some integer w > 0, then the posterior distribution is proper.
The proof of theorem is provided in Appendix.
3.3.2 Gibbs Updating Steps
Let S ≡ 1, 2, · · · , N be the set of all N individuals. Denote a partition of S by c = ∪dr=1Cr,
and define P to be the set of all partitions of S. Note that a randomly generated distri-
bution F from DP (α0G0) admits the Sethuraman representation (Sethuraman, 1994). By
46
integrating out the random measure F , the equivalence between the DP prior (for any gen-
eral space Θ) and the distribution it induces on P is well known. Moreover, the probability
distribution on P has an explicit form given by
π(c) =Γ(α0)
Γ(α0 +N)αd0
d∏r=1
(|Cr| − 1)!, (3.5)
where |Cr| is the number of elements in Cr. Define θ = (θ1, · · · ,θN ) and
θ−s = (θ1,θ2, · · · ,θs−1,θs+1, · · · ,θN ) to be the collection of all θ-components except θs.
Let c−s be the partition of S \ s determined by θ−s; the identical components of θ−s
uniquely determine the partition of c−s. Suppose that c−s = ∪N∗r=1Er. The conditional
distribution π(c | c−s) based on (3.5) is π(c | c−s) = α0/(α0 + N − 1) if c = s ∪ c−s
and π(c | c−s) = Nr/(α0 + N − 1) with Nr denoting the number of elements in Er where
c−s = ∪N∗j=1Ej .
(1) Update θs:
The posterior of θs conditional on θ−s is given by
π(θs |θ−s, Y , u) =
qs,0G∗0(dθs) +
∑N∗r=1 qs,r δθ(r)
qs,0 +∑N∗r=1 qs,r
, (3.6)
where
qs,0 =
[∫Θπ(Y s |θs, us)G0(dθs)
]π(c | c−s) and
qs,r = π(Y s |θ(r), us) π(c | c−s),(3.7)
respectively, for c = s∪c−s and c = (E1, E2, · · · , Er∪s, · · · , EN∗) for r = 1, 2, · · · , N∗.
47
G∗0(dθs) =π(Y s |θs, us)G0(dθs)∫
Θ π(Y s |θs, us)G0(dθs)(3.8)
is the posterior distribution of θs given that a new cluster is formed by the s-th individual.
The explicit expression for qs,0 after canceling 1/(α0 +N − 1) is given as
qs,0 = α0
k∗∑k=0
∑(n1,··· ,nk+1)
exp [H(n1, · · · , nk+1)]n0!
n1! · · ·nk+1!
(1
k + 1
)n0p(k),
where
H(n1, · · · , nk+1) = log
∫Rq+Ks
∫ ∞0
π(Y s |β(s), us, σ2s)π(σ2
s) dσ2sdβ
(s)
= −n− q −Ks2
log(2π)− log Γ(aσ)− aσ log bσ + log Γ
(2aσ + n− q −Ks
2
)−(
2aσ + n− q −Ks2
)log(b−1
σ + (1/2)Y T∗ P
(s)Y ∗)−1
2log∣∣∣X(s)TX(s)
∣∣∣with Y s∗ = Y s − us1n and P (s) = I −X(s)(X(s)TX(s))−1X(s)T .
qs,r is also given as
qs,r = Nr1
(2π)n/2Γ(aσ + n/2)
Γ(aσ)
b(aσ+n/2)∗baσσ
,
where b∗−1 =(b−1σ + 0.5(Y s∗ −Xrβr)T (Y s∗ −Xrβr)
). Xrβr ≡ θ(r) for the cluster Cr.
Note that θs′ ≡ θ(r) if s′ ∈ Cr.
Expression (3.6) explicitly demonstrates the clustering capability of the functional DP
prior. The current value of θs can be selected to be one of the distinct θ(r) functions
with probability∑N∗r=1 qs,r/(qs,0 +
∑N∗r=1 qs,r), this positive probability being the reason for
possible clustering of sites in terms of θs as mentioned. Expression (3.6) also allows for a
48
new θs to be generated from the posterior distribution G∗0.
(2) Update σ: The update of σ2s is carried out once θs is obtained via (3.6). Regardless
of whether θs is a new value or an existing θ(r), for each individual s = 1, · · · , N , the
conditional posterior distribution of σ2s given other parameters is
π(σ2s | · · · ) = IG(a, b),
with a = aσ + n2 and b−1 =
(b−1σ + 0.5(Y s∗ −X(s)β(s))T (Y s∗ −X(s)β(s))
).
Note that θ uniquely determines the collection of parameters (β,K,T ), where β is the set
of spline coefficients, K = (K1, · · · , KN ) is the set of number of knots and T = (ξ1, · · · , ξN )
is the set of knot locations. Indeed, since θ contains several identical components, it follows
that the corresponding components of (β,K,T ) are also identical to each other. In what
follows, we present the updating steps for the d distinct components of (β,K,T ), namely,
(βr, Kr,T r) for r = 1, 2, · · · , d. Let ∪dr=1Cr be the partition of 1, 2, · · · , N at the current
update of the Gibbs sampler (thus, d is the number of distinct clusters).
(3) Update (βr, Kr,T r): We first update Kr from the posterior marginal of Kr, and then
update T r |Kr, and finally βr |T r, Kr from their respective conditional distributions. The
posterior marginal probability of Kr = k is proportional to
p(k)∑
(n1,··· ,nk+1)
ν(n1, · · · , nk+1), (3.9)
with
ν(n1, · · · , nk+1) = expH(n1, · · · , nk+1) Γ(n0 + 1)∏k+1l=1 Γ(nl + 1)
(1
k + 1
)n0(3.10)
49
where
H(n1, · · · , nk+1) = log
∫Rq+k
∏s∈Cr
π(Y s |X(s),β, σ2s , us) dβ
= −(n|Cr| − q − k)
2log(2π)− n
2
∑s∈Cr
log σ2s −
1
2log
∣∣∣∣∣∣∑s∈Cr
1
σ2sX(s)TX(s)
∣∣∣∣∣∣− 1
2
∑s∈Cr
σ−2s Y T
s∗Y s∗
+1
2
∑s∈Cr
σ−2s X(s)TY s∗
T ∑s∈Cr
σ−2s X(s)TX(s)
−1∑s∈Cr
σ−2s X(s)TY s∗
.
Note that X(s) ≡ Xr for all s ∈ Cr.
The summation in (3.9) is over all non-negative integers n1, n2, · · · , nk+1 such that∑k+1l=1 nl = n0 ≡ n − 1 − (k + 1)w. Obtaining the posterior probability of Kr = k re-
quires evaluation of (3.10) for each value of k ≥ 0. This could potentially require significant
amount of computational time and drastically reduce the efficiency of the Gibbs chain, but
this did not occur in our application due to closed forms expressions of H using the flat prior
π0.
To update T r given Kr = k, note that this is equivalent to updating (n1, · · · , nk+1) with
probabilities p(n1, · · · , nk+1) ∝ v(n1, n2, · · · , nk+1). This is carried out by exhaustively
listing of all such combinations and numerically computing the corresponding probabilities.
The update of βr given T r and Kr = k is done based on the conditional distribution
N(µβr,Σβr
) with
Σβr=
∑s∈Cr
σ−2s X(s)TX(s)
−1
=(XrTXr
)−1
∑s∈Cr
σ−2s
−1
and (3.11)
µβr= Σβr
XrT∑s∈Cr
σ−2s Y s∗
(3.12)
50
with the q+k components of βr generated independently of each other from their respective
component densities.
(4) Update λ: λ is updated using
π(λ | · · · ) ∝ π3(λ)N∏s=1
p(Ks) ∝λa∗λ−1
e−λ/bλ
(∑k∗l=0
λll! )N
, (3.13)
with a∗λ = aλ +∑dr=1 |Cr|kr, where kr is the number of knot-points corresponding to θ(r)
in cluster Cr.
(5) Update α0: For updating α0 with π2(α0) ∝ αaα−10 e−α0/bα , we utilize the two-step
procedure by Escobar and West [6]: at the b-th iteration: (1) draw κ from the Beta dis-
tribution Be(α
(b−1)0 + 1, N
)and; (2) draw α
(b)0 from the mixture density of two Gamma
distributions
πκGa(aα +N∗, (1/bα − log(κ))−1) + (1− πκ)Ga(aα +N∗ − 1, (1/bα − log(κ))−1)
where N∗ is the latest number of clusters and the membership probability
πκ =aα +N∗ − 1
N(1/bα − log(κ)).
(6) Update u: We have
π(u | · · · ) ∝ π(Y |u, · · · ) · π(u | τ2)
∝ exp
−1
2
N∑s=1
(u2s1Tn1n − 2us1
TnWs
)/σ2s + (u− u0)T (u/− u0)τ2
where Ws = Y s −X(s)β(s). Notice that 1Tn1n = n. The conditional posterior distribution
51
of u is N(µu,Σu) with Σu =(nDiag(σ−2
1 , · · · , σ−2N ) + τ−2IN
)−1and µu = Σu(W +
τ−2INu0) with W = (1TnW1/σ21, · · · ,1
TnWN/σ
2N )T .
(7) Update τ2: The posterior distribution of τ2 is given as
π(τ2 | · · · ) ∝ π(u | τ2) · π(τ2),
∼ IG(a∗τ , b∗τ )
with a∗τ = N2 + aτ and b∗τ =
(1bτ
+ 12u
Tu)−1
.
3.4 Bayesian Inference: A Measure for Comparing a
Pair of Clustering of Sites
Given the validity and details of the Gibbs updating scheme for all the parameters based on
the elicited improper prior components from section 3.2.2, we are able to perform Bayesian
inference on the posterior sample of them, especially the estimated cluster information θ =
(θ1, · · · ,θN ). Suppose that we have M Gibbs samples for θ: θ(m)Mm=1 after convergence
is established. It may be challenging to obtain results for the clustering configuration directly
from them since such information (i.e. number of clusters, specific assignment of each site) is
changing at each Gibbs iteration, classical posterior analysis by collecting summary statistics
for clustering information is not straightforward and not easy to interpret. We introduce a
2 × 2 cross classification table to measure the deviation between two cluster configurations,
for instance, C and C∗. The table has entires nij, 1 ≤ i, j ≤ 2, where n11 is the number
of pairs of sites, out of all n++ = N(N −1)/2 pairs, that are in the same cluster in C as well
as C∗ while n22 is the number of pairs of sites that are not in the same cluster in C or C∗.
52
n12 is the number of pairs that are in the same cluster in C but not in C∗. Similarly, n21
is the number of pairs that are not in the same cluster in C but in the same cluster in C∗.
If two cluster C and C∗ have identical configurations, we have n11 = n+1 and n22 = n+2.
The interpretation of n+1 is that it is the number of pairs which are in the same cluster in
C∗, whereas the interpretation of n+2 is that it is the number of pairs that are not in the
same cluster in C∗. Thus, the latter quantities n+1 and n+2 are only dependent on C∗ but
not on C. Then, we consider two measures named sensitivity S1 = n11/n+1 and specificity
S2 = n22/n+2. The interpretation of S1 is that it is the proportion of pairs that are also
in the same cluster in C given that they are in the same cluster in C∗. Similarly, S2 is the
proportion of pairs that are also not in the same cluster in C given that they are not in the
same cluster in C∗. It is clear that S1 and S2 take values between 0 and 1 with the ideal
case that (S1, S2) = (1, 1). Also note that for a cluster C ⊂ C∗, that is, every partition of
C is in some partition of C∗, the number of clusters in C is much larger than that of C∗.
In this case, n12 = 0 yields S2 = 1 but S1 ≤ 1. On the other hand, when C∗ ⊂ C, we have
S1 = 1 and S2 ≤ 1. Thus, deviations from the point (1, 1) or from the lines y = 1 and x = 1
give an idea about the nature of deviations of the clustering C from the clustering C∗.
For simulation studies, (S1, S2) can be used to measure deviations between a Gibbs
cluster configuration and the true one. Suppose that from the Gibbs output, we have clus-
ter configurations CmMm=1. For the m-th Gibbs cluster configuration, we can calculate
(S1(m), S2(m)) with C = Cm and C∗ = C0 specified by the true cluster information. A
plot of (S1(m), S2(m)) will indicate how dispersed the current Gibbs cluster is with respect
to the true configuration: If (S1(m), S2(m)) is concentrated close to (1, 1), it indicates that
deviations of Cm from C0 is not much. Points (S1(m), S2(m)) far from (1, 1) indicate two
different types of deviations based on their proximity to the lines x = 1 or y = 1. Proximity
53
to the line x = 1 indicates that among those pairs that are in the same cluster in C0, more
pairs become in the same cluster in Cm as well whereas proximity to y = 1 indicates that
among those pairs not in the same cluster in C0, more pairs becomes not in the same cluster
in Cm as well. For simulation studies, C0 can be taken to be the true cluster configuration
since the true clustering is known. In the 2×2 cross-classification table, there are two degrees
of freedom. The quantities n+1 and n+2 are fixed for the true cluster configuration, so we
need two free parameters, say n11 and n22, to determine all entries in the table completely.
Thus, the scatter plot of (S1(m), S2(m)) : m = 1, 2, · · ·M gives a complete picture of
variability. As a numerical measure for variability, we consider
SS =1
M
M∑m=1
(2− S1(m)− S2(m))
with smaller SS value indicating better performance.
3.5 Numerical Analysis: Simulation Study and Real
Data Example
In this section, we apply the proposed method to numerical studies including simulations
under various settings as well as real case analysis of the lung cancer data in the U.S.. We
simulate data at n = 38 time points for N = 49 sites based on the model. We partition
the sites into 6 groups (clusters) Cr with r = 1, 2, · · · , 6. Within each cluster, the curves
share common information in form of (β(r), Kr,T(r)). As stated in the previous section, we
collect overall posterior samples via MCMC procedure with Gibbs sampling scheme at each
iteration. The true parameters are specified in the following way.
54
r |Cr| Cr(Site #) T (r) β(r)
1 9 2, 4, 11, 25, 27, 36, 43, 46, 49 (8, 30) (0.0667, 0.0500,−0.2000)2 9 12, 13, 14, 21, 22, 26, 33, 40, 48 (7, 11, 20) (−0.0667, 0.0300, 0.1667,−0.3333)3 8 3, 5, 15, 17, 24, 30, 35, 42 (13, 25) (−0.1333,−0.1333, 0.1667)4 7 1, 9, 10, 23, 32, 39, 41 (17, 21) (−0.0333,−0.0167, 0.1667)5 7 6, 18, 20, 28, 31, 38, 44 (18, 20) (−0.0667, 0.0200,−0.1333)6 9 7, 8, 16, 19, 29, 34, 37, 45, 47 (18, 20) (0.0167, 0.0167, 0.3333)
As for the true value of the variability parameter between the curves, τ2, and within the
curves, σ2s , we consider the following settings: τ2 = 1/30 or 2/30; σ2
s ∼ U(1/30, 2/30) or U(2/30, 4/30).
Such specifications lead to 4 scenarios.
3.5.1 An Alternative Less-informative Prior Choice for α0
Recall that when specifying the prior for parameters of our model in section 3.2.2, we raise
a question on the choice of prior for the precision parameter α0, which directly controls
the ability to create a new cluster, that is, increases the number of clusters. We observed
the fact that under the Gamma prior for the precision parameter α0, the corresponding
posterior distribution can be well obtained from a mixture of two Gamma distributions [6].
However, in practice, to guarantee a reasonable extent of splitting existing clusters, we may
need quite small value of α0, which leads to highly subjective Gamma prior for α0. For
instance, in our simulation, we require that α0 ∼ Ga(aα, bα) with E(α0) = aαbα ≈ 10−8
and Var(α0) ≈ 102 thus extreme small aα and large bα. To resolve this issue, we introduce
another prior as α0 ∼ LN(µα, σ2α), where LN(µ, σ2) denotes the log-Normal distribution
with corresponding mean parameter µ and variance parameter σ2. The intuition behind
such re-parameterization is that we believe it is more appropriate to put a relatively non-
informative Gaussian prior on log(α0): log(α0) ∼ N(µα, σ2α). For our simulation we use
µα = −40 and σα = 7. Note that the mean and variance for α0 are E(α0) = eµα+σ2α/2 ≈
55
1.86× 10−7 and Var(α0) = (eσ2α − 1)e2µα+σ2
α ≈ 6.57× 107. Hence we are able to claim that
the log-Normal type prior for α0 is much less subjective than the original Gamma prior.
To complete the posterior sampling scheme for π(α0) = LN(µα, σ2α), we simply apply
the grid search method over (µα−3σα, µα+ 3σα), which covers over 99% of the range of α0.
3.5.2 Simulations and Comparison with An Existing Approach
We generate 10 sets of data under each scenario specified at the beginning of section 3.5
and run MCMC iterations with two chains for 5000 times to guarantee the convergence. For
comparison purpose, we also apply a functional clustering methodology proposed by James
and Sugar (2003) to the same simulated datasets. The results are listed in Table 3.1 and
3.2.
τ2 σ2 S1(Specificity) S2(Sensitivity) SS(Overall) Med. # Mode #1/30 U(1/30, 2/30) 0.8813 0.9844 0.1343 7 6
U(2/30, 4/30) 0.6466 0.9827 0.3707 6.5 62/30 U(1/30, 2/30) 0.8898 0.9843 0.1260 6 6
U(2/30, 4/30) 0.6443 0.9827 0.3730 6.5 6
Table 3.1: Cluster Configuration Diagnostics for Simulations: Bayesian DP Approach
τ2 σ2 S1(Specificity) S2(Sensitivity) SS(Overall) True #1/30 U(1/30, 2/30) 0.7107 0.9066 0.3827 6
U(2/30, 4/30) 0.7017 0.9028 0.3955 62/30 U(1/30, 2/30) 0.6792 0.9137 0.4071 6
U(2/30, 4/30) 0.6978 0.9152 0.3870 6
Table 3.2: Cluster Configuration Diagnostics for Simulations: James and Sugar’s Approach
From the result of our approach in Table 3.1, we obtain excellent performance in sensi-
tivity (S2) as well as the median and mode of the estimated number of clusters. Meanwhile,
it is shown that the specificity (S1) is mainly affected by relatively large σ2s , the variability
56
of individual curves within a cluster. On the other hand, since James and Sugar’s model
requires a pre-specified number of clusters as introduced in section 3.1, we simply plug in the
true number of clusters, 6, when utilizing their models to avoid extra determining process.
We also point out that the diagnostic measures (specificity, sensitivity, etc.) are calculated
based on the predicted cluster membership for their case rather than averaging the posterior
samples for our case. It is clear from the comparison that our method leads to better sen-
sitivity (S2) in general and higher specificity (S1) for smaller σ2s . However, it appears that
their approach is slightly more robust under large individual variability.
3.5.3 A Real Data Example: Lung Cancer Rates in the U.S.
Besides the simulation study given above, we apply our proposed model to the lung cancer
mortality rate curves from 48 states and the Washington D.C. of the U.S.. In the data
analysis, the minimum number of time points in each time interval is set to be w = 7, which
leads to the maximum number of knot-points as k∗ = 4. Then all the hyper-parameters are
specified as follows. The shape and scale parameters of the inverse Gamma priors on σ2s and
τ2 are aσ = aτ = 2 and bσ = bτ = 100 respectively so that the variance of priors are large
enough to be considered as infinite. The shape and scale parameters of the Gamma prior on
λ are aλ = 0.004 and bλ = 500 respectively. For the prior on the precision parameter α0, as
we discussed in section 3.5.1, we propose the log-Normal distribution with µα = −100 and
σα = 11. Recall that the precision parameter α0 controls the number of clusters while λ
controls the number of knot-points.
We run two chains under MCMC algorithm with Gibbs sampling for 20000 iterations
each. The convergence of chains is attained after 15000 iterations given the Gelman-Rubin
diagnostics introduced in Chapter 1, section 1.4.1. We collect our posterior sample by
57
choosing every fifth sample from the last 5000 iterations for both chains. Such sampling
procedure provides more ‘mixed’ posterior sample with a total size of 2000. The posterior
probability of the number of clusters of states is shown in Table 3.3.
P (d = 2) P (d = 3) P (d = 4) P (d = 5) P (d = 6) P (d = 7) P (d = 8)0.093 0.217 0.298 0.236 0.110 0.038 0.008
Table 3.3: The Distribution of Number of Clusters d
It is clear that d = 4 gives us the estimated number of clusters with highest probability.
To provide a cluster configuration of all the states, we apply an agglomerative clustering
algorithm with the dissimilarity distance defined as
dist(s1, s2) = 1−M∑m=1
distm(s1, s2)/M,
where distm(s1, s2) = 1 if s1 and s2 fall into the same cluster in mth iteration. Otherwise,
distm(s1, s2) = 0. It is natural to specify the threshold for the maximum number of clusters
in the algorithm as d, the posterior mode of the number of clusters. The main algorithm
can be summarized as follows.
1. Create an individual cluster for each site.
2. Among all current clusters, select two clusters whose elements have the smallest dis-
similarity distance.
3. If two selected clusters from the previous step are distinct, merge them into a new
cluster and replace two old clusters.
4. Repeat step 2 and 3 until the desired threshold for the number of clusters is attained.
58
The estimated cluster configuration is shown on Table 3.4 along with a dendrogram in
Figure 3.1 that illustrates the process of agglomerative clustering algorithm as introduced
above. Moreover, Figure 3.2 provides the data from individual sites grouped by cluster
configuration while Figure 3.3 shows the averaged curve (in black line) over all sites within
those clusters.
Cluster # of Sites Cluster Details (State)(a) 8 CO, LA, MA, ME, OH, OR, PA, WY(b) 21 AL, AR, GA, IA, ID, IN, KS, KY, MN, MS,
MO, MT, NC, ND, NE, OK, SC, SD, TN,WI, WV
(c) 7 CA, DC, FL, MD, NJ, NY, UT(d) 13 AZ, CT, DE, IL, MI, NH, NM, NV, RI, TX,
VA, VT, WA
Table 3.4: Estimated Cluster Configuration
Cluster (a) has one knot-point around year 1991. The log scaled rates for these states
maintained a steady trend after that. Cluster (b) also has one knot-point. However, unlike
cluster (a), it happened a little earlier around year 1989. Cluster (c) has the same knot-point
at year 1991 as in cluster (a) but demonstrate a clear decreasing trend after that. As for
cluster (d), it represents states with potential two knot-points detected around year 1982
and 1992. These states have the log-rates increasing slower after 1982 and dropping after
1992.
3.6 Conclusion and Discussion
In this chapter, we proposed a mixed spline model that performs concurrent clustering of
multiple curves based on their patterns and trends. Such clustering is designed by introducing
a Dirichlet process prior on the space of step functions over time points. The model was
59
verified by a series of simulations and then applied to analyze age-adjusted cancer mortality
rates for all the states in the U.S. to find state-wise clusters that have similar trends. Given
the result of real data analysis, our model provides a reasonable and meaningful cluster
configuration and captures the group-wise features of the curves. For some potential further
investigation, one can extend a piece-wise linear model (corresponding to q = 2) to higher
order polynomial models such as cubic spline functions to model curves by summarizing
more information about their patterns. Then the criteria and interpretation of clustering
may have to be reestablished.
60
Figure 3.1: Dendrogram for the Clustering Estimation
61
Figure 3.2: Estimated Cluster Configuration
62
Figure 3.3: Overall Trend by Clusters
63
APPENDICES
64
Appendix A
Proof of Theorems in Chapter 1
Proof of Theorem 1. Note that the likelihood function given D,Σ0, β, γ is
l(y;D,Σ0, β, γ) =1
(2π)12Mdn|Σ|
12
e−12(y−Xβ−Zγ)′Σ−1(y−Xβ−Zγ) 1
(2π)12Mdk|G|
12
e−12γ′G−1γ ,
where Σ = IM ⊗ D ⊗ In and G = IM ⊗ Σ0 ⊗ Ik. Let l(y;D,Σ0) be the likelihood with β
and γ integrated out. To show the posterior is improper, it is enough to show
∫l(y;D,Σ0)π(D)π(Σ0) dD dΣ0 =∞. (A.1)
Using the eigenvalue decomposition of Σ0, Σ0 = Q′ΨQ, we can write π(Σ0) dΣ0 =
π(Ψ) dΨ · π(Q) dQ with π(Ψ) = 1∏dj=1 ψ
pj
∏i<j(ψi−ψj) by Yang and Berger [23]. Then, we
have
∫l(y;D,Σ0)π(D)π(Σ0) dD dΣ0 =
∫l(y;D,Σ0)π(D) π(Ψ)π(Q) dD dΨ dQ.
To show (A.1), it is enough to find subsets, SD and SΨ of the domains of integration for D
and Ψ on which the integration is unbounded. That is,
∫SD×SΨ
l(y;D,Σ0)π(D) π(Ψ) dD dΨ =∞.
65
Let SD = σ2j ≥ ε, j = 1, . . . , d for a fixed ε > 0 and SΨ = ψ : ψ1 > . . . ψd > 0, where
ψ = (ψ1, . . . , ψd) is the set of eigenvalues of Σ0 and S∗ = SD × SΨ. To proceed further, we
denote the ordering of two m × m symmetric matrices, A and B, with respect to positive
definiteness by A B if for any v ∈ Rm, v′Av ≤ v′Bv. We also say B A if A B.
Note that we have
l(y;D,Σ0) =(2π)−
12(n−q−1)|X ′X|−
12Md
|D|12M(n−q−1)|Σ0|
12Mk|Z ′PZ +G−1|
12
e−12y′Wy,
where W = P − PZ(Z ′PZ +G−1)−1Z ′P with P = IM ⊗D−1 ⊗ (In −X(X ′X)−1X ′). For
simplicity, define PX = X(X ′X)−1X ′, PH = In − PX so that P = IM ⊗ D−1 ⊗ PH . Let
0 < λ1 < λ2 < · · · < λd be the eigenvalues of Z ′PHZ. Then, we have
|Σ0|−Mk
2
|Z ′PZ +G−1|12
=|IM ⊗ Σ−1
0 ⊗ Ik|12
|IM ⊗D−1 ⊗ Z ′PHZ + IM ⊗ Σ−10 ⊗ Ik|
12
≥|IM ⊗ Σ−1
0 ⊗ Ik|12
|IM ⊗D−1 ⊗ λdIk + IM ⊗ Σ−10 ⊗ Ik|
12
≥ 1
|λdε−1Σ0 + I|Mk
2
=d∏j=1
1
(1 + ψjλdε−1)
Mk2
on SD, where the first inequality holds since λdIk Z ′PHZ. Also, e−1/2y′Wy ≥ e−1/2y′Py
since PZ(Z ′PZ +G−1)−1Z ′P 0.
Thus, on SD × SΨ,
l(y;D,Ψ, Q) π(Ψ) ≥ (2π)−12(n−q−1)|X ′X|−
12Md
|D|12M(n−q−1)
e−12y′Py
∏i<j(ψi − ψj)∏d
j=1 ψpj (1 + ψjλdε
−1)Mk
2
66
Since∫SD
(2π)−1
2(n−q−1)|X′X|−12Md
|D|12M(n−q−1)
e−12y′Pyπ(D) dD > 0, we are going to show
∫SΨ
∏i<j(ψi − ψj)∏d
j=1 ψpj (1 + ψjλdε
−1)Mk
2
dΨ =∞. (A.2)
The first term in the integrand in (A.2) corresponding to the finite expansion of∏i<j(ψi−ψj)
is
ψd−11 ψd−2
2 . . . ψ1d−1ψ
0d∏d
j=1 ψpj (1 + ψjλdε
−1)Mk
2
(A.3)
First consider (1) p = d+12 , corresponding to the Jeffrey’s prior. Since d ≥ 4, let j∗ =⌈d−1
2
⌉> 1 where dxe is the smallest integer greater than or equal to x. Then we have
d−12 − j > 0 for 1 ≤ j ≤ j∗ − 1 and d−1
2 − j ≤ 0 for j∗ ≤ j ≤ d. Consider A = ψ : ψ1 ≥
1, ψ2 ≥ 1, . . . , ψj∗−1 ≥ 1 ∩ SΨ. Then
∫SΨ
ψd−11 ψd−2
2 . . . ψ1d−1ψ
0d∏d
j=1 ψpj (1 + ψjλdε
−1)Mk
2
dΨ
≥∫A
d∏j=1
ψd−1
2 −jj
1
(1 + ψjλdε−1)
Mk2
dΨ
≥∫A
∏j≥j∗ ψ
d−12 −j
j∏j≥j∗(1 + ψjλdε
−1)Mk
2
1∏j<j∗(1 + ψjλdε
−1)Mk
2
dΨ
≥∫A
( 1
ψj∗
)∑j≥j∗|
d−12 −j| 1∏
j≥j∗(1 + ψjλdε−1)
Mk2
1∏j<j∗(1 + ψjλdε
−1)Mk
2
dΨ (A.4)
since ψj∗ > ψj∗+1 > · · · > ψd > 0.
67
For p > 1 and b > 0, we have
∫ t
0
1
(1 + bs)pds =
1
b(p− 1)
[1− 1
(1 + bt)p−1
](A.5)
Using (A.5), we integrate out (A.4) with respect to ψd, · · · , ψj∗ on A ∩ SΨ. First,
∫ ψd−1
0
1
(1 + ψdλdε−1)
Mk2
dψd =1
λdε−1(Mk
2 − 1)
[1− 1
(1 + ψd−1λdε−1)
Mk2 −1
].
Thus, using (A.5) again,
∫ ψd−2
0
1
(1 + ψd−1λdε−1)
Mk2
∫ ψd−1
0
1
(1 + ψdλdε−1)
Mk2
dψd dψd−1
=
∫ ψd−2
0
1
λdε−1(Mk
2 − 1)
[1− 1
(1 + ψd−1λdε−1)
Mk2 −1
]1
(1 + ψd−1λdε−1)
Mk2
dψd−1
=1
λ2dε−2(Mk
2 − 1)2
[1− 1
(1 + ψd−2λdε−1)
Mk2 −1
]
− 1
λ2dε−2(Mk
2 − 1)(Mk − 2)
[1− 1
(1 + ψd−2λdε−1)Mk−2
].
After integrating (A.4) with respect to ψd, · · · , ψj∗+1, we have
∫ ψj∗−1
0P
(1√
1 + ψj∗λdε−1
)·(
1
ψj∗
)∑j≥j∗|
d−12 −j|
dψj∗ , (A.6)
where P (·) is a polynomial of degree (Mk− 2)(d− j∗) +Mk. Note that we assume Mk2 > 1.
However, the integrand above is not integrable at 0 since∑j≥j∗
∣∣∣d−12 − j
∣∣∣ ≥ 1.
Hence we showed that the posterior is improper. Similar to case (1), by setting j∗ =⌈d−22
⌉≥ 1, we can prove case (2). For case (3), consider the corresponding multiple integral
68
on B = ψ : 1 > ψ1 > ψ2 > · · · > ψd > 0. Then, it is easy to verify the first integral of
(A.3) with respect to ψd is infinite.
Proof of Theorem 2. Let l(y;D,L) be the likelihood with β and γ integrated out and Σ0
expressed in terms of L. To show that the posterior distribution is proper, we need to verify
∫ ∫l(y;D,L)π(D)π(L) dD dL <∞.
Given the prior models for D and L, we can express
∫l(y;D,L)π(D)π(L) dD dL
=
∫l(y;D,L)
[d∏j=1
π(σ2j )
]π(`11)π(`1)π(L11) dσ2
1 · · · dσ2d d`11 d`1 dL11
= C
∫|Σ0|−
Mk2 e−
12y′Wy
|Z ′PZ +G−1|12
[d∏j=1
( 1
σ2j
)M(n−q−1)2 +αj+1
]π(`11)π(L11)dσ2
1 · · · dσ2d d`11 d`1 dL11,
where C is a constant independent of D and L.
First, we find W0 that does not depend on L and it satisfies W0 W so that e−12y′Wy ≤
e−12y′W0y. Let W0 = P − PZ(Z ′PZ)−1Z ′P . Since (Z ′PZ + G−1)−1 (Z ′PZ)−1 =
IM ⊗D⊗ (Z ′PHZ)−1, W = P − PZ(Z ′PZ +G−1)−1Z ′P P − PZ(Z ′PZ)−1Z ′P = W0.
Thus, e−12y′Wy ≤ e−
12y′W0y.
Let P0 = PH−PHZ (Z ′PHZ)−1 Z ′PH . Then, we can write W0 = IM⊗D−1⊗P0 so that
we can express y′W0y =∑dj=1
cj
σ2j
, where cj =∑Mi=1 y
(i)′j P0y
(i)j with y
(i)j = (y
(i)j1 , · · · , y
(i)jn )′
for j = 1, . . . , d. Now, we claim that cj > 0 for all j = 1, · · · , d with probability 1. Note that
PX , PH and P0 are idempotent so that PH is orthogonal to PX = I −PH and P0 is orthog-
onal to PHZ (Z ′PHZ)−1 Z ′PH . Thus, rank(PH) = rank(I)− rank(PX) = n− (q + 1) since
69
rank(PX) = rank(X) = q+ 1. rank(P0) = rank(PH)− rank(PHZ (Z ′PHZ)−1 Z ′PH). Since
we assume n > q + 1 + k, rank(Z ′PHZ) = minrank(Z), rank(PH) = k and rank(PHZ) =
minrank(Z), rank(PH) = k so that rank(PHZ (Z ′PHZ)−1 Z ′PH) = k. Therefore, rank(P0) =
n− (q + 1)− k > 0. Since P0 is idempotent and non-degenerate, we prove the claim.
Thus, we have
e−12y′Wy ≤ e−
12y′W0y =
d∏j=1
e−cj/2σ2
j , (A.7)
where cj > 0 for all j = 1, . . . , d.
Now, we want to bound|Σ0|−Mk
2
|Z′PZ+G−1|12
above by a simpler expression.
|Σ0|−Mk
2
|Z ′PZ +G−1|12
=|IM ⊗ Σ−1
0 ⊗ Ik|12
|IM ⊗D−1 ⊗ Z ′PHZ + IM ⊗ Σ−10 ⊗ Ik|
12
≤|IM ⊗ Σ−1
0 ⊗ Ik|12
|IM ⊗ E−1 ⊗ Ik + IM ⊗ Σ−10 ⊗ Ik|
12
=|Σ−1
0 |Mk
2
|E−1 + Σ−10 |
Mk2
=|E|
Mk2
|E + Σ0|Mk
2
,
where E = λ−11 D and λ1 be the smallest eigenvalue of Z ′PHZ. The inequality follows from
IM ⊗D−1 ⊗ Z ′PHZ IM ⊗D−1 ⊗ λ1Ik = IM ⊗ E−1 ⊗ Ik.
We further simplify the last expression. Let E =
e11 0
0 E11
. By the block-wise
70
expansion for determinants,
∣∣∣∣∣∣∣A B
C D
∣∣∣∣∣∣∣ = |A||D − CA−1B|, we have
|E+Σ0| = (e11+`211)
∣∣∣∣`1`′1+L11L′11+E11−
`1`′1`
211
e11 + `211
∣∣∣∣ = (e11+`211)
∣∣∣∣ e11`1`′1
e11 + `211
+L11L′11+E11
∣∣∣∣.By the identity |Ip + UV | = |Ir + V U | when U is a p× r matrix and V is an r × p matrix,
we have
|E + Σ0| = (e11 + `211)|L11L′11 + E11|
[1 + `′1(L11L
′11 + E11)−1`1
e11
e11 + `211
]= (e11 + `211)|L11L
′11 + E11|(1 + `′1F`1),
where F =e11
e11+`211(L11L
′11+E11)−1. Note that |E| = e11|E11| =
∏dj=1 ejj = λ−d1
∏dj=1 σ
2j .
Thus,
|Σ0|−Mk
2
|Z ′PZ +G−1|12
≤ |E|Mk
2
|E + Σ0|Mk
2
=eMk
211 |E11|
Mk2
(e11 + `211)Mk
2 |L11L′11 + E11|
Mk2 (1 + `′1F`1)
Mk2
. (A.8)
Let R = (`11, `1, L11, σ21, . . . , σ
2d) : `11 > 0; `1 ∈ Rd−1;σ2
j > 0, j = 1, . . . , d and
R0 = (`11, L11, σ21, . . . , σ
2d) : `11 > 0;σ2
j > 0, j = 1, . . . , d. Then, by (A.7) and (A.8), we
71
have
∫R
|Σ0|−Mk
2 e−12y′Wy
|Z ′PZ +G−1|12
[d∏j=1
( 1
σ2j
)M(n−q−1)2 +αj+1
]π(`11)π(L11)dσ2
1 · · · dσ2d d`11 d`1 dL11
=
∫R0
[∫Rd−1
1
(1 + `′1F`1)Mk
2
d`1
]
×eMk
211 |E11|
Mk2
(e11 + `211)Mk
2 |L11L′11 + E11|
Mk2
[d∏j=1
( 1
σ2j
)M(n−q−1)2 +αj+1
e
−cj
2σ2j
]× π(`11)π(L11) dσ2
1 · · · dσ2d d`11 dL11. (A.9)
When Mk > 1,∫Rd−1
1
(1+`′1F`1)Mk
2
d`1 = C1|F |−12 , where C1 is a constant. Thus, (A.9)
becomes
C1
∫R0
|F |−12 e
Mk2
11 |E11|Mk
2
(e11 + `211)Mk
2 |L11L′11 + E11|
Mk2
[d∏j=1
( 1
σ2j
)M(n−q−1)2 +αj+1
e
−cj
2σ2j
]× π(`11)π(L11) dσ2
1 · · · dσ2d d`11 dL11
= C1
∫R0
(e11
e11 + `211
)Mk−(d−1)2 |E11|
Mk2
|L11L′11 + E11|
Mk−12
[d∏j=1
( 1
σ2j
)M(n−q−1)2 +αj+1
e
−cj
2σ2j
]·
× π(`11)π(L11) dσ21 · · · dσ2
d d`11 dL11.
≤ C1
∫R0
|E11|12
[d∏j=1
( 1
σ2j
)M(n−q−1)2 +αj+1
e
−cj
2σ2j
]π(`11)π(L11) dσ2
1 · · · dσ2d d`11 dL11
= C2
∫ ∞0
( 1
σ21
)M(n−q−1)2 +α1+1
e− c1
2σ21 dσ2
1
[d∏j=2
∫ ∞0
( 1
σ2j
)M(n−q−1)2 +αj
e
−cj
2σ2j dσ2
j
]
×∫ ∞
0π(`11) d`11
∫π(L11) dL11
<∞,
72
where the first inequality holds since
(e11
e11 + `211
)Mk−(d−1)2
≤ 1,|E11|
Mk−12
|L11L′11 + E11|
Mk−12
≤ 1
for Mk ≥ d − 1 and the last inequality holds since π(`11) and π(L11) are proper and∫∞0 θ−se−t/θ dθ <∞ for s > 0, t > 0. C1 and C2 above are both finite constants. Hence we
conclude that the posterior is proper.
73
Appendix B
MCMC Algorithm via Gibbs
Sampling for Bayesian Inference in
Chapter 1
For the linear mixed model,
y = Xβ + Zγ + ε
γ ∼ N(0, G), ε ∼ N(0,Σ),
where G = IM ⊗ Σ0 ⊗ Ik and Σ = IM ⊗D ⊗ In. For the Bayesian inference, we put priors
on the parameters as described in section 1.3.3 and the followings are the updating steps of
β and σ2j for Gibbs sampling method.
Updating β
Let r(i) = y(i)−(Id⊗Z)γ(i) and r = 1M
∑Mi=1 r
(i). Given π(β) ∝ 1, the posterior distribution
of β is
β | y, γ,D,Σ0 ∼ N(β, Vβ
),
where β = (X ′Σ−1X)−1X ′Σ−1r = (Id ⊗ (X ′X)−1X ′)r and V
β= (X ′Σ−1X)
−1= 1
M (D ⊗
(X ′X)−1).
74
Updating γ
Let u(i) = y(i) − (Id ⊗ X)β, i = 1, . . . ,M . We can update γ(i) one by one independently.
Since γ(i) ∼ N(0,Σ0 ⊗ Ik), the posterior distribution of γ(i) is
γ(i) | y, β,D,Σ0 ∼ N(γ(i), Vγ), i = 1, . . . ,M,
where γ(i) = (D−1⊗Z ′Z+Σ−10 ⊗Ik)−1(D−1⊗Z ′)u(i) and Vγ = (D−1⊗Z ′Z+Σ−1
0 ⊗Ik)−1.
Updating D = diag(σ21, . . . , σ
2d)
We can update D by updating σ2j one by one independently.
Let r = y− Xβ − Zγ. Given IG(aj , bj) as the prior for π(σ2j ), the posterior distribution
of σ2j is
σ2j | y, β, γ,Σ0 ∼ IG
(1
2Mn+ aj ,
(1
2
M∑i=1
n∑t=1
r(i)2jt +
1
bj
)−1), j = 1, . . . , d
Updating procedure of Σ0 via `11, `1 and L11 is provided in section 1.3.3.
75
Appendix C
Maximum Likelihood Estimation in
Chapter 1
This part serves as a preliminary analysis of the parameters based on the likelihood approach.
Here instead of the original assumption that the covariance matrix for errors is Σ = IM⊗D⊗
Σϕ, where D is diagonal matrix, we consider a more general relationship between individual
covariates such that Σ = IM ⊗Σ1⊗Σϕ, where Σ1 can be any valid covariance matrix. Then
the likelihood function for linear regression model in section 1.4 is
f(y|β, γ,Σ1) =1
(2π)12ndM |Σ|
12
e−12(y−Xβ−Zγ)′Σ−1(y−Xβ−Zγ),
where Σ = IM ⊗Σ1⊗Σϕ when we assume γ as fixed effects as well. The MLE equation for
β and γ is then X ′Σ−1X X ′Σ−1Z
Z ′Σ−1X Z ′Σ−1Z
βγ
=
X ′Σ−1y
Z ′Σ−1y
.
Let β, γ denote the MLEs, e(i)jt = y
(i)jt − Xβ − Zγ, Ui = (e
(i)1 , . . . , e
(i)d ) and Si = U ′iΣ
−1ϕ Ui,
where i = 1, . . . ,M . Then, we have
f(y|β, γ,Σ1,Σϕ) ∝ e−12tr(Σ−1
1 S)
|Σ1|nM
2 |Σϕ|dM2
,
76
where S =∑Mi=1 Si. Then D is maximized at Σ1 = S
nM .
The log-likelihood function is
L(y|β, γ,Σ1,Σϕ) ∝ −M2
tr(Σ−11 S(ϕ)) +
nM
2log|Σ−1
1 |+dM
2log|Σ−1
ϕ |,
where S(ϕ) = SM .
It can be shown that |Σ−1ϕ | = 2n
∑n−1j=0 (1+ϕ cos(2jπ
n )). Therefore the MLE for ϕ satisfies
ϕ = arg max−1≤ϕ≤1
M
2tr(Σ−1
1 S(ϕ)) +dM
2log|Σ−1
ϕ |+ndM
2
Following the procedure above we are able to obtain the MLEs by updating the parameters
iteratively until convergence.
77
Appendix D
Proof of Theorems and Lemmas in
Chapter 2
Proof of Lemma 1. First note that Vj = εTX(j)/(√nσσj) ∼ N(0, 1). Examining the tail
probability of Gaussian random variables leads to
P
(max
1≤j≤n|Vj | >
√t2 + 2 log n
)≤ 2nP
(V1 >
√t2 + 2 log n
)(D.1)
≤ 2n exp
[−t
2 + 2 log n
2
]= 2 exp(−t2/2). (D.2)
Then it follows that
P
(max
1≤j≤n2|εTX(j)|
n> λ0
)= P
(2σ√n
max1≤j≤n
σj |Vj | > λ0
)(D.3)
≤ P
(max
1≤j≤n|Vj | >
√t2 + 2 log n
)(D.4)
≤ 2 exp(−t2/2). (D.5)
Proof of Lemma 2. Let W = nσ2/σ2 = Y TY /σ2. Since Y ∼ N(Xβ, σ2I), we claim
that W ∼ χ2n(r), i.e. a noncentral chi-square distribution with noncentrality parameter r.
78
It is known that
E(W ) = n+ r (D.6)
E[W − E(W )]2 = 2(n+ 2r) (D.7)
E[W − E(W )]4 = 12(n+ 2r)2 + 48(n+ 4r). (D.8)
Given k > 0, by Chebyshev’s inequality
P(|σ2 − (1 + r/n)σ2| > ασ2
)= P (|W − E(W )| > nα) ≤ E|W − E(W )|k
αknk(D.9)
Therefore,
P(σ2 < (1− α)σ2
)≤ P
(|σ2 − (1 + r/n)σ2| > ασ2
)≤ E|W − E(W )|k
αknk(D.10)
Finally by taking k = 2 and k = 4 respectively, we complete our proof.
Proof of Theorem 3. Note that under (2.5),
‖Y −Xβ‖22/n+ λn−1∑j=2
|βj | ≤ ‖Y −Xβ‖22/n+ λn−1∑j=2
|βj | (D.11)
By expanding the `2 terms, it simply gives
‖X(β − β)‖22/n+ λ
n−1∑j=2
|βj | ≤ 2εTX(β − β)/n+ λ
n−1∑j=2
|βj | (D.12)
For any vector v ∈ Rp we define v01 = (v0, v1, 0, · · · , 0)T and v2+ = (0, 0, v3, · · · , vn−1)T .
79
Since v = v01 + v2+, (D.12) becomes
‖X(β−β)‖22/n ≤ 2εTX(β−β)01/n+ 2εTX(β−β)2+/n−λn−1∑j=2
|βj |+λ
n−1∑j=2
|βj | (D.13)
Set λ0 = 6Kσ
√log nn . On Jλ0
=
max1≤j≤n 2|εTX(j)|/n ≤ λ0
,
2εTX(β − β)2+/n ≤ λ0
n−1∑j=2
|βj − βj | (D.14)
Meanwhile, note that ‖β01‖1 = |β0|+ |β1| ≤M by assumption. We have
2εTX(β − β)01/n ≤ λ0‖β01‖1 + λ0‖β01‖1 ≤ λ0M + λ0(|β0|+ |β1|) (D.15)
Now define Sn = σ : σ ≥ σ/2. Then on Sn we have λ ≥ λ0. Combining (D.12) through
(D.15) and noting the fact that limn→∞ λ = 0 and limn→∞ λ‖β‖1 = 0, we claim that on
Jλ0∩ Sn,
‖X(β − β)‖22/n ≤ λM + 2λ‖β‖1 → 0 as n→∞. (D.16)
It remains to show that Jλ0∩ Sn has high probability when n is large. In fact, we have
P(J cλ0∪ Scn
)≤ P (J cλ0
) + P (Scn) (D.17)
By lemma 1, P (J cλ0) ≤ 2/n2. Taking α = 3/4 in lemma 2 we get
P (Scn) ≤ (27 + 12) · 44
34 · n2≤ 27
n2(D.18)
80
when n ≥ max(4r, 8). Hence we claim that P(J cλ0∪ Scn
)≤ 130/n2 when n ≥ max(4r, 8).
Since∞∑n=1
P(J cλ0∪ Scn
)≤ (4r + 8) + 130
∞∑n=1
1
n2<∞, (D.19)
by Borel-Cantelli lemma,
P(J cλ0∪ Scn i.o.
)= 0. (D.20)
It follows that
P(∃N > 0 s.t. Jλ0
∩ Sn holds when n > N)
= 1. (D.21)
Therefore, by (D.16)
P(
limn→∞
‖X(β − β)‖22/n = 0)
= 1. (D.22)
Proof of Proposition 1. Recall that xj = j/n, j = 1, 2, · · · , n. According to (2.4), we
have
X − νI =
1− ν 1n 0 0 · · · 0 0
1 2n − ν 0 0 · · · 0 0
1 3n
1n − ν 0 · · · 0 0
......
......
......
1 n−1n
n−3n
n−2n · · · 1
n − ν 0
1 1 n−2n
n−1n · · · 2
n1n − ν
(D.23)
It is easy to see that ν = 2/n is not the solution to det(X − νI) = 0. So by multiplying
the second row of (D.23) with −1/(2 − nν) and add it to the first row we obtain a lower
81
triangular matrix
R(X − νI) =
nν2−(n+2)ν+12−nν 0 0 0 · · · 0 0
1 2n − ν 0 0 · · · 0 0
1 3n
1n − ν 0 · · · 0 0
......
......
......
1 n−1n
n−3n
n−2n · · · 1
n − ν 0
1 1 n−2n
n−1n · · · 2
n1n − ν
, (D.24)
where R denotes the row operation described as above. Then it is clear that
det(X − νI) = det(R(X − νI)) =nν2 − (n+ 2)ν + 1
n
(1
n− ν)n−2
(D.25)
It directly follows that
ν1 =n+ 2−
√n2 + 4
2n, ν2 = ν3 = · · · = νn−1 =
1
n, νn =
n+ 2 +√n2 + 4
2n(D.26)
are the roots of det(X − νI) = 0.
82
Appendix E
Proof of Theorems in Chapter 3
Proof of Theorem 4. Given site s, the likelihood function is
f(Y s|b, us, σ2s) =
1
(2πσ2s)n2
exp
(− 1
2σ2s
(Y s −X(s)b(s) − us1n)T (Y s −X(s)b(s) − us1n)
)(E.1)
To validate the propriety of the posterior, we need to show
∑c∈P
∫(0,∞)
∫RN
∫Θd
∫(0,∞)N
N∏s=1
f(Y s|b, us, σ2s)π(us|τ2)π(σ2
s)
π(c)π(τ2)dσ2G0(dθ)dudτ2 <∞
(E.2)
Integrating out σ2s , we have
fs(Y s, b, us) =
∫(0,∞)
f(Y s|b, us, σ2s)π(σ2
s)dσ2s = (2π)−
n2
Γ(aσ + n2 )
Γ(aσ)
(12eTs es + 1
bσ
)−(aσ+n2 )
baσσ,
(E.3)
where es = Y s −X(s)b(s) − us1¯n
.
Then the integrand in (E.2) becomes
N∏s=1
fs(Y s, b, us)π(us|τ2)
π(c)π(τ2), (E.4)
83
which is equivalent to the clustered form
(2π)−Nn2
(Γ(aσ + n
2 )
Γ(aσ)baσσ
)N d∏r=1
∏s∈Cr
(1
2eTr,ser,s +
1
bσ
)−(aσ+n2 )
, (E.5)
where er,s = Y s −Xrbr − us1¯n
.
Let Ir,s =(
12eTr,ser,s + 1
bσ
)−(aσ+n2 )
. Integrating with respect to G0(dθ), given cluster r we
need to evaluate
d∏r=1
∫Rq+kr
∏s∈Cr
Ir,s
dbr ≤d∏r=1
∫Rq+kr
b(aσ+n
2 )(|Cr|−1)σ Ir,r1dbr (E.6)
since Ir,s ≤ baσ+n
2σ for any r and s, where r1 ∈ Cr is a fixed location. Note that Ir,r1 =(
12eTr,r1er,r1 + 1
bσ
)−(aσ+n2 )
as a function of br is proportional to a (q + kr) dimensional
multivariate t-distribution with parameters (µr,Σr, φr), where
µr = (XTr Xr)
−1XTr (Y r1 − ur11
¯n) = (XT
r Xr)−1XT
r vr1 (E.7)
Σr =
2bσ
+ vTr1
[I −Xr(XT
r Xr)−1XT
r
]vr1
φr(XT
r Xr)−1 (E.8)
φr = 2aσ + n− (q + kr) (E.9)
Then the rth integral for the RHS of (E.6) can be shown as
∫Rq+kr
b(aσ+n
2 )(|Cr|−1)σ Ir,r1dbr = Arb
(aσ+n2 )(|Cr|−1)
σ ·
(2bσ
+ vTr1Prvr1
)q+kr2(
1bσ
+ vTr1Prvr1
)aσ+n2
(E.10)
≤ Arb(aσ+n
2 )|Cr|−q+kr2σ 2
12(q+kr), (E.11)
84
where Pr = I −Xr(XTr Xr)
−1XTr and
Ar =Γ(aσ +
n−(q+kr)2
)Γ(aσ + n
2 )
π12(q+kr)
|XTr Xr|
12
(E.12)
Therefore it remains to verify that
∫RN
∫(0,∞)
N∏s=1
π(us|τ2)π(τ2)dτdu <∞ (E.13)
Recall that π(us|τ2) = 1√2πτ
e− u2
s2τ2 and π(τ2) =
b−aττΓ(aτ )
(1τ2
)aτ+1e− 1bτ τ2 , integrating out τ2
leads to a new integrand
(2π)−N2
baττ
Γ(aτ + N2 )
Γ(aτ )
1
bτ+
1
2
N∑s=1
u2s
−(aτ+N2 )
, (E.14)
which, as a function of u, is proportional to the density of a N dimensional t-distribution
with parameters (zN , (aτ bτ )−1IN , 2aτ ). Hence the last integration with respect to u is finite.
Note the final summation is over finite possible partitions of S. Therefore we complete our
verification.
85
BIBLIOGRAPHY
86
BIBLIOGRAPHY
[1] Abraham, C., Cornillon, P.A., Matzner-Løber, E. and Molinari, N. (2003) UnsupervisedCurve Clustering using B-Splines. Scandinavian Journal of Statistics 30: 581-595.
[2] Bickel, P., Ritov, Y., and Tsybakov, A. (2009), Simultaneous Analysis of Lasso andDantzig Selector. The Annals of Statistics 37(4): 1705-1732.
[3] Buhlmann, P. and van de Geer, S. (2011), Statistics for High-Dimensional Data: Methods,Theory and Applications Springer: New York, USA.
[4] Dass, S., Lim, C., Maiti, T. and Zhang, Z. (2014), Clustering Curves based on Changepoint analysis : A Nonparametric Bayesian Approach. Accepted by Statistica Sinica
[5] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004), Least Angle Regression.The Annals of Statistics 32(2): 407-499.
[6] Escobar, M.D. and West, M. (1995), Bayesian Density Estimation and Inference usingMixtures. Journal of the American Statistical Association 90(430): 577-588.
[7] Ferguson, T.S. (1973), A Bayesian Analysis of Some Nonparametric Problems. The An-nals of Statistics 1(2): 209-230.
[8] Friedman, J., Hastie, T. and Tibshirani, R. (2010), Regularization Paths for GeneralizedLinear Models via Coordinate Descent. Journal of Statistical Software 33(1): 1-22.
[9] Gaston, K.J. (1996), Biodiversity: A Biology of Numbers and Difference Oxford, UK.
[10] Gelman, A. and Rubin, D.B. (1992), Inference from Iterative Simulation Using MultipleSequences. Statistical Science, 7: 457-472.
[11] Ghosh P., Basu S. and Tiwari, R.C. (2009), Bayesian Analysis of Cancer Rates fromSEER Program using Parametric and Semiparametric Joinpoint Regression Models. Jour-nal of the American Statistical Association 104(486): 439-452.
[12] Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical LearningSpringer: New York, USA.
87
[13] Jeffreys, H. (1961), Theory of Probability Oxford Univ. Press: Oxford, UK.
[14] Kim, H.-J., Fay, M. P., Feuer, E. J. and Midthune, D. N. (2000), Permutation Testsfor Joinpoint Regression with Applications to Cancer Rates. Statistics in Medicine 19:335-351.
[15] Lozano, R. et al. (2012), Global and Regional Mortality from 235 Causes of Death for20 Age Groups in 1990 and 2010: A Systematic Analysis for the Global Burden of DiseaseStudy 2010. The Lancet 380(9859): 2095-2128.
[16] Ramsay J.O. and Silverman B.W. (1997) Functional Data Analysis Springer: New York,USA.
[17] Ries, L.A.G., Harkins, D., Krapcho, M., Mariotto, A., Miller, B.A., Feuer, E.J., C-legg, L., Eisner, M.P., Horner, M.J., Howlader, N., Hayat, M., Hankey, B.F., Edwards,B.K. (eds). (2006), SEER Cancer Statistics Review, 1975-2003 National Cancer Institute:Bethesda, MD, USA.
[18] Sethuraman, J. (1994), A Constructive Definition of Dirichlet Priors. Statistica Sinica4: 639-650.
[19] The IUCN Red List of Threatened Species (2012). http://www.iucnredlist.org/.
[20] Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso. Journal of theRoyal Statistical Society: Series B (Methodological) 58(1): 267-288.
[21] Tiwari, R.C., Cronin, K.A., Davis, W., Feuer, E.J., Yu, B. and Chib, S. (2005), BayesianModel Selection for Join Point Regression with Application to Age-adjusted Cancer Rates.Journal of the Royal Statistical Society. Series C (Applied Statistics) 5: 919-939.
[22] Wang, H., Li, B. and Leng, C. (2009), Shrinkage Tuning Parameter Selection witha Diverging Number of Parameters. Journal of the Royal Statistical Society: Series B(Statistical Methodology) 71: 671-683.
[23] Yang, R. and Berger, J. (1994), Estimation of a Covariance Matrix Using the ReferencePrior. The Annals of Statistics 22(3): 1195-1211.
88