FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS Xin Qi · ducting Bayesian hypothesis testing on the covariance ordering of random e ects in a spline mixed model for multivariate functional

FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS

By

Xin Qi

A DISSERTATION

Submitted toMichigan State University

in partial fulfillment of the requirementsfor the degree of

Statistics - Doctor of Philosophy

2014

ABSTRACT

FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS

By

Xin Qi

As a branch of statistics, functional data analysis analyzes data based on the information

about curves, functions, surfaces or anything else varying over a continuum such as time or

spatial location, etc. This topic has been drawing more and more attention by scientists

and other people in recent years since in many real problems, samples are often collected on

curves and other functional observations. Therefore, it also calls for various statistical models

and methodologies to perform inference on the data appropriately. In this article, I introduce

some examples of functional analysis problem from real world and develop methodologies

that provide powerful tools to address statistical concerns.

iii

TABLE OF CONTENTS

LIST OF TABLES ........................................................................................................................v

LIST OF FIGURES .................................................................................................................... vi

Chapter 1 Bayesian Inference for Covariance Ordering in Multivariate Functional Data ...1

1.1 Introduction ...........................................................................................................................1

1.2 Spline Mixed Model ..............................................................................................................4

1.3 Bayesian Hypothesis Testing on Covariance Ordering .........................................................6

1.3.1 Model Selection and Hypothesis Testing ..................................................................6

1.3.2 New Prior on with Ordering Representation ........................................................8

1.3.3 Bayesian Estimation ...................................................................................................9

1.4 Analysis of Audio Frequency Data of Wildlife Species in Michigan .................................10

1.4.1 Bayesian Inference Results ......................................................................................12

1.4.2 Discussion ................................................................................................................15

Chapter 2 Joinpoint Detection using -Penalized Spline Method ........................................17

2.1 Introduction .........................................................................................................................17

2.2 -Penalized Regression Spline Model ...............................................................................19

2.3 Consistency of the LASSO Type Estimator ........................................................................21

2.4 Simulation Study .................................................................................................................25

2.5 A Case Study: National Cancer Incidence Rate Analysis ...................................................33

2.6 Conclusion and Discussion .................................................................................................37

Chapter 3 Nonparametric Bayesian Clustering of Functional Data ......................................38

3.1 Introduction .........................................................................................................................38

3.2 Spline Mixed Model for Clustering Multiple Curves .........................................................41

3.2.1 Model Specification ......................................................................................................41

3.2.2 Derivation of Functional Dirichlet Process and Other Prior .......................................43

3.3 Theory of Bayesian Inference ..............................................................................................46

3.3.1 Propriety of the Posterior Distribution .........................................................................46

3.3.2 Gibbs Updating Steps ...................................................................................................46

3.4 Bayesian Inference: A Measure for Comparing a Pair of Clustering of Sites ....................52

3.5 Numerical Analysis: Simulation Study and Real Data Example .........................................54

3.5.1 An Alternative Less-informative Prior Choice for ..................................................55

iv

3.5.2 Simulations and Comparison with An Existing Approach ...........................................56

3.5.3 A Real Data Example: Lung Cancer Rates in the U.S. ................................................57

3.6 Conclusion and Discussion .................................................................................................59

APPENDICES .............................................................................................................................64

Appendix A Proof of Theorems in Chapter 1 ...........................................................................65

Appendix B MCMC Algorithm via Gibbs Sampling for Bayesian Inference in Chapter 1 .....74

Appendix C Maximum Likelihood Estimation in Chapter 1 ....................................................76

Appendix D Proof of Theorems and Lemmas in Chapter 2 ......................................................78

Appendix E Proof of Theorems in Chapter 3 ............................................................................83

BIBLIOGRAPHY .......................................................................................................................86

v

LIST OF TABLES

Table 1.1: Bayes Factor for Significant Orderings at L00 ............................................................15

Table 1.2: Bayes Factor for Significant Orderings at L02 ............................................................15

Table 2.1: A Summary of Four Methods to Compare with the PTB ............................................28

Table 2.2: Simulation Result for , ........................................................29

Table 2.3: Simulation Result for , ......................................................30



Table 2.6: Simulation Result for , ...................................................34

Table 2.7: Simulation Result for , .................................................36

Table 2.8: Joinpoint Detection for Real Data Analysis ................................................................36

Table 3.1: Cluster Configuration Diagnostics for Simulations: Bayesian DP Approach .............56

Table 3.2: Cluster Configuration Diagnostics for Simulations: James and Sugar's Approach .....56

Table 3.3: The Distribution of Number of Clusters ..................................................................58

Table 3.4: Estimated Cluster Configuration .................................................................................59

vi

LIST OF FIGURES

Figure 1.1: An Example of Biodiversity Data ................................................................................3

Figure 2.1: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC = , .................................................................................................................................32

Figure 2.2: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC = ,

.................................................................................................................................32

Figure 2.3: Incidence Rates of Cancer for overall U.S. (1973 - 1999) .........................................35

Figure 2.4: Fitted Incidence Rates (1973 - 1999) .........................................................................35

Figure 3.1: Dendrogram for the Clustering Estimation ................................................................61

Figure 3.2: Estimated Cluster Configuration ................................................................................62

Figure 3.3: Overall Trend by Clusters ..........................................................................................63

Chapter 1

Bayesian Inference for Covariance

Ordering in Multivariate Functional

Data

1.1 Introduction

Biodiversity, often defined as the variety of life at all levels of organization [9], has been

studied at species level for most of the time. It is closely related to human lives and human

activities such as agriculture, health, business and industry, etc. In recent years, unsus-

tainable ecological practices such as habitat destruction, agricultural over-harvesting, and

pollution have been damaging species diversity. According to International Union for the

Conservation of Nature (IUCN) [19], the world’s main authority on the conservation status

of species, through 2012, 19,817 out of 63,837 assessed species were threatened with extinc-

tion. The Convention on Biological Diversity (CBD) was recognized to help conservations

of biodiversity and sustainable use of its components. Statistical analysis of biodiversity

measurements are essential and valuable to conservationists to understand the nature of

biodiversity. With species indices and habitat variability assessments, conservationists can

identify areas of ecological importance that should be protected.

1

The audio measurements on suburban or rural areas can be used to study impact of

human activities on the other wildlife animals. The audio measurements taken at a regular

time intervals are converted to energy readings on the frequency domain and energy readings

are then divided into several frequency bands. Each frequency band corresponds to a different

category of wildlife animals according to the sound they create. By investigating how energy

readings in each band are changing over time, we can study interaction between different

groups of wildlife animals. For example, Figure 1.1 (a) shows the energy readings of several

frequency bands at one suburban location in the upper peninsular of Michigan on June 15

2010 while Figure 1.1 (b) gives a similar plot for a rural location. Note that one of frequency

bands in black color is related to the sounds created by human activity such as the noise

from automobile vehicles in highways. We can see that energy readings of other frequency

bands (in red, green and blue) behave differently depending on level of energy readings by

human activities. Also, the patterns of such interaction vary over different location as in

(a) and (b). This type of data can be viewed as the set of functional curves, in which each

curve corresponds to the energy readings over time for each frequency band. To investigate

interaction between curves (species), we introduce spline mixed models. Off-diagonal entries

of the covariance matrix for random effects can be interpreted as correlation between different

frequency bands (i.e. different species) so that we can compare off-diagonal entries to study

sensitivity of species in one frequency band against species in another frequency band. This

can be done by Bayesian hypothesis testing on covariance ordering. However, entries of

the covariance matrix are on the constrained parameter space due to positive definiteness

criterion and different orderings of covariances may have unbalanced prior probabilities,

which is not desirable.

In this chapter, we investigate the relationship and interaction between species by con-

2

Figure 1.1: An Example of Biodiversity Data

ducting Bayesian hypothesis testing on the covariance ordering of random effects in a spline

mixed model for multivariate functional data. We develop a default prior of a covariance

matrix which yields balanced prior probabilities on the covariance ordering after appropriate

re-parametrization. With the proposed default prior, we fit a spline mixed model using ener-

gy readings data from two locations in the upper peninsular of Michigan. Our data analysis

claims significant difference in the relationship among group of species between such two

locations.

We state our spline mixed model in section 1.2 and introduce Bayesian hypothesis testing

as well as construction of a default prior for the covariance matrix of random effects in section

1.3. In section 1.4, we apply our result to the energy reading data in Michigan to compare

and discuss the activities of different species.

3

1.2 Spline Mixed Model

To investigate interaction between multiple curves (functions) over time which is of interest

in the application of biodiversity, we consider a spline mixed model. Suppose that there

are d curves, fj(τ)dj=1 on τ ∈ [0, T ]. We observe such curves at n discrete time points,

0 < τ1 < τ2 < · · · < τn < T during the time period [0, T ]. In the application of audio

measurements, each curve corresponds to one frequency band, a time period is one day. The

data were collected over several days and we assume that we have repeated measurements

on days since it is reasonable to assume that daily activity patterns of animals including

human are not too different over days. Let y(i)jt be the observed value of the j-th curve at

time τt on the i-th day for j = 1, · · · , d, t = 1, · · · , n and i = 1, · · · ,M .

For functional data, it is common to approximate a function using basis functions. In

this regard, we adopt approximation of functions using spline basis functions. That is,

f(τ) ≈q∑

m=0

τm

m!βm +

k∑m=1

(τ − ξm)q+

q!γm,

where a+ = maxa, 0. This spline approximation uses order q spline polynomial func-

tions with knot points ξ1, · · · , ξk. With the data structure described above, we intro-

duce the following spline mixed model in a matrix form for the i-th day. Let y(i) =(y

(i)11 , y

(i)12 , . . . , y

(i)1n , . . . , y

(i)d1 , y

(i)d2 , . . . , y

(i)dn

)Tbe the set of observed curves for the i-th day

for i = 1, . . . ,M . Then, we have

y(i) = (Id ⊗X)β(i) + (Id ⊗ Z)γ(i) + ε(i), (1.1)

where β(i) =(β

(i)10 , · · · , β

(i)1q , · · · , β

(i)d0 , · · · , β

(i)dq

)Tare fixed effects,

4

γ(i) =(γ

(i)11 , · · · , γ

(i)1k , · · · , γ

(i)d1 , · · · , γ

(i)dk

)Tare random effects and

ε(i) =(ε(i)11 , ε

(i)12 , · · · , ε

(i)1n, · · · , ε

(i)d1 , ε

(i)d2 , · · · , ε

(i)dn

)Tare error terms.

The corresponding design matrices for fixed effects and random effects are

X =

1 τ1 . . . 1q!τ

q1

1 τ2 . . . 1q!τ

q2

...

1 τn . . . 1q!τ

qn

, Z =

1

q!

(τ1 − ξ1)q+ (τ1 − ξ2)

q+ . . . (τ1 − ξk)

q+

(τ2 − ξ1)q+ (τ2 − ξ2)

q+ . . . (τ2 − ξk)

q+

...

(τn − ξ1)q+ (τn − ξ2)

q+ . . . (τn − ξk)

q+

.

We also assume that ε(i) ∼ N(0,Σ), where Σ = D ⊗ Σϕ with D = diag(σ21, σ

22, . . . , σ

2d)

and Σϕ = (ϕuv)n×n and ϕuv =

1 if u = v

ϕ if |u− v| = 1

0 otherwise

. σ2j allows to have different variability

of each curve. Σϕ is a temporal covariance matrix to capture possible temporal dependence

in each curve.

For the covariance structure of random effects, we assume γ(i) ∼ N(0,Σ0 ⊗ Ik) with

Σ0 = (σjl)d×d so that cov(γ(i)j , γ

(i)l ) = σjlIk for j, l = 1, . . . , d and σjl models the interaction

between the j-th and l-th curves.

Let y =(y(1), y(2), . . . , y(M)

)Tbe the collection of all data points. Then, the matrix

form of the entire model is

y = Xβ + Zγ + ε, (1.2)

where X = IM⊗Id⊗X and Z = IM⊗Id⊗Z. Note that γ ∼ N(0, G) with G = IM⊗Σ0⊗Ik.

5

1.3 Bayesian Hypothesis Testing on Covariance Order-

ing

To investigate interaction between curves using the spline mixed model introduced in the

previous section, we consider the ordering of the covariance matrix for random effects. In

particular, we are interested in a Bayesian test on a specific ordering within a row. e.g.,

σ12 > σ13 > · · · > σ1d > 0.

1.3.1 Model Selection and Hypothesis Testing

Let θ = Σ0 be the parameter of interest. We assume that the parameter space, Θ, is

partitioned into Θi such that Θ =⋃d!i=1 Θi and Θi is a partition of Θ with d! possible

ordering restrictions of σ12, σ13, . . . , σ1d, 0. Note that each subspace Θi represents a model

Mi with the corresponding ordering. To perform model selection, we consider testing the

hypothesis Hi : θ ∈ Θi versus Hj : θ ∈ Θj , where i, j = 1, . . . , d!, i 6= j.

Given the priors πi = π(Θi) and πj = π(Θj), suppose the posterior probabilities are

νi = π(Θi|y) and νj = π(Θj |y). Then the Bayes factor is defined as

B =posterior odds ratio

prior odds ratio=

νi/νjπi/πj

=

∫Θiπ(θi)f(y|θi) dθi∫

Θjπ(θj)f(y|θj) dθj

/πi/πj . (1.3)

When the Bayes factor becomes large enough, we tend to propose Hi.

To compute the Bayes factor, we need to put an appropriate prior on Σ0. The prior

distribution should bring a proper posterior probability and computation of πi/πj should be

easy. However, the common prior models for a covariance matrix such as Jeffrey’s prior and

inverse Wishart prior do not satisfy such criteria. This leads us to develop a default prior

6

which brings a proper posterior distribution as well as easiness of computation.

First, we show that Jeffrey’s reference prior leads to an improper posterior, so that it is

not suitable to use for our test.

Theorem 1. Assume d ≥ 4 and Mk > 2. The improper priors π(Σ0) ∝ |Σ0|−p with (1)

p = d+12 , (2) p = d

2 − 1 and (3) p = d give rise to improper posteriors.

The proof of the theorem is provided in Appendix A.

By Theorem 1, we can not calculate νi = π(Θi|y) and νj = π(Θj |y) which indicates

inappropriateness of Jeffrey’s reference prior for the proposed spline mixed models.

When inverse Wishart prior is considered, it is clear that the posterior distribution is

proper. However, inverse Wishart prior is not feasible for Σ0 for our testing problem above

since it has no explicit ordering representation. For instance, assume d = 4 and consider

testing hypotheses H1 : σ12 > σ13 > σ14 > 0 vs. H2 : σ12 > 0 > σ13 > σ14. Note that in

this case, Θ1 = σ12 > σ13 > σ14 > 0 and Θ2 = σ12 > 0 > σ13 > σ14. Then for inverse

Wishart prior Σ0 ∼ IW (S0,m), where S0 is a d× d positive definite matrix and m > d− 1

is a constant, it is difficult to calculate π1 = P (Σ0 ∈ Θ1) and π2 = P (Σ0 ∈ Θ2). Also, there

is no guarantee to have equal prior probability so that no prior subjectivity can affect the

Bayes factor. In the next section, we introduce a new prior on the covariance matrix which

solve both issues. That is, the new prior guarantees propriety of the posterior distribution

and equal prior probability so that we do not need to calculate πi and πj in the Bayes factor

due to cancellation in πi/πj .

7

1.3.2 New Prior on Σ0 with Ordering Representation

We propose a partially improper prior on a covariance matrix via Cholesky decomposition.

For simplicity, we assume Σϕ = In. As we see later in section 1.4, the real data supports

such simple assumption.

Let Σ0 = LLT be the Cholesky decomposition of Σ0, where L =

`11 0

`1 L11

is a lower

triangular matrix. Then we have the following representation for Σ0:

Σ0 =

σ11 σ1∗

σT∗1 Σ11

=

`11 0

`1 L11

`11 `T1

0 LT11

=

`211 `11`T1

`11`1 `1`T1 + L11L

T11

where `11 > 0 and `1 ∈ Rd−1.

This re-parametrization separates out the constrained vector of parameters (`1) and

unconstrained parameters (`11 and L11). Note that σ1∗ = (σ12, σ13, . . . , σ1d) = `11`T1 . Since

`11 > 0, each subset of parameter space, Θi, corresponds to an ordering of entries in `1 for

i = 1, 2, . . . , d!. The following theorem guarantees a proper posterior distribution when the

flat prior on `1 is used.

Theorem 2. Assume n > q+1+k and Mk ≥ d−1. Let Σ0 = LLT , where L =

`11 0

`1 L11

.

Suppose that the priors satisfy

(1) π(β) ∝ 1,

(2) π(D) =∏dj=1 π(σ2

j ), where π(σ2j ) ∝

( 1σ2j

)αj+1, αj > 0, ∀j = 1, . . . , d,

(3) π(L) = π(`11)π(`1)π(L11), where π(`1) ∝ 1, π(`11) and π(L11) are proper.

Then the posterior is proper.

The proof of theorem is provided in Appendix A.

8

Now we reconsider the testing example H1 : σ12 > σ13 > σ14 > 0 vs. H2 : σ12 > 0 >

σ13 > σ14 again. Theorem 2 claims proper posteriors so that ν1 = π(Θ1|y) and ν2 = π(Θ2|y)

are well defined. Furthermore, when the flat prior π(`1) ∝ 1 is used, the relationship between

σ1∗ and `1 simply implies π(Θ1) = π(Θ2) = 1/d! if we add 0 into the consideration to have

ordering of covariance parameters. Hence in this situation, the priors are canceled out and

the Bayes factor remains to be the posterior ratio B = νi/νj , which can be calculated from

posterior samples straightforwardly.

1.3.3 Bayesian Estimation

To obtain the posterior samples of parameters to calculate the Bayes factor, we use a Markov

Chain Monte Carlo (MCMC) algorithm, in particular, Gibbs sampling algorithm. To derive

the posterior distribution, we consider the following priors for the parameters, β, σ2j and

Σ0. We assume π(β) ∝ 1, π(σ2j ) ∼ IG(aj , bj), where IG(a, b) denotes the inverse Gamma

distribution with the density function π(x) ∝ x−(a+1)e−1/bx. For the prior of Σ0, we consider

the proposed prior of L developed in the previous section. That is, π(L) = π(`11)π(`1)π(L11)

with π(`1) ∝ 1, π(`11) ∼ IG(a0, b0) and π(L11LT11) ∼ IW (Ω11, d − 1). Then, by Theorem

2, the posterior distribution is proper and the MCMC algorithm is valid with the proposed

prior.

In this section, we derive the conditional posterior distributions of `11, `1 and L11.

Conditional posterior distribution of other parameters are given in Appendix B.

First, we introduce the matrix S which is quadratic forms of random effects. That is,

S =∑Mi=1

∑kl=1 g

(i)l

Tg

(i)l with g

(i)l =

(γ

(i)1l , γ

(i)2l , . . . , γ

(i)dl

), for l = 1, . . . , k and i = 1, . . . ,M .

9

Then, we can obtain the conditional posterior distributions of `1, `211 and L11LT11:

`1 | γ, `11, L11, . . . ∼ N(`∗1, V∗`1

),

`211 | γ, `1, L11, . . . ∼ IG(Mk − d+ 1

2+ α0,

(s11

2+

1

β0

)−1),

L11LT11 | γ, `11, `1 ∼ IW (Mk − d− 1, S∗ + Id−1)

where `∗1 =`11s11

s1, V ∗`1=

`211s11

(L11LT11) and S∗ = S11 − 1

s11s1s

T1 for S =

s11 sT1

s1 S11

.

With above conditional posterior distributions, we can obtain posterior samples by the

Gibbs sampling algorithm.

1.4 Analysis of Audio Frequency Data of Wildlife Species

in Michigan

In this section, we apply our methodology to audio frequency data of wildlife species in

Michigan.

The wildlife sounds from two locations, L00 and L02, in Upper Peninsula of Michigan,

U.S. were collected in 1 minute interval from June 1 2010 to June 15 2010 at every 30

minutes from 12 a.m. to 11:30 p.m. on each day. L00 is located at Crawford Bay Lane

area, Cheboygan, MI which is close to main city roads while L02 is at Crane Nest Narrows,

Cheboygan, MI and it is relatively pristine area. Then, the audio data were converted

into the frequency data and divided into ten frequency bands which correspond to different

groups of species (human, bird type 1, bird type 2, etc.). More relevant information about

the source of data can be found in the Remote Environmental Assessment Laboratory at

10

Michigan State University (real.msu.edu). The behavior of energy readings in each frequency

band represents activities of different types of species and we can observe how other species

interact with human in their activities by how the energy reading curves are changing over

time of day.

Let P (i)j (t) ≥ 0 : τt ∈ [0, T ]; j = 1, . . . , D; i = 1, . . . ,M be the energy readings for D

frequency bands obtained from recorded audio measurements. Since the energy readings are

scaled, P(i)j (t) is the proportion for the jth frequency band at time τt on the ith day. Thus,

a natural constraint on P(i)j (t) is

∑Dj=1 P

(i)j (t) = 1 for t = 1, · · · , n and i = 1, . . . ,M . For

fixed i and t, P (i)j (t)Dj=1 can be viewed as compositional data. A typical additive log-ratio

transformation is widely applied to compositional data, which we adopt for P(i)j (t) as well.

We define the additive log-ratio transformation as

y(i)jt = log

P(i)j (t)

P(i)D (t)

, j = 1, . . . , D − 1.

Assume that the set of time points, T = τ1, τ2, . . . , τn with 0 < τ1 < τ2 < · · · < τn < T

and d = D−1. Lety

(i)jt : t = 1, . . . , n; j = 1, . . . , d; i = 1, . . . ,M

be the response variables

for the proposed spline mixed model.

Since the last four frequency bands contribute on audio intensity relatively small through

the period, we consider first six frequency bands and combined the last four frequency bands

into one component. Thus, d = 6. The number of time points, n is 48 and the number

of days, M is 15. Then, we apply the spline mixed model introduced in section 1.2. An

intuitive choice of the order of polynomial for fixed effects is q = 1. We set up k = 8 equally

spaced time nodes as knots for the spline functions.

For the model (1.2) in section 1.2, we start with some level of time-dependence in Σϕ

11

with the parameter ϕ. To investigate the time-dependence, we obtain maximum likelihood

estimator (MLE) of β, γ and ϕ by assuming γ as fixed effect coefficients for simplicity.

The derivation of optimal equations for MLE is provided in Appendix C. The MLE of ϕ is

ϕ = −0.038 for location L00 and ϕ = 0.024 for location L02. We see in both locations, the

correlation between time points are weak, which suggests us to assume ϕ = 0 for the rest of

the analysis.

1.4.1 Bayesian Inference Results

We run the MCMC algorithm via Gibbs sampling for N = 5000 iterations with three chains.

To check the convergence of chains, we carry out Gelman-Rubin diagnostics by calculating

the potential scale reduction factor [10]. All three chains converged after 2000 iterations. Our

estimation is based on the posterior mean of last 1500 posterior samples, i.e. 500 posterior

samples from each chain. The upper 4 × 4 submatrices of estimated covariance matrices

of random effects are shown below. Σ0, Σ2 are the MLEs and Σ0, Σ2 are the Bayesian

estimates for locations L00 and L02, respectively.

For location L00,

Σ0 =

0.2307 0.1384 0.1140 0.0611

0.1384 0.1872 0.1546 0.0826

0.1140 0.1546 0.1703 0.1056

0.0611 0.0826 0.1056 0.1114

, Σ0 =

0.2646 0.1334 0.1079 0.0605

0.1334 0.2119 0.1599 0.0830

0.1079 0.1599 0.1906 0.1143

0.0605 0.0830 0.1143 0.1481

.

12

For location L02,

Σ2 =

0.0768 0.0075 −0.0326 −0.0365

0.0075 0.0758 0.0686 0.0400

−0.0326 0.0686 0.1500 0.1209

−0.0365 0.0400 0.1209 0.1671

, Σ2 =

0.1118 0.0287 −0.0201 −0.0271

0.0287 0.1073 0.0795 0.0361

−0.0201 0.0795 0.1657 0.1173

−0.0271 0.0361 0.1173 0.1865

.

We can see that, overall, the Bayesian estimation for both locations, Σ0 and Σ2 are very

close to the corresponding MLE, Σ0 and Σ0, respectively. They have same signs for all the

entries of covariance matrices, which justifies our Bayesian method from the likelihood per-

spective. Moreover, unlike for location L00, Σ2 is slightly different from Σ2 in the sense that

it gives larger variance for each component σ2j and lower magnitude of negative correlations,

i.e. σ13 and σ14 than the MLE. Finally, we suspect a difference in the overall relationship

(over time points) between band 1 and either band 3 or 4 for the two locations due to the

opposite sign of corresponding entries in two different locations. However, to further support

our finding, we have to perform more detailed posterior sample analysis.

Bayesian method allows us to obtain posterior samples of σjl from which we can investi-

gate interaction between different types of species (i.e. different frequency bands) as well as

compare the results from two locations. In particular, we are interested in the relationship

between frequency band 1 and frequency bands 2 to 4. The frequency band 1 corresponds to

the sounds from human activity while frequency bands 2, 3 and 4 represent three different

categories of wildlife animals including birds. So, we compute posterior probabilities of all

orderings of σ12, σ13, σ14, 0. 0 is included to see whether σjl is positive or negative.

There are 4! = 24 possible orderings. We list the first three highest posterior probabilities

13

among 24 possible orderings for each location.

For location L00,

P (Θ(0)1 = σ12 > σ13 > σ14 > 0 | data) = P (`12 > `13 > `14 > 0 | data) = 0.702,

P (Θ(0)2 = σ13 > σ12 > σ14 > 0 | data) = P (`13 > `12 > `14 > 0 | data) = 0.133,

P (Θ(0)3 = σ12 > σ14 > σ13 > 0 | data) = P (`12 > `14 > `13 > 0 | data) = 0.083.

For location L02,

P (Θ(2)1 = σ12 > 0 > σ13 > σ14 | data) = P (`12 > 0 > `13 > `14 | data) = 0.493,

P (Θ(2)2 = σ12 > σ13 > 0 > σ14 | data) = P (`12 > `13 > 0 > `14 | data) = 0.222,

P (Θ(2)3 = σ12 > 0 > σ14 > σ13 | data) = P (`12 > 0 > `14 > `13 | data) = 0.149.

Bayes factors computed from the posterior probabilities can be used to select a particular

ordering for each location. Thus, we consider testing the following hypotheses

Hi : θ(L) ∈ Θ(L)i vs. Hj : θ(L) ∈ Θ

(L)j ,

where i, j = 1, 2, 3 represent those particular ordering with high posterior probabilities L =

0, 2 represents the location of the data collected. Since π(Θ(L)i ) = π(Θ

(L)j ) for all i, j = 1, 2, 3

and L = 0, 2, the Bayes factor is just a ratio of posterior probabilities, i.e., B(L)ij = ν

(L)i /ν

(L)j .

Table 1.1 and 1.2 show the Bayes factors B(0)ij at locations L00 and L02, respectively.

From the above analysis, we found that there are significant differences in the patterns of

audio frequency bands over time in two locations, L00 and L02. For L00, all three bands 2,

14

Denom. \ Num. σ12 > σ13 > σ14 > 0 σ13 > σ12 > σ14 > 0 σ12 > σ14 > σ13 > 0σ12 > σ13 > σ14 > 0 1 0.19 0.12σ13 > σ12 > σ14 > 0 5.28 1 0.63σ12 > σ14 > σ13 > 0 8.42 1.60 1

Table 1.1: Bayes Factor for Significant Orderings at L00

Denom. \ Num. σ12 > 0 > σ13 > σ14 σ12 > σ13 > 0 > σ14 σ12 > 0 > σ14 > σ13

σ12 > 0 > σ13 > σ14 1 0.45 0.30σ12 > σ13 > 0 > σ14 2.22 1 0.67σ12 > 0 > σ14 > σ13 3.32 1.50 1

Table 1.2: Bayes Factor for Significant Orderings at L02

3 and 4 were synchronous with human activity, with band 4 in the lowest level and band 2

in the highest level. On the other hand, band 4 was most sensitive to human activity for the

pristine location L02, following by band 3. However, band 2 was synchronous with human

activity in L02.

1.4.2 Discussion

In this chapter, we investigated audio measurements data from the real ecological world.

Our primary objective is to extract the information that reflects the relationship and inter-

action between the variables of interest and then quantify the synchronicity or discrepancy

of such interaction among locations. By classifying the observations as the compositional

data varying over time points, we are able to view them as functional curves. We proposed

a spline mixed model and interpret the random effect coefficients of spline polynomials as a

measure of interaction between curves over time. By specifying the covariance components

of random effects, we successfully transformed our concerns to Bayesian inference for the

orderings of covariance matrices. To solve our problem, we developed a new non-informative

prior on the covariance matrices with ordering which possesses posterior propriety as well as

15

simple implementation for testing. Finally we applied our model with the new prior to the

biodiversity data from Michigan and our Bayesian analysis successfully identified the main

characteristics of interactions between species.

Some further extension of our current work may be worthwhile. For instance, it can

be challenging but interesting if one consider generalizing the type of prior we proposed to

covariance matrices with a wider class of orderings.

16

Chapter 2

Joinpoint Detection using

L1-Penalized Spline Method

2.1 Introduction

Cancer is one of the major causes of death in the United States. According to a global and

regional mortality study from 235 causes of death for 20 age groups presented on The Lancet,

it claimed 8.0 million lives in 2010, 15.1% of all deaths worldwide, with large increases in

deaths from trachea, bronchus, and lung cancers [15]. Statistical analyses of cancer incidence

have gained increasing attention in recent years given such challenging public health issue.

Many of them are conducted to investigate a single cancer rate curve varying over time. An

major interest is detecting the changes of the trend of annual cancer incident rates. Such

concern can be viewed as a joinpoint identification problem based on a certain structure of

regression models in statistics.

Several methods have been developed for the joinpoint problem recently, including [14],

[21], [11] and [4]. In general, the joinpoint regression model can be defined in the following

way. Let y be an observed response at time point x ∈ S = x1, x2, · · · , xn with x1 ≤ x2 ≤

· · · ≤ xn. Then we assume

y = β0 + β1x+ δ1(x− τ1)+ + · · ·+ δk(x− τk)+ + ε, (2.1)

17

where τ1 < τ2 · · · < τk are unknown joinpoints, ε represents the error and a+ = a1a > 0.

The main objective of joinpoint detection is to discover the correct number of joinpoints (k)

as well as the location of them (τs).

One classical permutation test based (PTB) approach, proposed by Kim, et al. [14]

in 2000, has been adopted by the National Cancer Institute (NCI) as an analytical tool

(Joinpoint Software) in their Surveillance Research Program. It generally applies a grid

search technique to fit a regression model under homoscedastic and uncorrelated errors. The

identification of significant joinpoints is conducted by multiple pairwise permutation tests

which compare two best regression models fitted with different sets of presumed joinpoints.

From a statistical perspective, it only involves basic estimation and testing knowledge for

piecewise linear regression models thus is well motivated and easy to understand. However,

due to the nature of permutation test and grid search technique, the PTB approach requires

heavy computation if either maximum possible number of joinpoints (k) or the sample size

(n) is relatively large, for example, k > 5 or n > 50 usually. Such limitation makes joinpoint

detection with big dataset infeasible using the PTB.

In this paper we propose a new method to detect the joinpoints on a discrete time grid

by introducing an `1 penalized regression spline model. Our model specifies spline base at

time knots as the covariates thus will be able to capture potential features of joinpoints as a

subset of statistically significant slopes. Meanwhile, by adding a penalty term, we allow for

a shrinkage of the spline model so that insignificant slope changes will be eliminated.

The rest of this chapter is organized in the following way. We define the penalized

regression spline model in section 2.2 and validate some theoretical properties of our estimates

in section 2.3. In section 2.4, we carry out a simulation study of our method and compare

the performance with the PTB method. Section 2.5 provides a case study of real data by

18

applying our approach to the cancer incidence rates for the overall U.S. from 1973 to 1999.

Note that a comparison with the results from the Joinpoint Software will also be presented.

We finally conclude with a discussion about possible extensions as well as potential future

works in section 2.6.

2.2 L1-Penalized Regression Spline Model

In this section we introduce the penalized regression spline model. Assume we observe data

yini=1 independently on time points xini=1 with x1 ≤ x2 ≤ · · · ≤ xn. We define a

penalized regression model with first order spline bases as follows.

y = β0 + β1x+ β2(x− τ2)+ + · · ·+ βp−1(x− τp)+ + ε, (2.2)

where ε ∼ N(0, σ2) and τ2 < · · · < τp are (p − 1) time knots in the entire time interval. It

specifies p covariates as in the linear model.

For the annual cancer incidence rate problem, it is appropriate and straightforward to

assume the set of candidate joinpoints is identical to the set of observed covariates (years),

that is, p = n − 1 and τj = xj for j = 2, · · · , n − 1. And without loss of generality, we

assume they are equally-spaced without ties. Furthermore, to guarantee some conditions for

theoretical justification of our model, we particularly scale observed time points along with

spline bases such that xj = jn , j = 1, 2, · · · , n. By introducing a matrix form representation

of our model, we have

y = Xβ + ε, ε ∼ N(0, σ2In), (2.3)

19

where y = (y1, y2, · · · , yn)T , β = (β0, β1, · · · , βn−1)T , ε = (ε1, ε2, · · · , εn)T and

X =

1 x1 (x1 − x2)+ · · · (x1 − xn−1)+

......

......

1 xn (xn − x2)+ · · · (xn − xn−1)+

(2.4)

Note that the PTB method works only on a small subset of time points as a candidate set

for joinpoints when performing the estimation or testing via grid search each time. On the

other hand, our penalizing method allows for the entire set of candidates. It is motivated in

a natural and data-driven way. Note that in the model we exclude the first year as well as

the last year since it is not reasonable to view them as possible joinpoints without further

information. When we have a large sample size n, the usual least square estimate of β

may contain a large number of nonzero components and fail to remain to be a consistent

estimator. It suggests us applying one of most commonly used penalty structure, Least

Absolute Shrinkage and Selection Operator (LASSO), introduced by Tibshirani [20]. It

yields to minimize the `1 penalized negative log-likelihood. Because of the singularity of

`1 norm at the origin, LASSO is a powerful tool for model selection. By doing so, we

only keep the candidates with nonzero estimated coefficients, which can be automatically

interpreted as significant slope changes thus solid evidence of joinpoints. In our study,

we treat the intercept (β0) and the first slope (β1) as the baseline trend for curves. It

follows that they are not supposed to characterize joinpoints thus are not penalized as other

truncated spline coefficients. However, it is still feasible to restrict the absolute magnitude of

them by a pre-specified constant in practice. So we consider a constrained parameter space

B = β : |β0|+ |β1| ≤M, where M is a positive constant independent of n. Our LASSO

20

type estimator is provided by

β = arg minβ∈B

‖y −Xβ‖22n+ λ

n−1∑j=2

|βj |

. (2.5)

The parameter λ appeared above is called the tuning (shrinkage) parameter. It essentially

controls how hard the βs are penalized. In practice, λ is often pre-specified or obtained by

some other data driven validation procedures. Several algorithms for LASSO, including the

well known Least Angle Regression (LAR) [5], have been developed in recent years. Most

of them automatically assume penalization for the full set of coefficients. However, since

in our model the baseline slope β1 is not penalized, we have to utilize an implementation

which allows penalization for a partial set of parameters. And the coordinate decent algo-

rithm proposed by Friedman, Hastie and Tibshirani [8] meets our demands by solving the

regularization paths as we desire.

2.3 Consistency of the LASSO Type Estimator

As we mentioned in the previous section, the LASSO method features its computational

feasibility. Meanwhile, under certain assumptions of the regression model, it also provides

statistical accuracy. The estimator under LASSO approach has been proved to be consis-

tent by a series of notable works. Some basic results are available in [3]. And we will

provide a proof of consistency of the LASSO type estimator under (2.5) following their

approach. From now on we define Σ = XTX/n and assume its j-th diagonal entry to

be σ2j , i.e. σ2

j = Σjj . Meanwhile, given the random error ε, we define a set of even-

t Jλ0=ω : max1≤j≤n 2|εTX(j)|/n ≤ λ0

, where X(j) denotes the jth column of the

21

design matrix X. First we give a lemma indicating a high probability of set Jλ0.

Lemma 1. Assume σ2j ≤ K for j = 1, 2, · · · , n and some K > 0. Then for all t > 0 and

λ0 = 2Kσ

√t2+2 log n

n , we have

P (Jλ0) ≥ 1− 2 exp(−t2/2). (2.6)

The proof of Lemma 1 and other proposition, lemma and theorem are provided in the

Appendix. Especially, when taking t2 ≥ 4 log n, we obtain

P

(max

1≤j≤n2|εTX(j)|

n> λ0

)≤ 2e−2 log n = 2/n2 (2.7)

for λ0 ≥ 2√

6Kσ

√log nn .

In practice we may need to replace unknown variance parameter σ2 by a reasonable

estimator σ2. The following lemma shows that we can choose the sample second raw moment

as a candidate for σ2.

Lemma 2. Let r = ‖Xβ‖22/σ2 be the signal-to-noise ratio of the model (2.2). Consider the

estimator σ2 = Y TY /n for σ. Given any 0 < α < 1, we have

P(σ2 < (1− α)σ2

)≤ min

2(n+ 2r)

α2n2,12(n+ 2r)2 + 48(n+ 4r)

α4n4, 1

. (2.8)

The previous lemma claims that the σ2 proposed above can not be relatively too small

compared to σ2. Now combining two lemmas above we conclude the consistency of our

estimator β.

Theorem 3. (Consistency of the LASSO Type Estimator) Suppose σ2j ≤ K for j = 1, 2, · · · , n

22

and some K > 0. Let σ2 = Y TY /n and λ = 12Kσ√

log n/n. Consider the model (2.2)

with the estimator β in (2.5). If

limn→∞

‖β‖1√n/ log n

= 0 and rn =‖Xβ‖22σ2

≤ R <∞ for some R > 0

then we have ‖X(β − β)‖22/n→ 0 almost surely as n→∞, i.e.

P(

limn→∞

‖X(β − β)‖22/n = 0)

= 1. (2.9)

It is easy to verify that the matrix Σ in model (2.2) satisfies σ2j ≤ K for all j with K = 1.

Furthermore, note that the proof of Theorem 3 remains valid when the number of covariates

p is greater than the number of time points n although they are identical under our design.

That is, our model (2.2) can be extended to allow more intermediate time knots as joinpoint

candidates.

The main theorem above claims the consistency of our LASSO estimator in terms of

prediction error. A stronger conclusion on such type of estimators has also been stud-

ied in [3]. In their publication, a more refined `1 consistency result for β is provided

under some additional compatibility conditions for the design matrix X. An alternative

restricted eigenvalue condition is given [2]. Given an index set S ⊂ 1, 2, · · · , n, denote

βS = (β0,S , β1,S , · · · , βn−1,S)T where βj,S = βj1j ∈ S for j = 1, 2, · · · , n − 1. The

compatibility condition is stated as follows.

Assumption 1. (Compatibility Condition) Let Σ = XTX/n be the Gram matrix. Given

the true index set S0 with cardinality s0 = |S0|, there exists φn > 0 such that for all β

23

satisfying ‖βSc0‖1 ≤ 3‖βS0

‖1, it holds that

‖βS0‖21 ≤

s0

φ2n

(βT Σβ

)and

s0λ2

φ2n

= o(1) (2.10)

Unfortunately, we point out that under our design with λ = O(√

log n/n), the matrix Σ

violates the compatibility condition.

Proposition 1. The eigenvalues of the design matrix in (2.4) are

ν1 =n+ 2−

√n2 + 4

2n, ν2 = ν3 = · · · = νn−1 =

1

n, νn =

n+ 2 +√n2 + 4

2n(2.11)

From the claim above we see that at least (n− 1) of the eigenvalues of X are no greater

than n−1. Therefore, by the properties of eigenvalues, we conclude that the smallest eigen-

value of Σ = XTX/n, denoted by ν1(Σ), is at most n−3. Given the simple fact that

‖βS0‖21 ≤ s0‖β‖22 and ν1(Σ)‖β‖22 ≤ βT Σβ for all β, it follows that φ2

n ≤ n−3 hence

λ2s0/φ2n ≥ O(n2 log(n)), which contradicts the assumption that s0λ

2/φ2n = o(1) in (2.10).

In other words, unless we choose a λ extremely close to 0, we will not be able to meet the

compatibility condition.

As discussed above, the violation of compatibility for Σ implies that we are not able

to use a similar technique as before to prove the `1 consistency result for β. This is not

surprising since the design matrix X consists of rescaled time nodes xj = jn , j = 1, 2, · · · , n

so that we may need conditions such as ‖β‖2 = O(n) to have r = ‖Xβ‖22/σ2 < ∞. Also,

it is natural to assume s0 < ∞. Then, given S0 = S, ‖β‖1 = O(n). Hence the usual `1

consistency for β may not be a reasonable criterion under our settings.

On the other hand, if we define S = 2 ≤ j ≤ n − 1 : βj 6= 0 as our selected set of

24

joinpoints, one may desire the following oracle properties

P (S = S0)→ 1 or P (S ⊃ S0)→ 1 as n→∞ (2.12)

It indicates that the estimated effective joinpoints will at least cover the true set S0 when n is

large. In our case, we may seek a weaker oracle property. Note thatXβ = (X1β,X2β, · · · ,Xnβ)T ,

where Xjs are row vectors of X. Let S∗ = 2 ≤ j ≤ n− 1 : Xjβ 6= 0 and S∗ = 2 ≤ j ≤

n− 1 : Xjβ 6= 0. Then one may show that

P (S∗ = S∗)→ 1 or P (S∗ ⊃ S∗)→ 1 as n→∞ (2.13)

Note that here Xj (j ≥ 2) has nonzero elements only in its first j entries. After all, it

is possible that although as a sufficient condition, the compatibility assumption fails under

design X, the numerical result still shows excellent performance in variable selections.

2.4 Simulation Study

In this section, we will examine the performance of joinpoint detection based on the `1 pe-

nalized regression (LASSO) method and the PTB method realized by the Joinpoint Software

via several simulations.

To be consistent with the real data analysis in the next section, we first consider n = 27

and x ∈ S = 1, 2, · · · , 27. The true models with two joinpoints can be written as follows.

y = β0 + β1x+ δ1(x− τ1)+ + δ2(x− τ2)+ + ε, (2.14)

25

where the errors ε are still assumed to be independent Gaussian N(0, σ2). To further specify

different models based on quantity of slope changes, assignment of joinpoint positions and the

noise level, we let β0 = 5, (τ1, τ2) = (8, 18) or (18, 23) and σ2 = 0.0001 or 0.001. Meanwhile,

as given in [14], the slopes, β1, δ1 and δ2 are determined by the annual percentage change

APC = (APC1,APC2,APC3) in the following way: β1 = log(1 + 0.01APC1), δ1 = log(1 +

0.01APC2)− log(1 + 0.01APC1) and δ2 = log(1 + 0.01APC3)− log(1 + 0.01APC2). In our

simulation we set APC = (1, 3, 1) or (4,−1, 2), which leads to different trends of data. Hence

all the combinations of true parameters give us eight different scenarios.

In the LASSO spline model, we clarify the design matrix X as the form of (2.4) with xj =

j/27, j = 1, 2, · · · , 27. That is, we rescale observed time points without losing information.

One of the major concern when fitting the generated dataset to our model is to select

appropriate tuning (penalty) parameters (λs) since it directly controls the sparsity, or the

effective number of joinpoints. Some popular rules, such as cross validation (CV) [12] and

Bayesian information criteria (BIC) [22], are widely used to handle this issue. As a model

validation technique, cross validation partitions a sample of data into a training set and

a testing set, performs the analysis on the training set, and validates the analysis on the

testing set. It indicates better performance in terms of accuracy of prediction in many

literatures. We will apply a 5-fold cross validation in the simulations. For instance, the

original observations are randomly partitioned into five almost equal size subsamples. A

single subsample is retained as the validation data for testing the model while the remaining

four are used as training data. The cross validation process is then repeated five times, with

each of the five subsamples used exactly once as the validation data. We simply select the

tuning parameter λ with minimum mean squared error (MSE). However, sometimes we may

prefer a more simplified model with fewer covariates, or, in our case, not too many estimated

26

joinpoints. Hastie, Tibshirani and Friedman suggest a ‘one standard error rule’ [8], which

chooses the largest λ value within one standard error of the original minimizer λ given by

cross validation, i.e. λ = supλ : MSE(λ) ≤ MSE(λ) + SE(λ), where MSE(λ) and SE(λ)

denote the mean squared error and standard error under λ. We will provide the results under

both original cross validation approach (L-CV) and one standard error rule (L-SE).

As an alternative model for comparison purpose, we also apply the PTB method imple-

mented in the Joinpoint Software. Since the PTB approach is based on the grid search

when fitting the model, it is inevitably time consuming even if one requires a moder-

ate amount of candidate joinpoints given a relatively large sample size of data (usually

when n ≥ 40). Especially, when the maximum number of joinpoint is greater than 4, the

computation will be quite slow. As advised by the user manual of the software, we set

the maximum number of joinpoint as 4. Since our penalized spline model does not have

such limitation and process the same data in a much faster way as described in section

2.2, we claim that it demonstrates more flexibility in modeling various types of real data

effectively. Moreover, one of the settings for the PTB implementation imposes a mini-

mum distance (3 as default) between two consecutive estimated joinpoints. Therefore, to

make a fair comparison, we also consider a modification for both L-CV and L-SE meth-

ods: First we call a sequentially ordered subset of originally selected components a ‘clus-

ter’ if it is the biggest set consisting of two or more consecutive nonzero coefficients that

preserve the same sign, either all positive or all negative, i.e. a set either of the form

B+(s, t) := (βs, βs+1, · · · , βt−1, βt) : βs−1 ≤ 0; βs, · · · , βt > 0; βt+1 ≤ 0 or of the for-

m B−(s, t) := (βs, βs+1, · · · , βt−1, βt) : βs−1 ≥ 0; βs, · · · , βt < 0; βt+1 ≥ 0 for some

2 ≤ s < t ≤ n − 2. Then for all other selected components that are not from a ‘cluster’,

we call each of them a ‘singleton’. Our modification eventually pick only one component

27

from each ‘cluster’ along with all the ‘singletons’ which minimizes the mean squared error as

the joinpoints. We denote such modification for L-CV and L-SE methods as L-CV-M and

L-SE-M, respectively. To give a summary of all the methods we use to compare with the

PTB, we present the following 2× 2 chart.

Tuning Parameter \ Refinement Original LASSO Modified LASSOλ1 via Cross-validation L-CV L-CV-Mλ2 by the ‘One Standard Error’ Rule L-SE L-SE-M

Table 2.1: A Summary of Four Methods to Compare with the PTB

We simulate the data 1000 times for each of eight scenarios and fit them by L-CV, L-

CV-M, L-SE, L-SE-M and PTB methods. The performance of joinpoint detection can be

characterized by three quantities: (1) the average number of correctly identified joinpoints,

#CJ; (2) the average number of incorrectly identified joinpoints, #IJ and (3) the average of

mean squared error of predictions, PMSE = ‖X(β−β)‖2/n. The results are listed in Table

2.2 and 2.3.

From the result we see that in general, the `1 penalized regression spline model with

original cross validation procedure (L-CV) provides better detection of true joinpoints com-

pared to PTB or L-SE. However, a relatively high #IJ value indicates it usually introduces

more joinpoints than the other two methods. The penalized model with cross validation

under one standard error rule (L-SE) performs equally or better than PTB in detecting

the true joinpoints for most of the time. And it brings down the #IJ value compared to

L-CV model due to a more penalization of parameters. Meanwhile, the modified approach

L-CV-M usually leads to more #CJ values than L-SE and less #IJ values than L-CV thus

gives a more balanced and robust estimation between L-CV and L-SE. The L-SE-M reduces

the number of effective joinpoints dramatically but is more likely to miss the true joinpoints

28

σ2 APC Method #CJ #IJ PMSE(×10−4)0.0001 (1, 3, 1) PTB 1.377 1.284 0.266

L-CV 1.648 3.852 0.940L-CV-M 1.436 1.659 0.277

L-SE 1.592 3.831 1.093L-SE-M 1.392 1.691 0.419

(4,−1, 2) PTB 1.897 0.236 0.200L-CV 1.942 2.982 0.993

L-CV-M 1.909 0.694 0.208L-SE 1.933 2.996 1.019

L-SE-M 1.900 0.714 0.211


L-CV 0.696 4.956 3.679L-CV-M 0.448 3.648 3.711

L-SE 0.283 3.375 5.673L-SE-M 0.215 2.754 5.663

(4,−1, 2) PTB 0.960 2.106 2.754L-CV 1.391 5.121 3.351

L-CV-M 1.032 3.367 2.913L-SE 1.171 4.485 5.523

L-SE-M 0.966 2.952 3.247

Table 2.2: Simulation Result for n = 27, (τ1, τ2) = (8, 18)

compared to other methods. As for the computational efficiency, as we expected, LASSO

models showed notable superiority to the PTB by equipping with the coordinate decent

algorithm mentioned in section 2.2.

To investigate the performance of our LASSO method further under larger sample size,

we design other groups of simulations with n = 54 and 135, corresponding to 2 and 5 times

the original sample size of responses. We keep the settings for APC values and σ2 same as

the previous simulation settings. However, as the time line extends, it is reasonable to simply

maintain the relative location of true joinpoints, for example, (τ1, τ2) = (16, 36) or (36, 46)

respectively for n = 54. Also note that since the PTB method will be significantly time-

29


L-CV 1.290 3.309 0.444L-CV-M 1.093 1.913 0.343

L-SE 1.010 3.113 0.766L-SE-M 0.749 2.190 0.571

(4,−1, 2) PTB 1.748 0.561 0.232L-CV 1.714 3.314 0.551

L-CV-M 1.575 1.023 0.353L-SE 1.435 3.429 0.945

L-SE-M 1.259 1.470 0.585


L-CV 0.623 3.703 2.533L-CV-M 0.510 2.793 2.659

L-SE 0.272 2.586 4.112L-SE-M 0.231 2.308 3.867

(4,−1, 2) PTB 0.448 2.368 3.275L-CV 1.134 3.897 2.856

L-CV-M 0.774 2.646 2.981L-SE 0.899 2.608 5.467

L-SE-M 0.604 2.098 3.258


consuming under large sample size, we decide not to include it in the comparison. The

results are shown on Table 2.4 through 2.7.

It is clear that as the sample size n increases, the LASSO model tends to select more

candidate components as joinpoints, which will be more likely to cover the true joinpoints.

Meanwhile, the modified LASSO, L-CV-M and L-SE-M reduce #IJ dramatically by elimi-

nating the redundant information within ‘clusters’. In general, both our methods L-CV-M

and L-SE-M give acceptable estimates.

On the other hand, note that as the sample size n increases, it is more worthwhile to

study the LASSO model with asymptotic tuning parameter λ ∼ O(σ√

log(n)/n) than the

30

σ2 APC Method #CJ #IJ PMSE(×10−4)0.0001 (1, 3, 1) L-CV 1.729 6.621 0.954

L-CV-M 1.445 2.250 0.163L-SE 1.729 6.626 0.959

L-SE-M 1.443 2.256 0.163(4,−1, 2) L-CV 1.883 5.227 1.004

L-CV-M 1.696 1.434 0.274L-SE 1.883 5.220 1.010

L-SE-M 1.696 1.433 0.274


L-CV-M 0.685 4.663 1.612L-SE 1.329 7.406 3.417

L-SE-M 0.680 4.107 1.645(4,−1, 2) L-CV 1.615 6.827 2.250

L-CV-M 1.065 4.041 1.878L-SE 1.546 6.136 3.386

L-SE-M 1.097 3.578 1.775


usual one selected from cross-validation techniques. Thus we also use λ0 = 4√

6σ√

log(n)/n

as specified in section 2.3 to verify the asymptotic behavior of the predicted mean squared

error (PMSE). Some results under σ2 = 0.0010 are shown in Figure 2.1 and 2.2. To save

computation time, we only take n = 27× 5k for k = 3, · · · , 15 for demonstration.

We can see a clear decreasing trend for the PMSE, which meets our expectation since

we have shown that the prediction error should be close to 0 once the sample size is large

enough.

31

Figure 2.1: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC =(1, 3, 1), σ2 = 0.0010

Figure 2.2: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC =(4,−1, 2), σ2 = 0.0010

32


L-CV-M 1.391 1.830 0.151L-SE 1.518 5.307 0.466

L-SE-M 1.324 1.945 0.163(4,−1, 2) L-CV 1.833 6.497 0.532

L-CV-M 0.828 2.876 1.537L-SE 1.763 6.479 0.618

L-SE-M 0.810 2.896 1.511


L-CV-M 0.635 3.951 1.544L-SE 0.669 4.262 3.489

L-SE-M 0.412 3.186 2.329(4,−1, 2) L-CV 1.545 7.744 1.739

L-CV-M 0.748 3.894 2.035L-SE 1.185 5.753 3.488

L-SE-M 0.652 3.240 2.460


2.5 A Case Study: National Cancer Incidence Rate

Analysis

As a real application of the penalized regression model, throughout this section we will

employ our method to analyze the cancer rates data.

First we provide Figure 2.3 showing the age-adjusted incidence rates of colon and rectal

cancer for the overall population in the U.S. from 1973 to 1999. These data reported in the

SEER Cancer Statistics Review [17] utilizing the Joinpoint Software. It is easy to see some

clear trends and changes over the time.

We again apply PTB and LASSO to the cancer rate data. As indicated in the previous

section, when applying the PTB method via the Joinpoint Software, we set the maximum

number of potential joinpoints to be four. We use the modified LASSO methods L-CV-M

33


L-CV-M 0.661 5.663 0.137L-SE 1.300 16.08 1.584

L-SE-M 0.661 5.663 0.137(4,−1, 2) L-CV 1.518 13.79 1.438

L-CV-M 0.611 4.736 6.280L-SE 1.518 13.79 1.441

L-CV-M 0.611 4.736 6.280


L-CV-M 0.688 6.461 0.827L-SE 1.483 17.75 2.083

L-SE-M 0.692 6.257 0.815(4,−1, 2) L-CV 1.495 15.92 1.939

L-CV-M 0.598 6.120 6.211L-SE 1.492 15.43 2.231

L-SE-M 0.578 6.01 5.760


and L-SE-M for comparison purpose. The result of fitting is given in Table 2.8.

In Table 2.8, PTB (#2) denotes the original unrestricted PTB method while PTB (#4) is

implemented under a pre-specified number of joinpoints as four only for comparison purpose.

From the results we can see that the LASSO type penalized regression methods tend to be

more aggressive when selecting joinpoints than PTB. We have no strong reasons to suspect

an over-fitting issue for this model since the tuning parameter λs are well chosen by cross

validation procedures. Moreover, we see that L-SE-M, the most penalized method among

all of our approaches, gives the same result as the original PTB (#2) while L-CV-M selects

more joinpoints than any others. So it is clear that our method provides more flexibility

in determining the total number of joinpoints because we are always able to control such

important quantity in terms of a series of appropriate tuning parameters. Finally, we show

the plot for fitted models in Figure 2.4, where we see both approaches successfully pick out

34

Figure 2.3: Incidence Rates of Cancer for overall U.S. (1973 - 1999)

Figure 2.4: Fitted Incidence Rates (1973 - 1999)

35


L-CV-M 0.676 3.848 0.211L-SE 1.895 11.88 0.554

L-SE-M 0.675 3.850 0.211(4,−1, 2) L-CV 1.550 16.72 0.905

L-CV-M 0.271 4.815 15.05L-SE 1.553 16.69 0.925

L-SE-M 0.272 4.804 15.07


L-CV-M 0.465 5.504 0.831L-SE 1.518 13.85 1.903

L-SE-M 0.468 4.790 0.816(4,−1, 2) L-CV 1.575 20.53 1.411

L-CV-M 0.247 5.937 13.68L-SE 1.513 17.93 2.266

L-SE-M 0.237 5.333 14.26


Method # of Joinpoints Joinpoints Details (Year)PTB (#2) 2 1986, 1995PTB (#4) 4 1977, 1982, 1985, 1995L-CV-M 7 1976, 1980, 1982, 1985, 1990, 1993, 1995L-SE-M 2 1985, 1996

Table 2.8: Joinpoint Detection for Real Data Analysis

the peak value around year 1985 as well as the bottom value at year 1995 and our approach

offers a quite close fit to the incidence rate curve.

Overall, we are satisfied with the performance of our penalized regression model regarding

both coverage and accuracy of the joinpoint detection. However, one should note that we did

assume homoscedasticity and independence of the errors in our model for simplicity. Thus

further potential approaches involving more complicated but reasonable error structures are

always worthwhile.

36

2.6 Conclusion and Discussion

As a summary of this chapter, we proposed a new method to solve the joinpoints detection

problem on discrete time points using `1 type penalized regression spline model (LASSO). We

proved a basic version of consistency of the joinpoint estimator for the LASSO and compared

it with another commonly used approach, PTB via simulations and real data analysis. Based

on the results, we conclude high performance of our method. It showed superiority to the

PTB in terms of both efficiency and accuracy. It is also easy to understand as well as to

implement. In practice, one may obtain the optimal estimation of joinpoints by carefully

selecting a set of appropriate tuning parameters. Moreover, a problem of exploring stronger

consistency properties of our estimator is still open and worthwhile for further investigation.

37

Chapter 3

Nonparametric Bayesian Clustering of

Functional Data

3.1 Introduction

We start this chapter as a continuation to study the mortality incidence rates of cancers.

However, unlike in the previous chapter, we first raise a more general scenario when multiple

curves are available for comparison and analysis at the same time. For example, assuming

now cancer data can be obtained at a state level from the U.S., then one may overlook

potential relationships or interactions among curves from different states by simply fitting

them individually. Actually, even if we consider the same question for joinpoint detection

as in Chapter 2, based on studying a set of multiple functional simulation data appearing

in section 3.5.2, our L1 penalized method in Chapter 2 gives us #CJ = 0.531 and #IJ =

1.449, while the model proposed by Dass et al. (2014), which is similar to the one we are

going to present in section 3.2, leads to #CJ = 0.816 and #IJ = 0.551, respectively. Thus

it is expected that with more information on curves such as the group memberships, we are

likely to obtain a better estimation of their trends. Therefore, we are well motivated by this

comparison and will now focus on grouping the curves that have similar patterns together.

That is, our main concern is whether they form some groups with similar overall trends

but have significant variations between groups. For instance, one may suspect that the

38

mortality rates for all 48 states of the U.S. indicate certain geographical similarities. Hence

an appropriate model is required to capture the characteristics of the pattern of curves and

to classify them into groups, or, clusters as the primary objective.

In general, the clustering problem briefly introduced above can be conveniently quanti-

fied in the following longitudinal way. Let fs(t) be a true function at time t ∈ [0, T ] for

the s-th individual, s = 1, · · · , N . It is usual and natural to assume each function fs to be

smooth so that it can be approximated by polynomial functions of any degree. A natural and

heuristic way to study the clustering problem came from the classification perspective when

MacQueen (1967), Hartigan and Wong (1978) introduced the famous K-means algorithm to

partition data into groups. Each curve is first viewed as a discretely observed vector. Then

the algorithm, which aims to minimize the within-group sum of squares, is directly applied

to multiple vectors. The membership of clusters is determined as a result of classification.

Overall, entire process is straightforward and easy to realize. But there are also clear disad-

vantages of such approach. For example, it requires the number of groups or clusters to be

specified at the beginning and fixed throughout the procedure although it is unknown for

most of the real studies. Moreover, this clustering method treats each individual curve as an

independent outcome thus overlooks potential correlations or interactions between curves,

which may not be negligible factor when recovering the true clusters. Later, some more

complicated model-based methods have been developed to address the issue. James and

Sugar (2003) proposed a mixed effects model which is more powerful for clustering sparsely

observed functional data. In their approach, B-splines are considered to capture individual

curves and random spline coefficients with a mixture normal distribution are used to cluster

them. It handles missing values well and provides reasonable estimation and confidence in-

tervals for them. However, like K-means algorithm, the number of clusters in the model has

39

also to be pre-determined by independent criteria. Meanwhile, the knot points are the same

across all clusters, which reduce the flexibility of model as well as its ability to process a group

of curves with varying connecting points. Abraham et al. (2003) combined B-spline basis

with K-means procedure on fixed effects to study such problem. Some other model-based

approaches include Banfield and Raftery (1993), Serban and Wasserman (2005), Heard et

al. (2006) and Ray and Mallick (2006).

We develop our model as a powerful tool for classification purpose by applying a classical

nonparametric Bayesian approach: Dirichlet Process (DP) methodology [7] in an innovative

way to clustering the trends of different curves as polynomial components. Our method will

utilize such representation of smooth curves to obtain relevant information such as connection

and fluctuation. In general, our proposed approach does not require an initial estimate of

the number of clusters as most of the other methods mentioned in the previous paragraph.

Instead, it provides a reasonable estimate along with an empirical distribution. It also allows

curves to be characterized by different numbers and locations of knot points, which has more

capability of fitting individual curves well. Furthermore, it performs the clustering based

on a good summary of the shape and trend of curves thus demonstrates the similarity of

patterns between curves within the same cluster.

The rest of this chapter is organized as follows. We present our functional spline mixed

model and prior specifications for Bayesian inference in section 3.2. Some prior components

are taken to be improper, and therefore, propriety of the posterior becomes a concern.

Section 3.3 establishes the propriety of the posterior distribution under some mild conditions.

Section 3.4 provides details of the Bayesian inference, particularly, performance measures of

clustering configuration for the comparison. At the same time, we also explore a new choice

of more non-informative prior on the concentration parameter. Section 3.5 shows results from

40

numerical studies including simulated data as well as the real lung cancer data. Finally, we

summarize our work and give some discussion for the proposed model and future research.

3.2 Spline Mixed Model for Clustering Multiple Curves

3.2.1 Model Specification

Suppose that we observe functions fs(t)Ns=1 at t = t1, t2, · · · , tn. Let ξ = (ξ1, · · · , ξK)

be knot points for spline basis functions, where K is the number of knots and ξ1 > t1 and

ξK < tn. Let ξ0 = t1 and ξK+1 = tn. The spline basis functions of order q are then given

by

hm(x) = xm−1 for m = 1, · · · , q

and

hq+l(x) = (x− ξl)q−1+ for l = 1, · · · , K.

In practice, most widely used orders are q = 1, 2, 4, where q = 1 corresponds to a constant

function, q = 2 corresponds to a linear function and q = 4 corresponds to a cubic spline

function. Then, we assume fs(t) can be approximated as following for s = 1, 2, · · · , N .

fs(t) ≈q+Ks∑m=1

β(s)m h

(s)m (t),

where h(s)m (t) depends on the s-th individual with time knots ξl =

(ξ

(s)1 , · · · , ξ(s)

Ks

).

Let Yst be the observed value of fs(t) at time t for t = 1, · · · , n. Then, the observed

41

values are modeled using spline representation such that

Yst =

q+Ks∑m=1

β(s)m h

(s)m (t) + est,

where est is an error that represents variability from various sources. We decompose the

error as est = us + εst and consider us as a random effect to capture variability between

curves and εst as an error that reflects variability across time for each s-th function. Thus

the model can be written as

Yst =

q+Ks∑m=1

β(s)m h

(s)m (t) + us + εst

with us ∼ N(0, τ2) and εst ∼ N(0, σ2s). Let Y s = (Yst1 , · · · , Ystn)T , Y = (Y T

1 , · · · ,YTN )T ,

εs = (εst1 , · · · , εstn)T , ε = (εT1 , · · · , εTN )T and u = (u1, · · · , uN )T . Then, we can write the

model in the matrix form for the s-th individual as

Y s = X(s)β(s) + us1n + εs,

where X(s) = (xij)n×(q+Ks) with xij = hj(ti) for i = 1, · · · , n and j = 1, · · · , q + Ks, and

β(s) = (β(s)1 , · · · , β(s)

q+Ks)T . For all s = 1, · · · , N , we have

Y = Xβ + u⊗ 1n + ε, (3.1)

with X = diag(X(1), · · · , X(N)) and β = (β(1)T , · · · ,β(N)T )T , where ⊗ denotes the Kro-

necker product.

42

3.2.2 Derivation of Functional Dirichlet Process and Other Prior

We are interested in clustering functions based on their shapes. Such clustering problem

based on the coefficients of basis functions are proposed by other authors, in particular,

using coefficients of spline basis functions such as [1] and [16]. However, the locations and

the number of knots are pre-determined before clustering is considered. Different groups of

functions may need different number of knots and knot-locations to capture the shape of

those functions. Thus, models should take into account such flexibility. To model clustering

of N sites with respect to their knot-point locations and corresponding magnitudes, we

apply the Dirichlet Process (DP) priors to the functions fs(t) with an appropriate centering

measure so that the number of knots and locations of knots are random and estimated during

inferential procedures.

To utilize the DP prior, we consider a function θs(t) = fs(t) and let Θ be the space of

all possible θs(t). A functional DP on the space of all distributions on Θ, where we define Θ

later, has two hyper parameters that characterize the DP such that DP ≡ DP (α0G0). α0

is called precision parameter and G0 is called the baseline (or centering) distribution on Θ.

It is well known that a randomly generated distribution F from DP (α0G0) is almost surely

discrete and admits the representation [18].

F =∞∑i=1

ωiδθi (3.2)

where δz denotes a point mass at z, ω1 = η1, ωi = ηi∏i−1k=1(1 − ηk), for i = 2, 3, · · · with

θ1, θ2, · · · i.i.d from distribution G0.

To specify G0 on Θ, it is convenient to consider a hierarchical structure. The randomness

43

of G0 is specified by assuming randomness on K, the number of knots, ξ, the locations of

knots, β, the spline coefficients and u, the random effect component.

(1) First, we specify the distribution of K. We assume that each time interval between

knot points has at least w units. We then consider a truncated Poisson distribution for the

distribution of K. That is,

p(k) = P (K = k) =e−λλ

k

k!∑k∗l=0 e

−λλll!

=λk

k!∑k∗l=0

λll!

for k = 0, 1, · · · , k∗, where k∗ = [n−1w − 1] to ensure n0 ≡ n− 1− (K + 1)w > 0.

(2) Given K = k, we assume that for (n1, · · · , nk+1), where nl ≥ 0 and∑k+1l=1 nl = n0,

(n1, n2, · · · , nk+1) ∼ Multinomial

(n0,

1

k + 1, · · · , 1

k + 1

).

nl +w becomes the length of a time interval between knot points for l = 1, · · · , k+ 1. That

is, ξl is defined recursively as ξ0 = 1 and ξl = nl + ξl−1 + w for l = 1, · · · , k.

(3) Given K = k and ξ = (ξ1, · · · , ξk), generate β1, · · · , βq+k i.i.d from the density π0 on

R1.

By the steps (1)-(3), we set θ(t) = f(t) =∑q+Km=1 βmhm(t) and θ = (f(t1), · · · , f(tn))T =

Xβ. From the hierarchical specification above, it follows that the infinitesimal measure for

G0 is given as

G0(dθ) = p(k)

(Γ(n0 + 1)∏k+1l=1 Γ(nl + 1)

(1

k + 1

)n0) q+k∏m=1

π0(βm)dβm. (3.3)

In our analysis, we take π0(β) ∝ 1.

Finally, to complete the hierarchical model, we specify priors of α0, λ, τ2 and σ2 =

44

(σ21, · · · , σ

2N ): π1(λ) = Ga(aλ, bλ), π2(τ2) = IG(aτ , bτ ) and π3(σ2

s) = IG(aσ, bσ), where

Ga(a, b) denotes the Gamma distribution with density function π(x) ∝ xa−1ex/b and IG(a, b)

has already been defined in Chapter 1. As for the precision parameter α0, we also choose

IG(aα, bα) in order to guarantee a simple form of posterior sampling distributions. However,

as we show in section 3.5 when performing the numerical analysis, such prior specification

may be subjective under some circumstances. Thus we will discuss on an alternative choice

of prior distribution of α0. In general, the hyper-parameters for the Gamma and inverse

Gamma distributions are chosen to have large variance so that the impact of the prior input

is minimal. The specific choices of the hyper-parameters for the various Gamma and inverse

Gamma distributions are given in section 3.5.

As mentioned in section 3.1, it is notable that Dass et al. (2014) proposed a mixed

effects model with similar DP prior to identify change-points of multiple curves. Although

both our model in this article and theirs feature a nonparametric Bayesian approach, we

feel worthwhile to point out the major difference between our models and objectives. In

the approach of Dass et al. (2014), piecewise disconnected linear segments were used to

model the change-points of functional data. They appear in the model as a set of slopes

and intercepts. Such design of model is not intended for clustering. On the other hand,

our model features B-spline basis which is designed to capture and cluster the shape and

trend of continuous curves well. Even if the order of B-splines is chosen to be 2, which

corresponds to connected linear functions, because of the discrepancy of the objective and

nature of our models, the interpretation of coefficients and other estimated parameters will

be clearly different.

45

3.3 Theory of Bayesian Inference

This section presents some fundamental theories of the nonparametric Bayesian inference

for our clustering spline mixed model, including a validation of propriety of the posterior

distribution and Gibbs sampling schemes for generating the posterior samples.

3.3.1 Propriety of the Posterior Distribution

Let S ≡ 1, 2, · · · , N be the set of all N sites. Denote a partition of S by c = ∪dr=1Cr, and

define P to be the set of all partitions of S. We consider probability distributions π(c) on

P :

π(c) =Γ(α0)

Γ(α0 +N)αd0

d∏r=1

(|Cr| − 1)! (3.4)

where |Cr| is the number of elements in Cr.

Theorem 4. Let σ2 = (σ21, σ

22, · · · , σ

2N ), θ = (θ1,θ2, · · · ,θN ) be the collection of functions

on the N sites with θs = (β(s), Ks, Ts), and φ = (u, τ2) with support Φ = RN × (0,∞).

Given model (3.1) and specification of priors in section 3.2.2, if Ks < minaσ + (n −

1)/2, [(n− 1)/w] for all s and some integer w > 0, then the posterior distribution is proper.

The proof of theorem is provided in Appendix.

3.3.2 Gibbs Updating Steps

Let S ≡ 1, 2, · · · , N be the set of all N individuals. Denote a partition of S by c = ∪dr=1Cr,

and define P to be the set of all partitions of S. Note that a randomly generated distri-

bution F from DP (α0G0) admits the Sethuraman representation (Sethuraman, 1994). By

46

integrating out the random measure F , the equivalence between the DP prior (for any gen-

eral space Θ) and the distribution it induces on P is well known. Moreover, the probability

distribution on P has an explicit form given by

π(c) =Γ(α0)

Γ(α0 +N)αd0

d∏r=1

(|Cr| − 1)!, (3.5)

where |Cr| is the number of elements in Cr. Define θ = (θ1, · · · ,θN ) and

θ−s = (θ1,θ2, · · · ,θs−1,θs+1, · · · ,θN ) to be the collection of all θ-components except θs.

Let c−s be the partition of S \ s determined by θ−s; the identical components of θ−s

uniquely determine the partition of c−s. Suppose that c−s = ∪N∗r=1Er. The conditional

distribution π(c | c−s) based on (3.5) is π(c | c−s) = α0/(α0 + N − 1) if c = s ∪ c−s

and π(c | c−s) = Nr/(α0 + N − 1) with Nr denoting the number of elements in Er where

c−s = ∪N∗j=1Ej .

(1) Update θs:

The posterior of θs conditional on θ−s is given by

π(θs |θ−s, Y , u) =

qs,0G∗0(dθs) +

∑N∗r=1 qs,r δθ(r)

qs,0 +∑N∗r=1 qs,r

, (3.6)

where

qs,0 =

[∫Θπ(Y s |θs, us)G0(dθs)

]π(c | c−s) and

qs,r = π(Y s |θ(r), us) π(c | c−s),(3.7)

respectively, for c = s∪c−s and c = (E1, E2, · · · , Er∪s, · · · , EN∗) for r = 1, 2, · · · , N∗.

47

G∗0(dθs) =π(Y s |θs, us)G0(dθs)∫

Θ π(Y s |θs, us)G0(dθs)(3.8)

is the posterior distribution of θs given that a new cluster is formed by the s-th individual.

The explicit expression for qs,0 after canceling 1/(α0 +N − 1) is given as

qs,0 = α0

k∗∑k=0

∑(n1,··· ,nk+1)

exp [H(n1, · · · , nk+1)]n0!

n1! · · ·nk+1!

(1

k + 1

)n0p(k),

where

H(n1, · · · , nk+1) = log

∫Rq+Ks

∫ ∞0

π(Y s |β(s), us, σ2s)π(σ2

s) dσ2sdβ

(s)

= −n− q −Ks2

log(2π)− log Γ(aσ)− aσ log bσ + log Γ

(2aσ + n− q −Ks

2

)−(

2aσ + n− q −Ks2

)log(b−1

σ + (1/2)Y T∗ P

(s)Y ∗)−1

2log∣∣∣X(s)TX(s)

∣∣∣with Y s∗ = Y s − us1n and P (s) = I −X(s)(X(s)TX(s))−1X(s)T .

qs,r is also given as

qs,r = Nr1

(2π)n/2Γ(aσ + n/2)

Γ(aσ)

b(aσ+n/2)∗baσσ

,

where b∗−1 =(b−1σ + 0.5(Y s∗ −Xrβr)T (Y s∗ −Xrβr)

). Xrβr ≡ θ(r) for the cluster Cr.

Note that θs′ ≡ θ(r) if s′ ∈ Cr.

Expression (3.6) explicitly demonstrates the clustering capability of the functional DP

prior. The current value of θs can be selected to be one of the distinct θ(r) functions

with probability∑N∗r=1 qs,r/(qs,0 +

∑N∗r=1 qs,r), this positive probability being the reason for

possible clustering of sites in terms of θs as mentioned. Expression (3.6) also allows for a

48

new θs to be generated from the posterior distribution G∗0.

(2) Update σ: The update of σ2s is carried out once θs is obtained via (3.6). Regardless

of whether θs is a new value or an existing θ(r), for each individual s = 1, · · · , N , the

conditional posterior distribution of σ2s given other parameters is

π(σ2s | · · · ) = IG(a, b),

with a = aσ + n2 and b−1 =

(b−1σ + 0.5(Y s∗ −X(s)β(s))T (Y s∗ −X(s)β(s))

).

Note that θ uniquely determines the collection of parameters (β,K,T ), where β is the set

of spline coefficients, K = (K1, · · · , KN ) is the set of number of knots and T = (ξ1, · · · , ξN )

is the set of knot locations. Indeed, since θ contains several identical components, it follows

that the corresponding components of (β,K,T ) are also identical to each other. In what

follows, we present the updating steps for the d distinct components of (β,K,T ), namely,

(βr, Kr,T r) for r = 1, 2, · · · , d. Let ∪dr=1Cr be the partition of 1, 2, · · · , N at the current

update of the Gibbs sampler (thus, d is the number of distinct clusters).

(3) Update (βr, Kr,T r): We first update Kr from the posterior marginal of Kr, and then

update T r |Kr, and finally βr |T r, Kr from their respective conditional distributions. The

posterior marginal probability of Kr = k is proportional to

p(k)∑

(n1,··· ,nk+1)

ν(n1, · · · , nk+1), (3.9)

with

ν(n1, · · · , nk+1) = expH(n1, · · · , nk+1) Γ(n0 + 1)∏k+1l=1 Γ(nl + 1)

(1

k + 1

)n0(3.10)

49

where

H(n1, · · · , nk+1) = log

∫Rq+k

∏s∈Cr

π(Y s |X(s),β, σ2s , us) dβ

= −(n|Cr| − q − k)

2log(2π)− n

2

∑s∈Cr

log σ2s −

1

2log

∣∣∣∣∣∣∑s∈Cr

1

σ2sX(s)TX(s)

∣∣∣∣∣∣− 1

2

∑s∈Cr

σ−2s Y T

s∗Y s∗

+1

2

∑s∈Cr

σ−2s X(s)TY s∗

T ∑s∈Cr

σ−2s X(s)TX(s)

−1∑s∈Cr

σ−2s X(s)TY s∗

.

Note that X(s) ≡ Xr for all s ∈ Cr.

The summation in (3.9) is over all non-negative integers n1, n2, · · · , nk+1 such that∑k+1l=1 nl = n0 ≡ n − 1 − (k + 1)w. Obtaining the posterior probability of Kr = k re-

quires evaluation of (3.10) for each value of k ≥ 0. This could potentially require significant

amount of computational time and drastically reduce the efficiency of the Gibbs chain, but

this did not occur in our application due to closed forms expressions of H using the flat prior

π0.

To update T r given Kr = k, note that this is equivalent to updating (n1, · · · , nk+1) with

probabilities p(n1, · · · , nk+1) ∝ v(n1, n2, · · · , nk+1). This is carried out by exhaustively

listing of all such combinations and numerically computing the corresponding probabilities.

The update of βr given T r and Kr = k is done based on the conditional distribution

N(µβr,Σβr

) with

Σβr=

∑s∈Cr

σ−2s X(s)TX(s)

−1

=(XrTXr

)−1

∑s∈Cr

σ−2s

−1

and (3.11)

µβr= Σβr

XrT∑s∈Cr

σ−2s Y s∗

(3.12)

50

with the q+k components of βr generated independently of each other from their respective

component densities.

(4) Update λ: λ is updated using

π(λ | · · · ) ∝ π3(λ)N∏s=1

p(Ks) ∝λa∗λ−1

e−λ/bλ

(∑k∗l=0

λll! )N

, (3.13)

with a∗λ = aλ +∑dr=1 |Cr|kr, where kr is the number of knot-points corresponding to θ(r)

in cluster Cr.

(5) Update α0: For updating α0 with π2(α0) ∝ αaα−10 e−α0/bα , we utilize the two-step

procedure by Escobar and West [6]: at the b-th iteration: (1) draw κ from the Beta dis-

tribution Be(α

(b−1)0 + 1, N

)and; (2) draw α

(b)0 from the mixture density of two Gamma

distributions

πκGa(aα +N∗, (1/bα − log(κ))−1) + (1− πκ)Ga(aα +N∗ − 1, (1/bα − log(κ))−1)

where N∗ is the latest number of clusters and the membership probability

πκ =aα +N∗ − 1

N(1/bα − log(κ)).

(6) Update u: We have

π(u | · · · ) ∝ π(Y |u, · · · ) · π(u | τ2)

∝ exp

−1

2

N∑s=1

(u2s1Tn1n − 2us1

TnWs

)/σ2s + (u− u0)T (u/− u0)τ2

where Ws = Y s −X(s)β(s). Notice that 1Tn1n = n. The conditional posterior distribution

51

of u is N(µu,Σu) with Σu =(nDiag(σ−2

1 , · · · , σ−2N ) + τ−2IN

)−1and µu = Σu(W +

τ−2INu0) with W = (1TnW1/σ21, · · · ,1

TnWN/σ

2N )T .

(7) Update τ2: The posterior distribution of τ2 is given as

π(τ2 | · · · ) ∝ π(u | τ2) · π(τ2),

∼ IG(a∗τ , b∗τ )

with a∗τ = N2 + aτ and b∗τ =

(1bτ

+ 12u

Tu)−1

.

3.4 Bayesian Inference: A Measure for Comparing a

Pair of Clustering of Sites

Given the validity and details of the Gibbs updating scheme for all the parameters based on

the elicited improper prior components from section 3.2.2, we are able to perform Bayesian

inference on the posterior sample of them, especially the estimated cluster information θ =

(θ1, · · · ,θN ). Suppose that we have M Gibbs samples for θ: θ(m)Mm=1 after convergence

is established. It may be challenging to obtain results for the clustering configuration directly

from them since such information (i.e. number of clusters, specific assignment of each site) is

changing at each Gibbs iteration, classical posterior analysis by collecting summary statistics

for clustering information is not straightforward and not easy to interpret. We introduce a

2 × 2 cross classification table to measure the deviation between two cluster configurations,

for instance, C and C∗. The table has entires nij, 1 ≤ i, j ≤ 2, where n11 is the number

of pairs of sites, out of all n++ = N(N −1)/2 pairs, that are in the same cluster in C as well

as C∗ while n22 is the number of pairs of sites that are not in the same cluster in C or C∗.

52

n12 is the number of pairs that are in the same cluster in C but not in C∗. Similarly, n21

is the number of pairs that are not in the same cluster in C but in the same cluster in C∗.

If two cluster C and C∗ have identical configurations, we have n11 = n+1 and n22 = n+2.

The interpretation of n+1 is that it is the number of pairs which are in the same cluster in

C∗, whereas the interpretation of n+2 is that it is the number of pairs that are not in the

same cluster in C∗. Thus, the latter quantities n+1 and n+2 are only dependent on C∗ but

not on C. Then, we consider two measures named sensitivity S1 = n11/n+1 and specificity

S2 = n22/n+2. The interpretation of S1 is that it is the proportion of pairs that are also

in the same cluster in C given that they are in the same cluster in C∗. Similarly, S2 is the

proportion of pairs that are also not in the same cluster in C given that they are not in the

same cluster in C∗. It is clear that S1 and S2 take values between 0 and 1 with the ideal

case that (S1, S2) = (1, 1). Also note that for a cluster C ⊂ C∗, that is, every partition of

C is in some partition of C∗, the number of clusters in C is much larger than that of C∗.

In this case, n12 = 0 yields S2 = 1 but S1 ≤ 1. On the other hand, when C∗ ⊂ C, we have

S1 = 1 and S2 ≤ 1. Thus, deviations from the point (1, 1) or from the lines y = 1 and x = 1

give an idea about the nature of deviations of the clustering C from the clustering C∗.

For simulation studies, (S1, S2) can be used to measure deviations between a Gibbs

cluster configuration and the true one. Suppose that from the Gibbs output, we have clus-

ter configurations CmMm=1. For the m-th Gibbs cluster configuration, we can calculate

(S1(m), S2(m)) with C = Cm and C∗ = C0 specified by the true cluster information. A

plot of (S1(m), S2(m)) will indicate how dispersed the current Gibbs cluster is with respect

to the true configuration: If (S1(m), S2(m)) is concentrated close to (1, 1), it indicates that

deviations of Cm from C0 is not much. Points (S1(m), S2(m)) far from (1, 1) indicate two

different types of deviations based on their proximity to the lines x = 1 or y = 1. Proximity

53

to the line x = 1 indicates that among those pairs that are in the same cluster in C0, more

pairs become in the same cluster in Cm as well whereas proximity to y = 1 indicates that

among those pairs not in the same cluster in C0, more pairs becomes not in the same cluster

in Cm as well. For simulation studies, C0 can be taken to be the true cluster configuration

since the true clustering is known. In the 2×2 cross-classification table, there are two degrees

of freedom. The quantities n+1 and n+2 are fixed for the true cluster configuration, so we

need two free parameters, say n11 and n22, to determine all entries in the table completely.

Thus, the scatter plot of (S1(m), S2(m)) : m = 1, 2, · · ·M gives a complete picture of

variability. As a numerical measure for variability, we consider

SS =1

M

M∑m=1

(2− S1(m)− S2(m))

with smaller SS value indicating better performance.

3.5 Numerical Analysis: Simulation Study and Real

Data Example

In this section, we apply the proposed method to numerical studies including simulations

under various settings as well as real case analysis of the lung cancer data in the U.S.. We

simulate data at n = 38 time points for N = 49 sites based on the model. We partition

the sites into 6 groups (clusters) Cr with r = 1, 2, · · · , 6. Within each cluster, the curves

share common information in form of (β(r), Kr,T(r)). As stated in the previous section, we

collect overall posterior samples via MCMC procedure with Gibbs sampling scheme at each

iteration. The true parameters are specified in the following way.

54

r |Cr| Cr(Site #) T (r) β(r)

1 9 2, 4, 11, 25, 27, 36, 43, 46, 49 (8, 30) (0.0667, 0.0500,−0.2000)2 9 12, 13, 14, 21, 22, 26, 33, 40, 48 (7, 11, 20) (−0.0667, 0.0300, 0.1667,−0.3333)3 8 3, 5, 15, 17, 24, 30, 35, 42 (13, 25) (−0.1333,−0.1333, 0.1667)4 7 1, 9, 10, 23, 32, 39, 41 (17, 21) (−0.0333,−0.0167, 0.1667)5 7 6, 18, 20, 28, 31, 38, 44 (18, 20) (−0.0667, 0.0200,−0.1333)6 9 7, 8, 16, 19, 29, 34, 37, 45, 47 (18, 20) (0.0167, 0.0167, 0.3333)

As for the true value of the variability parameter between the curves, τ2, and within the

curves, σ2s , we consider the following settings: τ2 = 1/30 or 2/30; σ2

s ∼ U(1/30, 2/30) or U(2/30, 4/30).

Such specifications lead to 4 scenarios.

3.5.1 An Alternative Less-informative Prior Choice for α0

Recall that when specifying the prior for parameters of our model in section 3.2.2, we raise

a question on the choice of prior for the precision parameter α0, which directly controls

the ability to create a new cluster, that is, increases the number of clusters. We observed

the fact that under the Gamma prior for the precision parameter α0, the corresponding

posterior distribution can be well obtained from a mixture of two Gamma distributions [6].

However, in practice, to guarantee a reasonable extent of splitting existing clusters, we may

need quite small value of α0, which leads to highly subjective Gamma prior for α0. For

instance, in our simulation, we require that α0 ∼ Ga(aα, bα) with E(α0) = aαbα ≈ 10−8

and Var(α0) ≈ 102 thus extreme small aα and large bα. To resolve this issue, we introduce

another prior as α0 ∼ LN(µα, σ2α), where LN(µ, σ2) denotes the log-Normal distribution

with corresponding mean parameter µ and variance parameter σ2. The intuition behind

such re-parameterization is that we believe it is more appropriate to put a relatively non-

informative Gaussian prior on log(α0): log(α0) ∼ N(µα, σ2α). For our simulation we use

µα = −40 and σα = 7. Note that the mean and variance for α0 are E(α0) = eµα+σ2α/2 ≈

55

1.86× 10−7 and Var(α0) = (eσ2α − 1)e2µα+σ2

α ≈ 6.57× 107. Hence we are able to claim that

the log-Normal type prior for α0 is much less subjective than the original Gamma prior.

To complete the posterior sampling scheme for π(α0) = LN(µα, σ2α), we simply apply

the grid search method over (µα−3σα, µα+ 3σα), which covers over 99% of the range of α0.

3.5.2 Simulations and Comparison with An Existing Approach

We generate 10 sets of data under each scenario specified at the beginning of section 3.5

and run MCMC iterations with two chains for 5000 times to guarantee the convergence. For

comparison purpose, we also apply a functional clustering methodology proposed by James

and Sugar (2003) to the same simulated datasets. The results are listed in Table 3.1 and

3.2.

τ2 σ2 S1(Specificity) S2(Sensitivity) SS(Overall) Med. # Mode #1/30 U(1/30, 2/30) 0.8813 0.9844 0.1343 7 6

U(2/30, 4/30) 0.6466 0.9827 0.3707 6.5 62/30 U(1/30, 2/30) 0.8898 0.9843 0.1260 6 6

U(2/30, 4/30) 0.6443 0.9827 0.3730 6.5 6

Table 3.1: Cluster Configuration Diagnostics for Simulations: Bayesian DP Approach

τ2 σ2 S1(Specificity) S2(Sensitivity) SS(Overall) True #1/30 U(1/30, 2/30) 0.7107 0.9066 0.3827 6

U(2/30, 4/30) 0.7017 0.9028 0.3955 62/30 U(1/30, 2/30) 0.6792 0.9137 0.4071 6

U(2/30, 4/30) 0.6978 0.9152 0.3870 6

Table 3.2: Cluster Configuration Diagnostics for Simulations: James and Sugar’s Approach

From the result of our approach in Table 3.1, we obtain excellent performance in sensi-

tivity (S2) as well as the median and mode of the estimated number of clusters. Meanwhile,

it is shown that the specificity (S1) is mainly affected by relatively large σ2s , the variability

56

of individual curves within a cluster. On the other hand, since James and Sugar’s model

requires a pre-specified number of clusters as introduced in section 3.1, we simply plug in the

true number of clusters, 6, when utilizing their models to avoid extra determining process.

We also point out that the diagnostic measures (specificity, sensitivity, etc.) are calculated

based on the predicted cluster membership for their case rather than averaging the posterior

samples for our case. It is clear from the comparison that our method leads to better sen-

sitivity (S2) in general and higher specificity (S1) for smaller σ2s . However, it appears that

their approach is slightly more robust under large individual variability.

3.5.3 A Real Data Example: Lung Cancer Rates in the U.S.

Besides the simulation study given above, we apply our proposed model to the lung cancer

mortality rate curves from 48 states and the Washington D.C. of the U.S.. In the data

analysis, the minimum number of time points in each time interval is set to be w = 7, which

leads to the maximum number of knot-points as k∗ = 4. Then all the hyper-parameters are

specified as follows. The shape and scale parameters of the inverse Gamma priors on σ2s and

τ2 are aσ = aτ = 2 and bσ = bτ = 100 respectively so that the variance of priors are large

enough to be considered as infinite. The shape and scale parameters of the Gamma prior on

λ are aλ = 0.004 and bλ = 500 respectively. For the prior on the precision parameter α0, as

we discussed in section 3.5.1, we propose the log-Normal distribution with µα = −100 and

σα = 11. Recall that the precision parameter α0 controls the number of clusters while λ

controls the number of knot-points.

We run two chains under MCMC algorithm with Gibbs sampling for 20000 iterations

each. The convergence of chains is attained after 15000 iterations given the Gelman-Rubin

diagnostics introduced in Chapter 1, section 1.4.1. We collect our posterior sample by

57

choosing every fifth sample from the last 5000 iterations for both chains. Such sampling

procedure provides more ‘mixed’ posterior sample with a total size of 2000. The posterior

probability of the number of clusters of states is shown in Table 3.3.

P (d = 2) P (d = 3) P (d = 4) P (d = 5) P (d = 6) P (d = 7) P (d = 8)0.093 0.217 0.298 0.236 0.110 0.038 0.008

Table 3.3: The Distribution of Number of Clusters d

It is clear that d = 4 gives us the estimated number of clusters with highest probability.

To provide a cluster configuration of all the states, we apply an agglomerative clustering

algorithm with the dissimilarity distance defined as

dist(s1, s2) = 1−M∑m=1

distm(s1, s2)/M,

where distm(s1, s2) = 1 if s1 and s2 fall into the same cluster in mth iteration. Otherwise,

distm(s1, s2) = 0. It is natural to specify the threshold for the maximum number of clusters

in the algorithm as d, the posterior mode of the number of clusters. The main algorithm

can be summarized as follows.

1. Create an individual cluster for each site.

2. Among all current clusters, select two clusters whose elements have the smallest dis-

similarity distance.

3. If two selected clusters from the previous step are distinct, merge them into a new

cluster and replace two old clusters.

4. Repeat step 2 and 3 until the desired threshold for the number of clusters is attained.

58

The estimated cluster configuration is shown on Table 3.4 along with a dendrogram in

Figure 3.1 that illustrates the process of agglomerative clustering algorithm as introduced

above. Moreover, Figure 3.2 provides the data from individual sites grouped by cluster

configuration while Figure 3.3 shows the averaged curve (in black line) over all sites within

those clusters.

Cluster # of Sites Cluster Details (State)(a) 8 CO, LA, MA, ME, OH, OR, PA, WY(b) 21 AL, AR, GA, IA, ID, IN, KS, KY, MN, MS,

MO, MT, NC, ND, NE, OK, SC, SD, TN,WI, WV

(c) 7 CA, DC, FL, MD, NJ, NY, UT(d) 13 AZ, CT, DE, IL, MI, NH, NM, NV, RI, TX,

VA, VT, WA

Table 3.4: Estimated Cluster Configuration

Cluster (a) has one knot-point around year 1991. The log scaled rates for these states

maintained a steady trend after that. Cluster (b) also has one knot-point. However, unlike

cluster (a), it happened a little earlier around year 1989. Cluster (c) has the same knot-point

at year 1991 as in cluster (a) but demonstrate a clear decreasing trend after that. As for

cluster (d), it represents states with potential two knot-points detected around year 1982

and 1992. These states have the log-rates increasing slower after 1982 and dropping after

1992.

3.6 Conclusion and Discussion

In this chapter, we proposed a mixed spline model that performs concurrent clustering of

multiple curves based on their patterns and trends. Such clustering is designed by introducing

a Dirichlet process prior on the space of step functions over time points. The model was

59

verified by a series of simulations and then applied to analyze age-adjusted cancer mortality

rates for all the states in the U.S. to find state-wise clusters that have similar trends. Given

the result of real data analysis, our model provides a reasonable and meaningful cluster

configuration and captures the group-wise features of the curves. For some potential further

investigation, one can extend a piece-wise linear model (corresponding to q = 2) to higher

order polynomial models such as cubic spline functions to model curves by summarizing

more information about their patterns. Then the criteria and interpretation of clustering

may have to be reestablished.

60

Figure 3.1: Dendrogram for the Clustering Estimation

61

Figure 3.2: Estimated Cluster Configuration

62

Figure 3.3: Overall Trend by Clusters

63

APPENDICES

64

Appendix A

Proof of Theorems in Chapter 1

Proof of Theorem 1. Note that the likelihood function given D,Σ0, β, γ is

l(y;D,Σ0, β, γ) =1

(2π)12Mdn|Σ|

12

e−12(y−Xβ−Zγ)′Σ−1(y−Xβ−Zγ) 1

(2π)12Mdk|G|

12

e−12γ′G−1γ ,

where Σ = IM ⊗ D ⊗ In and G = IM ⊗ Σ0 ⊗ Ik. Let l(y;D,Σ0) be the likelihood with β

and γ integrated out. To show the posterior is improper, it is enough to show

∫l(y;D,Σ0)π(D)π(Σ0) dD dΣ0 =∞. (A.1)

Using the eigenvalue decomposition of Σ0, Σ0 = Q′ΨQ, we can write π(Σ0) dΣ0 =

π(Ψ) dΨ · π(Q) dQ with π(Ψ) = 1∏dj=1 ψ

pj

∏i<j(ψi−ψj) by Yang and Berger [23]. Then, we

have

∫l(y;D,Σ0)π(D)π(Σ0) dD dΣ0 =

∫l(y;D,Σ0)π(D) π(Ψ)π(Q) dD dΨ dQ.

To show (A.1), it is enough to find subsets, SD and SΨ of the domains of integration for D

and Ψ on which the integration is unbounded. That is,

∫SD×SΨ

l(y;D,Σ0)π(D) π(Ψ) dD dΨ =∞.

65

Let SD = σ2j ≥ ε, j = 1, . . . , d for a fixed ε > 0 and SΨ = ψ : ψ1 > . . . ψd > 0, where

ψ = (ψ1, . . . , ψd) is the set of eigenvalues of Σ0 and S∗ = SD × SΨ. To proceed further, we

denote the ordering of two m × m symmetric matrices, A and B, with respect to positive

definiteness by A B if for any v ∈ Rm, v′Av ≤ v′Bv. We also say B A if A B.

Note that we have

l(y;D,Σ0) =(2π)−

12(n−q−1)|X ′X|−

12Md

|D|12M(n−q−1)|Σ0|

12Mk|Z ′PZ +G−1|

12

e−12y′Wy,

where W = P − PZ(Z ′PZ +G−1)−1Z ′P with P = IM ⊗D−1 ⊗ (In −X(X ′X)−1X ′). For

simplicity, define PX = X(X ′X)−1X ′, PH = In − PX so that P = IM ⊗ D−1 ⊗ PH . Let

0 < λ1 < λ2 < · · · < λd be the eigenvalues of Z ′PHZ. Then, we have

|Σ0|−Mk

2

|Z ′PZ +G−1|12

=|IM ⊗ Σ−1

0 ⊗ Ik|12

|IM ⊗D−1 ⊗ Z ′PHZ + IM ⊗ Σ−10 ⊗ Ik|

12

≥|IM ⊗ Σ−1

0 ⊗ Ik|12

|IM ⊗D−1 ⊗ λdIk + IM ⊗ Σ−10 ⊗ Ik|

12

≥ 1

|λdε−1Σ0 + I|Mk

2

=d∏j=1

1

(1 + ψjλdε−1)

Mk2

on SD, where the first inequality holds since λdIk Z ′PHZ. Also, e−1/2y′Wy ≥ e−1/2y′Py

since PZ(Z ′PZ +G−1)−1Z ′P 0.

Thus, on SD × SΨ,

l(y;D,Ψ, Q) π(Ψ) ≥ (2π)−12(n−q−1)|X ′X|−

12Md

|D|12M(n−q−1)

e−12y′Py

∏i<j(ψi − ψj)∏d

j=1 ψpj (1 + ψjλdε

−1)Mk

2

66

Since∫SD

(2π)−1

2(n−q−1)|X′X|−12Md

|D|12M(n−q−1)

e−12y′Pyπ(D) dD > 0, we are going to show

∫SΨ

∏i<j(ψi − ψj)∏d


−1)Mk

2

dΨ =∞. (A.2)

The first term in the integrand in (A.2) corresponding to the finite expansion of∏i<j(ψi−ψj)

is

ψd−11 ψd−2

2 . . . ψ1d−1ψ

0d∏d


−1)Mk

2

(A.3)

First consider (1) p = d+12 , corresponding to the Jeffrey’s prior. Since d ≥ 4, let j∗ =⌈d−1

2

⌉> 1 where dxe is the smallest integer greater than or equal to x. Then we have

d−12 − j > 0 for 1 ≤ j ≤ j∗ − 1 and d−1

2 − j ≤ 0 for j∗ ≤ j ≤ d. Consider A = ψ : ψ1 ≥

1, ψ2 ≥ 1, . . . , ψj∗−1 ≥ 1 ∩ SΨ. Then

∫SΨ

ψd−11 ψd−2

2 . . . ψ1d−1ψ

0d∏d


−1)Mk

2

dΨ

≥∫A

d∏j=1

ψd−1

2 −jj

1

(1 + ψjλdε−1)

Mk2

dΨ

≥∫A

∏j≥j∗ ψ

d−12 −j

j∏j≥j∗(1 + ψjλdε

−1)Mk

2

1∏j<j∗(1 + ψjλdε

−1)Mk

2

dΨ

≥∫A

( 1

ψj∗

)∑j≥j∗|

d−12 −j| 1∏

j≥j∗(1 + ψjλdε−1)

Mk2

1∏j<j∗(1 + ψjλdε

−1)Mk

2

dΨ (A.4)

since ψj∗ > ψj∗+1 > · · · > ψd > 0.

67

For p > 1 and b > 0, we have

∫ t

0

1

(1 + bs)pds =

1

b(p− 1)

[1− 1

(1 + bt)p−1

](A.5)

Using (A.5), we integrate out (A.4) with respect to ψd, · · · , ψj∗ on A ∩ SΨ. First,

∫ ψd−1

0

1

(1 + ψdλdε−1)

Mk2

dψd =1

λdε−1(Mk

2 − 1)

[1− 1

(1 + ψd−1λdε−1)

Mk2 −1

].

Thus, using (A.5) again,

∫ ψd−2

0

1

(1 + ψd−1λdε−1)

Mk2

∫ ψd−1

0

1

(1 + ψdλdε−1)

Mk2

dψd dψd−1

=

∫ ψd−2

0

1

λdε−1(Mk

2 − 1)

[1− 1

(1 + ψd−1λdε−1)

Mk2 −1

]1

(1 + ψd−1λdε−1)

Mk2

dψd−1

=1

λ2dε−2(Mk

2 − 1)2

[1− 1

(1 + ψd−2λdε−1)

Mk2 −1

]

− 1

λ2dε−2(Mk

2 − 1)(Mk − 2)

[1− 1

(1 + ψd−2λdε−1)Mk−2

].

After integrating (A.4) with respect to ψd, · · · , ψj∗+1, we have

∫ ψj∗−1

0P

(1√

1 + ψj∗λdε−1

)·(

1

ψj∗

)∑j≥j∗|

d−12 −j|

dψj∗ , (A.6)

where P (·) is a polynomial of degree (Mk− 2)(d− j∗) +Mk. Note that we assume Mk2 > 1.

However, the integrand above is not integrable at 0 since∑j≥j∗

∣∣∣d−12 − j

∣∣∣ ≥ 1.

Hence we showed that the posterior is improper. Similar to case (1), by setting j∗ =⌈d−22

⌉≥ 1, we can prove case (2). For case (3), consider the corresponding multiple integral

68

on B = ψ : 1 > ψ1 > ψ2 > · · · > ψd > 0. Then, it is easy to verify the first integral of

(A.3) with respect to ψd is infinite.

Proof of Theorem 2. Let l(y;D,L) be the likelihood with β and γ integrated out and Σ0

expressed in terms of L. To show that the posterior distribution is proper, we need to verify

∫ ∫l(y;D,L)π(D)π(L) dD dL <∞.

Given the prior models for D and L, we can express

∫l(y;D,L)π(D)π(L) dD dL

=

∫l(y;D,L)

[d∏j=1

π(σ2j )

]π(`11)π(`1)π(L11) dσ2

1 · · · dσ2d d`11 d`1 dL11

= C

∫|Σ0|−

Mk2 e−

12y′Wy

|Z ′PZ +G−1|12

[d∏j=1

( 1

σ2j

)M(n−q−1)2 +αj+1

]π(`11)π(L11)dσ2

1 · · · dσ2d d`11 d`1 dL11,

where C is a constant independent of D and L.

First, we find W0 that does not depend on L and it satisfies W0 W so that e−12y′Wy ≤

e−12y′W0y. Let W0 = P − PZ(Z ′PZ)−1Z ′P . Since (Z ′PZ + G−1)−1 (Z ′PZ)−1 =

IM ⊗D⊗ (Z ′PHZ)−1, W = P − PZ(Z ′PZ +G−1)−1Z ′P P − PZ(Z ′PZ)−1Z ′P = W0.

Thus, e−12y′Wy ≤ e−

12y′W0y.

Let P0 = PH−PHZ (Z ′PHZ)−1 Z ′PH . Then, we can write W0 = IM⊗D−1⊗P0 so that

we can express y′W0y =∑dj=1

cj

σ2j

, where cj =∑Mi=1 y

(i)′j P0y

(i)j with y

(i)j = (y

(i)j1 , · · · , y

(i)jn )′

for j = 1, . . . , d. Now, we claim that cj > 0 for all j = 1, · · · , d with probability 1. Note that

PX , PH and P0 are idempotent so that PH is orthogonal to PX = I −PH and P0 is orthog-

onal to PHZ (Z ′PHZ)−1 Z ′PH . Thus, rank(PH) = rank(I)− rank(PX) = n− (q + 1) since

69

rank(PX) = rank(X) = q+ 1. rank(P0) = rank(PH)− rank(PHZ (Z ′PHZ)−1 Z ′PH). Since

we assume n > q + 1 + k, rank(Z ′PHZ) = minrank(Z), rank(PH) = k and rank(PHZ) =

minrank(Z), rank(PH) = k so that rank(PHZ (Z ′PHZ)−1 Z ′PH) = k. Therefore, rank(P0) =

n− (q + 1)− k > 0. Since P0 is idempotent and non-degenerate, we prove the claim.

Thus, we have

e−12y′Wy ≤ e−

12y′W0y =

d∏j=1

e−cj/2σ2

j , (A.7)

where cj > 0 for all j = 1, . . . , d.

Now, we want to bound|Σ0|−Mk

2

|Z′PZ+G−1|12

above by a simpler expression.

|Σ0|−Mk

2

|Z ′PZ +G−1|12

=|IM ⊗ Σ−1

0 ⊗ Ik|12

|IM ⊗D−1 ⊗ Z ′PHZ + IM ⊗ Σ−10 ⊗ Ik|

12

≤|IM ⊗ Σ−1

0 ⊗ Ik|12

|IM ⊗ E−1 ⊗ Ik + IM ⊗ Σ−10 ⊗ Ik|

12

=|Σ−1

0 |Mk

2

|E−1 + Σ−10 |

Mk2

=|E|

Mk2

|E + Σ0|Mk

2

,

where E = λ−11 D and λ1 be the smallest eigenvalue of Z ′PHZ. The inequality follows from

IM ⊗D−1 ⊗ Z ′PHZ IM ⊗D−1 ⊗ λ1Ik = IM ⊗ E−1 ⊗ Ik.

We further simplify the last expression. Let E =

e11 0

0 E11

. By the block-wise

70

expansion for determinants,

∣∣∣∣∣∣∣A B

C D

∣∣∣∣∣∣∣ = |A||D − CA−1B|, we have

|E+Σ0| = (e11+`211)

∣∣∣∣`1`′1+L11L′11+E11−

`1`′1`

211

e11 + `211

∣∣∣∣ = (e11+`211)

∣∣∣∣ e11`1`′1

e11 + `211

+L11L′11+E11

∣∣∣∣.By the identity |Ip + UV | = |Ir + V U | when U is a p× r matrix and V is an r × p matrix,

we have

|E + Σ0| = (e11 + `211)|L11L′11 + E11|

[1 + `′1(L11L

′11 + E11)−1`1

e11

e11 + `211

]= (e11 + `211)|L11L

′11 + E11|(1 + `′1F`1),

where F =e11

e11+`211(L11L

′11+E11)−1. Note that |E| = e11|E11| =

∏dj=1 ejj = λ−d1

∏dj=1 σ

2j .

Thus,

|Σ0|−Mk

2

|Z ′PZ +G−1|12

≤ |E|Mk

2

|E + Σ0|Mk

2

=eMk

211 |E11|

Mk2

(e11 + `211)Mk

2 |L11L′11 + E11|

Mk2 (1 + `′1F`1)

Mk2

. (A.8)

Let R = (`11, `1, L11, σ21, . . . , σ

2d) : `11 > 0; `1 ∈ Rd−1;σ2

j > 0, j = 1, . . . , d and

R0 = (`11, L11, σ21, . . . , σ

2d) : `11 > 0;σ2

j > 0, j = 1, . . . , d. Then, by (A.7) and (A.8), we

71

have

∫R

|Σ0|−Mk

2 e−12y′Wy

|Z ′PZ +G−1|12

[d∏j=1

( 1

σ2j

)M(n−q−1)2 +αj+1

]π(`11)π(L11)dσ2

1 · · · dσ2d d`11 d`1 dL11

=

∫R0

[∫Rd−1

1

(1 + `′1F`1)Mk

2

d`1

]

×eMk

211 |E11|

Mk2

(e11 + `211)Mk

2 |L11L′11 + E11|

Mk2

[d∏j=1

( 1

σ2j

)M(n−q−1)2 +αj+1

e

−cj

2σ2j

]× π(`11)π(L11) dσ2

1 · · · dσ2d d`11 dL11. (A.9)

When Mk > 1,∫Rd−1

1

(1+`′1F`1)Mk

2

d`1 = C1|F |−12 , where C1 is a constant. Thus, (A.9)

becomes

C1

∫R0

|F |−12 e

Mk2

11 |E11|Mk

2

(e11 + `211)Mk

2 |L11L′11 + E11|

Mk2

[d∏j=1

( 1

σ2j

)M(n−q−1)2 +αj+1

e

−cj

2σ2j

]× π(`11)π(L11) dσ2

1 · · · dσ2d d`11 dL11

= C1

∫R0

(e11

e11 + `211

)Mk−(d−1)2 |E11|

Mk2

|L11L′11 + E11|

Mk−12

[d∏j=1

( 1

σ2j

)M(n−q−1)2 +αj+1

e

−cj

2σ2j

]·

× π(`11)π(L11) dσ21 · · · dσ2

d d`11 dL11.

≤ C1

∫R0

|E11|12

[d∏j=1

( 1

σ2j

)M(n−q−1)2 +αj+1

e

−cj

2σ2j

]π(`11)π(L11) dσ2

1 · · · dσ2d d`11 dL11

= C2

∫ ∞0

( 1

σ21

)M(n−q−1)2 +α1+1

e− c1

2σ21 dσ2

1

[d∏j=2

∫ ∞0

( 1

σ2j

)M(n−q−1)2 +αj

e

−cj

2σ2j dσ2

j

]

×∫ ∞

0π(`11) d`11

∫π(L11) dL11

<∞,

72

where the first inequality holds since

(e11

e11 + `211

)Mk−(d−1)2

≤ 1,|E11|

Mk−12

|L11L′11 + E11|

Mk−12

≤ 1

for Mk ≥ d − 1 and the last inequality holds since π(`11) and π(L11) are proper and∫∞0 θ−se−t/θ dθ <∞ for s > 0, t > 0. C1 and C2 above are both finite constants. Hence we

conclude that the posterior is proper.

73

Appendix B

MCMC Algorithm via Gibbs

Sampling for Bayesian Inference in

Chapter 1

For the linear mixed model,

y = Xβ + Zγ + ε

γ ∼ N(0, G), ε ∼ N(0,Σ),

where G = IM ⊗ Σ0 ⊗ Ik and Σ = IM ⊗D ⊗ In. For the Bayesian inference, we put priors

on the parameters as described in section 1.3.3 and the followings are the updating steps of

β and σ2j for Gibbs sampling method.

Updating β

Let r(i) = y(i)−(Id⊗Z)γ(i) and r = 1M

∑Mi=1 r

(i). Given π(β) ∝ 1, the posterior distribution

of β is

β | y, γ,D,Σ0 ∼ N(β, Vβ

),

where β = (X ′Σ−1X)−1X ′Σ−1r = (Id ⊗ (X ′X)−1X ′)r and V

β= (X ′Σ−1X)

−1= 1

M (D ⊗

(X ′X)−1).

74

Updating γ

Let u(i) = y(i) − (Id ⊗ X)β, i = 1, . . . ,M . We can update γ(i) one by one independently.

Since γ(i) ∼ N(0,Σ0 ⊗ Ik), the posterior distribution of γ(i) is

γ(i) | y, β,D,Σ0 ∼ N(γ(i), Vγ), i = 1, . . . ,M,

where γ(i) = (D−1⊗Z ′Z+Σ−10 ⊗Ik)−1(D−1⊗Z ′)u(i) and Vγ = (D−1⊗Z ′Z+Σ−1

0 ⊗Ik)−1.

Updating D = diag(σ21, . . . , σ

2d)

We can update D by updating σ2j one by one independently.

Let r = y− Xβ − Zγ. Given IG(aj , bj) as the prior for π(σ2j ), the posterior distribution

of σ2j is

σ2j | y, β, γ,Σ0 ∼ IG

(1

2Mn+ aj ,

(1

2

M∑i=1

n∑t=1

r(i)2jt +

1

bj

)−1), j = 1, . . . , d

Updating procedure of Σ0 via `11, `1 and L11 is provided in section 1.3.3.

75

Appendix C

Maximum Likelihood Estimation in

Chapter 1

This part serves as a preliminary analysis of the parameters based on the likelihood approach.

Here instead of the original assumption that the covariance matrix for errors is Σ = IM⊗D⊗

Σϕ, where D is diagonal matrix, we consider a more general relationship between individual

covariates such that Σ = IM ⊗Σ1⊗Σϕ, where Σ1 can be any valid covariance matrix. Then

the likelihood function for linear regression model in section 1.4 is

f(y|β, γ,Σ1) =1

(2π)12ndM |Σ|

12

e−12(y−Xβ−Zγ)′Σ−1(y−Xβ−Zγ),

where Σ = IM ⊗Σ1⊗Σϕ when we assume γ as fixed effects as well. The MLE equation for

β and γ is then X ′Σ−1X X ′Σ−1Z

Z ′Σ−1X Z ′Σ−1Z

βγ

=

X ′Σ−1y

Z ′Σ−1y

.

Let β, γ denote the MLEs, e(i)jt = y

(i)jt − Xβ − Zγ, Ui = (e

(i)1 , . . . , e

(i)d ) and Si = U ′iΣ

−1ϕ Ui,

where i = 1, . . . ,M . Then, we have

f(y|β, γ,Σ1,Σϕ) ∝ e−12tr(Σ−1

1 S)

|Σ1|nM

2 |Σϕ|dM2

,

76

where S =∑Mi=1 Si. Then D is maximized at Σ1 = S

nM .

The log-likelihood function is

L(y|β, γ,Σ1,Σϕ) ∝ −M2

tr(Σ−11 S(ϕ)) +

nM

2log|Σ−1

1 |+dM

2log|Σ−1

ϕ |,

where S(ϕ) = SM .

It can be shown that |Σ−1ϕ | = 2n

∑n−1j=0 (1+ϕ cos(2jπ

n )). Therefore the MLE for ϕ satisfies

ϕ = arg max−1≤ϕ≤1

M

2tr(Σ−1

1 S(ϕ)) +dM

2log|Σ−1

ϕ |+ndM

2

Following the procedure above we are able to obtain the MLEs by updating the parameters

iteratively until convergence.

77

Appendix D

Proof of Theorems and Lemmas in

Chapter 2

Proof of Lemma 1. First note that Vj = εTX(j)/(√nσσj) ∼ N(0, 1). Examining the tail

probability of Gaussian random variables leads to

P

(max

1≤j≤n|Vj | >

√t2 + 2 log n

)≤ 2nP

(V1 >

√t2 + 2 log n

)(D.1)

≤ 2n exp

[−t

2 + 2 log n

2

]= 2 exp(−t2/2). (D.2)

Then it follows that

P

(max

1≤j≤n2|εTX(j)|

n> λ0

)= P

(2σ√n

max1≤j≤n

σj |Vj | > λ0

)(D.3)

≤ P

(max

1≤j≤n|Vj | >

√t2 + 2 log n

)(D.4)

≤ 2 exp(−t2/2). (D.5)

Proof of Lemma 2. Let W = nσ2/σ2 = Y TY /σ2. Since Y ∼ N(Xβ, σ2I), we claim

that W ∼ χ2n(r), i.e. a noncentral chi-square distribution with noncentrality parameter r.

78

It is known that

E(W ) = n+ r (D.6)

E[W − E(W )]2 = 2(n+ 2r) (D.7)

E[W − E(W )]4 = 12(n+ 2r)2 + 48(n+ 4r). (D.8)

Given k > 0, by Chebyshev’s inequality

P(|σ2 − (1 + r/n)σ2| > ασ2

)= P (|W − E(W )| > nα) ≤ E|W − E(W )|k

αknk(D.9)

Therefore,

P(σ2 < (1− α)σ2

)≤ P

(|σ2 − (1 + r/n)σ2| > ασ2

)≤ E|W − E(W )|k

αknk(D.10)

Finally by taking k = 2 and k = 4 respectively, we complete our proof.

Proof of Theorem 3. Note that under (2.5),

‖Y −Xβ‖22/n+ λn−1∑j=2

|βj | ≤ ‖Y −Xβ‖22/n+ λn−1∑j=2

|βj | (D.11)

By expanding the `2 terms, it simply gives

‖X(β − β)‖22/n+ λ

n−1∑j=2

|βj | ≤ 2εTX(β − β)/n+ λ

n−1∑j=2

|βj | (D.12)

For any vector v ∈ Rp we define v01 = (v0, v1, 0, · · · , 0)T and v2+ = (0, 0, v3, · · · , vn−1)T .

79

Since v = v01 + v2+, (D.12) becomes

‖X(β−β)‖22/n ≤ 2εTX(β−β)01/n+ 2εTX(β−β)2+/n−λn−1∑j=2

|βj |+λ

n−1∑j=2

|βj | (D.13)

Set λ0 = 6Kσ

√log nn . On Jλ0

=

max1≤j≤n 2|εTX(j)|/n ≤ λ0

,

2εTX(β − β)2+/n ≤ λ0

n−1∑j=2

|βj − βj | (D.14)

Meanwhile, note that ‖β01‖1 = |β0|+ |β1| ≤M by assumption. We have

2εTX(β − β)01/n ≤ λ0‖β01‖1 + λ0‖β01‖1 ≤ λ0M + λ0(|β0|+ |β1|) (D.15)

Now define Sn = σ : σ ≥ σ/2. Then on Sn we have λ ≥ λ0. Combining (D.12) through

(D.15) and noting the fact that limn→∞ λ = 0 and limn→∞ λ‖β‖1 = 0, we claim that on

Jλ0∩ Sn,

‖X(β − β)‖22/n ≤ λM + 2λ‖β‖1 → 0 as n→∞. (D.16)

It remains to show that Jλ0∩ Sn has high probability when n is large. In fact, we have

P(J cλ0∪ Scn

)≤ P (J cλ0

) + P (Scn) (D.17)

By lemma 1, P (J cλ0) ≤ 2/n2. Taking α = 3/4 in lemma 2 we get

P (Scn) ≤ (27 + 12) · 44

34 · n2≤ 27

n2(D.18)

80

when n ≥ max(4r, 8). Hence we claim that P(J cλ0∪ Scn

)≤ 130/n2 when n ≥ max(4r, 8).

Since∞∑n=1

P(J cλ0∪ Scn

)≤ (4r + 8) + 130

∞∑n=1

1

n2<∞, (D.19)

by Borel-Cantelli lemma,

P(J cλ0∪ Scn i.o.

)= 0. (D.20)

It follows that

P(∃N > 0 s.t. Jλ0

∩ Sn holds when n > N)

= 1. (D.21)

Therefore, by (D.16)

P(

limn→∞

‖X(β − β)‖22/n = 0)

= 1. (D.22)

Proof of Proposition 1. Recall that xj = j/n, j = 1, 2, · · · , n. According to (2.4), we

have

X − νI =

1− ν 1n 0 0 · · · 0 0

1 2n − ν 0 0 · · · 0 0

1 3n

1n − ν 0 · · · 0 0

......

......

......

1 n−1n

n−3n

n−2n · · · 1

n − ν 0

1 1 n−2n

n−1n · · · 2

n1n − ν

(D.23)

It is easy to see that ν = 2/n is not the solution to det(X − νI) = 0. So by multiplying

the second row of (D.23) with −1/(2 − nν) and add it to the first row we obtain a lower

81

triangular matrix

R(X − νI) =

nν2−(n+2)ν+12−nν 0 0 0 · · · 0 0

1 2n − ν 0 0 · · · 0 0

1 3n

1n − ν 0 · · · 0 0

......

......

......

1 n−1n

n−3n

n−2n · · · 1

n − ν 0

1 1 n−2n

n−1n · · · 2

n1n − ν

, (D.24)

where R denotes the row operation described as above. Then it is clear that

det(X − νI) = det(R(X − νI)) =nν2 − (n+ 2)ν + 1

n

(1

n− ν)n−2

(D.25)

It directly follows that

ν1 =n+ 2−

√n2 + 4

2n, ν2 = ν3 = · · · = νn−1 =

1

n, νn =

n+ 2 +√n2 + 4

2n(D.26)

are the roots of det(X − νI) = 0.

82

Appendix E

Proof of Theorems in Chapter 3

Proof of Theorem 4. Given site s, the likelihood function is

f(Y s|b, us, σ2s) =

1

(2πσ2s)n2

exp

(− 1

2σ2s

(Y s −X(s)b(s) − us1n)T (Y s −X(s)b(s) − us1n)

)(E.1)

To validate the propriety of the posterior, we need to show

∑c∈P

∫(0,∞)

∫RN

∫Θd

∫(0,∞)N

N∏s=1

f(Y s|b, us, σ2s)π(us|τ2)π(σ2

s)

π(c)π(τ2)dσ2G0(dθ)dudτ2 <∞

(E.2)

Integrating out σ2s , we have

fs(Y s, b, us) =

∫(0,∞)

f(Y s|b, us, σ2s)π(σ2

s)dσ2s = (2π)−

n2

Γ(aσ + n2 )

Γ(aσ)

(12eTs es + 1

bσ

)−(aσ+n2 )

baσσ,

(E.3)

where es = Y s −X(s)b(s) − us1¯n

.

Then the integrand in (E.2) becomes

N∏s=1

fs(Y s, b, us)π(us|τ2)

π(c)π(τ2), (E.4)

83

which is equivalent to the clustered form

(2π)−Nn2

(Γ(aσ + n

2 )

Γ(aσ)baσσ

)N d∏r=1

∏s∈Cr

(1

2eTr,ser,s +

1

bσ

)−(aσ+n2 )

, (E.5)

where er,s = Y s −Xrbr − us1¯n

.

Let Ir,s =(

12eTr,ser,s + 1

bσ

)−(aσ+n2 )

. Integrating with respect to G0(dθ), given cluster r we

need to evaluate

d∏r=1

∫Rq+kr

∏s∈Cr

Ir,s

dbr ≤d∏r=1

∫Rq+kr

b(aσ+n

2 )(|Cr|−1)σ Ir,r1dbr (E.6)

since Ir,s ≤ baσ+n

2σ for any r and s, where r1 ∈ Cr is a fixed location. Note that Ir,r1 =(

12eTr,r1er,r1 + 1

bσ

)−(aσ+n2 )

as a function of br is proportional to a (q + kr) dimensional

multivariate t-distribution with parameters (µr,Σr, φr), where

µr = (XTr Xr)

−1XTr (Y r1 − ur11

¯n) = (XT

r Xr)−1XT

r vr1 (E.7)

Σr =

2bσ

+ vTr1

[I −Xr(XT

r Xr)−1XT

r

]vr1

φr(XT

r Xr)−1 (E.8)

φr = 2aσ + n− (q + kr) (E.9)

Then the rth integral for the RHS of (E.6) can be shown as

∫Rq+kr

b(aσ+n

2 )(|Cr|−1)σ Ir,r1dbr = Arb

(aσ+n2 )(|Cr|−1)

σ ·

(2bσ

+ vTr1Prvr1

)q+kr2(

1bσ

+ vTr1Prvr1

)aσ+n2

(E.10)

≤ Arb(aσ+n

2 )|Cr|−q+kr2σ 2

12(q+kr), (E.11)

84

where Pr = I −Xr(XTr Xr)

−1XTr and

Ar =Γ(aσ +

n−(q+kr)2

)Γ(aσ + n

2 )

π12(q+kr)

|XTr Xr|

12

(E.12)

Therefore it remains to verify that

∫RN

∫(0,∞)

N∏s=1

π(us|τ2)π(τ2)dτdu <∞ (E.13)

Recall that π(us|τ2) = 1√2πτ

e− u2

s2τ2 and π(τ2) =

b−aττΓ(aτ )

(1τ2

)aτ+1e− 1bτ τ2 , integrating out τ2

leads to a new integrand

(2π)−N2

baττ

Γ(aτ + N2 )

Γ(aτ )

1

bτ+

1

2

N∑s=1

u2s

−(aτ+N2 )

, (E.14)

which, as a function of u, is proportional to the density of a N dimensional t-distribution

with parameters (zN , (aτ bτ )−1IN , 2aτ ). Hence the last integration with respect to u is finite.

Note the final summation is over finite possible partitions of S. Therefore we complete our

verification.

85

BIBLIOGRAPHY

86

BIBLIOGRAPHY

[1] Abraham, C., Cornillon, P.A., Matzner-Løber, E. and Molinari, N. (2003) UnsupervisedCurve Clustering using B-Splines. Scandinavian Journal of Statistics 30: 581-595.

[2] Bickel, P., Ritov, Y., and Tsybakov, A. (2009), Simultaneous Analysis of Lasso andDantzig Selector. The Annals of Statistics 37(4): 1705-1732.

[3] Buhlmann, P. and van de Geer, S. (2011), Statistics for High-Dimensional Data: Methods,Theory and Applications Springer: New York, USA.

[4] Dass, S., Lim, C., Maiti, T. and Zhang, Z. (2014), Clustering Curves based on Changepoint analysis : A Nonparametric Bayesian Approach. Accepted by Statistica Sinica

[5] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004), Least Angle Regression.The Annals of Statistics 32(2): 407-499.

[6] Escobar, M.D. and West, M. (1995), Bayesian Density Estimation and Inference usingMixtures. Journal of the American Statistical Association 90(430): 577-588.

[7] Ferguson, T.S. (1973), A Bayesian Analysis of Some Nonparametric Problems. The An-nals of Statistics 1(2): 209-230.

[8] Friedman, J., Hastie, T. and Tibshirani, R. (2010), Regularization Paths for GeneralizedLinear Models via Coordinate Descent. Journal of Statistical Software 33(1): 1-22.

[9] Gaston, K.J. (1996), Biodiversity: A Biology of Numbers and Difference Oxford, UK.

[10] Gelman, A. and Rubin, D.B. (1992), Inference from Iterative Simulation Using MultipleSequences. Statistical Science, 7: 457-472.

[11] Ghosh P., Basu S. and Tiwari, R.C. (2009), Bayesian Analysis of Cancer Rates fromSEER Program using Parametric and Semiparametric Joinpoint Regression Models. Jour-nal of the American Statistical Association 104(486): 439-452.

[12] Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical LearningSpringer: New York, USA.

87

[13] Jeffreys, H. (1961), Theory of Probability Oxford Univ. Press: Oxford, UK.

[14] Kim, H.-J., Fay, M. P., Feuer, E. J. and Midthune, D. N. (2000), Permutation Testsfor Joinpoint Regression with Applications to Cancer Rates. Statistics in Medicine 19:335-351.

[15] Lozano, R. et al. (2012), Global and Regional Mortality from 235 Causes of Death for20 Age Groups in 1990 and 2010: A Systematic Analysis for the Global Burden of DiseaseStudy 2010. The Lancet 380(9859): 2095-2128.

[16] Ramsay J.O. and Silverman B.W. (1997) Functional Data Analysis Springer: New York,USA.

[17] Ries, L.A.G., Harkins, D., Krapcho, M., Mariotto, A., Miller, B.A., Feuer, E.J., C-legg, L., Eisner, M.P., Horner, M.J., Howlader, N., Hayat, M., Hankey, B.F., Edwards,B.K. (eds). (2006), SEER Cancer Statistics Review, 1975-2003 National Cancer Institute:Bethesda, MD, USA.

[18] Sethuraman, J. (1994), A Constructive Definition of Dirichlet Priors. Statistica Sinica4: 639-650.

[19] The IUCN Red List of Threatened Species (2012). http://www.iucnredlist.org/.

[20] Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso. Journal of theRoyal Statistical Society: Series B (Methodological) 58(1): 267-288.

[21] Tiwari, R.C., Cronin, K.A., Davis, W., Feuer, E.J., Yu, B. and Chib, S. (2005), BayesianModel Selection for Join Point Regression with Application to Age-adjusted Cancer Rates.Journal of the Royal Statistical Society. Series C (Applied Statistics) 5: 919-939.

[22] Wang, H., Li, B. and Leng, C. (2009), Shrinkage Tuning Parameter Selection witha Diverging Number of Parameters. Journal of the Royal Statistical Society: Series B(Statistical Methodology) 71: 671-683.

[23] Yang, R. and Berger, J. (1994), Estimation of a Covariance Matrix Using the ReferencePrior. The Annals of Statistics 22(3): 1195-1211.

88

Documents

FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS Xin Qi · ducting Bayesian hypothesis testing on the covariance ordering of random e ects in a spline mixed model for multivariate functional