Upload
conrad-j
View
214
Download
1
Embed Size (px)
Citation preview
Clustering Microarray Time-series Data usingExpectation Maximization and Multiple Profile
Alignment
Numanul Subhani∗, Luis Rueda∗, Alioune Ngom∗, Conrad J. Burden†∗School of Computer Science, 5115 Lambton Tower
University of Windsor, 401 Sunset Avenue
Windsor, Ontario, N9B 3P4, Canada
{hoque4,lrueda,angom}@uwindsor.ca†Centre for Bioinformation Science, Mathematical Sciences Institute and John Curtin School of Medical Research
The Australian National University
Canberra, ACT 0200, Australia
Abstract—A common problem in biology is to partition a setof experimental data into clusters in such a way that the datapoints within the same cluster are highly similar while datapoints in different clusters are very different. In this direction,clustering microarray time-series data via pairwise alignmentof piece-wise linear profiles has been recently introduced. Wepropose a EM clustering approach based on a multiple alignmentof natural cubic spline representations of gene expression profiles.The multiple alignment is achieved by minimizing the sum ofintegrated squared errors over a time-interval, defined on a set ofprofiles. Preliminary experiments on a well-known data set of 221pre-clustered Saccharomyces cerevisiae gene expression profilesyield encouraging results with 83.26% accuracy.
Index Terms—Microarrays, Time-Series Data, Gene Expres-sion Profiles, Profile Alignment, Cubic Spline, Clustering
I. INTRODUCTION
Genes with similar expression profiles are expected to be
functionally related or co-regulated. An important process in
functional genomic studies is clustering micrarray time-series
data, where genes with similar expression profiles are expected
to be functionally related [1]. A Bayesian approach in [7], a
partitional clustering based on k-means in [8], and a Euclidean
distance approach in [9] have been proposed for clustering
time-series gene expression profiles. They have applied self-
organizing maps (SOMs) to visualize and to interpret the
gene temporal expression profile patterns. A hidden phase
model was used for clustering time-series data to define
the parameters of a mixture of normal distributions in a
Bayesian-like manner that are estimated by using expectation
maximization (EM) [3]. Also, the methods proposed in [4],
[10] are based on correlation measures. A method that uses
jack-knife correlation with or without using seeded candidate
profiles was proposed for clustering time-series microarray
data as well [10], where the resulting clusters depend upon the
initially chosen template genes, because there is a possibility
of missing important genes. A regression-based method was
proposed in [6] to address the challenges in clustering short
time-series expression data.
Analyzing gene temporal expression profile data that are
non-uniformly sampled and can contain missing values has
been studied in [2]. Clustering temporal gene expression
profiles was studied by identifying homogeneous clusters of
genes in [5]. The shapes of the curves were considered instead
of the absolute expression ratios. Fuzzy clustering of gene
temporal profiles, where the similarities between co-expressed
genes are computed based on the rate of change of the
expression ratios across time, has been studied in [11]. In [12],
the idea of order-restricted inference levels across time has
been applied to select and cluster genes, where the estimation
makes use of known inequalities among parameters. In this
approach, two gene expression profiles fall into the same
cluster, if they show similar profiles in terms of directions
of the changes of expression ratios, regardless of how big or
small the changes are. In [13], pairs of profiles represented
by piece-wise linear functions are aligned in such a way to
minimize the integrated squared area between the profiles. An
agglomerative clustering method, combined with an area-based
distance measure between two aligned profiles, was used to
cluster microarray time-series data.
Using natural cubic spline interpolations, we re-formulated
in [14] the pairwise gene expression profile alignment problem
of [13] in terms of integrals of arbitrary functions, then gener-
alized the pairwise alignment formulae of [13] from the case of
piece-wise linear profiles to profiles which are any continuous
integrable functions on a finite interval. Next, we extended the
concept of pairwise alignment to multiple expression profile
alignment, where profiles from a given set are aligned in such
a way that the sum of integrated squared errors, over a time-
interval, defined on the set is minimized. Finally, we combined
k-means clustering with our multiple alignment approach to
cluster microarray time-series data. We achieved an accuracy
of 79.64% when clustering 221 pre-clustered Saccharomyces
cerevisiae time-course gene expression profiles.
In this paper, we combine EM clustering with our multiple
alignment approach of [14] to cluster microarray time-series
data.
II. PAIRWISE EXPRESSION PROFILE ALIGNMENT
Clustering time-series expression data, in terms of unequal
time intervals, is different from the general clustering problem,
because exchanging two or more time points delivers quite
different results, while doing this could not be biologically
meaningful at all. Taking into account the length of the interval
is accomplished by means of analyzing the area between two
expression profiles, joined by the corresponding measurements
at subsequent time points. This is equivalent to considering
the sum or average of squared errors between the infinite
points in the two lines. This analysis can be easily achieved
by computing the underlying integral, which is analytically
resolved in advance, subsequently avoiding expensive compu-
tations during the clustering process.
Given two profiles, x(t) and y(t) (either piece-wise linear or
continuously integrable functions), where y(t) is to be aligned
to x(t), the basic idea of alignment is to vertically shift y(t)towards x(t) in such a way that the integrated squared errorsbetween the two profiles is minimal. Let y(t) be the result of
shifting y(t). Here, the error is defined in terms of the areas
between x(t) and y(t) in interval [0, T ]. Functions x(t) and
y(t) may cross each other many times, but we want that the
sum of all the areas where x(t) is above y(t) minus the sum of
those areas where y(t) is above x(t), is minimal (see Fig. 1).
Let a denote the amount of vertical shifting of y(t). Then, we
want to find the value amin of a that minimizes the integrated
squared error between x(t) and y(t). Once we obtain amin,
the alignment process consists of performing the shift on y(t)as y(t) = y(t) − amin.
The pairwise alignment results of [13] generalize from the
case of piece-wise linear profiles to profiles which are anyintegrable functions on a finite interval. Suppose we have two
profiles, x(t) and y(t), defined on the time-interval [0, T ].The alignment process consists of finding the value a that
minimizes
fa(x, y) =∫ T
0
[x(t) − [y(t) − a]]2 dt. (1)
Differentiating yields
d
dafa(x, y) = 2
∫ T
0
[x(t) + a − y(t)]dt
= 2∫ T
0
[x(t) − y(t)]dt + 2aT. (2)
Setting ddafa(x, y) = 0 and solving for a gives
amin = − 1T
∫ T
0
[x(t) − y(t)]dt, (3)
and since d2
da2 fa(x, y) = 2T > 0 then amin is a minimum. The
integrated error between x(t) and the shifted y(t) = y(t) −
amin is then∫ T
0
[x(t) − y(t)]dt =∫ T
0
[x(t) − y(t)]dt + aminT = 0. (4)
In terms of Fig. 1, this means that the sum of all the areas
where x(t) is above y(t) minus the sum of those areas where
y(t) is above x(t), is zero.
Given an original profile x(t) = [e1, e2, . . . , en] (with
n expression values taken at n time-points t1, t2, . . . , tn)
we use natural cubic spline interpolation, with n knots,
(t1, e1), . . . , (tn, en), to represent x(t) as a continuously inte-
grable function
x(t) =
⎧⎪⎪⎨⎪⎪⎩
x1(t) if t1 ≤ t ≤ t2...
xn−1(t) if tn−1 ≤ t ≤ tn
(5)
where xj(t) = xj3(t − tj)3 + xj2(t − tj)2 + xj1(t − tj)1 +xj0(t− tj)0 interpolates x(t) in interval [tj , tj+1], with spline
coefficients xjk ∈ �, for 1 ≤ j ≤ n − 1 and 0 ≤ k ≤ 3.
For practical purposes, given the coefficients, xjk ∈ �,
associated with x(t) = [e1, e2, . . . , en] ∈ �n, we need
only to transform x(t) into a new space as, x(t) =[x13, x12, x11, x10, . . . , xj3, xj2, xj1, xj0, . . . , x(n−1)3, x(n−1)2,x(n−1)1, x(n−1)0] ∈ �4(n−1). We can add or subtract
polynomials given their coefficients, and the polynomials are
continuously differentiable. This yields an analytical solution
for amin in Eq. (3) as
amin = − 1T
n−1∑j=1
∫ tj+1
tj
[xj(t) − yj(t)]dt
= − 1T
n−1∑j=1
3∑k=0
(xjk − yjk) (tj+1 − tj)k+1
k + 1. (6)
Fig. 1(b) shows a pairwise alignment, of the two initial
profiles in Fig. 1(a), after applying the vertical shift y(t) ←y(t) − amin. The two aligned profiles cross each other many
times, but the integrated error, Eq. (4), is zero.
In particular, from Eq. (4), the horizontal t-axis will bisect
a profile x(t) into two halves with equal areas, when x(t) is
aligned to the t-axis. In the next section, we use this property
of Eq. (4) to define the multiple alignment of a set of profiles.
III. MULTIPLE EXPRESSION PROFILE ALIGNMENT
Given a set X = {x1(t), . . . , xs(t)}, we want to align the
profiles such that the integrated squared error between any two
vertically shifted profiles is minimal. Thus, for any xi(t) and
xj(t), we want to find the values of ai and aj that minimize
fai,aj(xi, xj) =
∫ T
0
[xi(t) − xj(t)]2dt
=∫ T
0
[[xi(t) − ai] − [xj(t) − aj ]]2dt
(7)
0 0.5 1 1.50.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Time in hrs. (a)
Exp
ress
ion
ratio
xy
0 0.5 1 1.50.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Time in hrs. (b)
Exp
ress
ion
ratio
xy
Fig. 1. (a) Unaligned profiles x(t) and y(t). (b) Aligned profiles x(t) and y(t), after applying y(t) ← y(t) − amin.
where both xi(t) and xj(t) are shifted vertically by an
amount ai and aj , respectively, in possibly different directions,
whereas in the pairwise alignment of Eq. (1), profile y(t) is
shifted towards the fixed profile x(t). The multiple alignment
process consists then of finding the values of a1, . . . , as that
minimize
Fa1,...,as(x1, . . . , xs) =
∑1≤i<j≤s
fai,aj(xi, xj) , (8)
The following result is important since it allows us to apply
multiple alignment as a pre-processing step, and iterative clus-
tering as a second step. This implies a substantial improvement
on computing efficiency and provides independence of the
clustering algorithm. The proof of this theorem and other
related results can be found in [14].
Theorem 1 (Multiple Alignment [14]): Given a fixed pro-
file, z(t), and a set of profiles, X = {x1(t), . . . , xs(t)}, there
always exists a multiple alignment, X = {x1(t), . . . , xs(t)},
such that
xi(t) = xi(t)−amini , amini = − 1T
∫ T
0
[z(t)−xi(t)]dt, (9)
and, in particular, for profile z(t) = 0, defined by the
horizontal t-axis, we have
xi(t) = xi(t) − amini, where, amini
=1T
∫ T
0
xi(t)dt. (10)
We use the multiple alignment of Eq. (10) in all subsequent
discussions. Using spline interpolations, each profile xi(t),1 ≤ i ≤ s, is a continuous integrable profile
xi(t) =
⎧⎪⎪⎨⎪⎪⎩
xi,1(t) if t1 ≤ t ≤ t2...
xi,n−1(t) if tn−1 ≤ t ≤ tn
(11)
where, xi,j(t) = xij3(t− tj)3 +xij2(t− tj)2 +xij1(t− tj)1 +xij0(t− tj)0 represents xi(t) in interval [tj , tj+1], with spline
coefficients xijk for 1 ≤ i ≤ s, 1 ≤ j ≤ n−1 and 0 ≤ k ≤ 3.
Thus the analytical solution for amini in Eq. (10) is
amini =1T
n−1∑j=1
3∑k=0
xijk (tj+1 − tj)k+1
k + 1(12)
IV. CENTROID OF A CLUSTER
The centroid of a cluster is the average point in the multi-
dimensional space, that is, its coordinates are the arithmetic
mean for each dimension separately over all the points in the
cluster. In a sense, it is the center of gravity for the respective
cluster. The distance between two clusters is determined as the
difference between centroids.
Given a set of profiles X = {x1(t), . . . , xs(t)}, we wish to
find a representative centroid profile μ(t), that well represents
X . An obvious choice is the function that minimizes
Δ[μ] =s∑
i=1
d (xi, μ) . (13)
where, Δ plays the role of the within-cluster-scatter defined
in [13], and the distance between two profiles, x(t), y(t), was
defined in [14] as:
d(x, y) =1T
∫ T
0
[x(t) − y(t)]2 dt. (14)
The distance d(·, ·) as defined in Eq. (14) is unchanged by
an additive shift x(t) → x(t) − a in either of its arguments,
and hence, it is order-preserving; that is: d(u, v) ≤ d(x, y) if
and only if d (u, v) ≤ d (x, y). Therefore, we have
Δ[μ] =s∑
i=1
d (xi, μ) =1T
∫ T
0
s∑i=1
[xi(t) − μ(t)]2 dt, (15)
where, X = {x1(t), . . . , xs(t)} is the multiple alignment of
Eq. (10). This is a functional of μ; that is, a mapping from the
set of real valued functions defined on [0, T ] to the set of real
numbers. To minimize with respect to μ we set the functional
derivative to zero1. This functional is of the form
F [φ] =∫
L(φ(t))dt, (16)
for some function L, for which the functional derivative is
simplyδF [φ]δφ(t) = dL(φ(t))
dφ(t) . In our case, we have
δΔ[μ]δμ(t)
= − 2T
s∑i=1
[xi(t) − μ(t)] = − 2T
(s∑
i=1
xi(t) − sμ(t)
).
(17)
SettingδΔ[μ]δμ(t) = 0 gives
μ(t) =1s
s∑i=1
xi(t). (18)
With the spline coefficients, xijk, of each xi(t) interpolated
as in Eq. (11), the analytical solution for μ(t) in Eq. (18) is
μj(t) =1s
s∑i=1
[3∑
k=0
xijk (t − tj)k
]− amini , (19)
in each interval [tj , tj+1] . Eq. (18) applies to aligned profiles
while Eq. (19) can apply to unaligned profiles.
V. EM CLUSTERING VIA MULTIPLE ALIGNMENT
In [14], we devised a clustering approach, k-MCMA, where
we combined the multiple alignment of Eq. (10) and the k-
means clustering method with a distance function based on the
pairwise alignment of Eq. (3). In this paper, we use the EM
clustering algorithm instead and combine with the alignment
methods.
EM is used for clustering in the context of mixture models
[16]. The goal of EM clustering is to estimate the means and
standard deviations for each cluster so as to maximize the
likelihood of the observed data (distribution). In other word,
the EM algorithm attempts to approximate the observed distri-
butions of values based on mixtures of different distributions
in different clusters. A mixture of Gaussians is a set of kprobability distributions, where each distribution represents a
cluster. With an initial approximation of the cluster parameters,
it iteratively performs two steps: first, the expectation step
computes the values expected for the cluster probabilities, and
second, the maximization step computes the distribution and
their likelihood. It iterates until the log-likelihood reaches a
(possibly local) maximum. The algorithm is similar to the k-
means in the sense that the centers of the natural clusters in the
data are re-computed until a desired convergence is achieved.
We want to partition a set of s profiles, D ={x1(t), . . . , xs(t)}, into k disjoint clusters C1, . . . , Ck, 1 ≤k ≤ s, such that; (i) Ci �= ∅, i = 1, . . . , k; (ii)
⋃ki=1 Ci = D;
(iii) Ci ∩ Cj = ∅, i, j = 1, . . . , k and i �= j. Let D be the
1For a functional F [φ], the functional derivative is defined asδF [φ]δφ(t)
=
limε→0(F [φ+εδt]−F [φ])
ε, where δt(τ) = δ(τ−t) is the Dirac delta function
centered at t.
complete-data space drawn independently from the mixture
density:
E-step: p(x|θ) =k∑
i=1
p(x|Ci, θi)P (Ci) (20)
where parameter θ = [θ1, . . . , θk]t is fixed but unknown, and
P (Ci) is the known posterior probability of class Ci. The aim
is to maximize the likelihood:
M-step: p(D|θ) =s∏
e=1
p(xe|θ) (21)
To maximize the likelihood function, log-likelihood is used
in normal distribution of the component densities given by:
p(xk|Ci, θi) ∼ N(μi, Σi) where θi = [μi, Σi]t; μi and Σi are
the means and the covariances of classes, respectively. Both
the steps iterate until the log-likelihood reaches a maximum.
Thus, EM assigns profiles to multiple clusters, like in fuzzyclustering. Also, unlike in k-means, each profile is assigned
to the cluster that finds the maximum posterior probability.
Algorithm 1 EMMA: EM Clustering with Multiple AlignmentRequire: Set of profiles, D = {x1(t), . . . , xs(t)}, and desired
number of clusters, kEnsure: Clusters Cμ1 , . . . , Cμk
1: Apply natural cubic spline interpolation on xi(t) ∈ D, for
1 ≤ i ≤ s (see Section II)
2: Multiple-align transformed D to obtain D ={x1(t), . . . , xs(t)}, using Eq. (10)
3: Initialize centroid μi(t), for 1 ≤ i ≤ k4: Compute the initial log-likelihood (see Eq. (21))
5: repeat6: E-step: p(x|θ) =
∑ki=1 p(x|Cμi , θi)P (Cμi)
7: Assign xj(t) to cluster Cμiwith maximum log-
likelihood, for 1 ≤ j ≤ s and 1 ≤ i ≤ k8: M-step: p(D|θ) =
∏se=1 p(xe|θ)
9: until The log-likelihood reaches its maximum
10: return Clusters Cμ1 , . . . , Cμk
In EMMA (see Algorithm 1), we first multiple-align the set
of profiles D, using Eq. (10), and then cluster the multiple-
aligned D with EM. Recall that the process of Eq. (10) is to
pairwise align each profile with the t-axis. The k centroids
can be initialized randomly in step (3) of EMMA, or by any
initialization approach. However, to obtain better clustering
results with EMMA, it is necessary to start with near-optimal
centroids; thus, we applied the k-MCMA algorithm of [14] to
generate the k initial centroids in step (3) (we have also tried
different initialization methods but with less success).
For distance-based clustering algorithms such as the k-
means method, it has been shown that any arbitrary distance
function (including Euclidean distance) can be used in the
cluster assignment step of the algorithm, when the profiles are
multiple-aligned first [14]. Moreover, Theorem 2 (which proof
can be found in [14]) showed that, for distance-based methods,
there is no need to multiple-align Cμi to update its centroid
μi(t).Theorem 2 ( [14]): Let μ(t) be the centroid of a cluster of
m multiple-aligned profiles. Then ˆμ(t) = μ(t).By Theorem 2, there is also no need to multiple-align a
cluster Cμiin step (6) of EMMA, for updating its centroid
μi(t). Likewise, any arbitrary distance function can be used
in step (6), for computing the centroids. Thus, Theorem 2
makes EMMA run much faster than applying EM directly
on the non-aligned data set D. If a profile xi(t) is assigned
to a cluster Cμiby EM on D, its shifted profile xi(t) will
be assigned to cluster Cμi by EMMA (EM on D). This can
be easily shown by the fact that multiple alignment is order-
preserving and distance-preserving, as pointed out in Section 4
of [14] (and briefly in Section (IV)). EMMA is not a distance-
based clustering method, nevertheless the quantities p(x|θ),p(x|Cμi , θi), P (Cμi), and p(D|θ) are also preserved when the
distances are preserved. In other words, we show that we can
multiple-align once, and obtain the same EM clustering results,
provided that we initialize the means in the same manner. This
opens the door for applications of multiple alignment methods
to any (distance-based) clustering methods.
VI. COMPUTATIONAL EXPERIMENTS
The data set of pre-clustered budding yeast of [1]2 contains
time-series gene expression profiles of the complete character-
ization of mRNA transcript levels during the yeast cell cycle.
These experiments measured the expression levels of the 6,220
yeast genes during the cell cycle at seventeen different points,
from 0 to 160 minutes, at every 10-minute time-interval. From
those gene profiles, 221 profiles were analyzed. We normalized
each expression profile as in [1]; that is, we divided each
transcript level by the mean value of each profile with respect
to each other.
The data set contains five known clusters called phases:
Early G1 phase (32 genes), Late G1 phase (84 genes), S phase
(46 genes), G2 phase (28 genes) and M phase (31 genes); the
phases are visualized in Fig. 2(b). Setting k = 5, we applied
EMMA on the data set to see if EMMA is able to find these
phases as accurately as possible. Once the clusters have been
found, to compare the EMMA clustering with the pre-clustered
dataset of [1], the next step is to label the clusters, where the
labels are the “phases” in the pre-clustered dataset.
To measure the performance of EMMA method, we as-
signed each EMMA cluster to a yeast phase using the Hungar-ian algorithm [18]. The Hungarian method is a combinatorial
optimization algorithm which solves the assignment problem
in polynomial time. Our phase assignment problem and the
complete discussion of the solution can be found in [14]. In
Fig. 2, the cluster and the phase of each of the five selected
pairs, found by the Hungarian algorithm, are shown at the
same level; e.g., cluster C5 of EMMA in 2(a) is assigned to the
Late G1 phase of [1] by our phase assignment approach, and
2http://genomics.stanford.edu/yeast cell cycle/cellcycle.html
hence they are at the same level in the figure. Same procedure
by using k-MCMA of [14] and the results are in 2(c).
The five clusters found by EMMA are shown in Fig. 2(a)
and those found by k-MCMA are shown in Fig 2(c), while
the corresponding phases of [1] after the phase assignment
are shown in Fig. 2(b). The horizontal axis represents the
time-points in minutes and the vertical axis represents the
expression values. The dashed black lines are the clustercentroids learned by EMMA (Fig. 2(a)) and the known phasecentroids of the yeast data (Fig. 2(b)). In the figure, each
cluster and phase were multiple-aligned using Eq. (10) to
enhance visualization.
Fig. 2 clearly shows a high degree of similarity between
the EMMA clusters and the yeast phases. Visually, each
EMMA cluster on the left is very similar to exactly one of
the yeast phases, which we show at the same level on the
right. Also visually, it even “seems” that EMMA clusters are
more accurate than the yeast phases and k-MCMA clusters;
which suggests that EMMA can also correct manual phase
assignment errors, if any.
An objective measure for comparing EMMA and k-MCMA
clusters with the yeast phases was computed as follows. For
each EMMA or k-MCMA cluster, Cμc (1 ≤ c ≤ k = 5), we
find the shortest distance between each profile xi(t), 1 ≤ i ≤|Cμc
|, and all five phase centroids νj(t), 1 ≤ j ≤ k = 5, using
Eq. (16) of [14]. Profile xi(t) will be assigned the correct label
(i.e., assigned to phase label of Pνj ) whenever xi(t) ∈ Pνj
and(Cμc , Pνj
)∈ S the set of selected cluster-phase pairs;
otherwise, xi(t) will be assigned the incorrect label, if cluster
Cμc was not paired with phase Pνj by our pair-assignment
method. The percentage of correct assignments over the 221
profiles was used as our measure of accuracy, resulting in
83.26% for EMMA and 79.64% for k-MCMA. That is
Acc =∑k
c=1
∑|Cμc |i=1 E ( c , arg min1≤j≤k d (xi, νj) )
221,
(22)
where E(a, b) returns 1 when a = b, and zero otherwise.
This criterion is reasonable, as k-MCMA is an unsupervised
learning approach that does not know the phases beforehand,
and hence the aim is to “discover” the phases. In [1],
the 5 phases were determined using biological information,
including genomic and phenotypic features observed in the
yeast cell cycle experiments. EMMA’s accuracy of 83.26%is quite high considering that it is an unsupervised learning
method.
VII. CONCLUSION
We proposed EMMA, a method that combines EM and
multiple alignment of gene expression profiles to cluster
microarray time-series data. The profiles are represented as
natural cubic splines functions, to compare profiles, where
expression measurements were not taken at the same time-
intervals. Multiple alignment is based on minimizing the sum
of integrated squared errors over a time-interval, defined on
0 50 100 1502
4
6
8
10
12
14
16
18
Time in min. (a) EMMA clusters
Expr
essio
n rati
o
0 50 100 1502
4
6
8
10
12
14
16
18
Time in min. (b) Yeast phases
0 50 100 1502
4
6
8
10
12
14
16
18
Time in min. (c) k−MCMA clusters
M phase
Early G1
S phase
G2 phase
Late G1C5
C4
C3
C2
C1
C5
C4
C3
C2
C1
Fig. 2. (a) EMMA clusters, (b) Yeast phases [1] and (c) k-MCMA clusters, with centroids shown.
a set of profiles. EMMA was able to find the 5 yeast cell-
cycle phases of [1] with an accuracy of about 83.26% whereas
k-MCMA accuracy was 79.64%. EMMA can also be used
to correct manual phase assignment errors. In the future, we
plan to study other distance-based clustering approaches using
our multiple alignment method. It will be also interesting to
study the effectiveness of any such clustering methods in dose-
response microarray data sets. Cluster validity indices based on
multiple alignment are currently being investigated. We argue
that in real applications data can be very noisy, and the use of
cubic spline interpolation could lead to some problems. The
use of splines has the advantage of being tractable, however.
We also plan to study interpolation methods that incorporate
noise. Currently, we are also carrying out experiments with
larger data sets.
Acknowledgements: This research has been partially
funded by Canadian NSERC Grants #RGPIN228117-2006 and
#RGPIN261360-2009, and CFI grant #9263.
REFERENCES
[1] R. Cho, M. Campbell, E. Winzeler, L. Steinmetz, A. Conway, L. Wod-icka, T. Wolfsberg, A. Gareilian, D. Lockhart, R. Davis: A genome-wide transactional analysis of the mitotic cell cycle. Molecular Cell2(1) (1998) 65–73
[2] Z. Bar-Joseph, G. Gerber, T. Jaakkola, D. Gifford, I. Simon: Continuousrepresentations of time series gene expresion data. Journal of Comp.Biology 10(3-4) (2003)
[3] L. Brehelin: Clustering gene expression series with prior knowledge.Lecture Notes in Computer Science 3692 (2005) 27–28
[4] S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P. Brown,I. Herskowitz: The transcriptional program of sporulation in buddingyeast. Science 282 (1998) 699–705
[5] S. Djean, P. Martin, A. Baccini, P. Besse: Clustering time-series geneexpression data using smoothing spline derivatives. EURASIP Journalon Bioinformatics and Systems Biology 70561 (2007) 705–761
[6] J. Ernst, G. Nau, Z. Bar-Joseph: Clustering short time series geneexpression data. Bioinformatics 21(suppl. 1) (2005) i159–i168
[7] M. Ramoni, P. Sebastiani, I. Kohane, eds.: Cluster analysis of geneexpression dynamics. 99, Proc. Natl Acad. Sci. USA (2002)
[8] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, G. Church: Systematicdetermination of genetic network architecture. Nature Genetics 22(1999) 281–285
[9] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky,E. Lander, T. Golub, eds.: Interpreting patterns of gene expression withSOMs: Methods and application to hematopoietic differentiation. 96(1999)
[10] L. Heyer, S. Kruglyak, S. Yooseph: Exploring expression data: identi-fication and analysis of coexpressed genes. Genome Research 9 (1999)1106–1115
[11] C. Moller-Levet, F. Klawonn, K. Cho, O. Wolkenhauer: Clustering ofunevenly sampled gene expression time-series data. Fuzzy sets andSystems 152(1,16) (2005) 49–66
[12] S. Peddada, E. Lobenhofer, L. Li, C. Afshari, C. Weinberg, D. Um-bach: Gene selection and clustering for time-course and dose-responsemicroarray experiments using order-restricted inference. Bioinformatics19(7) (2003) 834–841
[13] L. Rueda, A. Bari, A. Ngom: Clustering time-series gene expressiondata with unequal time intervals. Springer Trans. on Compu. SystemsBiology X LNBI 5410 (2008) 100–123
[14] N. Subhani, A. Ngom, L. Rueda, C. Burden: Microarray Time-SeriesData Clustering via Multiple Alignment of Gene Expression Profiles.Springer Trans. on Pattern Recognition in Bioinformatics Accepted(2009)
[15] R. Xu, D. Wunsch: Clustering. Wiley-IEEE Press (2008)[16] A. Dempster, N. Laird, D. Rubin: Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society 39(1977) 1–38
[17] V. Roth, J. Laub, M. Kawanabe, J. Buhmann: Optimal cluster preservingembedding of nonmetric proximity data. IEEE Trans. on PatternAnalysis and Machine Intelligence 25(12) (2003) 1540–1551
[18] H. Kuhn: The hungarian method for the assignment problem. NavalResearch Logistics 52(1) (2005) 7–21