6
Clustering Microarray Time-series Data using Expectation Maximization and Multiple Profile Alignment Numanul Subhani , Luis Rueda , Alioune Ngom , Conrad J. Burden School of Computer Science, 5115 Lambton Tower University of Windsor, 401 Sunset Avenue Windsor, Ontario, N9B 3P4, Canada {hoque4,lrueda,angom}@uwindsor.ca Centre for Bioinformation Science, Mathematical Sciences Institute and John Curtin School of Medical Research The Australian National University Canberra, ACT 0200, Australia [email protected] Abstract—A common problem in biology is to partition a set of experimental data into clusters in such a way that the data points within the same cluster are highly similar while data points in different clusters are very different. In this direction, clustering microarray time-series data via pairwise alignment of piece-wise linear profiles has been recently introduced. We propose a EM clustering approach based on a multiple alignment of natural cubic spline representations of gene expression profiles. The multiple alignment is achieved by minimizing the sum of integrated squared errors over a time-interval, defined on a set of profiles. Preliminary experiments on a well-known data set of 221 pre-clustered Saccharomyces cerevisiae gene expression profiles yield encouraging results with 83.26% accuracy. Index Terms—Microarrays, Time-Series Data, Gene Expres- sion Profiles, Profile Alignment, Cubic Spline, Clustering I. I NTRODUCTION Genes with similar expression profiles are expected to be functionally related or co-regulated. An important process in functional genomic studies is clustering micrarray time-series data, where genes with similar expression profiles are expected to be functionally related [1]. A Bayesian approach in [7], a partitional clustering based on k-means in [8], and a Euclidean distance approach in [9] have been proposed for clustering time-series gene expression profiles. They have applied self- organizing maps (SOMs) to visualize and to interpret the gene temporal expression profile patterns. A hidden phase model was used for clustering time-series data to define the parameters of a mixture of normal distributions in a Bayesian-like manner that are estimated by using expectation maximization (EM) [3]. Also, the methods proposed in [4], [10] are based on correlation measures. A method that uses jack-knife correlation with or without using seeded candidate profiles was proposed for clustering time-series microarray data as well [10], where the resulting clusters depend upon the initially chosen template genes, because there is a possibility of missing important genes. A regression-based method was proposed in [6] to address the challenges in clustering short time-series expression data. Analyzing gene temporal expression profile data that are non-uniformly sampled and can contain missing values has been studied in [2]. Clustering temporal gene expression profiles was studied by identifying homogeneous clusters of genes in [5]. The shapes of the curves were considered instead of the absolute expression ratios. Fuzzy clustering of gene temporal profiles, where the similarities between co-expressed genes are computed based on the rate of change of the expression ratios across time, has been studied in [11]. In [12], the idea of order-restricted inference levels across time has been applied to select and cluster genes, where the estimation makes use of known inequalities among parameters. In this approach, two gene expression profiles fall into the same cluster, if they show similar profiles in terms of directions of the changes of expression ratios, regardless of how big or small the changes are. In [13], pairs of profiles represented by piece-wise linear functions are aligned in such a way to minimize the integrated squared area between the profiles. An agglomerative clustering method, combined with an area-based distance measure between two aligned profiles, was used to cluster microarray time-series data. Using natural cubic spline interpolations, we re-formulated in [14] the pairwise gene expression profile alignment problem of [13] in terms of integrals of arbitrary functions, then gener- alized the pairwise alignment formulae of [13] from the case of piece-wise linear profiles to profiles which are any continuous integrable functions on a finite interval. Next, we extended the concept of pairwise alignment to multiple expression profile alignment, where profiles from a given set are aligned in such a way that the sum of integrated squared errors, over a time- interval, defined on the set is minimized. Finally, we combined k-means clustering with our multiple alignment approach to cluster microarray time-series data. We achieved an accuracy of 79.64% when clustering 221 pre-clustered Saccharomyces

[IEEE 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshop, BIBMW - Washington, DC, USA (2009.11.1-2009.11.4)] 2009 IEEE International Conference on Bioinformatics

Embed Size (px)

Citation preview

Clustering Microarray Time-series Data usingExpectation Maximization and Multiple Profile

Alignment

Numanul Subhani∗, Luis Rueda∗, Alioune Ngom∗, Conrad J. Burden†∗School of Computer Science, 5115 Lambton Tower

University of Windsor, 401 Sunset Avenue

Windsor, Ontario, N9B 3P4, Canada

{hoque4,lrueda,angom}@uwindsor.ca†Centre for Bioinformation Science, Mathematical Sciences Institute and John Curtin School of Medical Research

The Australian National University

Canberra, ACT 0200, Australia

[email protected]

Abstract—A common problem in biology is to partition a setof experimental data into clusters in such a way that the datapoints within the same cluster are highly similar while datapoints in different clusters are very different. In this direction,clustering microarray time-series data via pairwise alignmentof piece-wise linear profiles has been recently introduced. Wepropose a EM clustering approach based on a multiple alignmentof natural cubic spline representations of gene expression profiles.The multiple alignment is achieved by minimizing the sum ofintegrated squared errors over a time-interval, defined on a set ofprofiles. Preliminary experiments on a well-known data set of 221pre-clustered Saccharomyces cerevisiae gene expression profilesyield encouraging results with 83.26% accuracy.

Index Terms—Microarrays, Time-Series Data, Gene Expres-sion Profiles, Profile Alignment, Cubic Spline, Clustering

I. INTRODUCTION

Genes with similar expression profiles are expected to be

functionally related or co-regulated. An important process in

functional genomic studies is clustering micrarray time-series

data, where genes with similar expression profiles are expected

to be functionally related [1]. A Bayesian approach in [7], a

partitional clustering based on k-means in [8], and a Euclidean

distance approach in [9] have been proposed for clustering

time-series gene expression profiles. They have applied self-

organizing maps (SOMs) to visualize and to interpret the

gene temporal expression profile patterns. A hidden phase

model was used for clustering time-series data to define

the parameters of a mixture of normal distributions in a

Bayesian-like manner that are estimated by using expectation

maximization (EM) [3]. Also, the methods proposed in [4],

[10] are based on correlation measures. A method that uses

jack-knife correlation with or without using seeded candidate

profiles was proposed for clustering time-series microarray

data as well [10], where the resulting clusters depend upon the

initially chosen template genes, because there is a possibility

of missing important genes. A regression-based method was

proposed in [6] to address the challenges in clustering short

time-series expression data.

Analyzing gene temporal expression profile data that are

non-uniformly sampled and can contain missing values has

been studied in [2]. Clustering temporal gene expression

profiles was studied by identifying homogeneous clusters of

genes in [5]. The shapes of the curves were considered instead

of the absolute expression ratios. Fuzzy clustering of gene

temporal profiles, where the similarities between co-expressed

genes are computed based on the rate of change of the

expression ratios across time, has been studied in [11]. In [12],

the idea of order-restricted inference levels across time has

been applied to select and cluster genes, where the estimation

makes use of known inequalities among parameters. In this

approach, two gene expression profiles fall into the same

cluster, if they show similar profiles in terms of directions

of the changes of expression ratios, regardless of how big or

small the changes are. In [13], pairs of profiles represented

by piece-wise linear functions are aligned in such a way to

minimize the integrated squared area between the profiles. An

agglomerative clustering method, combined with an area-based

distance measure between two aligned profiles, was used to

cluster microarray time-series data.

Using natural cubic spline interpolations, we re-formulated

in [14] the pairwise gene expression profile alignment problem

of [13] in terms of integrals of arbitrary functions, then gener-

alized the pairwise alignment formulae of [13] from the case of

piece-wise linear profiles to profiles which are any continuous

integrable functions on a finite interval. Next, we extended the

concept of pairwise alignment to multiple expression profile

alignment, where profiles from a given set are aligned in such

a way that the sum of integrated squared errors, over a time-

interval, defined on the set is minimized. Finally, we combined

k-means clustering with our multiple alignment approach to

cluster microarray time-series data. We achieved an accuracy

of 79.64% when clustering 221 pre-clustered Saccharomyces

cerevisiae time-course gene expression profiles.

In this paper, we combine EM clustering with our multiple

alignment approach of [14] to cluster microarray time-series

data.

II. PAIRWISE EXPRESSION PROFILE ALIGNMENT

Clustering time-series expression data, in terms of unequal

time intervals, is different from the general clustering problem,

because exchanging two or more time points delivers quite

different results, while doing this could not be biologically

meaningful at all. Taking into account the length of the interval

is accomplished by means of analyzing the area between two

expression profiles, joined by the corresponding measurements

at subsequent time points. This is equivalent to considering

the sum or average of squared errors between the infinite

points in the two lines. This analysis can be easily achieved

by computing the underlying integral, which is analytically

resolved in advance, subsequently avoiding expensive compu-

tations during the clustering process.

Given two profiles, x(t) and y(t) (either piece-wise linear or

continuously integrable functions), where y(t) is to be aligned

to x(t), the basic idea of alignment is to vertically shift y(t)towards x(t) in such a way that the integrated squared errorsbetween the two profiles is minimal. Let y(t) be the result of

shifting y(t). Here, the error is defined in terms of the areas

between x(t) and y(t) in interval [0, T ]. Functions x(t) and

y(t) may cross each other many times, but we want that the

sum of all the areas where x(t) is above y(t) minus the sum of

those areas where y(t) is above x(t), is minimal (see Fig. 1).

Let a denote the amount of vertical shifting of y(t). Then, we

want to find the value amin of a that minimizes the integrated

squared error between x(t) and y(t). Once we obtain amin,

the alignment process consists of performing the shift on y(t)as y(t) = y(t) − amin.

The pairwise alignment results of [13] generalize from the

case of piece-wise linear profiles to profiles which are anyintegrable functions on a finite interval. Suppose we have two

profiles, x(t) and y(t), defined on the time-interval [0, T ].The alignment process consists of finding the value a that

minimizes

fa(x, y) =∫ T

0

[x(t) − [y(t) − a]]2 dt. (1)

Differentiating yields

d

dafa(x, y) = 2

∫ T

0

[x(t) + a − y(t)]dt

= 2∫ T

0

[x(t) − y(t)]dt + 2aT. (2)

Setting ddafa(x, y) = 0 and solving for a gives

amin = − 1T

∫ T

0

[x(t) − y(t)]dt, (3)

and since d2

da2 fa(x, y) = 2T > 0 then amin is a minimum. The

integrated error between x(t) and the shifted y(t) = y(t) −

amin is then∫ T

0

[x(t) − y(t)]dt =∫ T

0

[x(t) − y(t)]dt + aminT = 0. (4)

In terms of Fig. 1, this means that the sum of all the areas

where x(t) is above y(t) minus the sum of those areas where

y(t) is above x(t), is zero.

Given an original profile x(t) = [e1, e2, . . . , en] (with

n expression values taken at n time-points t1, t2, . . . , tn)

we use natural cubic spline interpolation, with n knots,

(t1, e1), . . . , (tn, en), to represent x(t) as a continuously inte-

grable function

x(t) =

⎧⎪⎪⎨⎪⎪⎩

x1(t) if t1 ≤ t ≤ t2...

xn−1(t) if tn−1 ≤ t ≤ tn

(5)

where xj(t) = xj3(t − tj)3 + xj2(t − tj)2 + xj1(t − tj)1 +xj0(t− tj)0 interpolates x(t) in interval [tj , tj+1], with spline

coefficients xjk ∈ �, for 1 ≤ j ≤ n − 1 and 0 ≤ k ≤ 3.

For practical purposes, given the coefficients, xjk ∈ �,

associated with x(t) = [e1, e2, . . . , en] ∈ �n, we need

only to transform x(t) into a new space as, x(t) =[x13, x12, x11, x10, . . . , xj3, xj2, xj1, xj0, . . . , x(n−1)3, x(n−1)2,x(n−1)1, x(n−1)0] ∈ �4(n−1). We can add or subtract

polynomials given their coefficients, and the polynomials are

continuously differentiable. This yields an analytical solution

for amin in Eq. (3) as

amin = − 1T

n−1∑j=1

∫ tj+1

tj

[xj(t) − yj(t)]dt

= − 1T

n−1∑j=1

3∑k=0

(xjk − yjk) (tj+1 − tj)k+1

k + 1. (6)

Fig. 1(b) shows a pairwise alignment, of the two initial

profiles in Fig. 1(a), after applying the vertical shift y(t) ←y(t) − amin. The two aligned profiles cross each other many

times, but the integrated error, Eq. (4), is zero.

In particular, from Eq. (4), the horizontal t-axis will bisect

a profile x(t) into two halves with equal areas, when x(t) is

aligned to the t-axis. In the next section, we use this property

of Eq. (4) to define the multiple alignment of a set of profiles.

III. MULTIPLE EXPRESSION PROFILE ALIGNMENT

Given a set X = {x1(t), . . . , xs(t)}, we want to align the

profiles such that the integrated squared error between any two

vertically shifted profiles is minimal. Thus, for any xi(t) and

xj(t), we want to find the values of ai and aj that minimize

fai,aj(xi, xj) =

∫ T

0

[xi(t) − xj(t)]2dt

=∫ T

0

[[xi(t) − ai] − [xj(t) − aj ]]2dt

(7)

0 0.5 1 1.50.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Time in hrs. (a)

Exp

ress

ion

ratio

xy

0 0.5 1 1.50.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Time in hrs. (b)

Exp

ress

ion

ratio

xy

Fig. 1. (a) Unaligned profiles x(t) and y(t). (b) Aligned profiles x(t) and y(t), after applying y(t) ← y(t) − amin.

where both xi(t) and xj(t) are shifted vertically by an

amount ai and aj , respectively, in possibly different directions,

whereas in the pairwise alignment of Eq. (1), profile y(t) is

shifted towards the fixed profile x(t). The multiple alignment

process consists then of finding the values of a1, . . . , as that

minimize

Fa1,...,as(x1, . . . , xs) =

∑1≤i<j≤s

fai,aj(xi, xj) , (8)

The following result is important since it allows us to apply

multiple alignment as a pre-processing step, and iterative clus-

tering as a second step. This implies a substantial improvement

on computing efficiency and provides independence of the

clustering algorithm. The proof of this theorem and other

related results can be found in [14].

Theorem 1 (Multiple Alignment [14]): Given a fixed pro-

file, z(t), and a set of profiles, X = {x1(t), . . . , xs(t)}, there

always exists a multiple alignment, X = {x1(t), . . . , xs(t)},

such that

xi(t) = xi(t)−amini , amini = − 1T

∫ T

0

[z(t)−xi(t)]dt, (9)

and, in particular, for profile z(t) = 0, defined by the

horizontal t-axis, we have

xi(t) = xi(t) − amini, where, amini

=1T

∫ T

0

xi(t)dt. (10)

We use the multiple alignment of Eq. (10) in all subsequent

discussions. Using spline interpolations, each profile xi(t),1 ≤ i ≤ s, is a continuous integrable profile

xi(t) =

⎧⎪⎪⎨⎪⎪⎩

xi,1(t) if t1 ≤ t ≤ t2...

xi,n−1(t) if tn−1 ≤ t ≤ tn

(11)

where, xi,j(t) = xij3(t− tj)3 +xij2(t− tj)2 +xij1(t− tj)1 +xij0(t− tj)0 represents xi(t) in interval [tj , tj+1], with spline

coefficients xijk for 1 ≤ i ≤ s, 1 ≤ j ≤ n−1 and 0 ≤ k ≤ 3.

Thus the analytical solution for amini in Eq. (10) is

amini =1T

n−1∑j=1

3∑k=0

xijk (tj+1 − tj)k+1

k + 1(12)

IV. CENTROID OF A CLUSTER

The centroid of a cluster is the average point in the multi-

dimensional space, that is, its coordinates are the arithmetic

mean for each dimension separately over all the points in the

cluster. In a sense, it is the center of gravity for the respective

cluster. The distance between two clusters is determined as the

difference between centroids.

Given a set of profiles X = {x1(t), . . . , xs(t)}, we wish to

find a representative centroid profile μ(t), that well represents

X . An obvious choice is the function that minimizes

Δ[μ] =s∑

i=1

d (xi, μ) . (13)

where, Δ plays the role of the within-cluster-scatter defined

in [13], and the distance between two profiles, x(t), y(t), was

defined in [14] as:

d(x, y) =1T

∫ T

0

[x(t) − y(t)]2 dt. (14)

The distance d(·, ·) as defined in Eq. (14) is unchanged by

an additive shift x(t) → x(t) − a in either of its arguments,

and hence, it is order-preserving; that is: d(u, v) ≤ d(x, y) if

and only if d (u, v) ≤ d (x, y). Therefore, we have

Δ[μ] =s∑

i=1

d (xi, μ) =1T

∫ T

0

s∑i=1

[xi(t) − μ(t)]2 dt, (15)

where, X = {x1(t), . . . , xs(t)} is the multiple alignment of

Eq. (10). This is a functional of μ; that is, a mapping from the

set of real valued functions defined on [0, T ] to the set of real

numbers. To minimize with respect to μ we set the functional

derivative to zero1. This functional is of the form

F [φ] =∫

L(φ(t))dt, (16)

for some function L, for which the functional derivative is

simplyδF [φ]δφ(t) = dL(φ(t))

dφ(t) . In our case, we have

δΔ[μ]δμ(t)

= − 2T

s∑i=1

[xi(t) − μ(t)] = − 2T

(s∑

i=1

xi(t) − sμ(t)

).

(17)

SettingδΔ[μ]δμ(t) = 0 gives

μ(t) =1s

s∑i=1

xi(t). (18)

With the spline coefficients, xijk, of each xi(t) interpolated

as in Eq. (11), the analytical solution for μ(t) in Eq. (18) is

μj(t) =1s

s∑i=1

[3∑

k=0

xijk (t − tj)k

]− amini , (19)

in each interval [tj , tj+1] . Eq. (18) applies to aligned profiles

while Eq. (19) can apply to unaligned profiles.

V. EM CLUSTERING VIA MULTIPLE ALIGNMENT

In [14], we devised a clustering approach, k-MCMA, where

we combined the multiple alignment of Eq. (10) and the k-

means clustering method with a distance function based on the

pairwise alignment of Eq. (3). In this paper, we use the EM

clustering algorithm instead and combine with the alignment

methods.

EM is used for clustering in the context of mixture models

[16]. The goal of EM clustering is to estimate the means and

standard deviations for each cluster so as to maximize the

likelihood of the observed data (distribution). In other word,

the EM algorithm attempts to approximate the observed distri-

butions of values based on mixtures of different distributions

in different clusters. A mixture of Gaussians is a set of kprobability distributions, where each distribution represents a

cluster. With an initial approximation of the cluster parameters,

it iteratively performs two steps: first, the expectation step

computes the values expected for the cluster probabilities, and

second, the maximization step computes the distribution and

their likelihood. It iterates until the log-likelihood reaches a

(possibly local) maximum. The algorithm is similar to the k-

means in the sense that the centers of the natural clusters in the

data are re-computed until a desired convergence is achieved.

We want to partition a set of s profiles, D ={x1(t), . . . , xs(t)}, into k disjoint clusters C1, . . . , Ck, 1 ≤k ≤ s, such that; (i) Ci �= ∅, i = 1, . . . , k; (ii)

⋃ki=1 Ci = D;

(iii) Ci ∩ Cj = ∅, i, j = 1, . . . , k and i �= j. Let D be the

1For a functional F [φ], the functional derivative is defined asδF [φ]δφ(t)

=

limε→0(F [φ+εδt]−F [φ])

ε, where δt(τ) = δ(τ−t) is the Dirac delta function

centered at t.

complete-data space drawn independently from the mixture

density:

E-step: p(x|θ) =k∑

i=1

p(x|Ci, θi)P (Ci) (20)

where parameter θ = [θ1, . . . , θk]t is fixed but unknown, and

P (Ci) is the known posterior probability of class Ci. The aim

is to maximize the likelihood:

M-step: p(D|θ) =s∏

e=1

p(xe|θ) (21)

To maximize the likelihood function, log-likelihood is used

in normal distribution of the component densities given by:

p(xk|Ci, θi) ∼ N(μi, Σi) where θi = [μi, Σi]t; μi and Σi are

the means and the covariances of classes, respectively. Both

the steps iterate until the log-likelihood reaches a maximum.

Thus, EM assigns profiles to multiple clusters, like in fuzzyclustering. Also, unlike in k-means, each profile is assigned

to the cluster that finds the maximum posterior probability.

Algorithm 1 EMMA: EM Clustering with Multiple AlignmentRequire: Set of profiles, D = {x1(t), . . . , xs(t)}, and desired

number of clusters, kEnsure: Clusters Cμ1 , . . . , Cμk

1: Apply natural cubic spline interpolation on xi(t) ∈ D, for

1 ≤ i ≤ s (see Section II)

2: Multiple-align transformed D to obtain D ={x1(t), . . . , xs(t)}, using Eq. (10)

3: Initialize centroid μi(t), for 1 ≤ i ≤ k4: Compute the initial log-likelihood (see Eq. (21))

5: repeat6: E-step: p(x|θ) =

∑ki=1 p(x|Cμi , θi)P (Cμi)

7: Assign xj(t) to cluster Cμiwith maximum log-

likelihood, for 1 ≤ j ≤ s and 1 ≤ i ≤ k8: M-step: p(D|θ) =

∏se=1 p(xe|θ)

9: until The log-likelihood reaches its maximum

10: return Clusters Cμ1 , . . . , Cμk

In EMMA (see Algorithm 1), we first multiple-align the set

of profiles D, using Eq. (10), and then cluster the multiple-

aligned D with EM. Recall that the process of Eq. (10) is to

pairwise align each profile with the t-axis. The k centroids

can be initialized randomly in step (3) of EMMA, or by any

initialization approach. However, to obtain better clustering

results with EMMA, it is necessary to start with near-optimal

centroids; thus, we applied the k-MCMA algorithm of [14] to

generate the k initial centroids in step (3) (we have also tried

different initialization methods but with less success).

For distance-based clustering algorithms such as the k-

means method, it has been shown that any arbitrary distance

function (including Euclidean distance) can be used in the

cluster assignment step of the algorithm, when the profiles are

multiple-aligned first [14]. Moreover, Theorem 2 (which proof

can be found in [14]) showed that, for distance-based methods,

there is no need to multiple-align Cμi to update its centroid

μi(t).Theorem 2 ( [14]): Let μ(t) be the centroid of a cluster of

m multiple-aligned profiles. Then ˆμ(t) = μ(t).By Theorem 2, there is also no need to multiple-align a

cluster Cμiin step (6) of EMMA, for updating its centroid

μi(t). Likewise, any arbitrary distance function can be used

in step (6), for computing the centroids. Thus, Theorem 2

makes EMMA run much faster than applying EM directly

on the non-aligned data set D. If a profile xi(t) is assigned

to a cluster Cμiby EM on D, its shifted profile xi(t) will

be assigned to cluster Cμi by EMMA (EM on D). This can

be easily shown by the fact that multiple alignment is order-

preserving and distance-preserving, as pointed out in Section 4

of [14] (and briefly in Section (IV)). EMMA is not a distance-

based clustering method, nevertheless the quantities p(x|θ),p(x|Cμi , θi), P (Cμi), and p(D|θ) are also preserved when the

distances are preserved. In other words, we show that we can

multiple-align once, and obtain the same EM clustering results,

provided that we initialize the means in the same manner. This

opens the door for applications of multiple alignment methods

to any (distance-based) clustering methods.

VI. COMPUTATIONAL EXPERIMENTS

The data set of pre-clustered budding yeast of [1]2 contains

time-series gene expression profiles of the complete character-

ization of mRNA transcript levels during the yeast cell cycle.

These experiments measured the expression levels of the 6,220

yeast genes during the cell cycle at seventeen different points,

from 0 to 160 minutes, at every 10-minute time-interval. From

those gene profiles, 221 profiles were analyzed. We normalized

each expression profile as in [1]; that is, we divided each

transcript level by the mean value of each profile with respect

to each other.

The data set contains five known clusters called phases:

Early G1 phase (32 genes), Late G1 phase (84 genes), S phase

(46 genes), G2 phase (28 genes) and M phase (31 genes); the

phases are visualized in Fig. 2(b). Setting k = 5, we applied

EMMA on the data set to see if EMMA is able to find these

phases as accurately as possible. Once the clusters have been

found, to compare the EMMA clustering with the pre-clustered

dataset of [1], the next step is to label the clusters, where the

labels are the “phases” in the pre-clustered dataset.

To measure the performance of EMMA method, we as-

signed each EMMA cluster to a yeast phase using the Hungar-ian algorithm [18]. The Hungarian method is a combinatorial

optimization algorithm which solves the assignment problem

in polynomial time. Our phase assignment problem and the

complete discussion of the solution can be found in [14]. In

Fig. 2, the cluster and the phase of each of the five selected

pairs, found by the Hungarian algorithm, are shown at the

same level; e.g., cluster C5 of EMMA in 2(a) is assigned to the

Late G1 phase of [1] by our phase assignment approach, and

2http://genomics.stanford.edu/yeast cell cycle/cellcycle.html

hence they are at the same level in the figure. Same procedure

by using k-MCMA of [14] and the results are in 2(c).

The five clusters found by EMMA are shown in Fig. 2(a)

and those found by k-MCMA are shown in Fig 2(c), while

the corresponding phases of [1] after the phase assignment

are shown in Fig. 2(b). The horizontal axis represents the

time-points in minutes and the vertical axis represents the

expression values. The dashed black lines are the clustercentroids learned by EMMA (Fig. 2(a)) and the known phasecentroids of the yeast data (Fig. 2(b)). In the figure, each

cluster and phase were multiple-aligned using Eq. (10) to

enhance visualization.

Fig. 2 clearly shows a high degree of similarity between

the EMMA clusters and the yeast phases. Visually, each

EMMA cluster on the left is very similar to exactly one of

the yeast phases, which we show at the same level on the

right. Also visually, it even “seems” that EMMA clusters are

more accurate than the yeast phases and k-MCMA clusters;

which suggests that EMMA can also correct manual phase

assignment errors, if any.

An objective measure for comparing EMMA and k-MCMA

clusters with the yeast phases was computed as follows. For

each EMMA or k-MCMA cluster, Cμc (1 ≤ c ≤ k = 5), we

find the shortest distance between each profile xi(t), 1 ≤ i ≤|Cμc

|, and all five phase centroids νj(t), 1 ≤ j ≤ k = 5, using

Eq. (16) of [14]. Profile xi(t) will be assigned the correct label

(i.e., assigned to phase label of Pνj ) whenever xi(t) ∈ Pνj

and(Cμc , Pνj

)∈ S the set of selected cluster-phase pairs;

otherwise, xi(t) will be assigned the incorrect label, if cluster

Cμc was not paired with phase Pνj by our pair-assignment

method. The percentage of correct assignments over the 221

profiles was used as our measure of accuracy, resulting in

83.26% for EMMA and 79.64% for k-MCMA. That is

Acc =∑k

c=1

∑|Cμc |i=1 E ( c , arg min1≤j≤k d (xi, νj) )

221,

(22)

where E(a, b) returns 1 when a = b, and zero otherwise.

This criterion is reasonable, as k-MCMA is an unsupervised

learning approach that does not know the phases beforehand,

and hence the aim is to “discover” the phases. In [1],

the 5 phases were determined using biological information,

including genomic and phenotypic features observed in the

yeast cell cycle experiments. EMMA’s accuracy of 83.26%is quite high considering that it is an unsupervised learning

method.

VII. CONCLUSION

We proposed EMMA, a method that combines EM and

multiple alignment of gene expression profiles to cluster

microarray time-series data. The profiles are represented as

natural cubic splines functions, to compare profiles, where

expression measurements were not taken at the same time-

intervals. Multiple alignment is based on minimizing the sum

of integrated squared errors over a time-interval, defined on

0 50 100 1502

4

6

8

10

12

14

16

18

Time in min. (a) EMMA clusters

Expr

essio

n rati

o

0 50 100 1502

4

6

8

10

12

14

16

18

Time in min. (b) Yeast phases

0 50 100 1502

4

6

8

10

12

14

16

18

Time in min. (c) k−MCMA clusters

M phase

Early G1

S phase

G2 phase

Late G1C5

C4

C3

C2

C1

C5

C4

C3

C2

C1

Fig. 2. (a) EMMA clusters, (b) Yeast phases [1] and (c) k-MCMA clusters, with centroids shown.

a set of profiles. EMMA was able to find the 5 yeast cell-

cycle phases of [1] with an accuracy of about 83.26% whereas

k-MCMA accuracy was 79.64%. EMMA can also be used

to correct manual phase assignment errors. In the future, we

plan to study other distance-based clustering approaches using

our multiple alignment method. It will be also interesting to

study the effectiveness of any such clustering methods in dose-

response microarray data sets. Cluster validity indices based on

multiple alignment are currently being investigated. We argue

that in real applications data can be very noisy, and the use of

cubic spline interpolation could lead to some problems. The

use of splines has the advantage of being tractable, however.

We also plan to study interpolation methods that incorporate

noise. Currently, we are also carrying out experiments with

larger data sets.

Acknowledgements: This research has been partially

funded by Canadian NSERC Grants #RGPIN228117-2006 and

#RGPIN261360-2009, and CFI grant #9263.

REFERENCES

[1] R. Cho, M. Campbell, E. Winzeler, L. Steinmetz, A. Conway, L. Wod-icka, T. Wolfsberg, A. Gareilian, D. Lockhart, R. Davis: A genome-wide transactional analysis of the mitotic cell cycle. Molecular Cell2(1) (1998) 65–73

[2] Z. Bar-Joseph, G. Gerber, T. Jaakkola, D. Gifford, I. Simon: Continuousrepresentations of time series gene expresion data. Journal of Comp.Biology 10(3-4) (2003)

[3] L. Brehelin: Clustering gene expression series with prior knowledge.Lecture Notes in Computer Science 3692 (2005) 27–28

[4] S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P. Brown,I. Herskowitz: The transcriptional program of sporulation in buddingyeast. Science 282 (1998) 699–705

[5] S. Djean, P. Martin, A. Baccini, P. Besse: Clustering time-series geneexpression data using smoothing spline derivatives. EURASIP Journalon Bioinformatics and Systems Biology 70561 (2007) 705–761

[6] J. Ernst, G. Nau, Z. Bar-Joseph: Clustering short time series geneexpression data. Bioinformatics 21(suppl. 1) (2005) i159–i168

[7] M. Ramoni, P. Sebastiani, I. Kohane, eds.: Cluster analysis of geneexpression dynamics. 99, Proc. Natl Acad. Sci. USA (2002)

[8] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, G. Church: Systematicdetermination of genetic network architecture. Nature Genetics 22(1999) 281–285

[9] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky,E. Lander, T. Golub, eds.: Interpreting patterns of gene expression withSOMs: Methods and application to hematopoietic differentiation. 96(1999)

[10] L. Heyer, S. Kruglyak, S. Yooseph: Exploring expression data: identi-fication and analysis of coexpressed genes. Genome Research 9 (1999)1106–1115

[11] C. Moller-Levet, F. Klawonn, K. Cho, O. Wolkenhauer: Clustering ofunevenly sampled gene expression time-series data. Fuzzy sets andSystems 152(1,16) (2005) 49–66

[12] S. Peddada, E. Lobenhofer, L. Li, C. Afshari, C. Weinberg, D. Um-bach: Gene selection and clustering for time-course and dose-responsemicroarray experiments using order-restricted inference. Bioinformatics19(7) (2003) 834–841

[13] L. Rueda, A. Bari, A. Ngom: Clustering time-series gene expressiondata with unequal time intervals. Springer Trans. on Compu. SystemsBiology X LNBI 5410 (2008) 100–123

[14] N. Subhani, A. Ngom, L. Rueda, C. Burden: Microarray Time-SeriesData Clustering via Multiple Alignment of Gene Expression Profiles.Springer Trans. on Pattern Recognition in Bioinformatics Accepted(2009)

[15] R. Xu, D. Wunsch: Clustering. Wiley-IEEE Press (2008)[16] A. Dempster, N. Laird, D. Rubin: Maximum likelihood from incomplete

data via the EM algorithm. Journal of the Royal Statistical Society 39(1977) 1–38

[17] V. Roth, J. Laub, M. Kawanabe, J. Buhmann: Optimal cluster preservingembedding of nonmetric proximity data. IEEE Trans. on PatternAnalysis and Machine Intelligence 25(12) (2003) 1540–1551

[18] H. Kuhn: The hungarian method for the assignment problem. NavalResearch Logistics 52(1) (2005) 7–21