Continuous Representations of Time Gene Expression Data Ziv Bar-Joseph, Georg Gerber, David K. Gifford MIT Laboratory for Computer Science J. Comput. Biol.,10,341-356,

Continuous Representations of Time Gene Expression Data

Ziv Bar-Joseph, Georg Gerber, David K. GiffordMIT Laboratory for Computer Science

J. Comput. Biol.,10,341-356, 2003

Outline

• Splines• Estimating Unobserved Expression Values and

Time Points• Model Based Clustering Algorithm for

Temporal Data• Aligning Temporal Data• Results

Splines

• The word “spline” come from the ship building industry

Splines

• Splines are piecewise polynomials with boundary continuity and smoothness constraints.

• The typical way to represent a piecewise cubic curve :

maxmin1

)()( ttttSCtyn

iii

tscoefficien theare )( s,polynomial are )( iCtSi

14/...0 ,4...1 njlspolynomial piecewise theof points-break thedenote 'sx j

Splines

– We have cubic polynomial :

– equations are required :

– Interpolating splines

4/n

4

1

14)(

l

lljj tCtp

n

)()(

)()(

)()(

)(

1

1

1

jjjj

jjjj

jjjj

jjj

xPxp

xpxp

xpxp

Dxp

4/4/14/000 )( and )( nnn DxpDxp

Splines

• B-spline– In terms of a set of normalized Basis functions

• The application of fitting curved to gene expression time-series data– Convenient with the B-spline basis to obtains

approximating or smoothing splines– Fewer basis coefficient than there are observed

data points– Avoid overfitting

Splines

• The basis coefficients :– Interpreted geometrically as control points – The vertices of a polygon that control the shape of

the spline but are not interpolated by the curve– The curve lies entirely within the convex hull of

this controlling polygon.– Each vertex exerts only a local influence on the

curve.

iC

Splines

Splines– 任何 xi區間中 S(t)必為 k-1次的多項式– S(t)具有 1,2,…,k-2階微分的連續性– 對於同一 k值而言

– 在 t的有效區間中 bi,k 0≧ ，且任一 bi,k均僅有唯一極大值，除k=1,2外 bi,k均為連續平滑曲線。

y

t

1

xi xi+1 xi+2 xi+3

bi,1

bi,2

bi,3

n

iki tb

1, 1)(

Splines

• A uniform knot vector is one in which the entries are evenly space– i.e. – The basis functions will be translated of each

other, i.e.– For a periodic cubic B-spline (k=4), the equation

specifying the curve :

T)7,6,5,4,3,2,1,0(x

)1()1()( ,1,1, tbtbtb kikiki

141

4, for )()(

n

n

iii xtxtbCty

B-splines

– The B-spline will only be defined in the shaded region 3t 4

Estimating Unobserved Expression Values and Time Points

• To obtain a continuous time formulation, use cubic B-spline – Getting the value of the splines at a set of control points in the

time-series.• Re-sample the curve to estimate expression values at any

time-points.• Spline function are not fit for each gene individually

– due to noise and missing value– lead to over-fitting

• Instead, constraint the spline coefficients of co-expressed genes to have the same covariance matrix– Use other genes in the same class to estimate the missing values of

a specific gene.


• A probabilistic model of time series expression data– Assume a set of genes are grouped together• Using prior biological knowledge• a clustering algorithm

. at time for valueobserved theis )( , classin gene aFor titYji i

iiji tstY ))(()(


– –

–

–

jj classin gensfor tscoefficien spline theof valueaverage the:

j

i

Γmatrix covariance points control spline class theand

zeromean th vector widdistributenormally

tscoefficien variationspecific gene the:

2 varianceand 0mean with

ddistributenormally is that termnoise random :

i

used points control spline ofnumber the: q

1by dimension

at time evaluated basis spline of vector the: )s(

q

tt


• To learn the parameters of this model (, , and ) – Use the observed values, and maximize the likelihood of

the input data

– – –

))('()'(ˆ

: any timeat gene resamplecan We

ijii tstY

t'i

iijii SY )(

iYi genefor valueobserved of vector thedataset our in for valuesexpression of totala im

qm

i

Si

by dimension

observed were genefor aluesin which v timeat the evaluated

function basis spline the:


– Decompose the probability : • If the values were observed, decompose the

probability :

Tjj SSI

j

2

: classin gene afor matrix covariance combined the


– Use EM• E step : find the best estimation for using the values

we have for 2, , and .• M step : maximize .),,|,( 2 Yp

Model Based Clustering Algorithm for Temporal Data

• A new clustering algorithm that simultaneously solves the parameter estimation and class assignment problems

– – EM algorithm• E step

• M step

random.at unifomly genefor class aselect weFirst, ij

)|( , class tobelongs that

yprobabilit the class and geneeach for estimate

variablesmissing theas sassignemnt class treat the

ijPji

j i

step E in the computed as )|(y probabilit class therespect to with

parameters with classeach for parameter our maximize

ijP

Model Based Clustering Algorithm for Temporal Data

– )|(max)|(

class the to geneassign coverges, algorithm When the

1ikPijp

ji

ck

Aligning Temporal Data

• Assume we have two sets of time-series gene expression profiles– Splines for reference

– Splines in the set to be warped

• A mapping – Linear transformation

maxmin1 where),( ssssgi

maxmin2 where),( ttttgi

tsT )(

abssT /)()(


• The error of the alignment:– Averaged squared distance

• Find parameters a and b that minimize• The error for a set of genes S of size n

)}(,max{ min1

min tTs )}(,min{ max1

max tTs 2ie

n

iiiS ewE

1

2

dssgsTg

eii

i

212

2

)())((The averaged squared distance between the two curve

Take into account the degree of overlap between the curves.


– –

one tosumt that coefficien weightingare ' swi

genes allfor amfe thebe product the

thatrequire tois thisgformulatin of way one2iiew

Results

• 800 genes in Saccharomyces cerevisiae with five groups• Unobserved data estimation

Results

• Clustering– Explore the effect that non-uniform sampling• Two synthetic curves :

Results

Results

Results

Documents

Continuous Representations of Time Gene Expression Data Ziv Bar-Joseph, Georg Gerber, David K. Gifford MIT Laboratory for Computer Science J. Comput. Biol.,10,341-356,