Upload
steven-harmon
View
215
Download
1
Embed Size (px)
Citation preview
Continuous Representations of Time Gene Expression Data
Ziv Bar-Joseph, Georg Gerber, David K. GiffordMIT Laboratory for Computer Science
J. Comput. Biol.,10,341-356, 2003
Outline
• Splines• Estimating Unobserved Expression Values and
Time Points• Model Based Clustering Algorithm for
Temporal Data• Aligning Temporal Data• Results
Splines
• The word “spline” come from the ship building industry
Splines
• Splines are piecewise polynomials with boundary continuity and smoothness constraints.
• The typical way to represent a piecewise cubic curve :
maxmin1
)()( ttttSCtyn
iii
tscoefficien theare )( s,polynomial are )( iCtSi
14/...0 ,4...1 njlspolynomial piecewise theof points-break thedenote 'sx j
Splines
– We have cubic polynomial :
– equations are required :
– Interpolating splines
4/n
4
1
14)(
l
lljj tCtp
n
)()(
)()(
)()(
)(
1
1
1
jjjj
jjjj
jjjj
jjj
xPxp
xpxp
xpxp
Dxp
4/4/14/000 )( and )( nnn DxpDxp
Splines
• B-spline– In terms of a set of normalized Basis functions
• The application of fitting curved to gene expression time-series data– Convenient with the B-spline basis to obtains
approximating or smoothing splines– Fewer basis coefficient than there are observed
data points– Avoid overfitting
Splines
• The basis coefficients :– Interpreted geometrically as control points – The vertices of a polygon that control the shape of
the spline but are not interpolated by the curve– The curve lies entirely within the convex hull of
this controlling polygon.– Each vertex exerts only a local influence on the
curve.
iC
Splines
Splines– 任何 xi區間中 S(t)必為 k-1次的多項式– S(t)具有 1,2,…,k-2階微分的連續性– 對於同一 k值而言
– 在 t的有效區間中 bi,k 0≧ ,且任一 bi,k均僅有唯一極大值,除k=1,2外 bi,k均為連續平滑曲線。
y
t
1
xi xi+1 xi+2 xi+3
bi,1
bi,2
bi,3
n
iki tb
1, 1)(
Splines
• A uniform knot vector is one in which the entries are evenly space– i.e. – The basis functions will be translated of each
other, i.e.– For a periodic cubic B-spline (k=4), the equation
specifying the curve :
T)7,6,5,4,3,2,1,0(x
)1()1()( ,1,1, tbtbtb kikiki
141
4, for )()(
n
n
iii xtxtbCty
B-splines
– The B-spline will only be defined in the shaded region 3t 4
Estimating Unobserved Expression Values and Time Points
• To obtain a continuous time formulation, use cubic B-spline – Getting the value of the splines at a set of control points in the
time-series.• Re-sample the curve to estimate expression values at any
time-points.• Spline function are not fit for each gene individually
– due to noise and missing value– lead to over-fitting
• Instead, constraint the spline coefficients of co-expressed genes to have the same covariance matrix– Use other genes in the same class to estimate the missing values of
a specific gene.
Estimating Unobserved Expression Values and Time Points
• A probabilistic model of time series expression data– Assume a set of genes are grouped together• Using prior biological knowledge• a clustering algorithm
. at time for valueobserved theis )( , classin gene aFor titYji i
iiji tstY ))(()(
Estimating Unobserved Expression Values and Time Points
– –
–
–
jj classin gensfor tscoefficien spline theof valueaverage the:
j
i
Γmatrix covariance points control spline class theand
zeromean th vector widdistributenormally
tscoefficien variationspecific gene the:
2 varianceand 0mean with
ddistributenormally is that termnoise random :
i
used points control spline ofnumber the: q
1by dimension
at time evaluated basis spline of vector the: )s(
q
tt
Estimating Unobserved Expression Values and Time Points
• To learn the parameters of this model (, , and ) – Use the observed values, and maximize the likelihood of
the input data
– – –
))('()'(ˆ
: any timeat gene resamplecan We
ijii tstY
t'i
iijii SY )(
iYi genefor valueobserved of vector thedataset our in for valuesexpression of totala im
qm
i
Si
by dimension
observed were genefor aluesin which v timeat the evaluated
function basis spline the:
Estimating Unobserved Expression Values and Time Points
– Decompose the probability : • If the values were observed, decompose the
probability :
Tjj SSI
j
2
: classin gene afor matrix covariance combined the
Estimating Unobserved Expression Values and Time Points
– Use EM• E step : find the best estimation for using the values
we have for 2, , and .• M step : maximize .),,|,( 2 Yp
Model Based Clustering Algorithm for Temporal Data
• A new clustering algorithm that simultaneously solves the parameter estimation and class assignment problems
– – EM algorithm• E step
• M step
random.at unifomly genefor class aselect weFirst, ij
)|( , class tobelongs that
yprobabilit the class and geneeach for estimate
variablesmissing theas sassignemnt class treat the
ijPji
j i
step E in the computed as )|(y probabilit class therespect to with
parameters with classeach for parameter our maximize
ijP
Model Based Clustering Algorithm for Temporal Data
– )|(max)|(
class the to geneassign coverges, algorithm When the
1ikPijp
ji
ck
Aligning Temporal Data
• Assume we have two sets of time-series gene expression profiles– Splines for reference
– Splines in the set to be warped
• A mapping – Linear transformation
maxmin1 where),( ssssgi
maxmin2 where),( ttttgi
tsT )(
abssT /)()(
Aligning Temporal Data
• The error of the alignment:– Averaged squared distance
• Find parameters a and b that minimize• The error for a set of genes S of size n
)}(,max{ min1
min tTs )}(,min{ max1
max tTs 2ie
n
iiiS ewE
1
2
dssgsTg
eii
i
212
2
)())((The averaged squared distance between the two curve
Take into account the degree of overlap between the curves.
Aligning Temporal Data
– –
one tosumt that coefficien weightingare ' swi
genes allfor amfe thebe product the
thatrequire tois thisgformulatin of way one2iiew
Results
• 800 genes in Saccharomyces cerevisiae with five groups• Unobserved data estimation
Results
• Clustering– Explore the effect that non-uniform sampling• Two synthetic curves :
Results
Results
Results