A comparative approach for gene network inference using time-series gene expression data

A comparative approach for gene network inference using time-series

gene expression data

Guillaume Bourque* andDavid Sankoff

*Centre de Recherches Mathématiques,Université de Montréal

October 2003

DNA Microarrays

http://www.sri.com/pharmdisc/cancer_biology/laderoute.html

• Experiment design

• Noise reduction

• Normalization

• …

• Data analysis

Gene Expression Data

Beyond Clustering…

Time series

x2

x1

x4

x3

_

_

+

+ _

_

+

_

?

Gene network

Comparative Framework

Specie CSpecie BSpecie A

Harder Problem?

• This new problem seems more ambitious and harder to solve.

• BUT, we will show that, for closely related species (samples), the comparative framework can actually improve the quality of the solutions recovered.

• The repetitive nature of the data can be used to sort through some of the noise and some of the ambiguity.

Outline

• Gene network model• Single network inference

– Algorithm– Simulations

• Multiple networks inference– Algorithm– Simulations

• Conclusions

Gene Network Model

• We use linear differential equations to model the gene trajectories (Chen et al. ‘99, D’haeseleer et al. ‘99):

dxi(t) / dt = a0 + ai,1 x1(t)+ ai,2 x2(t)+ … + ai,n xn(t)

• Several reasons for that choice:– Takes advantage of the continuous aspect of the data.

– Allows for feed-back loops.

– Low number of parameters implies that we are less likely to over fit the data.

– Sufficient to model complex interactions between genes.

Small Network Example

dx1(t) / dt = 0.491 - 0.248 x1(t)dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)

x2

x1

x4

x3

_

_

+

+ _

_

+

_



x2

x1

x4

x3

_

_

+

+ _

_

+

_

interactioncoefficient



x2

x1

x4

x3

_

_

+

+ _

_

+

_

constantcoefficient

Problem Revisited

a0,i a1,i a2,i a3,i a4,i

x1 .431 -.248 0 0 0

x2 0 0 0 -.473 .374

x3 -.427 .376 0 -.241 0

x4 0 .435 0 -.315 -.437

Given the time-series data, can we find the interactions coefficients?

Linear Differential Equations• Even under the simplest linear model, there are m(m+1)

unknown parameters to estimate:• m(m-1) directional effects• m self effects• m constant effects

• Number of data points is mn and we typically have that n << m (few time-points).

• To avoid over fitting, extra constraints must be incorporated into the model such as:

• Smoothness of the equations (D’haeseleer et al. ‘99)• Sparseness of the network, i.e. few non-null interaction

coefficients (Yeung et al. ‘02, De Hoon et al. ‘02)

Algorithm for Network Inference

• To recover the interaction coefficients, we use stepwise multiple linear regression.

• Why?– This procedure finds coefficient that significantly improve

the fit in the regression. It limits the number of non-zero coefficients (i.e. it finds sparse networks) a feature we were seeking.

– It is highly flexible and provides p-value scores which can be interpreted easily.

Partial F Test

• The procedure finds the interaction coefficients iteratively for each gene xi.

• A partial F test is constructed to compare the total square error of the predicted gene trajectory with a specific subset of coefficients being added or removed.

• If the p-value obtained from the test exceeds a certain cutoff, the subset of coefficients is significant and will be added or removed.

• The procedures iterates until no more subsets of coefficients are either added or removed.

Simulations

• Difficult to find coefficients that will produce realistic gene trajectories.

• We select coefficients such that the resulting trajectories satisfy 3 conditions:– They are bounded

– The correlation of any pair is not too high

– They are not too stable

• We added gaussian noise to model errors.

Gaussian Noise

regression

procedure

Network Inferencea0,i a1,i a2,i a3,i a4,i

x1 .431 -.248 0 0 0

x2 0 0 0 -.473 .374

x3 -.427 .376 0 -.241 0

x4 0 .435 0 -.315 -.437

Procedure recovers perfectly this network with 4 genes and 10 interactions coefficients.

x2

x1

x4

x3

_

_+

+ _

_

+_

10 Genes

Procedure also recovers perfectly this network with 10 genes and 22 interactions coefficients.

Multiple Networks

Specie CSpecie BSpecie A

Types of Problems

• Multiple networks related by a graph or a tree can arise from various situations:– Different species

– Different developments stages

– Different tissues

• The goal is now not only to maximize the fit (with as few interactions as possible) but also to minimize an evolutionary cost on the graph of the networks.

Evolutionary Cost

{1, 2}

{1, 2, 3}

{1} {1, 3} {1, 2, 3}

sets of predictedregulatorsevolutionary

event

Evolutionary cost = 3

Multiple Network Inference

• The stepwise regression algorithm is modified to add/remove subsets of regulators directly on the edges of the graph.

• Partial F tests are computed on the vertices affected by this change the evaluate the change in fit.

• The p-values obtained are then modified based on the change in evolutionary cost.

• The p-values are finally combined into a scoring function using a Kolmogorov-Smirnov Test.

• The algorithm iteratively adds/removes the best scoring move when above/below a certain threshold.

Simulation Example

Simulation Example

Simulation Results

Conclusions

• The comparative framework actually simplifies the inference process especially for instances of the problem with more genes, more noise or fewer time-points.

• The procedure could also be used for the revision of gene networks.

• Possibility of exploring different evolutionary models.

• We need to try the procedure on real data.

Documents

A comparative approach for gene network inference using time-series gene expression data