Inference of Gene Regulatory Networks from Large Scale ...mtakan1/marinathesis.pdf · 1 Introduction DNA is commonly considered to be the \blueprint of life". DNA contains the instructions

$Page 1: Inference of Gene Regulatory Networks from Large Scale ...mtakan1/marinathesis.pdf · 1 Introduction DNA is commonly considered to be the \blueprint of life". DNA contains the instructions$
Inference of Gene Regulatory Networks from Large

Scale Gene Expression Data.

Marina Takane∗

School of Computer ScienceMcGill University, Montreal

A thesis submitted to McGill Universityin partial fulfilment of the requirements of the degree of

Master of Science.

August 2003

∗ c©Marina Takane 2003

1

Acknowledgements

Special thanks to Mike Hallett, Sebastien Loisel, Ted Perkins, Doina Precup, and YoshioTakane.

2

Abstract

With the advent of the age of genomics, an increasing number of genes havebeen identified and their functions documented. However, not as much is known ofspecific regulatory relations among genes (e.g. gene A up-regulates gene B). At thesame time, there is an increasing number of large-scale gene expression datasets,in which the mRNA transcript levels of tens of thousands of genes are measuredat a number of time points, or under a number of different conditions. A numberof studies have proposed to find gene regulatory networks from such datasets. Ourmethod is a modification of the continuous-time neural network method of Wahde& Hertz [25, 26]. The genetic algorithm used to update weights was replaced withLevenberg-Marquardt optimization. We tested our method on artificial data as wellas Spellman’s yeast cell cycle data [22]. Results indicated that this method was ableto detect salient regulatory relations between genes.

3

Abstract

Avec l’arrivee de la genomique, un nombre croissant de genes ont ete identifieset leurs fonctions documentees. Toutefois, les relations regulatoires specifiques (parexemple, l’effet regulatoire du gene A sur le gene B) sont moins bien connues. Enmeme temps, nous disposons d’une quantite de donnees genetique a grande echellede plus en plus vaste, ou nous pouvons trouver l’evolution de dizaines de milliers degenes echantillonnes a plusieurs instants ou sous differentes conditions. Des etudesont proposes d’inferer des reseaux regulatoires genetiques a partir de ces donnees.Notre methode est une modification des reseaux neuronaux a temps continu deWahde et Hertz [25, 26]. L’algorithme genetique utilise pour la mise a jours despoids est remplace par la methode Levenberg-Marquardt. Nous avons teste notremethode avec des donnees artificielles ainsi que les donnes de cycle de levure deSpellman. Nos resultats indiquent que cette methode est capable de detecter lesrelations regulatoires saillantes entre les genes.

4

Contents

1 Introduction 6

2 Problem Definition 6

3 Modelling 7

4 Related Work 94.1 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Method 155.1 The Levenberg-Marquardt Method . . . . . . . . . . . . . . . . . . . . . . 165.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Experiments on Artificial Data 19

7 Experiments on Biological Data 217.1 The Spellman paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7.2.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237.2.2 Using only relevant weights . . . . . . . . . . . . . . . . . . . . . . 25

7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

8 Discussion 32

References 33

5

1 Introduction

DNA is commonly considered to be the “blueprint of life”. DNA contains the instructionsto make proteins, which play a vital role in all cellular processes. Since the discovery ofthe double-helix structure of DNA in 1953 by Watson and Crick, genomics research hasbecome an increasingly important field. The primary objectives of the Human GenomeProject, started in 1990, included sequencing the entire human genome and cataloguingall human genes. Initial sequencing was completed in 2000, and new genes are still beingdocumented. Currently, genomics research offers one of the best hopes for understandingand developing treatments for diseases. However, much is still unknown. We do not yethave a complete understanding of even the simplest organisms. In particular, knowledgeof how specific genes are regulated is still fragmentary.

The central dogma of molecular biology states that genes are transcribed into mRNA,which in turn are translated into proteins. The DNA is the “master copy” of the infor-mation; all cells in an organism contain more or less the same DNA. Cells with the sameDNA (say, a neuron and a heart cell in the same organism) exhibit different behavioursbecause the genes are expressed differently. The simplest explanation for this is that thegenes are regulated differently in different cell types. Thus, if we want to understandwhat makes a heart cell behave the way it does, we need to understand how genes areregulated. However, traditional wet-lab techniques for elucidating regulatory relationsare tedious and it can take many years to understand the regulation of a single gene.

Proteins are known as the “machinery of the cell”. There are many kinds of proteinsserving different functions, such as hormone receptors, enzymes, and transcription fac-tors. Transcription factors bind to cis-regulatory elements in the promoter region of agene to regulate transcription. Since transcription factors are themselves the products oftranscription and translation, this completes a feedback loop that allows genes to controleach others’ expression levels. When we say “gene A regulates gene B,” we mean isthat mRNA transcripts of gene A are translated into transcriptions factors that controltranscription of gene B.

The past several years have seen a proliferation of large-scale gene expression datasets,in which the mRNA transcript levels of tens of thousands of genes are measured at anumber of time points, or under a number of different conditions. These datasets containtoo much data to be able to make sense of them by “eye-balling”. This being the case,computers are the natural tool to use to make sense of this data. Researchers haveproposed various computational methods for inferring gene regulatory networks fromsuch large-scale gene expression datasets. In this thesis, we provide an overview of someof these methods. In addition, we modify one of these methods, and test how well itworks on artificial and biological data.

2 Problem Definition

As described above, genes are able to regulate one another’s expression levels via proteinscalled transcription factors. We will call the set of genes that regulate transcription of aspecific gene its regulators. The network of regulatory relations among genes throughoutthe genome is called a genetic network. We formalize the notion as follows:

Definition 1 A genetic network is a directed graph G = (V,A), where each gene isrepresented by a vertex v ∈ V , and an arc (u, v) ∈ A exists if gene u is a regulator ofgene v.

Thus the regulators of v are its parents in G. Note that this definition of genetic networkdoes not specify the nature of the regulatory relations among genes. A more precise

6

definition could include weights on the arcs, or a circuit for each gene detailing exactlyhow the regulators govern transcription.

Suppose we have measurements of concentration for gene products in a cell. We wouldlike to infer regulatory relations among the genes from such data. Gene expressiondatasets often consist of a time-series, or expression levels measured under differentconditions such as an added environmental stimuli, and over-expression or deletion ofone or more genes. In either case, the number of measurements (usually dozens) is muchless than number of genes whose expression levels are being measured (tens of thousands).This is the problem of dimensionality: The number of measurements is much less than thenumber of variables. Hence, although computational methods are used to make sense ofthe large amounts of expression data being collected, it is ironically too little data to gaina complete picture of gene regulation. One would need at least as many measurementsas the number of genes, and quite possibly several times more. Unfortunately, we arestill far from reaching this point.

Genetic network inference is a hard problem. Even if we had all the necessary data,it is not all that clear how gene regulation manifests itself in gene expression. Clustersof similarly expressed genes may be co-regulated, i.e. regulated by the same mechanisms;however, this is not the same as identifying the regulating genes. It is possible to identifypossible regulators through systematic deletion and over-expression experiments. If theexpression levels for a set of genes goes down consistently as a result of deleting a certaingene A, then we may hypothesize that A is an up-regulator for these genes. Similarconclusions may be drawn for a down-regulator. However, we would like to infer atleast some of these relations without doing exhaustive deletion experiments. In fact, onemotivation for modelling gene regulation computationally is to find possible regulatoryrelations which can then be verified or further explored in wet-lab experiments.

Another difficulty arises when attempting to verify the accuracy of a model. In manymodelling problems, the solutions (or a subset thereof) are known; thus the accuracy ofa model can be verified by comparing model predictions to the real solutions. However,for this problem, there are no well-characterized genetic networks for any organisms.Indeed, the set of regulators is not known for most genes; the Endo 16 gene is onenotable exception [31, 32]. Since we don’t have a known “solution” to compare modelpredictions to, it is difficult to say how well a particular model works. Often, the bestone can do is to offer some biological evidence to support predictions made by the model.

3 Modelling

The quantity and quality of the data available is one of the first things to consider whenmodelling a gene regulatory system. Important factors to consider include the numberof independent measurements, what variables in the system are measured, the time scalefor time-series data, the accuracy of the data, and prior knowledge of the system underconsideration. Variables measured may include mRNA levels, protein levels, concen-trations of metabolites, as well as spatial considerations such as localization of cellularproducts [5]. In general, large scale data collection methods are not able to capturedetails such as localization within a cell. At least one gene regulatory network study[16, 19] has taken spatial protein concentrations into consideration, although the datacollected was not large scale and their method of incorporating localization was highlyspecific to the system. At present, it is still difficult to obtain large scale measurements ofprotein concentrations. The most widespread 2D-PAGE method separates proteins into“spots” in a gel based on isoelectric point and molecular weight. Protein concentrationis proportional to the intensity of each spot. However, protein spots must be identifiedindividually, and there is a possibility that more than one protein species is present in a

7

single spot. 2D gels are also difficult to reproduce. A number of databases housing 2Dgel data exist, including the SWISS-2DPAGE database.

Far easier to measure on a large scale are mRNA levels. Microarrays exploit thehybridization property of DNA and RNA to measure mRNA transcript levels. Microarraychips are spotted with DNA probes representing genes, fragments of genes, or ESTs(Expressed Sequence Tags). When fluorescently tagged mRNA is washed over the chip, ithybridizes to the DNA probes, and the amount of fluorescence in each spot is proportionalto the mRNA levels. Oligonucleotide arrays, such as Affymetrix GeneChips, have probessynthesized directly to the surface of glass chips. GeneChips have over 100,000 probesper chip representing tens of thousands of genes. Cho et al. [3] published a yeast cellcycle oligonucleotide microarray dataset. Another microarray technique uses strands ofcDNA instead of oligonucleotides on glass chips. Probes cannot be placed as densely onthe glass chips, but they can be tailored to specific needs since they can be made byindividual researchers. A well-known yeast cell cycle dataset was published by Spellmanet al. [22] using cDNA arrays.

From a biological standpoint, it would be better to use protein levels rather thanmRNA levels, since proteins are the ones actually involved in cellular processes, whilemRNA is simply an intermediate state between DNA and protein. However, as outlinedabove, mRNA levels are much easier to measure, and on a larger scale, than proteins.Several studies have examined how well mRNA levels correlate with protein levels [8, 9].Results indicate that correlation is poor (r = 0.36 [9]) to moderate (r = 0.76 [8]). Thereare a number of reasons why this would be so. Both mRNA and proteins are subject todegradation, so it is possible to have a situation where protein levels are low even if mRNAlevels are high. Also, splice variants of the same gene can give rise to different proteins,and proteins themselves are subject to post-translational modifications. Despite thereported disparity between protein and mRNA levels, mRNA levels provide an importantsource of data. The assumption is that although correlation may be poor, genes withhigh mRNA levels are nonetheless more likely to have higher protein levels, and viceversa. However, these things should be taken into consideration when constructing amodel of gene regulation.

Noise is another factor. Microarray data is known to be quite noisy, and repeatedmeasurements must be made to be confident of the data. Apart from the unavoidablemachine and human errors that result in noise, there are several sources of error particularto microarrays. First, some mRNA strands may hybridize to the wrong probe on thechip, a process called cross-hybridization that can occur when probes are sufficientlysimilar. Also, some mRNA strands may self-hybridize, also leading to distortion of theexpression levels of those genes. In addition, organs or parts of organs are often used toobtain the mRNA. This means that the resulting data are the sum of gene expressionacross all cells in the tissue sample. Ideally we would like to have the expression levelof genes from just one cell, but this is not yet possible. Another source of noise comesfrom different samples being used for replicates, since different samples will have differentexpression levels for many genes.

In the case of time-series data, the size of the time scale should also be considered.For instance, it would be unreasonable to expect to model the details of transcription ifgene expression levels are measured days apart. We will deal primarily with time seriesdata in this thesis.

All this leads to the conclusion that the data currently available are fairly limited ina number of ways. Although expression levels of many genes can be measured simulta-neously, not enough of these measurements can be taken, the core of the dimensionalityproblem. The data obtained is noisy and does not necessarily reflect protein levels. Thusit would be unreasonable to expect to find the complete genetic regulatory network basedon this data. In practice, we will be satisfied with finding a few significant regulatory

8

relations between genes.One approach to dealing with noisy data is to discretize it into a small number of

states. Boolean networks take this approach, categorizing expression levels as either ‘on’or ‘off’. Although this makes the problem more tractable, many regulatory mechanismsthat rely on graded expression levels cannot be modelled. Intermediate expression levelsbetween ‘on’ and ‘off’ may be important to model certain genes. In addition, the dy-namics of Boolean networks may not be sufficiently complex to model gene regulation.Boolean differential models may avoid these problems.

We posit that gene expression is inherently a continuous object, and that currentdata is sufficient to treat it as such. Differential equations are a natural choice as theyare commonly used to model biochemical systems. In particular, chemical reactionsat equilibrium can be written as a system of differential equations. Modelling generegulation at this level of biochemical detail would require knowing the concentrationsof transcription factors and the rate constants for the binding reaction for each gene.Small systems such as the lysis-lysogeny switch in λ-phage can be modelled in this way[2]; however, at present it would be impossible on a genome-wide scale. In any case,transcription factors have not been fully characterized for most genes, and neither havetheir binding strengths. Modelling at this detailed level will become more importantonce we have more knowledge, and could help elucidate detailed mechanisms of generegulation. At present, however, we are still at the stage of finding regulatory relationsamong genes. For this purpose, we will use a coarse-grained level of modelling [19, 5, 25].A fundamental assumption of this approach is that regulatory relations among genes canbe described in a meaningful way without taking into account the biochemical details ofregulation. One consequence of coarse-graining is that if we have a system where geneA regulates gene B and gene B regulates gene C, the model may predict the indirectregulatory relation between gene A and C in addition to the direct ones. With currentdata, it is unlikely that such indirect relations could be distinguished from the directones.

One additional assumption we make is that stochastic effects in the cell can be safelyignored. Again, this is in the interest of simplicity of the model, and because data isinsufficient data to support modelling these effects.

4 Related Work

A large number of methods to infer regulatory relations among genes can be foundin literature. These methods originate from diverse fields such as statistics, machinelearning, and data-mining. The methods proposed include Boolean networks [1, 14],linear models [4, 23], neural networks [16, 25, 5], and Bayesian networks [7]. Here, weshall restrict our discussion to so-called additive regulation models, in which the changein each variable Vi is assumed to be a function of the weighted sum of all variables [5]:

∆Vi = g

∑

j

wijVj + bi

. (1)

The variables Vi represent gene expression levels and are (non-negative) continuous func-tions of time. The weights wij describe the regulatory relations between genes. Asignificantly positive wij indicates up-regulation of gene i by gene j, while a significantlynegative wij indicates down-regulation. Note that the above equation does not take intoaccount combined effects of two or more genes on gene i, although these could be added

9

quite easily as follows:

∆Vi = g

bi +

∑

j

wijVj +∑

j,k

wijkVjVk + · · · (2)

However, the currently available data are not sufficient to determine weights wijk ofhigher order terms. One advantage of this model is that extra terms representing higherorder effects or external inputs can be added if they are known to play an importantrole in the system being analysed [4, 27, 25]. Additive regulation models include linearmodels such as those described by Someren et al. [23] and D’haeseleer et al. [4, 5], as wellas neural network models such as in Mjolsness et al. [16], Weaver et al. [27], D’haeseleer[5], and Wahde & Hertz [25, 26].

4.1 Linear Models

Gene regulation is generally considered to be a non-linear problem; however, linear modelscan provide basic insight into how some genes are regulated. The simplicity of the linearmodel is at once its greatest asset and worst limitation: although the parameters canbe computed quickly and efficiently using linear algebra, it cannot adequately modelimportant non-linear gene interactions. However, D’haeseleer [5] maintains that it isindispensable as an exploratory tool. As well, Someren et al. [23] state that the availabledata do not support reliable estimation of parameters for more complex models.

The linear model typically has the following form as a difference equation:

Vi(t+ ∆t) =∑

j

wijVj(t) + bi. (3)

The corresponding differential equation formulation as ∆t→ 0 is

dVidt

=∑

j

wijVj(t) + bi. (4)

The matrix-vector equation equivalent to (3) is−→V (t+ ∆t) = W

−→V (t) +

−→b . (5)

where−→V (t+ ∆t),

−→V (t),

−→b are vectorized versions of Vi(t+ ∆t), Vi(t) and bi, respectively.

The matrix W is N ×N , where N is the number of genes in the system. We solve thesystem for W using linear regression: An equation (5) exists for each pair of consecutivetime points. We can combine all these equations into a single matrix equation as follows:

V∗ = W V (6)

where the−→b vector is incorporated into the W matrix, and V is augmented with a

row of ones at the bottom. Matrices V∗, V, and W have dimensions N × (T − 1),(N + 1)× (T − 1), and N × (N + 1) respectively, where T is the number of time points:

V∗ =

V1(2) · · · V1(T )...

...VN (2) · · · VN (T )

V =

V1(1) · · · V1(T − 1)...

...VN (1) · · · VN (T − 1)

1 · · · 1

10

W =[

W−→b].

The matrix equation (6) can be solved exactly if V is invertible. In general, V is notsquare, and therefore not invertible. However, we can find the least squares solution to(6) [28, 5]:

Wlsq = V∗ VT (V VT )−1 (7)

provided that V VT is invertible.Thus, given an N × T matrix V of time-series gene expression data, it is straight-

forward to solve for parameters W and−→b . The only difficulty arises in ensuring that

the system (6) is not under-determined. This comes back to the problem of dimensional-ity: we need the number of parameters to be less than the number of independent datapoints, i.e. N2 + N < N(T − 1) or N < T − 2. Someren et al. [23] and D’haeseleer[4, 5] handle this problem in different ways. Someren et al. cluster genes with similarexpression profiles before fitting with the linear model. Their method makes sense sincehaving two or more genes with very similar expression profiles causes redundancy in theweight matrix, resulting in many possible equivalent weight matrices. Clustering thesegenes and combining their weights removes this source of ambiguity. The authors use hi-erarchical clustering with Pearson correlation as their distance metric. One disadvantageis that cluster boundaries, and consequently the number of clusters, must be determinedempirically. Their method was tested on artificial data, as well as yeast cell cycle datafrom Spellman et al. [22]. Results indicated that some interesting interactions were cap-tured. However, some difficulties were noted in interpreting the results. For instance,two signals that had large weight values for all genes could naively be interpreted toplay an important role as a regulator for many genes. Closer examination showed thatthese two signals effectively cancelled one another out - since they summed almost to aconstant, their combined effect was similar to the constant bias term bi.

D’haeseleer’s approach [4, 5] is to interpolate continuously between time points finelyenough that the number of time points exceeds the number of genes. Although this solvesthe immediate problem of determining the parameters, it highlights one of the difficultiesof working with time-series data: the data points are not independent. The interpolationprocess adds time points to the dataset, but these points are not independent. Onlynew time course data starting from different initial conditions is considered independent.In addition, there is a possibility that important dynamics are missed in between timepoints, particularly where time intervals are large. In this case, interpolating may lead tomistaken conclusions, or false confidence in such results. D’haeseleer applies his methodto rat central nervous system data [29] consisting of three time courses. Having multipleindependent time courses allows for better parameter estimation [25, 26]. To be able tomodel all three time courses in the same framework, two extra terms were added to themodelling equation (3), representing tissue type and kainate level. Thus there were anadditional 2N parameters to estimate related to these extra terms. The author remarksthat most weights turn out to be close to zero, indicating that the gene regulation matrixW is sparse. A small set of genes were identified as regulators, and Monte Carlo analysiswas performed to find which parameters were robust to noise.

4.2 Recurrent Neural Networks

Differential equation models of dynamical systems commonly have the following form:

{rate of change in variable} = {rate of production} − {rate of decay} (8)

The right-hand side of Equation (4) seen previously can be considered as the rate ofproduction. In its current form, however, Vi(t) can take negative values or become

11

unbounded, properties that are biologically implausible. Applying a non-linear transferfunction such as a sigmoid function g(z) = (1 + e−z)−1 solves these problems, and has asimilar effect to a biological dose-response curve [5]. A decay term can be added to theequation as well, to get

d−→V

dt= g(W

−→V +

−→b )−D

−→V (9)

where D is an N ×N diagonal matrix of non-negative decay parameters. Note that thisequation is in the form outlined in (8). Equation (9) is essentially equivalent to the oneused for continuous-valued neural networks [11]:

τd−→V

dt= g(W

−→V +

−→b )−−→V (10)

where τ is a diagonal matrix of time constants τi > 0. The τ matrix is invertible sinceτi 6= 0, and τ−1 is itself a diagonal matrix with non-zero entries. Thus, by left-multiplyingon both sides of (10) by τ−1, we can isolate d

−→V/dt on the left-hand side. Equation (10)

is equivalent to Equation (9), since the effect of multiplying g(W−→V +−→b ) by τ−1 can be

absorbed into the W and−→b terms inside the sigmoid, and the effect on the −−→V term

can be considered the same as the decay parameters D.The sigmoid function g(z) = (1+e−z)−1 is a monotonically increasing function defined

on z ∈ (−∞,+∞). The effect of g() is to squash z to a value between 0 and 1. Supposeτi > 0 ∀i, and let z = W

−→V +

−→b . For a single gene, equation (10) can be written as

dVidt

=1τi

(g(z)− Vi). (11)

As stated above, g(z) ∈ (0, 1), so (g(z)− Vi) ∈ (−Vi, 1− Vi). Supposing the initial valueof Vi is in (0, 1), we have (g(z)− Vi) ∈ (−1, 1), or dVi/dt ∈ (−1/τ, 1/τ). In other words,if Vi(t) ∈ (0, 1), dVi/dt can be either positive or negative, and thus Vi may be increasingor decreasing. However, at the lower boundary Vi = 0, we have (g(z) − Vi) ∈ (0, 1), ordVi/dt > 0. This means that if we ever have Vi(t) = 0 for some t, then dVi/dt|t > 0, soVi can never go below 0. Likewise, if Vi = 1, we have (g(z)−Vi) ∈ (−1, 0), or dVi/dt < 0.Hence Vi must be decreasing at this point, so Vi never goes above 1. We have shown thatVi is bounded between 0 and 1, provided that its initial value is between 0 and 1, andthat τi > 0. A negative τi allows the possibility of an unbounded Vi. For this reason,τi should always be positive. A negative τi would also change the decay term into aproduction term. In what follows, we always assume that expression values have all beennormalised between 0 and 1.

Hence, the solutions Vi of Equation (10) demonstrate several properties desirablefor modelling gene expression time-series. For τi > 0, Vi is bounded, non-negative andalways takes values in the interval [0, 1]. In addition, for certain values of parameterswij , bi, and τi, Vi is known to converge to a stable limit cycle [10]. This is desirable whenmodelling processes such as the cell-cycle that involve genes with periodic expressionlevels. Another possibility is that the expression level Vi converges to a stable fixedpoint,

Vi = g

∑

j

wijVj + bi

, (12)

Chaotic trajectories are also possible [10].Mjolsness et al. [16] introduce a theoretical framework for modelling biological phe-

nomena that occur in a cell. They describe how their framework can be applied to generegulation in the blastoderm of Drosophila melanogaster. In the blastoderm stage of

12

development, the regulatory network of the genes (N = 5) involved in eve stripe forma-tion can effectively be modelled in isolation. Protein concentrations are used instead ofgene expression levels. The small number of genes involved makes it possible to collectdatasets of protein levels. Their model is a hybrid system of a continuous time neuralnetwork with discrete time state transitions intended to model phenomena such as mi-tosis, axon sprouting, and growth. The continuous portion is modelled by the followingdifferential equation:

d−→V (k)

dt= Rg(W

−→V (k) +

−→b )

+D(n)[(−→V (k−1) −−→V (k)) + (

−→V (k+1) −−→V (k))]

−λ−→V (k). (13)

This is similar to Equation (9). One notable difference is that more than one cell nucleusis modelled at a time, and therefore we have a system of equations, indexed by k, foreach nuclei. Diffusion of gene products between neighbouring nuclei is also taken intoaccount, as the second term on the right-hand side. The N × N diagonal matrix D(n)contains the diffusion parameter for each gene, which depend on the number n of celldivisions that have already taken place. The nuclei are lined up along one axis of theembryo, and thus can be indexed by a single number k, i.e. k− 1 and k+ 1 indicate thetwo nuclei adjacent to k. The N × N diagonal matrix R contains the maximum ratesof synthesis for each gene as the diagonal entries. The regulation matrix W is assumedto be the same for all nuclei since they contain the same genetic material. The discreteportion of the model represents mitosis. Mitosis is modelled by a periodic suspension ofthe differential equation and re-initialization of the variables

−→V (k). The hybrid nature

of this dynamical system makes it difficult to use some of the more traditional weightupdate methods used in the neural network community [5].

This model of the Drosophila blastoderm is put into practice by Reinitz & Sharp [19].Euler’s method is used to solve the differential equation (13); error is defined as follows:

E =∑

i,k,t

(V (k)i (t)−X(k)

i (t))2 + (penalty terms) (14)

where X(k)i are the data points, V (k)

i are the approximations to the data given by themodel, and the penalty terms prevent parameters from exceeding pre-defined boundaries.Note that the error surface is not continuous due to the periodic re-initialization of vari-ables and the penalty terms. Thus, gradient descent methods cannot be applied readily.Reinitz & Sharp use simulated annealing to optimize parameter values. Optimizationproceeds as follows: Initial parameter values are initialized to some random values. Ateach iteration of the optimization, Equation (13) is solved to find protein concentrationsV (k), and error E is computed based on these values. One or more parameter valuesare changed according to some move generation strategy, and the new error is computedbased on the updated parameter values. If the new error is smaller than the previouserror, then the new parameter values are retained. Otherwise, if the new error is greaterthan the previous one, the new parameter values are retained with some probability. Thegreater the increase in error, the less likely it is that the “uphill” step will be accepted.Also, the probability of taking such a step decreases over the course of training. Allowingthese uphill steps helps to prevent the optimization from getting stuck in local minima;however, it makes convergence very slow. In this case, it took up to a week for one runto converge (in 1994). To be confident that the algorithm has found a global minimum,multiple runs should yield the same solution.

Expression of the gene eve is uniformly distributed spatially at the beginning ofcleavage cycle 14, but is segmented into five distinct stripes by the end of the cycle.

13

These are called eve stripes. The genes involved in regulating this process are Kr,kni, gt, hb, end eve, as well as the protein bcd. The dataset was acquired by visualexamination of fluorescently tagged photomicrographs of Drosophila embryos. Proteinlevels were obtained by estimating staining intensity. Data were obtained for 32 nuclei at3 to 4 time points for both wild-type and mutant eve− strains. The authors reported thatparameter values from the simulated annealing runs were nearly identical, and predictedexpression levels were within 1% of the data. Qualitative features of model predictionswere consistent with the data as well. Five distinct stripes of eve expression were formedin the correct order of emergence. The authors analyze the particulars of the results indepth and discuss agreement with other Drosophila blastoderm studies. One definitivelimiting factor in this study was the qualitative nature of the data, as well as the smallnumber of time points. However, they achieve considerable success modelling eve stripeformation in the Drosophila embryo.

Weaver et al. [27] use a discrete time version of Equation (9) without a decay term.This is equivalent to Equation (3), only with the sigmoid transfer function applied to theright-hand side:

Vi(t+ 1) = g

∑

j

wijVj(t) + bi

. (15)

The method for reverse-engineering the weight matrix W and bias vector−→b given time-

series data of expression levels is similar to that for linear models. The only difference isthat the inverse of the sigmoid must be applied as the first step:

g−1(y) = − ln(

1y− 1).

The rest of the process of solving for W and−→b is the same as outlined in Section 4.1. The

regulation matrix is expected to be sparse, so the authors experiment with zeroing outweights with small absolute value. Euclidean error (sum squared error) is recomputedeach time a weight is deleted, and the weights that give the smallest error are retained.The method was tested on artificial data, but not on biological data.

D’haeseleer [5] uses the following equation (equivalent to Equation (10)) to modelgene expression levels:

d−→V

dt= Ag(W

−→V +

−→b )−D

−→V .

Weights are updated using a continuous version of Back-Propagation Through Time(BPTT) [18], and experiments on artificial data were performed. Error is defined as

E =∑

i,t

(Vi(t)−Xi(t))2 (16)

where Xi(t) are the data points, and Vi(t) are the network approximations. Like Weaveret al., D’haeseleer makes the reasonable assumption that the weight matrix is sparse.Methods such as weight decay, weight elimination, and pruning were applied to weed outunimportant weights. Weight decay adds a penalty term to the error function (16) thatpenalizes large weights:

E∗ = E + α∑

ij

w2ij (17)

where α is the decay rate. This prevents weights from becoming larger than they needto be, and tends to penalize larger weights more than smaller ones. Weight elimination

14

also adds a penalty term to the error function, but tends to penalize smaller weights asopposed to larger weights:

E∗ = E + γ∑

ij

wij2

w2th + w2

ij

(18)

where γ is the weight elimination rate, and wth is a weight threshold. Pruning computesa saliency measure for each weight to gauge how important it is, then removes connec-tions with low saliency score. Early stopping was carried out to prevent overfitting.Experiments on artificial data indicated that weight elimination performed better thanweight decay, and that a combination of the two worked well. Experiments on biologicaldata were not performed.

Wahde & Hertz [25, 26] model gene expression Equation (10), and use a geneticalgorithm to update weights. The genetic algorithm works roughly as follows: Eachmember of a population is a candidate solution and encodes for the parameters wij , bi,and τi. For each candidate solution, the differential equations (10) are solved for Vi,using the first time point Vi(t = 0) from the data as the initial value. The fitness of thesolution is calculated as

f =1

1 + 1K

∑k δ

2k

(19)

where k enumerates the K = N(T − 1) data points (i.e. all data points except at t = 0)used to compute fitness, δk = (Vk − Xk)/σ, Vk is obtained from the above integration,Xk is the corresponding data point, and σ is a tolerance parameter. Like Someren etal. [23], Wahde & Hertz use clusters of genes as the variables Vi, rather than individualgenes. Their method was first applied to artificial data. Unsurprisingly, the authorsfound that having data from more than one time course was more important that havingmany points in each time course. In addition, they ran experiments on the rat CNSdata [26]. Taking advantage of the clusters identified by Wen et al. in their study [29],inference of regulatory relations are made among these clusters. The authors remarkthat the relations are biologically plausible.

Wessels et al. [30, 24] published a study compared several genetic network inferencemethod, including Weaver et al. [27], Someren et al. [23], D’haeseleer et al. [4], and Wahde& Hertz [25]. They based their evaluation on a number of criteria, including robustnessto noise, inferential power, and predictive power. One of the conclusions they drew wasthat the Wahde & Hertz model required much more time to converge compared to theother methods, often taking over 15 minutes to converge while other methods took lessthan 5 seconds each. This is hardly surprising since the optimization routine in Wahde& Hertz is an iterative method, whereas the others perform a linear regression. On somedatasets, the Wahde & Hertz model failed to converge.

We claim that a weight update method faster than a genetic algorithm should be usedto speed convergence. Various forms of gradient descent are the most commonly usedweight update methods for neural networks. Rather than making random mutations asin the genetic algorithm, it would be better to use information about the gradient of theerror surface to go in the right direction. This is the basis of our modifications, describedin the next section.

5 Method

We follow basically the same method as Wahde & Hertz [25]. Equation (10) is used tomodel the dynamics of gene expression over time. The main difference is that the geneticalgorithm is replaced with a second-order optimization algorithm called the Levenberg-Marquardt method. To do this, we need to define an error function that is continuous

15

and differentiable with respect to the parameters. Suppose we have an N × T matrix Xof gene expression levels for N genes at T time points, normalized between 0 and 1. Ifthe network predictions for these data points are given by Vi(t), then the mean squarederror is

E =1NT

∑

i,t

(V (i, t)−X(i, t))2 for i = 1, . . . , N, t = 1, . . . , T (20)

The partial derivatives of the error function [18, 10] with respect to the parameters wijand τi are

∂E

∂wij=

2NTτi

∑t

Yi(t)g′(hi(t))Vj(t)dt (21)

where hi(t) =∑j wijVj(t), and Yi(t) is the solution to the dynamical equation

dYidt

=1τiYi −

∑

j

1τjwjig

′(hj)Yj − Ei(t) (22)

with the boundary condition Yi(T ) = 0 for all i, where Ei(t) = Vi(t) − Xi(t). Thederivative of E with respect to the time constants τi is given by

∂E

∂τi= − 2

NTτi

∑t

Yi(t)dVidtdt. (23)

Note that the derivatives with respect to the bi’s are similar to those for the wij ’s.Regular gradient descent is a first-order method. Derivatives of the error with respect

to each parameter are computed, then parameters are updated in the direction of negativegradient,

w(k + 1) = w(k)− η ∂E∂w

(24)

where η is the learning rate. A second order method used information about the curvatureof the error surface. However, the exact second order derivatives (or the Hessian) of theerror is difficult to compute, so many methods use approximations. The Levenberg-Marquardt method is one of them. It is a commonly used method to solve medium-scalenon-linear least squares problems (i.e. problems that have a few hundred parameters orless). It is known to perform well in practice, faster than regular gradient descent orgradient descent with momentum.

5.1 The Levenberg-Marquardt Method

The global extremum of a quadratic function can be found exactly in a single step.The error function in general is not quadratic; however, it resembles a quadratic in theneighbourhood of a local minimum. This can be exploited to jump closer to the localminimum when this quadratic approximation applies. This is the idea behind Levenberg-Marquardt. But first we will go through the details of the quadratic approximation. Thederivation here is adapted from [20]. A note about the notation for this section: Matricesare written as capitalized bold letters, as in A, vectors as lower-case bold letter, as in b,and scalars as plain lower-case letters, as in c.

Let f(w) be the network approximations of the data points y. Previously, we havealways considered the network approximations as functions of time, but here we areinterested in how they vary as a function of the parameters w. Note that we include allthe parameters W,b, and τ in the vector w. Suppose that f(w) is a linear function ofw,

f(w) = k + wTJ (25)

16

where w is an n× 1 vector, J is m× n, and k is 1×m. This gives a 1×m vector f(w).Note that m = NT is the number of data points, and n = N2 + 2N is the number ofparameters. Then the error is

E(w) =1m

(f(w)− y)(f(w)− y)T . (26)

This error is the same as that given in Equation (20), only in vector form. This errorfunction is quadratic in w:

E(w) =1m

(wTJ + k− y)(wTJ + k− y)T

=1m

wTJJTw + 2(k− y)JTw + (k− y)(k− y)T

=1m

wTCw + 2bw + a (27)

(28)

where C = J JT , b = (k− y)JT , and a = (k− y)(k− y)T . Then the gradient of E canbe found by taking the linear terms of

E(w + ∆w)− E(w) =1m

[(w + ∆w)TC(w + ∆w) + 2bT (w + ∆w) + a]

− 1m

[wTCw + 2bw + a]

=1m

(2∆wTCw + 2∆wTb + ∆wTC∆w).

The first two terms are linear in ∆w, so the gradient is

∇E =2m

(Cw + b). (29)

Solving ∇E = 0 to find the weights at the minimum point, we get

wmin = C−1b. (30)

Note that this result is exactly analogous to the scalar case.Suppose w eare in the neighbourhood of a local minimum. We can take a Taylor

expansion of f(w) about the current weights w0:

f(w) = f(w0) + (w −w0)T∇f(w0). (31)

This has exactly the same form as Equation (25). So we can write down the error anderror gradient in terms of f(w0), (w−w0) and ∇f(w0) directly, based on (27) and (29).

E(w) =1m

[wT∇f(w0)∇f(w0)Tw + 2(f(w0)− y)∇f(w0)

+(f(w0)− y)(f(w0)− y)T ] (32)

∇E =2m

[∇f(w0)∇f(w0)T (w −w0)

+∇f(w0)(f(w0)− y)T ]

=2m

[H(w −w0) + d] (33)

where d = ∇f(w0)(f(w0)−y)T , and H = ∇f(w0)∇f(w0)T . Setting ∇E = 0 and solvingfor w, we get

wmin = w0 − H−1d. (34)

17

Note that d is exactly the error gradient. H is an approximation to the Hessian, thematrix of second order partial derivatives. H is exact if and only if f is linear.

Now, this quadratic approximation of the error will not be valid at all points, onlynear local minima. Even then, the update rule

wi+1 = wi − H−1d. (35)

will not give the exact weights at the minimum. However, convergence using this updaterule will be significantly faster than using regular gradient descent (24). Levenberg’s idea[12] was to blend the two methods, so that (35) is used when we are close to a minimum,and (24) is used the rest of the time:

wi+1 = wi − (H + λI)−1d (36)

where λ is the blending factor and I is the identity matrix. As λ → 0, (36) approachesthe quadratic update rule (35). For large values of λ, H is dominated by the λ term, andthe weight update rule approaches

wi+1 = wi − 1λ

d (37)

which is regular gradient descent. The value of λ is updated as follows: Suppose weightvalues are updated by the above rule (36), and a new error is computed. If the newerror is bigger than the previous error, then the quadratic approximation of the erroris not working well. The weights are reset to their previous values, and λ is increasedby some significant factor (e.g. 10) so the next update will be closer to pure gradientdescent. Otherwise, if the new error is smaller than the previous error, we assume thatthe quadratic approximation is working well, and that we are approaching a minimum.The new weights are retained and λ is decreased by some significant factor (e.g. 10) tomake more use of the second order information in H.

Marquardt further improved this method to take curvature information into accountin the stepsize for pure gradient descent [15]. One of the classic problems with gradientdescent is the “error valley” problem, in which there is a long, narrow valley in the errorsurface. The component of the gradient along the base of the valley is small comparedto that along the walls of the valley, so gradient descent will tend to take big steps downthe walls of the valley, but inch down along its base. The problem is that the distancethat needs to be travelled along the base is much longer than the distance down thewalls, so convergence is slow. Marquardt proposed to alleviate this problem by takingbigger steps in the directions where the gradient is smaller. The identity matrix in (36)is replaced with the diagonal of the approximate Hessian H:

wi+1 = wi − (H + λdiag(H))−1d (38)

Levenberg-Marquardt is a heuristic method that works well in practice on mediumscale problems. For large scale problems with thousands of parameters, the matrixinversion becomes too expensive, so other methods such as conjugate gradients are used.To avoid the pitfalls of coding a matrix inversion routine, we used the lsqnonlin functionin Matlab. This is a non-linear least squares optimization function that uses Levenberg-Marquardt as its medium scale routine.

5.2 Training

Training proceeded as follows: Let X be the N × T matrix of time-series expressiondata, where X(i, t) ∈ (0, 1). Parameters wij and bi were initialized to random values

18

W−→b −→τ

20.0 5.0 0.0 0.0 0.0 10.025.0 -5.0 -17.0 0.0 -5.0 5.00.0 10.0 20.0 -20.0 -5.0 5.00.0 0.0 10.0 -5.0 -5.0 15.0

Table 1: Parameter values used to generate artificial data.

in the interval (−0.5, 0.5). Note that although initial values of wij and bi were between-0.5 and 0.5, no maximum and minimum values are imposed on these parameters duringtraining. Parameters τi were initialized to 0.5. The τi’s were restricted to positive valuesfor the entire duration of training. Given parameter values for wij , bi, and τi, and initialvalues

−→V (t = 0), Equation (10) was solved for expression levels

−→V as functions of t. We

used the Matlab routine ode15s, an adaptive time-step method that has been writtenparticularly for stiff systems to solve the differential equation. Evaluating Vi(t) at thesame time points as the data gives network approximations of the data points. Usingthese points, the error function (20) was evaluated, and the partial derivatives of theerror function with respect to each parameter (21), (23) were computed. Parameters areupdated by Levenberg-Marquardt, and the error is recomputed. This process is repeateduntil there is no significant reduction of error.

Dimensionality requires that the number of data points exceeds the number of param-eters. We have N2 + 2N = N(N + 2) parameters to optimize, and N(T − 1) useful datapoints (the point t = 0 is excluded since it is used as the initial value in the integrationof the above equations). In other words, we need N < T − 3. Note, however, that this isa bare minimum constraint, and we would actually prefer to have many times more datapoints than parameters.

6 Experiments on Artificial Data

We tested our method on the same artificial data used by Wahde & Hertz [25]. Theparameters used to generate the data are shown in Table (1), in conjunction with initialvalues

−→V(0) =

0.60.10.10.1

.

Expression curves were generated for four genes at ten time points. Note that this fulfillsthe data requirements outlined above. Two hundred trials were run on a dual processorAMD 2.1 GHz machine.

The mean error over 200 trials was 2.45 × 10−5(SD = 1.37 × 10−4), and rangedbetween 3.95× 10−6 and 1.96× 10−3. The mean time for convergence was 26.6 seconds(SD = 15.2). The maximum learning time was 72.7 seconds (1 minute 12.7 seconds), andthe minimum was 7.2 seconds. The results of a typical trial are shown in Figure 1. Notethat the network approximation coincides almost exactly with the data curve. Meanparameter values and standard deviations are summarized in Table 2. For comparison,the results from Wahde & Hertz are shown in Table 3. One noticeable difference is inthe magnitude of the parameters, and the size of the standard deviations. Our resultshave much smaller average parameter values compared to Wahde & Hertz. The likelyreason for this is that parameter values were initially between -0.5 and 0.5, and thuswere biased toward smaller values. Our standard deviations are also much smaller, anindication of better determination of the parameters.

19

0 5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time

Exp

ress

ion

Leve

l

Figure 1: Data curves are shown alongside network predictions. Data points are shownas circles (◦) and network approximations are shown as crosses (×).

W−→b −→τ

0.37(0.65) 0.02(0.14) -0.30(0.33) 3.76(0.64) 0.81(0.32) 4.79(0.78)3.19(1.00) -0.63(1.47) -10.94(1.88) 5.38(0.65) 3.78(1.73) 4.80(0.41)

-3.37(1.78) 7.94(1.07) 5.69(0.70) 3.13(0.73) -4.25(0.95) 4.55(0.14)0.41(0.69) 0.70(0.16) 0.65(0.39) 3.02(0.47) -1.68(0.43) 7.84(1.00)

Table 2: Results of trials on artificial data. The mean parameter values are shown, withstandard deviations in parentheses.

W−→b −→τ

16(11) 7.5(17) 1.8(16) 12(15) 2.0(5.6) 9.5(1.1)18(7.7) -13(5.8) -8.3(11) -9.6(14) 4.8(4.3) 5.0(0.64)-10(12) 13(8.1) 14(11) 9.7(15) -3.3(5.8) 6.3(0.97)-0.1(16) 6.2(16) 13(13) 5.0(17) -0.88(5.4) 17(1.9)

Table 3: Results from Wahde & Hertz [25]. The mean parameter values are shown, withstandard deviations in parentheses. These numbers are based on 50 trials.

20

We tested for parameter values that were significantly non-zero. The Jarque-Beratest for normality was first applied to each parameter. Based on the results of this test,a t-test with α = 0.01 was performed for parameters that had a normal distribution, anda Wilcoxon signed rank test was performed at α = 0.01 for those that had a non-normaldistribution. All but four parameters had a non-normal distribution, and all parameters(normally distributed or not) except one were assessed to be significantly non-zero. Thenull hypothesis µ = 0 could not be rejected only for the weight w12, (p = 0.087). For allother parameters, the null hypothesis (µ = 0 for normally distributed, median = 0 fornon-normally distributed), was rejected with p < 8.0× 10−9.

One important issue to consider is that the learned parameter values are quite dis-similar to the original values in Table 1. It is plausible that there are many solutionsthat fit the data well, which brings up the question of how to interpret results. If thereare so many possible solutions, how can we tell which one is correct? The short answeris that there is no good way to tell. The key to resolving this problem is to have moredata. Wahde & Hertz found that the more time-courses the network was trained on, thebetter it performed and the smaller the standard deviations. Additional independentdata would narrow down the number of solutions. In addition, with only a single timecourse to train on, we run the risk of over-fitting to the data. Indeed, there is a goodchance that the network predictions in Figure 1 are overfitted. One of the most commonmethods for preventing overfitting in machine learning is cross-validation. A subset ofthe available data (called the test set) is withheld from the training set, and error isminimized on this test set. However, points from the same time-series dataset are notindependent, so this method cannot be applied if only a single time course is availablefor training.

7 Experiments on Biological Data

No genetic network inference method is worth much unless it can be applied to realbiological data. Wahde and Hertz [25, 26] apply their method to rat central nervoussystem data, collected by Wen et al. [29]. Although it would have been interesting toapply our method to the same data and to compare results, we decided to use Spell-man’s [22] yeast cell cycle data instead for several reasons. First, the yeast cell cyclewebsite (http://genome-www.stanford.edu/cellcycle/), where the Spellman data is avail-able for download, has many useful resources such as a searchable database for all genesin the dataset and links to the Saccharomyces genome database. The Wen dataset is alsoavailable publicly, but does not have the advantage of these additional resources. Sec-ondly, in the Wen study, expression levels are measured at nine time points non-uniformlyspread throughout the course of CNS development, from rat embryo to adult. With suchlarge time intervals, the continuity assumption for expression levels becomes dubious.Finite difference methods for solving ODEs require that the step size be small enoughto capture the phenomena being modelled. With time intervals on the order of days, ifnot weeks in some cases, it is inevitable that expression levels would vary significantlybetween recorded time points and that significant swings in expression level are missed.In contrast, the Spellman dataset consists of four time courses varying in length from14 to 24 time points, spaced at intervals between 7 and 30 minutes. With shorter timeintervals, there is a better chance of modelling gene expression level accurately.

Both the Wen and Spellman studies do some cluster analysis on their respectivedatasets. Wen et al. identify five basic patterns of expression that they believe representdifferent phases of CNS development. Although they examine functional relationshipsamong genes within clusters, they do not discuss regulatory relations between clusters orhow the clusters are regulated at all. Thus, it would be difficult to evaluate the accuracy

21

of a genetic network inference method applied to this data without in-depth knowledgeof the biological mechanisms underlying CNS processes. Indeed, Wahde and Hertz pre-dict some regulatory relations that may exist among the clusters, but do not assess thebiological plausibility of these predictions in much depth. Spellman et al. identify nineclusters of genes, and identify possible regulators for these clusters by searching for tran-scription factor binding sites in the promoter regions of the genes. Our experimentaldesign utilizes this information so we can validate our results to some extent.

One advantage of the Wen data is that measurements are taken in triplicate, and thusare more reliable. The Spellman data consists of a single measurement per gene per timepoint, and is therefore more susceptible to noise. This needs to be taken into accountwhen we use this data to test our method. Numerous other genetic network inferencepapers [7, 6, 13] have used this dataset to test their method.

7.1 The Spellman paper

The eukaryotic cell division cycle consists of four phases: G1, S, G2, and M. The twomajor steps in cell division are DNA replication (S phase) and mitosis (M phase). Thesetwo steps are separated by gap phases G1 and G2. A gene whose expression level variesperiodically with the cell cycle can be considered to be cell cycle regulated. However, notall such genes are functionally involved in mechanisms of the cell cycle, nor is it possibleto say with certainty that all genes involved in the cell cycle necessarily display periodicbehaviour [3]. The cell cycle has been studied extensively by molecular biologists and isrelatively well understood.

Spellman et al. [22] identify approximately 800 putative cell cycle regulated genes inSaccharomyces cerevisiae based on analysis of microarray time-series data. They devise ascoring method to assess whether a gene is cell cycle regulated based on Fourier analysisand Pearson correlation. They selected a score threshold, exceeded by 91% of knowncell cycle genes, and classified all genes that scored above this threshold as cell cycleregulated. The resulting 800 or so genes (out of approximately 6200) were clusteredhierarchically. Spellman et al. then identified and analyzed nine functionally relatedclusters of genes.

The microarray hybridization data used consists mainly of four independent time-series datasets, which we refer to as α-factor, cdc15, cdc28, and elutriation. THe cdc28dataset originally comes from Cho et al. [3]. Each time-series measured mRNA transcriptlevels for the same set of genes, but was synchronised using a different method. Yeastcultures must be synchronised so that all cells are at the same point in the cell cycle beforetranscript levels are measured. Significant synchrony was achieved for one to three cellcycles, depending on the method. The number of time points in each time series variedfrom 14 to 24. In addition to the four time courses, Spellman et al. examined theresponse of genes to the cyclins Cln3p and Clb2p, two known cell cycle regulators. Morethan half of the 800 putative cell cycle regulated genes responded to at least one of thesecyclins.

The genes were clustered using the hierarchical clustering method described by Eisenet al. [6], and nine clusters were identified empirically. The genes in each cluster werefunctionally related and substantially co-regulated based on analysis of the promoterregions. For each cluster, Spellman et al. identified known and hypothesized bindingsites for possible regulators, and described any significant regulatory effects of Cln3p andClb2p. In addition, they discussed which genes take part in major functions of the cellcycle, such as DNA replication, budding, glycosylation, and nuclear division.

22

7.2 Experimental Setup

We do not use all of Spellman’s data to test our method. We confine ourselves toa single time series, and select a small number of genes on which to run the neuralnetwork - recall that dimensionality restricts the number of genes that can be modelledsimultaneously. We chose to use the α-factor time course, since it had 18 uniformlyspaced time points, more than the other datasets. The more difficult question was whichgenes to select. One possibility was to take the average expression of each cluster andto infer possible regulatory relations between the clusters. This is basically the methodadopted by Wahde & Hertz [25, 26]. The problem is that Spellman et al. do not describethe nature of regulatory relations between clusters, nor do we know if such relations evenexist. Thus, we cannot validate our predictions at all.

Since Spellman et al. describe the regulators for each cluster, we decided to use thisas the basis for our experiments. For each of three clusters that had relatively well-characterized regulators, the training set was taken to be the expression levels of theregulating genes in addition to the average expression level of the cluster. Three setsof experiments were performed, one for each of the clusters CLN2, CLB2, and SIC1.The total number of signals in a training set never exceeded six, which is safely underthe maximum number dictated by the dimensionality restriction. The one significantchange in methodology from Section 5 was that the effects of the regulator genes on oneanother were disregarded, since we were only interested in how the regulators affectedthe expression level of the cluster.

A number of preprocessing steps were necessary. Expression levels were recordedas log ratios in the Spellman study, ranging between −∞ and ∞, so the data weretransformed linearly to lie between 0 and 1. To ensure that the transformed expressionlevels were never exactly equal to 0 or to 1, a fudge factor of 0.05 was added to thescaling value. A value of 0 or 1 is difficult for the network to learn due to the asymptoticnature of the logistic sigmoid. In addition, there were a number of missing values, whichwere filled by linear interpolation. There are better ways of dealing with missing values;however, in this case, the exact interpolated value is of less consequence because of thefiltering process we subsequently applied to reduce noise in the data. We describe thefiltering process in the next section.

7.2.1 Filtering

To fit the language of signal processing, we will refer to gene expression curves as signals.A signal f(t) is just some function of time. For convenience, let f ∈ L2(0, 2π), whichmeans that ∫ 2π

0

(f(t))2dt <∞.

The idea behind Fourier series is to write f as a superposition of basic waves calledharmonics. In this case, the harmonics are

ek(t) =eikt√

2π.

These functions are chosen for two reasons: First, they form a Hilbert space [21], yieldinga Fourier series formula

f(t) =∞∑−∞

f(k)ek(t) (39)

23

where

f(k) =∫ 2π

0

f(t)ek(t)dt

=∫ 2π

0

f(t)e−ikt√

2πdt. (40)

Note that the larger the value of |k|, the higher the frequency of the harmonic func-tion. The second reason is that these basic waves are empirically well-suited to physicalproblems [17], and the Fourier coefficients represent physically important quantities.

Suppose we have a signal x(t), and that noise n(t) is added to the signal when itis measured. Let x(t) be the resulting noisy signal. One of the fundamental problemsof signal processing is to recover the original signal x(t) from a noisy signal x(t). Inthe absence of better information, it would be safe to guess that x(t) has small highfrequency Fourier coefficients. For instance, if x(t) is infinitely differentiable, then theFourier coefficients ˆx(k) will go to 0 faster than any power of 1/k [21]. Similarly, if x(t)is (n+ 1) times differentiable, then ˆx(k) will go to 0 faster than (1/k)n. Noise is unlikelyto be differentiable, which forces it to have larger high frequency Fourier coefficients.Hence, if we truncate the Fourier series at a cleverly chosen medium frequency, most ofthe information removed will be from the noise component. Thus, a reasonable filteredversion of x(t) is

n∑−n

x(k)ek(t).

We can write this as a convolution [21]

x ∗ h

where h ∈ L2 is the function such that

h(k) ={

1, |k| ≤ Kc

0, |k| > Kc

Convoluting h with a signal is called the sinc filter and h is given by

h(t) =Kc

πsinc

(Kc

πt

)

=sin(Kct)

πt.

We do not have expression level measurements x(t) for all time points, only at a finitenumber of sample points. We can approximate x(k) with a Riemann sum:

x(k) =∫ 2π

0

x(t)ek(t)dt

≈ 2πn

n−1∑

j=0

x(2πj/n)ek(2πj/n√

2π

=√

2πn

n−1∑

j=0

x(2πj/n)e−2πijk/n (41)

It just happens that this Riemann sum coincides with the discrete Fourier transform ofZ/nZ. Because of this coincidence and based on the Fourier theory for Z/nZ (which we

24

do not discuss here), we know that we do in fact have the right formula

x(2πl/n) =n−1∑

k=0

x(k)ek(2πl/n) (42)

where x(k) is calculated according to Equation (41). Note that x(k + Za) = f(k) andthat ek(2πl/n) = ek+Za(2πl/n). If k1, k2. . . . , kn ∈ Z are such that the ki + nZ j =1, . . . , n partition Z/nZ, then

x(2πl/n) =∑

kj

x(kj)ekj (2πl/n).

This is called the aliasing problem, in which each discrete frequency corresponds tomany continuous frequencies. The question is how to identify the continuous frequencieswith the discrete ones in view of this problem. A heuristic that has proven useful inpractice is to identify the discrete frequency k with the continuous frequency k suchthat k = k(mod n) and |k| is as small as possible. That is, the discrete frequenciesare interpreted to be the lowest possible frequency they can be. This agrees with thenotion discussed above that the most pertinent information is stored in the low frequencyFourier coefficients. The final formula for the filtered signal is

x(t) =n∑−n

x(k)ek(t) (43)

where x(k) is given by (41), and n is the cutoff frequency of the sinc filter.The expression level measurements x(t), t = 0, . . . , T − 1 of a gene are not exactly

periodic. Although the expression level of a cell cycle gene is expected to be periodic innature, it is quite likely that the time series data do not represent a whole number ofcell cycles, and that the initial and final expression levels are very different. Extendingthe signal by a couple of points on both ends of the time series so it is continuouslydifferentiable alleviates the problem of “ringing”, which occurs when filtering a signalwith a jump discontinuity. Ringing appears as inconsistencies in the first and last fewtime points in the smoothed signal, as the first points exert undue influence on thelast points and vice versa. By interpolating smoothly between the last point and thefirst point, we hope to contain any ringing to the points in the extension, which wesubsequently remove. Figure 2 depicts the original signals and the filtered signals.

The cutoff frequency n was largely decided by trial and error. About two thirds ofthe Fourier coefficients were retained. We strived to eliminate most of the high frequencyfluctuations while still retaining the underlying shape of the curve.

7.2.2 Using only relevant weights

The clusters CLN2, CLB2, and SIC1 are sets of genes with similar expression patterns.Table 4 lists the genes in each cluster. In our experiment, we take the average expressionlevel of the genes in each cluster, and treat this as a single signal. For each cluster, wealso have a set of genes identified by Spellman et al. as possible regulators. Each of thesegenes is treated by itself as a signal. The purpose of the experiment is to find how theregulators influence the expression level of the cluster, the “target” signal. Since we donot care to know how the regulators influence one another, we do not need to optimizethe weights corresponding to these effects. We retain only those weights that describethe effect of each signal on the target. The differential equation 10 becomes the following:

dVcdt

=1τ

−Vc + g

∑

i∈regwiXi + wcVc + b

25

0 20 40 60 80 100 1200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time

Exp

ress

ion

Leve

l

Figure 2: The gene expression levels for the CLB2 experiment, before and after filtering.Filtered signals are bold.

Cluster Genes in clusterEST1, YBR070C, SMC1, CAC2, CLN1, RNR3, YJL181W, POL12,YBR089W, RAD53, YCL022C, YCL024W, PRI2, SPT21, CLB6, RHC18,

CLN2c YFR027W, SPH1, CDC45, YDL011C, YDL010W, ASF2, SWE1, SMC3,CDC21, YDL163W, RFA2, RNR1, RAD27, CDC9, RNH35, YPL267W,YNL300W, KIM2, SRO4, CSI2, CLN2, YOX1, POL30, MCD1, YLR183C,HCM1, YPL208W, YGR221C, BNI4, YPR174C, DPB2, STB1NUM1, YCL063W, YCL062W, SHE2, BUD8, YCL012W, BUD3,YCL013W, YPL141C, KIP2, IQG1, YPR156C, BUD4, TEM1,

CLB2c YNL058C, CHS2, MYO1, YJL051W, YIL158W, YML033W, YML034W,MOB1, HST3, ACE2, CDC20, CYK2, YML119W, YLR190W, SWI5,ALK1, CLB2, CDC5, CLB1, YLR084CYDR055W, YNL078W, YNR067C, EGT2, PCL9, YOR263C,YNL046W, YOR264W, CHS1, FAA3, TEC1, HSP150, PIR1, PIR3,

SIC1c ASH1, YPL158C, YKL116C, YDL117W, SIC1, YGR086C, YIL104C,RME1, YBR158W, YHR143W, CTS1, YGL028C, PRY3

Table 4: Genes in each cluster.

26

Cluster Regulator ModeCLN3 ++

CLN2c CLB2 - -MBF (Swi6p,Mbp1p)SBF (Swi6p,Swi4p)CLN3 -

CLB2c CLB2 ++Mcm1pCLN3 (-)

SIC1c Swi5pAce2p

Table 5: Regulators for each cluster. ‘+’ indicates up-regulation, ‘−’ indicates down-regulation; ‘++,−−’ indicate strong up- and down-regulation respectively.

where Vc is the expression level of the cluster, the Xi’s are the expression levels of theregulators, and wi, wc are the weights. In addition, τ is the time constant, g() is thelogistic sigmoid, and b is the bias term as before. Note that Vc and Xi are scalar functionsof time, τ and b are scalars instead of vectors, and −→w is a vector instead of a matrix.

Given values for τ and −→w , we wish to solve the above differential equation for valuesof Vc. In order to use an adaptive time step method such as ode15s in Matlab, we needto be able to evaluate Xi at arbitrary values of t. This presents a slight problem, since weonly have values of Xi at specified time points from the time series data. Luckily, whenwe filter the signals using the method outlined in the previous section, we get a truncatedFourier series representation of the signal, which is defined for arbitrary values of t. Thisallows us to solve the differential equation for Vc with a high degree of accuracy.

The simplification above speeds up the learning process, since we only need to op-timize N + 2 parameters instead of N2 + 2N . Since there is less redundancy in theparameters, we can also expect more consistent results. Trials resulting in an error lessthan 0.01 that took less than 15 minutes to converge were deemed acceptable.

7.3 Results

We discuss the results for each of the clusters CLN2, CLB2, and SIC1 in turn. Thetranscription factors named by Spellman et al. as regulators are listed in Table 5. Todistinguish the cluster CLN2 from the gene CLN2, we refer to the cluster as CLN2c, andsimilarly for the other clusters.

The CLN2c cluster contains 48 genes. From the Spellman study, CLN3 is known tobe a strong inducer of this cluster, while CLB2 tends to have a repressing effect. The twobinding factors MBF and SBF are both heterodimers. It is not clear whether anythingmeaningful can be inferred from the expression levels of the corresponding genes, but theyare included for the sake of completeness. One possible reasons why this may not yieldworthwhile results is the documented lack of correlation between gene and protein levels[8, 9]. Also, the stoichiometry of the binding is an important factor, since there may belots of protein A floating around, but it cannot bind (and therefore affect transcription)unless it bumps into a copy of protein B. In this case, the expression levels of the genedo not reflect how much of an effect it has on transcription. It would be better to havethe levels of the binding factors themselves, but this data do not exist. Thus it wouldnot be surprising if the results for the genes SWI6, MBP1, and SWI4 do not amount tomuch.

Figure 3 depicts the network approximations of the expression of the CLN2c cluster.The network approximates the data curve rather well. The mean error over 200 trials was

27

0 20 40 60 80 100 1200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time

Exp

ress

ion

Leve

l

SWI6MBP1SWI4CLN3CLB2CLN2c prediction

Figure 3: Gene expression levels and network approximation for CLN2c. The networkapproximation is the curve marked by crosses; the curve it is approximating is shown inbold.

28

-3.27(1.43) SWI6-0.32(0.48) MBP1−→

W 2.35(1.32) SWI45.34(2.58) CLN3-2.27(0.87) CLB22.43(1.13) CLN2c

b -2.02(1.19)τ 2.36(0.69)

Table 6: Results of trials on CLN2c data. The mean parameter values are shown, withstandard deviations in parentheses. The gene corresponding to each weight is shown inthe third column.

5.53× 10−4, (SD = 4.28× 10−4), indicating that almost all trials terminated with verylow error. The maximum error over all trials was 0.0027, still well below the acceptable0.01 level. The mean convergence time per trial was 31.8 seconds, (SD = 45.5). Theminimum time taken was 5.2 seconds, while the maximum time taken was 540.0 seconds(9 minutes), safely under the allowed 15 minutes. The means and standard deviations ofthe parameters are given in Table 6. CLN3 has a large positive weight, indicating thatit has an up-regulating effect on the CLN2c cluster, while CLB2 has a negative weight,indicating that it has a down-regulating effect. This is consistent with the effects of thesegenes reported by Spellman, summarized in Table 5. The weights indicate that SWI6 hasa repressive effect, while SWI4 seems to induce expression of CLN2c. Also, CLN2c itselfseems to have an auto-inducing effect. However, we do not have the necessary informationto verify these predictions. Again, the predictions regarding SWI6, MBP1, and SWI4should be taken with a grain of salt, since gene expression levels do not necessarily reflectlevels of the binding factors that directly influence transcription.

Parameters were tested to see if they were significantly non-zero, as for the artificialdata. Again, depending on the results of the Jarque-Bera test, either the t-test withnull-hypothesis µ = 0 or the Wilcoxon signed rank test with null-hypothesis median = 0was applied. In all cases, the null hypothesis was rejected with p < 1.43× 10−15, i.e. theaverage (mean or median, as the case may be) of each parameter was considered to besignificantly non-zero.

The CLB2c cluster contains 34 genes. The gene CLB2 is known to be a strong inducerof CLB2c, while CLN3 tends to repress expression. In addition, MCM1 is known to bea regulator for this cluster. The Spellman study did not state if MCM1 was an induceror repressor of the cluster. If we get consistent results, then it may be possible to guessif MCM1 up-regulates or down-regulates CLB2c. However, it is also possible that thereare regulators that we have not taken into account here, which could make the resultsunreliable.

Figure 4 depicts the network approximations of the expression of the CLB2c cluster.The network approximates the data curve rather well. The mean error over the 200trials was 3.27 × 10−4, (SD = 1.7 × 10−3). The maximum error was 0.015, slightlyabove the acceptable 0.01 threshold. The mean time for convergence was 19.5 seconds,(SD = 25.3). The minimum time taken for a trial was 0.81 seconds, and the maximumtime was 221.8 seconds (3 minutes, 41.8 seconds). Mean parameter values and standarddeviations are given in Table 7. CLB2 has a mean positive weight, indicating that it up-regulates the CLB2c cluster. This is consistent with the information in Table 5. However,CLN3 is expected to have a negative weight, but has a small positive one. This may bebecause the effect of CLN3 on the cluster CLB2c is not strong. It is possible that CLN3was not repressing transcription of CLB2c when this dataset was collected. The weight

29

0 20 40 60 80 100 1200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time

Exp

ress

ion

Leve

l

MCM1CLB2CLN3CLB2c prediction

Figure 4: Gene expression levels and network approximation for CLB2c. The networkapproximation is the curve marked by crosses; the curve it is approximating is shown inbold.

-0.43(0.19) MCM1−→W 1.47(0.24) CLB2

0.74(0.23) CLN3-0.41(0.38) CLB2c

b -0.65(0.15)τ 0.05(0.08)

Table 7: Results of trials on CLB2c data. The mean parameter values are shown, withstandard deviations in parentheses. The gene corresponding to each weight is shown inthe third column.

30

0 20 40 60 80 100 1200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time

Exp

ress

ion

Leve

l

SWI5ACE2CLN3CLB2SIC1c prediction

Figure 5: Gene expression levels and network approximation for SIC1c. The networkapproximation is the curve marked by crosses; the curve it is approximating is shown inbold.

for MCM1 is slightly negative, an indication that it has a repressive effect on CLB2c.Again, we do not have the information to verify this independently. All parameters hadnon-normal distributions, and all had significantly non-zero medians, p < 8.33× 10−27.

The SIC1c cluster contains 27 genes. CLN3 has a mildly repressive effect on thiscluster, while CLB2 does not seem to affect it. Swi5p and Ace2p are known to beregulators, but we do not know how they affect the expression of the cluster.

Figure 5 depicts the network approximations of the expression of the SIC1c cluster.The approximation here is not as good as for the other two clusters. The mean errorwas 4.17×10−3, (SD = 4.83×10−3). The maximum error was 0.021, which is above theacceptable threshold of 0.01. Twenty-one trials out of 200 had error over 0.01. The meantime for convergence was 36.4 seconds, (SD = 43.9). Finishing times ranged between1.3 and 407.6 seconds (6 minutes 47.6 seconds). Means and standard deviations forparameter values are summarized in Table 8. The results do not show a down-regulatingeffect of CLN3 on the cluster SIC1c. In fact, the weight for CLN3 is positive, although thestandard deviation is quite large. Many of the parameters had large standard deviations,indicating that not all trials converged to the same solution. It is possible that sometrials converged to different local minima on the error surface. Despite the relativelylarge standard deviations, the median of each parameter was significantly non-zero withp < 6.62× 10−5 (all parameters had non-normal distributions).

31

1.18(4.28) SWI5-2.70(2.96) ACE2−→

W 5.04(4.69) CLN32.61(1.75) CLB23.10(2.08) SIC1c

b -4.48(3.75)τ 1.90(1.23)

Table 8: Results of trials on SIC1c data. The mean parameter values are shown, withstandard deviations in parentheses. The gene corresponding to each weight is shown inthe third column.

8 Discussion

We achieved three notable things in this thesis. First, the Levenberg-Marquardt algo-rithm was used to optimize parameters of a continuous-time recurrent neural network tomodel gene regulatory networks. All trials on artificial and biological data converged inunder ten minutes, although most trials converged in far less time. We did not directlycompare the speed of our method relative to that of Wahde & Hertz [25]. Second, noise-plagued biological time-series data were filtered with a sinc filter. This was our methodof dealing with noise in the data. Third, we tested our method on a subset of the bio-logical data that allowed us to verify some of the predictions made by the model. Thisis particularly advantageous considering how difficult it is to validate model predictionssince we don’t know the “true” regulatory network. It also deals with the dimensionalityproblem, since the subset of genes that we use is small enough relative to the number oftime points.

However, there are still many questions to be answered. One of the biggest problemsis interpreting the weights W. Given a matrix X of gene expression levels, there seem tobe many possible parameter values that give very low error. How can we tell which oneis “correct”? The simple answer is that we cannot tell based on only one set of data. Ifwe had multiple independent time course data to train on, the field of possible solutionswould narrow quite considerably. This brings up another important question of howto incorporate different datasets into one training set. D’haeseleer [4, 5] and Wahde &Hertz [25, 26] add condition-dependent parameters to model differences inherent to eachdataset, but assume that the underlying weight matrix (W ) remains the same. However,this is hard to do unless we know what condition-dependent factors are important, andunless some data is available to model them. For instance, the yeast cell cycle data inSpellman et al. [22] likely has artifacts based on synchronization method, but we cannotmodel them without knowing what they are.

We expect a regulatory matrix to be sparse, but our results on artificial data hadmany significantly non-zero entries. However, as noted by Someren et al. [23], it isdangerous to look at each weight in isolation. Interactions among parameters are likelyto exist, which are not taken in consideration when weights are analyzed in isolationfrom one another. Nonetheless, this is what we have done. The rationale is that someinformation can be gained by looking at each weight by itself. However, multivariateanalysis would probably be informative and quite appropriate in this case. It would alsoallow us to investigate interactions between weights.

It is also dangerous to assume that weights with bigger absolute value are moreimportant [5]. Since the weights are multiplied by the gene expression levels, a largeweight could simply be making up for very low expression levels of that gene. Thatbeing the case, how can we determine which weights are important and which aren’t?One possible solution is to implement weight elimination and pruning of weights, like

32

D’haeseleer in his experiments on artificial data [5]. If we prune or otherwise changean important weight, the network is likely to suffer an increase in error. Thus, thesemethods could help determine which weights are important.

Overfitting is another problem. Since we only have a single time course on which totrain, the parameters are likely overfitted to the data. This means that the parametershave been fitted to the data so closely that the model does not generalize well. Thestandard machine learning method to prevent overfitting is cross-validation. The datasetis divided into training and test sets. Parameters are optimized based on the trainingset, but we retain the parameters that minimize error on the test set. We cannot performcross-validation in our case because the data points of a single time-series are not inde-pendent. However, once we are able to incorporate different datasets into one trainingset, and sufficiently many datasets are available, cross-validation should be a useful tool.

Considering the noise in the data and all the possible sources of error, it is notsurprising that the weaker regulatory relations between CLN3, CLB2, and the clustersCLN2c, CLB2c, and SIC1c were not found. In fact, it is quite an achievement thatthe three strong regulatory relations were predicted correctly by the model. It wouldbe interesting to see if any of the other predictions made by the neural network wereaccurate, perhaps by searching through more yeast cell-cycle literature or by conductingexperiments. It is plausible that by learning more about the yeast cell cycle, we canmodel the system more closely and get better results. For instance, if we can get acomplete list of regulators and their effect on transcription for genes these clusters, thiswould allow us to better evaluate how good the current model is, and may suggest waysto improve performance. Data for protein levels would also allow better modelling of thesystem.

In conclusion, our modelling efforts were rather successful, despite certain issues con-cerning analysis and interpretation of the weights that still need to be resolved. Some ofthese problems can be resolved by more and better data. Progress in data collection willresult in more datasets with more replicates at more time points at a lower cost. Thereis still a lot of work to be done before we can have any confidence in predictions madeby the model, and we still have a long way to go before we can infer regulatory networksfor entire genomes. What we have here is just the beginning.

References

[1] Tatsuya Akutsu, Satoru Miyano, and Satoru Kuhara. Identification of genetic net-works from a small number of gene expression patterns under the Boolean networkmodel. In R. Altman, K. Dunker, L. Hunter, K. Lauderdale, and T. Klein, editors,Pacific Symposium on Biocomputing, volume 4, pages 17–28, Hawaii, January 1999.World Scientific Publishing.

[2] Adam Arkin, John Ross, and Harley H. McAdams. Stochastic kinetic analysisof developmental pathway bifurcation in phage λ-infected Escherichia coli cells.Genetics, 149:1633–1648, August 1998.

[3] Raymond J. Cho, Michael J. Campbell, Elizabeth A. Winzeler, Lars Steinmetz,Andrew Conway, Lisa Wodicka, Tyra G. Wolfsberg, Andrei E. Gabrielian, DavidLandsman, David J. Lockhart, and Ronald W Davis. A genome-wide transcriptionalanalysis of the mitotic cell cycle. Molecular Cell, 2:65–73, July 1998.

[4] P. D’haeseleer, X. Wen, S. Fuhrman, and R. Somogyi. Linear modeling of mRNAexpression levels during CNS development and injury. In Pacific Symposium onBiocomputing, volume 4, pages 41–52, Hawaii, January 1999. World Scientific Pub-lishing.

33

[5] Patrik D’haeseleer. Reconstructing Gene Networks from Large Scale Gene Expres-sion Data. Ph.D. dissertation, University of New Mexico, Albuquerque, New Mexico,December 2000.

[6] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Clusteranalysis and display of genome-wide expression patterns. Proceedings of the NationalAcademy of Sciences, 95:14863–14868, December 1998.

[7] N. Friedman, M. Linial, I. Nachman, and D. Pe’er. Using Bayesian networks toanalyze expression data. Journal of Computational Biology, 7(3/4):601–620, August2000.

[8] B. Futcher, G.I. Latter, P. Monardo, C.S. McLaughlin, and J.I. Garrels. A samplingof the yeast proteome. Molecular and Cellular Biology, 19(11):7357–7368, November1999.

[9] Steven P. Gygi, Yvan Rochon, B. Robert Franza, and Ruedi Aebersold. Correlationbetween protein and mrna abundance in yeast. Molecular and Cellular Biology,19(3):1720–1730, March 1999.

[10] John Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the Theory ofNeural Computation. A Lecture Notes Volume in the Santa Fe Studies in the Scienceof Complexity. Addison-Wesley, Redwood City, California, 1991.

[11] J. J. Hopfield. Neurons with graded response have collective computational proper-ties like those of two-state neurons. Proceedings of the National Academy of Sciences,USA, 81:3088–3092, 1984.

[12] K. Levenbeg. A method for the solution of certain problems in least-squares. Quar-terly Applied Math, 2:164–168, 1944.

[13] Ker-Chau Li, Ming Yan, and Chinsheng Yuan. A simple statistical model for depict-ing the cdc15-synchronized yeast cell-cycle regulated gene expression data. StatisticaSinica, 12:141–158, 2002.

[14] Shoudan Liang, Stefanie Fuhrman, and Roland Somogyi. REVEAL, a general re-verse engineering algorithm for inference of genetic network architectures. In R. Alt-man, K. Dunker, L. Hunter, K. Lauderdale, and T. Klein, editors, Pacific Symposiumon Biocomputing, volume 3, pages 18–29, Hawaii, January 1998. World ScientificPublishing.

[15] D. Marquardt. An algorithm for least-squares estimation of nonlinear parameters.SIAM Journal Applied Math, 11:431–441, 1963.

[16] Eric Mjolsness, David H. Sharpe, and John Reinitz. A connectionist model of de-velopment. Journal of Theoretical Biology, 152:429–453, 1991.

[17] A.V. Oppenheim, A.S. Willsky, and H. Nawab. Signals and Systems. Signal Pro-cessing Series. Prentice Hall, 2 edition, 1997.

[18] Barak A. Pearlmutter. Learning state space trajectories in recurrent neural net-works. Neural Computation, 1:263–269, 1989.

[19] John Reinitz and David H. Sharp. Mechanism of eve stripe formation. Mechanismsof Development, 49:133–158, 1995. gene circuit.

[20] Sam T. Roweis. Levenberg-marquardt optimization. Unpublished notes.

34

[21] Walter Rudin. Real and Complex Analysis. McGraw-Hill, 3 edition, 1986.

[22] Paul T. Spellman, Gavin Sherlock, Michael Q. Zhang, Vishwanath R. Iyer, KirkAnders, Michael B. Eisen, Patrick O. Brown, David Botstein, and Bruce Futcher.Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomycescerevisiae by microarray hybridization. Molecular Biology of the Cell, 9(12):3273–3297, December 1998.

[23] E.P. van Someren, L.F.A Wessels, and M.J.T. Reinders. Linear modeling of ge-netic networks from experimental data. In Proceedings of the Eighth InternationalConference on Intelligent Systems for Molecular Biology, pages 355–366, La Jolla,California, August 2000. American Association for Artificial Intelligence.

[24] E.P. van Someren, L.F.A Wessels, and M.J.T. Reinders. Genetic network models:A comparative study. In Proceedings of SPIE, Micro-arrays: Optical Technologiesand Informatics, volume 4266, pages 236–247, San Jose, California, January 2001.American Association for Artificial Intelligence.

[25] Mattias Wahde and John Hertz. Coarse-grained reverse engineering of genetic reg-ulatory networks. BioSystems, 55:129–136, 2000.

[26] Mattias Wahde and John Hertz. Modeling genetic regulatory dynamics in neuraldevelopment. Journal of Computational Biology, 8(4):429–442, September 2001.

[27] D.C. Weaver, C.T. Workman, and G.D. Stormo. Modeling regulatory networks withweight matrices. In Pacific Symposium on Biocomputing, volume 4, pages 112–123,Hawaii, January 1999. World Scientific Publishing.

[28] Eric W. Weisstein. Moore-Penrose matrix inverse. Eric Weisstein’s World of Math-ematics. http://mathworld.wolfram.com/Moore-PenroseMatrixInverse.html.

[29] Xiling Wen, Stefanie Fuhrman, George S. Michaels, Daniel B. Carr, Susan Smith,Jeremy L. Barker, and Roland Somogyi. Large-scale temporal gene expression map-ping of central nervous system development. Proceedings of the National Academyof Sciences, 95(1):334–339, January 1998.

[30] L.F.A Wessels, E.P. van Someren, and M.J.T. Reinders. A comparison of geneticnetwork models. In Pacific Symposium on Biocomputing, volume 6, pages 508–519,Hawaii, January 2001. World Scientific Publishing.

[31] C.H. Yuh, H. Bolouri, and E.H. Davidson. Genomic cis-regulatory logic: Experi-mental and computational analysis of a sea urchin gene. Science, 279:1896–1902,1998.

[32] C.H. Yuh, H. Bolouri, and E.H. Davidson. Cis-regulatory logic in the endo16 gene:switching from a specification to a differentiation mode of control. Development,128:617–629, 2001.

35

Documents

Inference of Gene Regulatory Networks from Large Scale ...mtakan1/marinathesis.pdf · 1 Introduction DNA is commonly considered to be the \blueprint of life". DNA contains the instructions