Hernández-Lemus & Rangel-Escareño, 2011. The Role of Information Theory in Gene Regulatory Network Inference

8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

1/48

In: Information Theory: New Research

Editors: P. Deloumeaux et al, pp. 137-184

ISBN: 978-1-62100-325-0

c 2011 Nova Science Publishers, Inc.

Chapter 4

THE ROLE OFINFORMATIONTHEORY

IN GENE REGULATORYNETWORKINFERENCE

Enrique Hernandez-Lemusand Claudia Rangel-Escare no

Computational Genomics Department,

National Institute of Genomic Medicine, Mexico

Abstract

One important problem in contemporary computational biology, is thatof reconstructing the best possible set of regulatory interactions between

genes (a so called gene regulatory network -GRN) from partial knowl-

edge, as given for example by means of gene expression analysis exper-

iments. Since only highly noisy-data is available, doing this represents

a challenge to common probabilistic modeling approaches. However, a

variety of algorithms rooted in information theory and maximum entropy

methods, have been developed and they have coped with the problem suc-

cessfully (to a certain degree). Mutual information maximization, Markov

random fields, use of the data processing inequality, minimum description

length, Kullback-Liebler divergence and information-based similarity are

some of these. Another approach to modeling gene regulatory networks

combines information theory and machine learning techniques. Monte

Carlo methods and variational methods can also be used to measure datainformation content. Hidden Markov models (HMM) or stochastic linear

E-mail address: [email protected]


2/48

138 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

dynamical systems use time series data to represent information of a statesequence about the past through a discrete random variable called the hid-

den state. Similarly, stochastic linear dynamical systems represent infor-

mation about the past but through a real-valued hidden state vector. Com-

mon to these models is the fact that conditioned on the hidden state vec-

tor, the past, present and future observations are statistically independent.

State-Space models, also known as Linear Dynamical Systems (LDS) or

Kalman Filter models, are a subclass of dynamic Bayesian networks used

for modeling time series data. Expressing time series models in state-

space form allows for unobserved components - an important factor when

modeling gene expression data. Unobserved variables can model biolog-

ical effects that are not taken into account by the observables. They could

model the effects of genes that have not been included in the experiment,

levels of regulatory proteins or possible effects of mRNA degradation.Work presented here shows the use of these models to reverse engineer

regulatory networks from high-throughput data sources such as microar-

ray gene expression profiling. In this review we will also describe the

basic theoretical foundations common to such methods and will briefly

outline their virtues and limitations.

Keywords: Information theory, Network inference, probabilistic modeling

1. Introduction

A common situation in several emerging fields of science and technology,such as bioinformatics and computational biology, high energy physics and as-

tronomy, to name a few, is that researchers are confronted with datasets having

thousands of variables, large noise levels, non-linear statistical dependencies

and a very reduced sampling universe. The detection of functional and struc-

tural relationships of the data when confronted with such situations is always

a major challenge. In particular, the construction of dynamic maps of gene in-

teractions (also-called genetic regulatory networks) relies on understanding the

interplay between thousands of genes. Several issues arise in the analysis of

data related to gene function: the measurement processes generate highly noisy

signals; there are far more variables involved (number of genes and interactions

among them) than experimental samples. Another source of complexity is the

highly nonlinear character of the underlying biochemical dynamics.

Hence two important milestones in the analysis of genomic regulation are

variable selection(also calledfeature selection) andnetwork inference. The for-


3/48

The Role of Information Theory... 139

mer is a machine learning topic whose goal is to select from amongst thousandsof input variables, those that lead to the best predictive model. Feature selec-

tion methods applied to genomic data allows, for instance, to improve molecular

diagnosis and prognosis in complex diseases (such as cancer) by identifying a

set (called a molecular signature) of features or variables that best represent

the phenomenon. In the case of network inference, consists in representing the

(in general non-linear) set of statistical dependencies between variables on a set

(that can be the whole input dataset or a feature-selectedsubset of it) by means

of a graph. When applied to genomic expression data (e.g. from microarray

experiments), network inference is able to reverse-engineer the transcriptional

gene regulatory network (GRN) of the related cell. Knowledge of this GRN

would allow for instance, to the discovery of new drug targets to cure diseases.

Information theory (IT) has resulted on a powerful theoretical foundation

to develop algorithms and computational techniques to deal both with feature

selection and with network inference problems applied to real data. There are

however goals and challenges involved in the application of IT to genomic anal-

ysis. The applied algorithms should return intelligible models (i.e. they must

result understandable), they must also rely on little a priori knowledge, deal

with thousands of variables, detect non-linear dependencies and all of this start-

ing from tens (or at most few hundreds) of highly noisy samples. As we will

shown in this chapter, IT has provided approaches to deal with this problems.

Some of these approaches are based on machine learning techniques, basically

by modeling a targetfunction connecting the variables of a system. Here, the

output or target variable is the one to be predicted and the input variables are the

predictors.

As a means to produce intelligible models we perform feature-selection pro-

cedures. The goal of these procedures is to select inputs among a set of variables

which lead to the best predictive model. In the vast majority of cases, feature

selection is a preprocessing step prior to the actual machine learning stage. This

is a somewhat critical part of the whole inference process. In the one hand, vari-

able or feature elimination can lead to information losses. In the other, feature

selection is a mean to improve the accuracy of a model, to improve the gener-

alizability of such model, as well as its intelligibility and at the same time to

decrease the computational burden for the training and inference stages. Com-

putational methods for feature selection usually consist in a search algorithm

that explores different combinations of variables, supplemented with a measure

of performance (or score) for this combinations. There are several ways to ac-


4/48


complish this task, in our opinion, the best benchmarking options for the GRNinference scenario, are the use of sequential search algorithms (as opposed to

stochastic search) and performance measures based on IT, since this made fea-

ture selection fast end efficient, and also provide an easy means to communicate

the results to non-specialists (e.g. molecular biologists, geneticists and physi-

cians).

GRNs are graph-theoretical constructs that describe the integrated state of

a cell (or a small population of similar cells to be more precise) under certain

biological conditions at a given time. GRNs are means for identifying gene

interactions from experimental data through the use of theoretical models and

computational analysis. The inference of such an interaction connectivity net-

work involves the solution of an inverse problem (a deconvolution) that aims touncover the interactions from the properties and dynamics of observable behav-

ior in the form of, for example, RNA transcription levels in a characteristic gene

expression profile. A growing number of deconvolution methods (also called

reverse engineering methods) have been proposed in the past [6, 62]. Their

goal is to provide a well-defined representation of the cellular network topol-

ogy from the transcriptional interactions as revealed by gene expression mea-

surements that are then treated as samples from a joint probability distribution.

The goal of deconvolution methods is the discovery of GRNs based on statisti-

cal dependencies within this joint distribution [13]. One major shortcoming is

that, surprisingly, there is still no conceptual agreement as to what the depen-

dencies are within these multivariate settings and about the role of noise and

stochastic dynamics in the problem. The special case of conditional statistical

dependence has gained, however, a certain place as a somehow useful criterion

in most biomedical applications. The central aim is to find a way to decom-

pose the Statistical Dependency Matrix (SDM) -that is, the deviation of a joint

probability distribution from the product of its marginals- into a series of well

defined contributions coming from interactions of several orders of complex-

ity. IT is therefore the right setting to do so. Typical means to reach this goal

consist in the quantification of the new information content that arise when we

look at the full joint probability distribution compared to a series of successive

independence approximations.

In GRNs each variable of the dataset is represented by a node (or vertex) in

the graph. There is a link joining two variables-nodes if these variables exhibit

a particular form of dependency (the particular form of dependency depends

explicitly on the inference method chosen). Some genes can produce a protein


5/48


(or other biomolecules such as a microRNA) that is able to activate or repressthe production of another genes protein. There is thus, a presence ofcircuits

coded in the DNA of a cell. A useful way to represent this circuits is a graph

where the nodes represent the genes and the links or arcs are the interactions

between them. Here we will be dealing with reverse engineering methods for

GRNs using whole-genome gene expression data as input data. This problem

is very general and useful in contemporary research in computational molecular

biology, however it is a question that remains ut to date open due to its combina-

torial nature and the poor information content of the data. Validation of network

against available real-life data will be thus an important stage in the discovery

of reliable GRNs.

As we have seen there are two major shortcomings related to the feature

selection and network inference procedures: i) non-linearity and ii) large num-

ber of variables. IT methods are often efficient techniques to deal with issues

i) and ii) [52, 22, 21, 38, 26]. It can be seen that most of these methods rely

on some form ofmutual information metric. Mutual information (MI) is an

information-theoretic measure of dependency which is model independent and

has been used to define (and quantify) relevance, redundancy and interaction in

such large noisy datasets. MI has the enormous advantage that captures non-

linear dependencies [38, 26]. Finally MI it is rather fast to compute, hence it

can be calculated a high number of times in a still reasonable amount of time,

an explicit requirement in whole-genome transcription analysis.

2. Information Theoretical Measures and Probability

Measures

We will introduce here the essential notions of IT that will be used, like

entropy, mutual information and other measures. In order to do so, let Xand Ydenote two discrete random variables having the following features:

Finite alphabet X and Yrespectively

Joint probability mass distributionp(X, Y)

Marginal probability mass distributionsp(X)andp(Y)Let also X and Y denote two additional discrete random variables defined

onX andYrespectively, the associated probability mass distributions will be


6/48


p(X)andp(Y), their joint probability mass distribution p(X, Y)and defined onJ, the joint probability sampling space; J =X Y. For particular realizations,we havep(x) =P(X=x)and p(y) =P(Y =y).

Following Shannon [58], for every discrete probability distributionX it ispossible to define theinformation theoretical entropyHof such distribution asfollows

H=Ks

p(X)logp(X) (1)

here H is called Shannon-Weavers entropy, Ks is a constant useful thedetermine the units in which entropy is measured andp(X)is the mass prob-

ability density for state of the random variable given by X = x. Entropywas originally developed to serve as a measure of of the amount of uncertainty

associated with the value ofXhence relating the predictabilityof an outcomewith the probability distribution.

TheKullback-Leibler divergence, KL(, :)is a non-commutative measure ofthe difference between two discrete probability distributions [33].

KL [p(Y);p(Y) ] = y Y

p(y)logp(y)p(y) (2)

TheJoint Kullback-Leibler divergencebetween two probability mass distri-

butionsp(X, Y)andp(X, Y)is given by:KL [p(X, Y);p(X, Y) ] =

xX

p(x)yY

p(y|x)logp(x, y)p(x, y) (3)

In a similar way, it is possible to define the Conditional Kullback-Leibler

divergencebetweenp(Y|X)andp(Y|X)as follows:KL [p(Y|X);p(Y|X) ] =

xX

p(x)y Y

p(y|x)logp(y|x)p(y|x) (4)

Equation 4 means that a conditional Kullback-Leibler divergence can also

be defined as the expected value of the Kullback-Leibler divergence of the con-

ditional probability mass functions averaged over the conditioning random vari-

ables.

Recalling equation 2 we notice that it can be rephrased as follows:


7/48


KL [p(Y);p(Y) ] = yY

p(y)logp(y) y Y

p(y)logp(y) (5)We could see that the first term in the right hand side of equation 5 is pre-

cisely the negative of the entropy H(Y) as given by equation 1. Shannons en-

tropy depends on the distributionp(Y)and, as Shannon himself showed [58], itis maximum for a uniform distribution u(Y).H[u(Y)] = log |Y|. If we replacep(y)foru(Y)in equation 5 we get:

H[p(Y)] = log |Y|KL [p(Y); u(Y) ] (6)

As we can see, equation 6 states that the entropy of a random variable Yis the logarithm of the size of the support set minus the Kullback-Leibler di-vergence between the probability distribution of Y and the uniform distribution

over the same domain Y. Thus, the closer the probability distribution is to auniform distribution, the higher is the entropy. Hence, entropy measures ran-

domness and unpredictability of a distribution.

Now, let us consider a pair of discrete random variables(Y, X)with a JointProbability Distribution (JPD)p(Y, X). For these random variables the jointentropyH(Y, X)is given in terms of the JPD as:

H(Y, X) =

yY xXp(y, x)logp(y, x) (7)

We could notice that the maximal joint entropy is attained under indepen-

dence conditions of the random variables Y and X, that is when the JPD isfactorizedp(Y, X) =p(Y)p(X), in this case the entropy of the JPD is just thesum of their respective entropies. An inequality theorem could be stated as an

upper bound for the join entropy:

H(Y, X) H(Y) + H(X) (8)

Equality only holds ifXandY are statistically independent.

Also, given a Conditional Probability Distribution (CPD), the corresponding

conditional entropyofY givenXcan be defined as:

H(Y|X) = yY

xX

p(y, x)logp(y|x) (9)


8/48


Conditional entropies are useful to measure the uncertainty of a randomvariable once another one (the conditioner) is known. It can be proved [12] that:

H(Y, X) = H(X) + H(Y|X) H(Y) + H(X) (10)

Or, in other words:

H(Y|X) H(Y) (11)

Equality only holds when Xand Yare statistically independent. Expression11 is extremely useful in the inference/prediction scenario: ifYis a target vari-able andXis a predictor, adding variables can only decrease the uncertainty onthe targetY. This will result almost essential for IT methods of GRN inference.

Entropy reduction by conditioning can be accounted in a pretty formal way

if we consider a measure called the mutual information, I(Y, X) which is asymmetrical measure (i.e. I(Y, X) =I(X, Y)) that is written as:

I(Y, X) =H(Y) H(Y|X) or I(X, Y) =H(X) H(X|Y) (12)

If we resort to Shannons definition of entropy (equation 1) [58] and substi-

tute it into equation 12 we get:

H(Y, X) =

yY xXp(x, y)log

p(x, y)

p(x)p(y) (13)

Mutual information can be written as the product of the Kullback-Liebler

divergence between the JPD and the product distribution:

I(Y, X) = KL [p(X, Y); p(X)p(Y) ] (14)

Mutual information is also given by the Kullback-Liebler divergence be-

tween the marginal distributionp(X)and the conditional distributionp(X|Y)

I(Y, X) = KL [p(X|Y); p(X) ] (15)

Mutual information and Kullback-Liebler divergences are two of the most

widely used IT measures to solve the GRN inference problem.

A comprehensive catalogue of algorithms to calculate diverse information

theoretical measures has been developed for [R] the statistical scientific com-

puting environment [27].


9/48


3. Methods in Regulatory Network InferenceThe deconvolution of a GRN could be based on a maximum entropy opti-

mization of the JPD of gene-gene interactions as given by gene expression ex-

perimental data could be implemented as follows [26]. The JPD for the station-

ary expression of all genes, P({gi}),i = 1, . . . , N may be written as follows[38]:

P({gi}) = 1

Z expHgen (16)

Hgen = [Ni

i(gi) Ni,j

i,j(gi, gj) Ni,j,k

i,j,k(gi, gj, gk) . . .] (17)

Here N is the number of genes, Z is a normalization factor (the partitionfunction), thes areinteraction potentials. A truncation procedure in equation17 it is used to define an approximate hamiltonianHpthat aims to describe sta-tistical properties of the system. A set of variables (genes), interacts with eachother if and only if the potential between such set of variables is non-zero.The relative contribution ofis taken as proportional to the strength of the in-teraction between this set. Equation 17 does not define the potentials uniquely,

thus, additional constraints should be provided in order to avoid ambiguity. A

usual approach to do so is specify

s using maximum entropy (MaxEnt) ap-

proximations consistent with the available information on the system in the form

of marginals. Information theory provides a set of useful criteria for setting up

probability distribution functions (PDFs) on the basis of partial knowledge.

The MaxEnt estimate of a PDF is the least biased estimate possible, given

the information, i.e. the PDF that is maximally non-committal with regard to

missing information [28]. It is not possible to constrain the system via the

specification of all possible N-way potentials when N is large, hence one has

to approximate the interaction structure. According to the current genomics

literature, sample sizes of order 102 (the usual maximum size available inmost present-day studies) are generally sufficient to estimate 2-way marginals,

whereas 3-way marginals (e.g. triplet interactions i,j,k(gi, gj , gk)) requireabout an order of magnitude more samples, a sample size unattainable under

present circumstances. Being this the case, one is usually confronted with a

2-way hamiltonian of the form:


10/48


Figure 1. A set of genesi interacts with another set of genesk by means ofa potential= 0and is non-interacting with another set of genesj since thecorresponding potential functional is equal to zero.

Happrox =N

i i(gi) N

i,j i,j(gi, gj) (18)Under that approximation, the reconstruction (or deconvolution) of the as-

sociated GRN consists in the inverse-problem of determining the complete set

of relevant 2-way interactions i,j(gi, gj) consistent with the JPD (equations16 and 17) that defines all known constrictions, e.g. the values of the stationary

expression of genesgi as given by the set ofi(gi)s and non-committal withevery other restriction in the form of a marginal. The modeling of a GRN de-

pends on the description of the interactions in the form of several correlation

functions. A great deal of work has been done within the framework of the

Bayesian Network (BN) approach [51, 23]. BN models both static and dynamic

have provided with a better understanding of the problem in terms of solvabil-

ity, noise reduction and algorithmic complexity. Since BNs are a form of the

Directed Acyclic Graph (DAG) problem, there are several instances (e.g. feed-

forward loops, feed-back cycles, etc.) in which the DAG formalism of BNs


11/48


fails short. It has been noted [6] that BNs require a larger number of data points(samples) to infer the probability density distributions whereas information the-

oretical approaches perform well for steady-state data and can be applied even

when few experiments (compared to the number of genes) are available. A re-

cently developed approach is the use of statistical and information theoretical

models to describe the interactions [36].

If we consider a 2-way interaction hamiltonian, all gene pairs i,j for which

i,j= 0 are said to be non-interacting. This is true for genes that are statisticallyindependent,P(gi, gj) P(gi) P(gj), but it is also valid for genes that do nothave a direct interaction but are connected via other genes i.e. i,j = 0 butP(gi, gj) = P(gi) P(gj). Several metrics such as Pearson Correlation, SquareCorrelation and Spearman Ranked coefficients over the sampling universe have

been used, but the performance of these methods is usually poor as suffers from

a big number of false positive predictions.

3.1. Information Theoretical Methods

3.1.1. Mutual Information

An information theoretical measure that has been used successfully to infer

2-way interactions in GRNs is mutual information (MI) [38, 37, 3, 4]. MI for

a pair of random variables, andis defined asI(, ) = H() + H() H(, ). HereHis the information theoretical entropy (Shannons entropy),

H(x) = logp(xi) = ip(xi)logp(xi). MI measures the degree ofstatistical dependency between two random variables. From the definition one

can see that I(, ) = 0 if and only if andare statistically independent.Estimating MI between gene expression profiles under high throughput exper-

imental setups typical of todays research in the field is a computational and

theoretical challenge of considerable magnitude. One possible approximation

is the use of estimators. Under a Gaussian kernel approximation [60], the JPD

of a 2-way measurementXi= (xi, yi),i = 1, 2, . . . , M is given as [38]:

f(X) =

1

MiG[

h1|X

Xi|

]

h2 (19)

G is the bivariate standard normal density and h is the associated kernelwidth [38]. The mutual information could be evaluated as follows:


12/48


I({xi}, {yi}) = 1

M

i

log f(xi, yi)

f(xi) f(yi) (20)

hence, two genes with expression profilesgi and gj for whichI(gi, gj)= 0are said to interact each other with a strength I(gi, gj) (gi, gj), whereastwo genes for which I(gi, gj) is zero are declared non-directly interactingto within the given approximations. Since MI is reparametrization invari-

ant, one usually calculates the normalized mutual information. In this case

I(gi, gj) [0, 1],i, j.

Figure 2. Panel i shows a bivariate interaction between gene A and genes B

and C, panel ii shows an indirect interaction of gene A on gene C mediated by

gene B, panel iii depicts two independent interactions between gene A and B

and gene A and C.

A highly customizable set of algorithms for mutual information inference

of gene regulatory networks has been implemented in the [R]/BioConductor

scheme [43, 42] and is called minet. The inference proceeds in two steps.

First, the Mutual Information Matrix (MIM) is computed, a square matrix whose

MIMij term is the mutual information between genexi and xj . Secondly, aninference algorithm takes the MIM matrix as input and attributes a score to

each edge connecting a pair of nodes. Different entropy estimators are imple-

mented in this package as well as different inference methods, namely aracne,

clrand mrnet, finally the package integrates accuracy assessment tools, like

PR-curves and ROC-curves, to compare the inferred network with a reference

one [41]. The approach used there is based also on techniques from informa-

tion theory, it is called themaximum relevance/minimum redundancy algorithm

(MRMR) [17] and results a highly effective information-theoretic technique for


13/48


feature (or variable) selection in supervised learning. The MRMR principleconsists in selecting among the least redundant variables the ones that have the

highest mutual information with the target.

MRNET [41] extends this feature selection principle to networks in order to

infer gene-dependence relationships from microarray data. The MRMR method

[17] used in conjunction with a best-first search strategy for performing filter

selection in supervised learning problems can be performed as follows: Con-

sidering a supervised learning task, where the output is denoted by Y and V

is the set of input variables. MRMR ranks the set V of inputs according to

a score that is the difference between the mutual information with the output

variable Y (maximum relevance) and the average mutual information with the

previously ranked variables (minimum redundancy). Hence direct interactionsshould be well ranked, whereas indirect interactions should be badly ranked

by the method. Then a greedy search algorithm starts by selecting the vari-

ableXi that shows the highest mutual information to the target Y. The follow-ing selected variable Xj will be the one with a high information I(Xj ; Y) tothe target and at the same time a low information I(Xj; Xi) to the previouslyselected variable. In the following steps, given a setSof selected variables,the criterion updatesSby choosing the variable that maximizes the score. Ateach step of the algorithm, the selected variable is expected to allow an efficient

trade-off between relevance and redundancy. The MRMR criterion is therefore

an optimal pairwise approximation (a proxy) of the conditional mutual infor-

mation between any two genes Xj andYgiven the set S of selected variablesI(Xj; Y|S). MRNET (and minetalso) works-out by repeating such MRMRalgorithm for every target gene (or in any case for every gene to search for de

novotranscriptional interactions).

MRNET reverse engineers networks by means of a forward selection strat-

egy that aims to identify a maximally-independent set of neighbors for every

variable. A known limitation of algorithms based on forward selection, how-

ever, is that the quality of the selected subset strongly depends on the first vari-

able selected (dependence of initial conditions). A modified version is presented

called mrnetb[43], which is an improved version of MRNET that overcomes

this shortcoming by using backward selection followed by a sequential replace-

ment and it can be implemented with about the same computational burden as

the original forward selection strategy. The optimization problem of MRNET is

a form of binary quadratic optimization for which backward elimination com-

bined with a sequential search is known to perform well. Backward elimination


14/48


starts with a set containing all the variables and then selects the variable Xiwhose removal induces the highest increase of the objective function. The pro-

cedure is enhanced by an iterative sequential replacement which, at each step,

swaps the status of a selected and a non selected variable such that the largest

increase in the objective function is achieved. The sequential replacement is

stopped when no further improvement is met [43]. Forward selection, backward

elimination, and sequential replacement all have an algorithmic complexity of

O(n2)so that the network built by backward elimination followed by sequentialreplacement has the same asymptotic computational cost as the one based on a

forward selection strategy alone.

As one could further notice, the inference of GRNs by means of such high

performance IT methods is posed by large computational complexity. The lim-iting condition to these approaches is the time-consuming step of computing

the MI matrix. A method has been proposed by Qiu and colleagues [53] to

reduce this computation time. It is based in the application of spectral analy-

sis to re-order the genes, so that genes that share regulatory relationships are

more likely to be placed close to each other. Then, using a sliding window ap-

proach with appropriate window size and step size, the MI for the genes within

the sliding window is then computed, and the remainder is assumed to be zero.

Qius method does not incur performance loss in regions of high-precision and

low-recall, while the computational time is significantly lowered. The essence

of Qius method is as follows: To determine the new gene ordering, a Lapla-

cian matrix is derived from the correlation matrix of the gene expression data,assuming the correlation matrix provides an adequate approximation to the adja-

cency matrix for our purpose, then it is computed the Fiedler vector [11], which

is the eigenvector associated with the second smallest eigenvalue of the Lapla-

cian matrix. Since the Fiedler vector is smooth with respect to the connectivity

described by the Laplacian matrix, the elements of the Fiedler vector are then

sorted to obtain the desired gene ordering. The computational complexity of ob-

taining the gene ordering is negligible compared to the computation of the MI

matrix. The reduction in computational complexity is the result of computing

only the diagonal part of the reshuffled MI matrix. Because the remaining en-

tries of the MI matrix are set to be zeros, there is potential loss of reconstruction

accuracy although due to Fielder minimization [53] this effect is not expected

to be significant. In fact, according with a benchmark of the method [53] in the

high-precision low-recall regime, applying the sliding window does cause a per-

formance loss. In some cases, applying the sliding window yields slightly better


15/48


performance. In the low-precision regime, however, the windowed version haslower recall but this regime is dismissed, because one is not able to distinguish

biologically meaningful links from false positive ones.

3.1.2. Markov Random Fields

A Markov random field is a n-dimensional random process defined on a

discrete lattice. Usually the lattice is a regular 2-dimensional grid in the plane,

either finite or infinite. Assuming thatXnis a Markov Chain taking values in afinite set,

P(Xn= xn|Xk= xk; k=n) = P(Xn= xn|Xn1 = xn1; Xn+1 = xn+1) (21)

Hence, full conditional distribution ofXndepends of only in the neighborsXn1and Xn+1: In the 2-D setting, ifS= 1;2; . . . ; NS= 1;2; . . . ; Nis theset ofN2 points, called sites or states. The aforementioned morphism defines aconditional Markov random field [32].

Markov random field (MRF) models have been applied in several scenarios

within he computational molecular biology setting, for instance with regards

to functional prediction of proteins in protein-protein interaction networks [14,

15, 35], in the discovery of molecular pathways for protein interaction and gene

expression data [56] and in general network-based analysis for genomic data

[63, 64]. In the case of reverse engineering methods for network inference, aMRF model could be stated as follows [63]:

An arbitrary state assignment for a gene set will be denoted by x =(x1, x2, . . . , xp), here xi is the expression state (either equally or differen-tially expressed, 0 or 1 respectively) of gene i, letx be the true but unknowngene expression state. We can interpret this as a particular realization of a

random vector X = (X1, X2, . . . , X p) where Xi assigns expression state togene i. Let yi stand for the experimentally observed mRNA expression levelof genei and y the corresponding vector, that here is interpreted as a particu-lar realization of a random vector Y = (Y1, Y2, . . . , Y n). Yi itself is a vectoryi = (yi,1, yi,2, . . . yi,m, yi,m+1, yi,m+2, . . . , yi,m+n). This vector containsmreplicates under one condition and n replicates on the other condition. The

joint distribution ofYcould be given in terms of a MRF, to write down thisjoint probability we need to know conditional dependence/independence. In-

formation theory could then be useful to determine from the distributions such


16/48


conditional dependencies. One way to do that is by means of the so-called It-erative Conditional Mode (ICM) algorithm [63] but other IT-based alternatives

could be also used.

Conditional dependencies are not the only application of IT and MRFs in

transcriptional network inference. To study functional robustness in GRNs,

Emmert-Streib and Dehmer [20] modeled the information processing within the

network as a first order Markov chain and studied the influence of single gene

perturbations on the global, asymptotic communication among genes. Differ-

ences were accounted by an information theoretic measure that allowed to pre-

dict genes that are fragilewith respect to single gene knockouts. The informa-

tion theoretic measure used to capture the asymptotic behavior of information

processing evaluates the deviation of the unperturbed (or normal (n)) state from

the perturbed (p) state caused by the perturbation of gene k. The relative entropy

or Kullback- Leibler (KL) divergence was used to quantify this deviation:

KLi,k=KLpp,i,k ;p

n,i

=m

pp,i,k (m)logpp,i,k (m)

pn,i (m) (22)

In equation 22 the stationary distributionspp,i,k andpn,i are given by:

pp,i,k = limtTtp0i (23)

pn,

i = limtTt

kp0i (24)

The Markov chain given byTk corresponds to the process obtained by per-turbing gene k in the network. By means of this Markov chain model sup-plemented with an information theoretical KL measure, Emmert-Streib and

Dehmer [20] were able to study the asymptotic behavior of the transcriptional

regulatory network of yeast regarding information propagation under the influ-

ence of single gene perturbations. Hence not only static network properties

(such as structure) of the transcriptional regulation networks but also dynamic

features (such as robustness) could be analyzed from the standpoint of IT. The

study concludes that the knocked out genes destroy some communication paths

and, hence, can still have a strong impact on the information processing within

the cell. It seems to be reasonable to assume that the further away the knockout

gene is from the starting gene (say in Dijkstra distance [16]) the less the impact

will be. This is a strong evidence that information processing on a systems level


17/48


depends crucially on the information processing in a local environment of thegene that sends the information.

From a perspective of information processing the connection between

asymptotic information change and local network structure represented by their

degrees is interesting because it indicates that a local subgraph may be sufficient

to study information processing in the overall network. This finding seems truly

interesting because it would allow to reduce the computational complexity (and

the computational burden also) that arises when studying large genomes on a

systems scale. From the standpoint of information processing, it was shown

that the connection between asymptotic information changes and local network

structure at a local subgraph level may be sufficient to study information pro-

cessing in the overall network.

3.1.3. Data Processing Inequality

In engineering and information theory, the data processing inequality (DPI)

is a simple but useful theorem that states that no matter what processing you do

on some data, you cannot get more information (in the sense of Shannon [58])

out of a set of data than was there to begin with. In a sense, it provide a bound

on how much can be accomplished with signal processing [12]. More quanti-

tatively, consider two random variables, X and Y, whose mutual information is

I(X, Y). Now consider a third random variable, Z, that is a (probabilistic) func-tion of Y only. The only qualifier meansP

Z|XY(z|x, y) =P

Z|Y(z|y), which in

turn implies thatPX|Y Z(x|y, z) =PX|Y(x|y), as is easy to show using Bayestheorem. The DPI states that Z cannot have more information about X than Y

has about X; that is I(X; Z) I(X; Y). This inequality, which again is a prop-erty that Shannons information should have, can be proved, thus, I(X; Z) =H(X) H(X|Z) H(X) H(X|Y, Z) = H(X) H(X|Y) = I(X; Y).The inequality follows because conditioning on an extra variable (in this case Y

as well as Z) can only decrease entropy, and the second to last equality follows

becausePX|Y Z(x|y, z) = PX|Y(x|y). This same principle is applicable eitherto engineering control systems or to biological signal processing such as the one

present in GRNs [38, 57].

In reference [38] the DPI states that if genesg1andg3 interact only througha third gene, g2; we have that I(g1, g3) min[I(g1, g2); I(g2, g3)]. Hence,the least of the three MIs can come from indirect interactions only so that the

proposed algorithm (ARACNe) examines each gene triplet for which all three


18/48


MIs are greater than I0 and removes the edge with the smallest value. DPI isthus useful to quantify efficiently the dependencies among a large number of

genes. The ARACNe algorithm eliminates those statistical dependencies that

might be of an indirect nature, such as between two genes that are separated by

intermediate steps in a transcriptional cascade. Such genes will very likely have

non-linear correlated expression profiles which may result in in high MI, and

otherwise would be selected as candidate interacting genes. Given a transcrip-

tion factor, application of the DPI will generate predictions about other genes

that may be its direct transcriptional targets or its upstream transcriptional reg-

ulators [39, 25].

The use of the DPI may result not only in a greater assessment of the re-

sults but also in a significant reduction of the computational burden associated

with network inference. Zola, et al. [67] presented a parallel method integrating

mutual information, data processing inequality, and statistical testing to detect

significant dependencies between genes, and efficiently exploit parallelism in-

herent in such computations. They developed a method to carry out permuta-

tion testing for assessing statistical significance of interactions, while reducing

its computational complexity by a factor ofO(n2), where n is the number ofgenes. The problem of inference (usually consuming thousand of computation

hours) at the whole genome network level by constructing a 15,222 gene net-

work of the plant Arabidopsis thaliana from 3,137 microarray experiments in

30 minutes on a 2,048-CPU IBM Blue Gene/L, and in 2 hours and 25 minutes

on a 8-node Cell blade cluster [67].

3.1.4. Minimum Description Length

One of the major drawbacks for the information theoretic models to infer

GRNs is that of setting up a threshold which defines the regulatory relationships

between genes. The minimum description length (MDL) principle has been

implemented to overcome this problem [10, 19]. The description length used

by the MDL principle is the sum of model length and data encoding length.

A user-specified fine tuning parameter is used as control mechanism between

model and data encoding, but it is difficult to find the optimal parameter. A new

inference algorithm has been proposed, which incorporates mutual information

(MI), conditional mutual information (CMI) [defined in terms of the associ-

ated conditional entropies] and predictive minimum description length (PMDL)

principle to infer gene regulatory networks from DNA microarray data. In this


19/48


20/48


It is also noticeable that the MDL principle also helps to achieve a goodtrade-off between the network model complexity and the accuracy of data fit-

ting, since given a network and a dataset, the MDL principle evaluates simul-

taneously the goodness of fit of the network and the data. Intuitively, the more

complicated the network is, the better the data would be fitted. However, very

often models which are over-fitted relative to the actual systems are selected,

which give rise to numerous errors. MDL aims to achieve a good trade-off be-

tween model complexity and fitness of the data. A general criterion is thus ob-

tained for constructing the network so as to contain only direct interactions. The

convergence of the proposed MDL-based network inference algorithms can be

assessed by the recovery of the topology of some artificial networks and through

the error rate plots obtained through extensive simulations on datasets produced

by synthetic networks [66].

3.1.5. Kullback-Liebler Divergence

Kullback-Liebler divergence [33] (as well as its symmetricized version, the

Jensen-Shannon measure) are, as it turns out, very commonly used informa-

tion densities in GRN inference and other problems in computational molecular

biology. Either as unique measure [45, 44] or used in conjunction with other

indicators, such as spectral metrics [29], Markov fields [20], minimum descrip-

tion lengths [19], Bayesian networks [50, 31, 46, 48] and multivariate analysis

[40].

However, by far the most general use of the KL-divergence within GRNinformation setting is by playing the role of the multi-information: it is known

[40] that for two variables, X1 andX2, independence is well defined via de-composition of the bivariate JPD, P(X1, X2) = P(X1)P(X2), and mutualinformationI(X1; X2) =log2 P(X1, X2)/[P(X1)P(X2)] which is the onlymeasure of dependence [58]. Along the same lines, thetotal interaction(i.e. the

deviation from independence) in a multivariate JPD, P(Xi), i = 1,...,N, canbe measured by the multi-information as follows:

I(X1;X2; . . .X N) = KL [P(X1;X2; . . .X N), P] = KL

P(X1;X2; . . .X N),i

P(Xi)(29)HereP(X1; X2; . . . X N) is the full JPD andP

=

i P(Xi) is the prob-


21/48


ability distribution approximated under independence assumption. SinceP

isthe maximum entropy (MaxEnt) distribution [28] that has the same univariate

marginals asPonly without statistical dependencies among the variables, themulti-information is given by the KL divergence between the JPD and its Max-

Ent approximation with univariate marginal constraints. This KL-divergence

measures the gain in information by knowing the complete JPD against assum-

ing total independence. In a similar fashion, thus, MaxEnt distributions con-

sistent with various multivariate marginals of the JPD introduce no statistical

interactions apart from the corresponding marginals. By comparing the JPD to

its MaxEnt approximations under various marginal constraints, we are expect-

ing to separate dependencies included in the low-order statistics from those not

present in them [40].

Assuming that we have a N-variables GRN and we know a set of marginal

distributions of all variable subsets (for size k 1), One can ask what is theJPD Pk that captures all multivariate interactions prescribed by these marginals,but introduces no additional dependencies. This is of course equivalent to

search for the minimum I(X1; X2; . . . X N)or conversely, its maximum entropyH(X1; X2; . . . X N), turning our inference problem into a MaxEnt problem:

Pk arg maxP,{}

H(P)

M

M(PkM PM)

(30)

whereMis the set of constrained variables.

3.1.6. Information Based Similarity

A promising approach consists in considering that the interactivity of the

system is based oncommunication channels(either real or abstract) for the bio-

signals. Thus, Information Theory (IT) could play a useful role in identifying

entropic measures between pairs {gi, gj} of genes within the sampling universeas potential interactions i,j. IT can also provide with means to test for theMaxEnt distribution, by considering, for example the Kullback-Liebler (KL)

divergence (in the sense of multi-information) or the Connected Information as

criteria of iterative convergence to the MaxEnt PDF in the same sense that the

cumulative distribution leads to the specification of usual PDFs [61].

One possible approach that we propose below is based on the quantification

of the so called Information-Based Similarity Index (IBS) [65] initially devel-

oped to work out the complex structure generated by the human heart beat time


22/48


series. Nevertheless, IBS has proved to be a very powerful tool in the compar-ison of the dynamics of highly nonlinear processes. Within the present context

[26], the symbolic sequence represent the expression values of a single gene

(say gene k-th) all along the sampling universe (of size M), as given by a vector = gk = (gk1 , gk2 , . . . , gkM). Let us consider a series that could well representa gene expression vector. It is possible to classify each pair of successive points

into one of the following binary statesBn, if(n+1 n)< 0 thenBn= 0; inthe other caseBn= 1. This procedure maps theMstep real-valued time series(i)into anM 1step binary-valued seriesB(i). It is now possible to definea binary sequence of lengthm (called anm-bit word). Each of the m-bit wordswk represents a unique pattern in a given time series. For every unitary time-shift, the algorithm makes a different collectionW of m-bit words over thewhole time series,W = {w1, w2, . . . , wn}. It is expected that the frequencyof occurrence of these m-bit words will reflect the underlying dynamics of the

original (real-valued) time series. We are then looking to write down a proba-

bility distribution function in therank-frequencyrepresentation (RF-PDF). This

RF-PDF represents the statistical hierarchy of symbolic words of the original

series [65]. Two given symbolic sequences are said to have similarity if they

give rise to similar probability distribution functions.

Following the very same order of ideas, Yang and collaborators [65] defined

a measure of similarity (akin to statistical equivalence) between two series by

plotting the rank number of every m-bit word in the first series with the rank

for the same m-bit word in the second series. Of course since the series are

supposed to be finite, the m-bit words are not equally likely to appear. Themethod introduces the likelihood of each word by defining a weighted distance

mbetween two given symbolic sequences1 and 2 as follows:

m(1, 2) = 1

2m 1

2mk=1

|R1(wk) R2(wk)|F(wk) (31)

F(wk) is the normalized likelihood of the m-bit word k, weighted by itsgiven Shannon entropy, i.e.:

F(wk) = 1

Z

[p1(wk)logp1(wk) p2(wk)logp2(wk)] (32)

pi(wk) and Ri(wk) represent the probability and rank of a givenword wk in the i-th series. The normalization factor in equation 32 is


23/48


the total Shannons entropy of the ensemble and is calculated as Z =k[p1(wk)logp1(wk) p2(wk)logp2(wk)]. m(1, 2) is called the In-

formation Based Similarity Index (IBS) between series 1, and 2 (e.g. ex-pression vectors g1 and g2 for genes 1 and 2 respectively). One notices thatm(1, 2) [0, 1]; 1, 2; m. In fact one is able to consider m(1, 2)as a probability measure. Iflim m(1, 2) 1 the series are absolutely dis-similar, whereas in the opposite case (limm(1, 2) 0) the two series be-come equivalent (in the statistical sense). One can then approximate the value

of the interaction potentials(gi, gj) as follows. If one is to consider interac-tion as given by correlation or information flow, one can notice that high values

ofm imply stronger dissimilarity, hence lower correlation and sincem is aprobability measure, one can define the complementary measure

m

= 1mand then one can approximate(gi, gj) m(gi, gj).

4. Bayesian and Machine Learning Methods

Systems biology aims to understand biological processes in living systems

by developing mathematical models which are capable of integrating both ex-

perimental and theoretical knowledge, and it works both ways: Given a pre-

specified mathematical framework, the behavior of a set of genes in a specific

GRN can be simulated under a variety of biological conditions and used to test

hypotheses. But also, given a particular pre-specified mathematical framework,

the observation of gene behavior under specific conditions may be used to inferthe underlying GRN. Generally speaking, the reconstruction of a GRN based on

experimental data is known as areverse engineering approach.

In the context of information theory combined with systems biology, there

are two well known information extraction approaches, characterized as top-

downand bottom-up, both have been used to infer GRNs from high-throughput

data sources such as microarray gene expression measurements. A top-down

approach mainly breaks down a system, in order to gain insights into the sys-

tem. On the other side, bottom-up approaches seek to construct a synthetic gene

networks.

The simplest network in an information theory approach is the correlation

network. This is an undirected graph with edges that are weighted by correla-

tion coefficients. It is simple, computationally manageable and with small data

requirement. The drawback of these is that the models are static and they do not

infer the causality of gene regulation.


24/48


4.1. Bayesian NetworksA Bayesian network (BN) is a probabilistic graphical network model, de-

scribed by a directed acyclic graph (DAG). In the model each node represents a

random variable and edges define conditional independence relations between

these random variables. These relationships e.g, gene-gene interactions, can be

seen in a directed graph without cycles.Without cyclesmeans a gene may have

no direct or indirect interaction with itself. In order to reverse engineer a gene

network using this approach, one would need to find the directed acyclic graph

that best describes the gene expression data. This particular limitation of a di-

rected acyclic graph can be overcome by using a dynamic Bayesian network.

4.2. Dynamic Bayesian Networks

Bayesian networks that model sequences of variables are called dynamic

Bayesian networks (DBNs). Murphy and Mian [47] first introduced the use of

DBNs to model gene expression time series data. The benefits of DBNs include

the ability to handle latent variables and missing data (such as transcription

factor protein concentrations, which may have an effect on mRNA steady state

levels) and to model stochasticity. Friedman et al. [23] explored experimental

applications to microarray data analysis. Dynamic Bayesian networks may also

use continuous measurements rather than discrete. Feedback loops can also

be unfolded with respect to time, by explicitly modeling the influence of gene

g1 at time t1 on another gene g2 at time t2, where t2 > t1. An appropriatemodel for gene expression microarray data belongs to the class of linear statespace models, widely used in estimation and control problems arising in system

modeling. These models consist of a state variable that is either unobserved

or partially observed, an observable that evolves in a linear relation to the state

variable, and a structural specification which is a set of parameters in the linear

and distributional relationships between state variables, observables, and noise

terms.

4.3. State-Space Models

State-Space models, also known as Linear Dynamical Systems (LDS), are

a subclass of dynamic Bayesian networks. A state space model is a mathemati-

cal model for a process that accepts inputs which are the drivers of the process

and generates outputs that are interpreted as observable manifestations of what


25/48


is going on inside the process and how this internal behavior is affected by theinputs. These models are suitable for modeling time series data where we have

a series of observations related to a series of unobserved variables changing

over time. Time series models in state-space representation can be thought of as

unobservedcomponent models. The state vector represents those unobserved

or hidden or missing variables and their dynamics over time are governed by

a state transition equation. In the very general setting of a state-space model,

the state vector determines the future evolution of the dynamic system, given

future time paths of all of the variables affecting the system. The variables are

not restricted, they can be either discrete with a countable number of possible

values or continuous with an associated density curve. For example, modeling

gene expression data assumes continuous variables and requires the inclusion

of hidden states. Hidden variables could model the effects of genes that have

not been included in the experiment, they could also model levels of regulatory

proteins as well as possible effects of mRNA or protein degradation. One goal

is to infer the characteristics and properties of the unobserved variables based

on the observations. In linear state-space models, a sequence of p-dimensional

real-valued observation vectors {y1...,yT}, is modeled by assuming that at eachtime step ytwas generated from a K-dimensional real-valued hidden (i.e. unob-served) state variable xt, and that the sequence ofxs is governed by a first-orderMarkov process. This type of model is shown pictorially in Figure (3).

A linear-Gaussian state space model of the time series {yt} is specified bythe matricesAandCcalled system matrices and is described by a pair of equa-

tions:

xt+1 = Axt+ wt (33)

yt = Cxt+ vt (34)

These two equations represent the most basic form of a state-space model.

The vector xt RK is called the state vector at time t. The state equation

(33) shows how this vector evolves with time. A is the dynamic or transitionstate matrix, and its eigenvalues are important in determining the way the data

behave. The observation equation (34) specifies the relationship between the

observed data and this newly introduced vectorxt. Cdescribes the relation be-tween state and observation, and wtand vtare zero-mean random noise vectors.

For the most general case the noise vectors could be mutually correlated,

although serially uncorrelated. In the particular Linear Gaussian case they are


26/48


Figure 3. State-Space model.

mutually independent and independent of the initial state value x0. Assumingthat the initial statex0is fixed or Gaussian distributed, and the noise vectors are

jointly Gaussian, then the state and output of the system is also Gaussian. That

is, all future hidden states xt and observationsyt generated from those hiddenstates will be Gaussian distributed.

This model has been extensively used in state-space modeling. Brockwell

and Davis [7] develop the state-space model described by (33) and (34) as

well as the associated Kalman filter recursions and apply these in representing

ARMA (autoregressive moving average) and ARIMA (autoregressive integrated

moving average) processes. The Kalman filter recursions define recursive esti-

mators for the state vector xt, given observations up to the present time t. Stofferand Shumway [59] present a similar development and apply it to representing

ARMAX (autoregressive-moving average with exogenous terms) models. Stof-

fer and Shumway also develop the recursive smoother, which gives estimators

of the state variablextgiven observations prior to and after time t, and developstate space models that include exogenous inputs in the state equation, observa-

tion equation, or both. State-space models can be written in different ways. The

structure of the model used in this thesis includes exogenous variables in both

equations and its derivation is detailed in the next section.


27/48


4.4. LDS Model for Gene ExpressionFluorescent intensities are measures of gene expression levels. Values of

some of these variables influence the values of others through the regulatory

proteins they express, including the possibility that the expression of a gene at

one time point may, in various circumstances, influence the expression of that

same gene at a later time point.

To model the effects of the influence of the expression of one gene at a

previous time point on another gene and its associated hidden variables the LDS

model with inputs we modify the structure as follows.

We let the observations y(i)t = g

(i)t , the expression level of genei at time

pointt, and the inputsht = gtandut= gt1to give the model shown in Figure

4.

Figure 4. Bayesian network representation of the model for gene expression.

This model is described by the following equations:

xt+1 = Axt+ Bgt+ wt (35)

gt = Cxt+ Dgt1+ vt (36)


28/48


Model Assumptions

The vectorut Rpu is the exogenous input observation vector,ht R

ph

represents the exogenous influence on the hidden states. As before, the state

and observation vectorsxtand ythave dimensionsKand p, respectively.Ais the state transition matrix,Bis the input to state matrix in the state transition equation,Cis the state to observation matrix andDis the input to observation matrix.The state and observation noise vectors, wt and vt respectively, are randomvectors serially independent and identically distributed, and also independent

of the initial values ofx andy and independent of one another.

Remarks

These system matrices A ,B,C ,Dare taken to be constant in this researchbut they also may be varying over time in which case it is appropriate to add a

subscript indicating this.

When the sequence {x1, w1,...,wT} is independent then the distribution ofxt+1|xt,...x1 is the same as the distribution ofxt+1|xt, hence the state vectorxtevolves with a first-order Markov property withAas the transition matrix. The noise vectors can also be viewed as hidden variables. Here the matrix

Din the observation equation captures gene-gene expression level influences atconsecutive time points whilst the matrix Ccaptures the influence of the hiddenvariables on gene expression level at each time point. Matrix B models theinfluence of gene expression values from previous time points on the hidden

states, and A is the state transition matrix. However, our interests focus onCB + D which not only captures the direct gene to gene interaction but alsothe gene to gene interactions through the hidden states over time. This is

actually the matrix we will concentrate the analysis on, since it captures all of

the information related to gene-gene interaction over time.

5. Constrained LDS

Mathematically speaking, the idea of adding constraints to the model is ba-

sically to reduce the number of parameters to estimate. Narrowing down the


29/48


range of parameters to estimate by adding extra constraints reduces dimension-ality which can considerably simplify the search for the parameters that best

describe the model. At all times during modeling with constraints diagnostics

should be made to make sure the model still fits well after taking account of

the constraints. How precisely to include these forms of information into the

inference process was not a straightforward task. However, this is the true art of

modeling.

From the biological point of view, the current application to gene expression

data is already complex. Data generation, low-level analyses and classification

are known to be crucial in getting gene expression levels. Different algorithms

can lead to different sets of genes. Hence, biological mining should be present in

any machine learning approach. In this sense, any knowledge about gene behav-

ior and regulatory interactions are helpful. Now, if this additional information

can be included and modeled, estimation not only becomes more realistic due

to the reduction os parameters but also due to a more biology based approach.

Given either a-priori or new hypothesized information leading to a set of

plausible models, the LDS model is re-trained based on this knowledge about

the parameters. The a-priori information would be supplied by past experiments

or biological knowledge, while the new hypothesized information is obtained

from the bootstrap analysis

5.1. Model Definition

Two competing motivations must be kept in mind when defining a model:

fidelity and tractability. The models fidelity describes how closely it corre-

sponds to reality. On the other hand, the models tractability focuses on the ease

with which it can be mathematically described as well as analyzed and validated

statistically based on observations and measurements. It is understandable that

increasing one (either fidelity or tractability) is usually done at the expense of

the other. Consequently, the ideal model should be developed in close cooper-

ation between the science governing the application and feasible mathematical

and statistical methods. One common assumption that aids tractability is that

model errors are normally (or Gaussian) distributed. Indeed, a large number

of existing algorithms and methods of statistical inference are based on jointly

Gaussian observables. Though rarely satisfied exactly in practice, this assump-

tion is often justified because it makes the analysis of the model tractable and

the resulting statistical inferences are robust in the sense of being insensitive to


30/48


small departures from normality. The model definition used in this work, is de-fined with the Gaussian assumption only insofar as it makes the analysis of the

models more straight forward and tractable. However, for statistical inferences

and validation of the model, no essential use of the Gaussian assumption is be

made. Instead, more general methods such as bootstrapping are employed.

5.2. Structural Specification

We will concentrate here on incorporating a-priori information, and for this,

the emphasis is on constraining elements in the matrix D. The reason for this is

simple: D describes the direct gene-to-gene interactions over time, and there-

fore seems the most suitable place to incorporate a-priori information. Recall,that the gene regulatory network is constructed from the estimate of CB + D,

and thus has incorporated in it also the influence of hidden variables (e.g. the

influence of missing genes / proteins, etc.). Thus, the hypothesized form of this

dag entails that some elements of the matrix CB + D are zero. The idea now

would be to impose those constraints on CB + D and re-estimate the model

structural parameters under these constraints and verify that the model still fits

the data well. Imposing constraints reduces the dimensionality of the unknown-

parameter space, and thus creates a new estimation problem (one for which the

remaining unconstrained parameters can be estimated more precisely). Because

of this, solving this new estimation problem (and performing diagnostics) could

expose shortcomings in how well the constrained model describes the data, or

could expose other parts of the model structure that were obscured because of

the larger number of parameters to estimate in the unconstrained model.

5.3. Estimation

With the structural specification known, the objective is to estimate, in a

least-squares sense, the unknown or unobserved state variables from the avail-

able observations. The so-called Kalman filter solves this problem, and vari-

ations of the filter give interpolation, extrapolation, and smoothing estimators

of the state variables (see the book by Aoki [5], for example). The resulting

estimators are optimal in the sense of least- squares, given that one is restricted

to consideration of estimators that are linear functions of the observables. Their

derivation can be accomplished in generality by casting the problem in the con-

text of approximation in a Hilbert space of random variables possessing finite


31/48


second order moments. This reduces the problem to one of computing projec-tions onto the subspaces spanned by the observables, but the derivations and

machinery of that theoretical approach are tedious. However, in the special

case when the states and observables are jointly Gaussian, the least squares es-

timators of state are given by conditional expectations (conditioned on the ob-

servables) which are in turn linear functions of the observables. Moreover, the

conditional expectation operator has all the essential properties of the subspace

projection operator in the Hilbert space context. As a consequence, the shorter

and more elegant analysis of the problem in the Gaussian context leads to ex-

actly the same estimators of the state variables as the more general Hilbert space

context. Thus, in terms of formulating the state estimators, there is no loss of

generality in assuming Gaussian joint distributions.

Regarding the estimation of the structural parameters, in the absence of as-

sumptions regarding the joint distributions of the state variables and observables

or any other pertinent information, a weighted least-squares approach would be

reasonable and justified. If the assumption is made that the state variables and

observables are jointly Gaussian, then the method of maximum likelihood leads

to parameter estimators that are essentially equivalent to those yielded by the

weighted least-squares approach. Thus, again there is no loss of generality in

making the Gaussian assumption for constructing estimators of structural pa-

rameters.

5.4. Derivation

To model the effects of the influence of the expression of one gene at a

previous time point on another gene and its associated hidden variables, we

consider the state-space model

xt+1= Axt+ Byt+ wt (37)

yt = C xt+ Dyt1+ vt (38)

The column vectorx is the state vector of hidden variables for the system,u is the input observation vector, C is the state to observation matrix whichcaptures the influence of the hidden variables on gene expression level at each

time point.

The matrix D describes the gene-to-gene interaction at consecutive timepoints. From this matrix we obtain the Bayesian network representation of the


32/48


33/48


+ tr(R1

)(Syy SyxC

SyuD

CS

yx+ CP C

+ CSxuD DSyu+ DS

xuC

+ DSuuD)

where

Syy =N

j=1

Tt=1

y(j)t y

(j)t

Syx =N

j=1

Tt=1

y(j)t x

(j)t

Syu =

Nj=1

Tt=1

y(j)t u

(j)t

Sxu =N

j=1

Tt=1

x(j)t u

(j)t

Suu =N

j=1

Tt=1

u(j)t u

(j)t

P =N

j=1

Tt=1

E[xtxt|y1,...,yT]

Taking partial derivatives of (39) and making them equal to zero, we solve for

C, D and R. In other words, we find the unconstrained estimatorsthat mini-mize the likelihood function (39).

D = (Syu SyxP1Sxu)(Suu S

xuP

1Sxu)1 (40)

C = (Syx DSxu)P

1 (41)

R = 1

NT

Syy CS

yx DS

yu

(42)

To obtain the constrained estimators(Dcons, Ccons, and Rcons) we needto solve the following


34/48


Constrained Minimization ProblemMinimize

2L(C ,D,R) = NTlog |R|

+ tr(R1)(Syy SyxC SyuD

CSyx+ CP C

+ CSxuD DSyu+ DS

xuC

+ DSuuD)

subject to the constraintDF G= 0

Solution: We introduce the Lagrange Multipliers method to minimize

the new likelihood function (39) subject to the constraintDFG= 0.

Let us define the real-valued column vector of Lagrange multipliers

= (1, 2,...,n). The likelihood function and the constraints associ-

ated with it define our objective function as:

M(C ,D,R) = tr(N Tlog |R|)


CSyx+ CP C

+ CSxuD DSyu+ DS

xuC

+ DSuuD)

+ tr[(DF G)] (43)

Necessary conditions for a minimum of M(C ,D,R) are that elements inC,D,R,andbe chosen to give

M

C = 0,

M

D, and

M

=C onstraints= 0

The third expression implies that a minimum for M is also a minimum for thelikelihood function (39).

M

C =

Ctr(R1)(SyxC

CSyx + CP C + CSxuD

+ DSxuC)

= 2R1

(CconsP+ DconsS

xu Syx) = 0 (44)

M

D =

D

tr(R1)(SyuD

+ CSxuD DSyu + DS

xuC + DSuuD

) + F


35/48


= 2R1(CconsSxu+ DconsSuu Syu) + F = 0 (45)

M

= DconsFG= 0 (46)

From (44) and (45) we get the constrained estimators forCand D

Ccons = (Syx DconsSxu)P

1 (47)

Dcons = (Syu CconsSxu 1

2

RconsF)S1uu

Using the expressions (40) and (41) for the unconstrained estimators we get the

constrainedD matrix

Dcons= D 1

2Rcons

F(Suu SxuP

1Sxu)1

Substituting these back into (46) and solving forgives:

1

2Rcons

= (DF G)(F(Suu SxuP

1Sxu)1F)1

Putting the expression above back into (5.4.) and solving for Dcons we finallyobtain the constrained estimators for C and D in terms of the unconstrainedones.

Dcons = D (DFG)(F(Suu S

xuP1Sxu)

1F)1F(Suu S

xuP1Sxu)

1

Ccons = C (DF G)(F(Suu S

xuP1Sxu)

1F)1F(Suu S

xuP1Sxu)

1SxuP1

Similarly, the constrained covariance matrix Rconsis obtained by differentiatingwith respect toR and solve.

M

R = NTR1cons(Syy SyxC

cons SyuD

cons CconsS

yx + CconsPC

cons

+CconsSxuDcons DconsS

yu + DconsS

xuC

cons+ DconsSuuD

cons)

= 0 (48)


36/48


leads to

Rcons = R+ 1

N T(Syu+ CconsSxu+ DconsSuu)Dcons

= R+ 1

N T

1

2Rcons

FD

cons

(49)

= R 1

N T(DFG)(F(Suu S

xuP

1Sxu)1F)1G (50)

Unfortunately, this constraints cannot be implemented in the model used for

this research. The selection of the matrices F and G that could zero out some

elements inD become difficult as the size of the matrix increases. However, by

re-writting the constrained problem using the vec operator we can easily handleany matrix size.

5.5. Vec Formulation

The vec operator vectorizes a matrix by piling up the columns. That is,

suppose we want to vectorize a 2x2 matrixM

M=

m11 m12m21 m22

, vec(M) =

m11m21m12

m22

The Kronecker product of two matrices plays an important role when using the

vec operator. There are important relationships that will be used in the develop-

ment of the constrained minimization problem in vec formulation.

Definition: The Kronecker product of two matrices,A and B , whereA ismxn andB is pxq, is defined as

A B=

A11B A12B . . . A1nBA21B A22B . . . A2nB

. . . . . . . . . . . .Am1B Am2B . . . AmnB

,

which is anmpxnqmatrix.


37/48


Important Operator Relationships

vec(AXB ) = (BT A)vec(X) (51)

(AC BD) = (A B)(C D) (52)

(A B)1 = A1 B1 (53)

dxTAx

dx = xT(A + AT) (54)

To show the application of the vec operator in the constraint settings let us look

at the following example.

EXAMPLE:

Let us consider a 2x2 matrix D and suppose we want to constrain it to bediagonal. Select the matricesF andG to be

D=

d11 d12d21 d22

, F =

0 1 0 00 0 1 0

G=

00

Then, applying the constraint Fvec(D)=G we get that the elements d1 andd2are zero and the matrixD becomes:

D= d11 00 d22

In general, for anyn xn matrixD we can find matricesF andG and solve theconstrained minimization problem using vec formulation as follows:

Constrained Minimization Problem 2

Minimize

2L(C ,D,R) = NTlog |R|+ tr(R1)(Syy SyxC

SyuD CSyx+ CP C

+ CSxuD DSyu+ DS

xuC

+ DSuuD)


38/48


subject to the constraintFvec(D) -G = 0

Solution: We introduce the Lagrange Multipliers method to minimize

the objective function

M(C ,D,R) = tr(N Tlog |R|)


CSyx+ CP C

+ CSxuD DSyu+ DS

xuC

+ DSuuD)

+ (Fvec(D) G) (55)

subject to the constraintFvec(D)-G= 0.

M

C =

Ctr(R1)(SyxC

CSyx+ CP C + CSxuD

+ DSxuC)

= 2R1(CconsP+ DconsS

xu Syx) = 0 (56)

M

vec(D) = 2vec(R1consSyu) + 2vec(R

1consCconsSxu) +

2vec(R1consDconsSuu) +vec(F) = 0 (57)

M

= Fvec(Dcons) G= 0 (58)

M

R = N TR1cons(Syy SyxC

cons SyuD

cons CconsS

yx+

CconsP C

cons+ CconsSxuD

cons DconsS

yu+ DconsS

xuC

cons+

DconsSuuD

cons) = 0 (59)

From (57) and the following expressions

vec(R1consDconsSuu) = (Suu R1cons)vec(Dcons)

vec(R1consCconsSxu) = (S

xu R

1cons)vec(Ccons)

vec(F) = F

vec(Ccons) = vec(SyxP1) (P1Sxu I)vec(Dcons)


39/48


we have that,

vec(Dcons) =vec(D) 1

2((Suu S

xuP

1Sxu)1 Rcons)F

We still need to work out the value for. Hence, substituting (57) into (58) andsolving for gives:

= (Fvec(D) G)(1

2F(Suu S

xuP

1Sxu)1 Rcons)F

)1 (60)

Now, putting this expression forback into (5.5.) we obtain

vec(Dcons) =vec(D) V1F[F V1F]1(Fvec(D) G) (61)

where,

V = (Suu SxuP

1Sxu)1 Rcons)

Finally, from (59) we obtain the expression for Rcons implicitly in theform ofRcons = R+ f(Rcons) for which we will need to iterate and reshapethe matrixDconsat each iteration.

Rcons= R+ 1

NT

1

2Rcons

FDcons

(62)

5.6. Constraints Implementation - EM Procedure

In order to apply the EM algorithm, we require initial values of the state

and covariance as well as the parameters which are initialized using linear

regression. Then the EM procedure operates as follows:

E-step

Given the initial estimatorsx0, P0 and initial estimators ofA ,B,C ,D,Q andRuse the Kalman filter equations to compute the estimates forx+t andPt.

M-step

Re-estimate the unconstrainedA, B,C, D, Q, and R using the values for x+t


40/48


and Pt in the formulas for a,b, c, d, e, and P (* Here, is where we add theconstraintsFvec(D) -G= 0*)

ALGORITHM:

1. Start with the unconstrained estimates ofC, D , andR. Equations (40)-(42).

2. The vec expression for the constrained Cconsand Dconsare in fact func-tions ofRconswhich is in turn a function of the unconstrained Ruand thepreviousRcons, and has to be calculated by iteration. That is,

vec(Dcons) = vec(D) V1F[FV1F]1(Fvec(D) G) (63)

vec(Ccons) = vec(C) (P1d I)V1F[FV1F]1(Fvec(D) G) (64)

whereV(Rcons(r))is as in (5.5.), and

Rcons= R+ f(Rcons)withRcons(0) =R and

Rcons(r + 1) = R + f(Rcons(r)); r = 0, 1, 2,... until||Rc(r+ 1) Rc(r)||< tol

Hence,

Rc= Rc(r+ 1),

Cc = C c(Rc(r+ 1)), andDc= Dc(Rc(r+ 1))

3. Now, in the iteration process,

Rc(r+ 1) = 1

N T[a Cc(Rc(r))b Dc(Rc(r))c

+(Cc(Rc(r))d + Dc(Rc(r))e c)Dc(Rc(r+ 1))]

So, for each iterationr we need to reshape vec(Dc) and vec(Cc) and putit back into matrix form to compute a new Rc(r+1). Continue this until

convergence and once we have the final Rc put it back one more time to

find vec(Dc) and vec(Cc) and reshape them.


41/48


4. Then Dc and Cc are the matrices that go back to the E-step to be used(along with the other parameters) to find an updated and more accurate

estimate ofx+t andPt

6. Conclusions

Information theory as such, is concerned with the quantification, analysis

and forecasting of information processing in systems under incomplete and/or

noisy data acquisition. As we discussed in this chapter, the problem of the in-

ference and analysis of gene regulatory networks from experimental data on

gene expression at a genome wide scale, is closely related with the foundational

tenets of information theory. In fact, given the current biological understand-ing of gene regulation as an extremely complex signal processing phenomena,

information theoretical tools and concepts result a natural choice for the task

of inference/analysis of such GRNs. We presented several instances in which

information theory, either on its own, or combined with probabilistic graphical

models, Bayesian statistics and machine-learning techniques have been used in

the inference and assessment of GRNs.

Purely information theoretical approaches are based on complex graph ren-

derings (i.e. both cyclic and acyclic probabilistic models are allowed) and are

able to describe the system using either continuous or discrete probability den-

sity functions. The means for dealing with incomplete or noisy data is by quan-

tifying interactions that are usually valued by means of statistical dependencemeasures such as mutual information and Kullback-Leibler divergences, either

on a marginal or conditional setting. The use of minimum description length

as a measure of algorithmic complexity, of the data processing inequality to

discriminate between direct and indirect interactions, and of Shannons signal

processing theorems to establish thresholds or bounds of confidence, is usually

supplemented with optimization based on maximum entropy (MaxEnt) tech-

niques.

In the other hand, Bayesian/machine-learning implementations of informa-

tion theoretical models are usually based on directed acyclic graphs (DAGs),

these also allow either discrete or continuous probability distribution funct

Documents

Hernández-Lemus & Rangel-Escareño, 2011. The Role of Information Theory in Gene Regulatory Network Inference