Hernández-Lemus & Rangel-Escareño, 2011. The Role of Information Theory in Gene Regulatory Network Inference

Embed Size (px)

Citation preview

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    1/48

    In: Information Theory: New Research

    Editors: P. Deloumeaux et al, pp. 137-184

    ISBN: 978-1-62100-325-0

    c 2011 Nova Science Publishers, Inc.

    Chapter 4

    THE ROLE OFINFORMATIONTHEORY

    IN GENE REGULATORYNETWORKINFERENCE

    Enrique Hernandez-Lemusand Claudia Rangel-Escare no

    Computational Genomics Department,

    National Institute of Genomic Medicine, Mexico

    Abstract

    One important problem in contemporary computational biology, is thatof reconstructing the best possible set of regulatory interactions between

    genes (a so called gene regulatory network -GRN) from partial knowl-

    edge, as given for example by means of gene expression analysis exper-

    iments. Since only highly noisy-data is available, doing this represents

    a challenge to common probabilistic modeling approaches. However, a

    variety of algorithms rooted in information theory and maximum entropy

    methods, have been developed and they have coped with the problem suc-

    cessfully (to a certain degree). Mutual information maximization, Markov

    random fields, use of the data processing inequality, minimum description

    length, Kullback-Liebler divergence and information-based similarity are

    some of these. Another approach to modeling gene regulatory networks

    combines information theory and machine learning techniques. Monte

    Carlo methods and variational methods can also be used to measure datainformation content. Hidden Markov models (HMM) or stochastic linear

    E-mail address: [email protected]

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    2/48

    138 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    dynamical systems use time series data to represent information of a statesequence about the past through a discrete random variable called the hid-

    den state. Similarly, stochastic linear dynamical systems represent infor-

    mation about the past but through a real-valued hidden state vector. Com-

    mon to these models is the fact that conditioned on the hidden state vec-

    tor, the past, present and future observations are statistically independent.

    State-Space models, also known as Linear Dynamical Systems (LDS) or

    Kalman Filter models, are a subclass of dynamic Bayesian networks used

    for modeling time series data. Expressing time series models in state-

    space form allows for unobserved components - an important factor when

    modeling gene expression data. Unobserved variables can model biolog-

    ical effects that are not taken into account by the observables. They could

    model the effects of genes that have not been included in the experiment,

    levels of regulatory proteins or possible effects of mRNA degradation.Work presented here shows the use of these models to reverse engineer

    regulatory networks from high-throughput data sources such as microar-

    ray gene expression profiling. In this review we will also describe the

    basic theoretical foundations common to such methods and will briefly

    outline their virtues and limitations.

    Keywords: Information theory, Network inference, probabilistic modeling

    1. Introduction

    A common situation in several emerging fields of science and technology,such as bioinformatics and computational biology, high energy physics and as-

    tronomy, to name a few, is that researchers are confronted with datasets having

    thousands of variables, large noise levels, non-linear statistical dependencies

    and a very reduced sampling universe. The detection of functional and struc-

    tural relationships of the data when confronted with such situations is always

    a major challenge. In particular, the construction of dynamic maps of gene in-

    teractions (also-called genetic regulatory networks) relies on understanding the

    interplay between thousands of genes. Several issues arise in the analysis of

    data related to gene function: the measurement processes generate highly noisy

    signals; there are far more variables involved (number of genes and interactions

    among them) than experimental samples. Another source of complexity is the

    highly nonlinear character of the underlying biochemical dynamics.

    Hence two important milestones in the analysis of genomic regulation are

    variable selection(also calledfeature selection) andnetwork inference. The for-

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    3/48

    The Role of Information Theory... 139

    mer is a machine learning topic whose goal is to select from amongst thousandsof input variables, those that lead to the best predictive model. Feature selec-

    tion methods applied to genomic data allows, for instance, to improve molecular

    diagnosis and prognosis in complex diseases (such as cancer) by identifying a

    set (called a molecular signature) of features or variables that best represent

    the phenomenon. In the case of network inference, consists in representing the

    (in general non-linear) set of statistical dependencies between variables on a set

    (that can be the whole input dataset or a feature-selectedsubset of it) by means

    of a graph. When applied to genomic expression data (e.g. from microarray

    experiments), network inference is able to reverse-engineer the transcriptional

    gene regulatory network (GRN) of the related cell. Knowledge of this GRN

    would allow for instance, to the discovery of new drug targets to cure diseases.

    Information theory (IT) has resulted on a powerful theoretical foundation

    to develop algorithms and computational techniques to deal both with feature

    selection and with network inference problems applied to real data. There are

    however goals and challenges involved in the application of IT to genomic anal-

    ysis. The applied algorithms should return intelligible models (i.e. they must

    result understandable), they must also rely on little a priori knowledge, deal

    with thousands of variables, detect non-linear dependencies and all of this start-

    ing from tens (or at most few hundreds) of highly noisy samples. As we will

    shown in this chapter, IT has provided approaches to deal with this problems.

    Some of these approaches are based on machine learning techniques, basically

    by modeling a targetfunction connecting the variables of a system. Here, the

    output or target variable is the one to be predicted and the input variables are the

    predictors.

    As a means to produce intelligible models we perform feature-selection pro-

    cedures. The goal of these procedures is to select inputs among a set of variables

    which lead to the best predictive model. In the vast majority of cases, feature

    selection is a preprocessing step prior to the actual machine learning stage. This

    is a somewhat critical part of the whole inference process. In the one hand, vari-

    able or feature elimination can lead to information losses. In the other, feature

    selection is a mean to improve the accuracy of a model, to improve the gener-

    alizability of such model, as well as its intelligibility and at the same time to

    decrease the computational burden for the training and inference stages. Com-

    putational methods for feature selection usually consist in a search algorithm

    that explores different combinations of variables, supplemented with a measure

    of performance (or score) for this combinations. There are several ways to ac-

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    4/48

    140 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    complish this task, in our opinion, the best benchmarking options for the GRNinference scenario, are the use of sequential search algorithms (as opposed to

    stochastic search) and performance measures based on IT, since this made fea-

    ture selection fast end efficient, and also provide an easy means to communicate

    the results to non-specialists (e.g. molecular biologists, geneticists and physi-

    cians).

    GRNs are graph-theoretical constructs that describe the integrated state of

    a cell (or a small population of similar cells to be more precise) under certain

    biological conditions at a given time. GRNs are means for identifying gene

    interactions from experimental data through the use of theoretical models and

    computational analysis. The inference of such an interaction connectivity net-

    work involves the solution of an inverse problem (a deconvolution) that aims touncover the interactions from the properties and dynamics of observable behav-

    ior in the form of, for example, RNA transcription levels in a characteristic gene

    expression profile. A growing number of deconvolution methods (also called

    reverse engineering methods) have been proposed in the past [6, 62]. Their

    goal is to provide a well-defined representation of the cellular network topol-

    ogy from the transcriptional interactions as revealed by gene expression mea-

    surements that are then treated as samples from a joint probability distribution.

    The goal of deconvolution methods is the discovery of GRNs based on statisti-

    cal dependencies within this joint distribution [13]. One major shortcoming is

    that, surprisingly, there is still no conceptual agreement as to what the depen-

    dencies are within these multivariate settings and about the role of noise and

    stochastic dynamics in the problem. The special case of conditional statistical

    dependence has gained, however, a certain place as a somehow useful criterion

    in most biomedical applications. The central aim is to find a way to decom-

    pose the Statistical Dependency Matrix (SDM) -that is, the deviation of a joint

    probability distribution from the product of its marginals- into a series of well

    defined contributions coming from interactions of several orders of complex-

    ity. IT is therefore the right setting to do so. Typical means to reach this goal

    consist in the quantification of the new information content that arise when we

    look at the full joint probability distribution compared to a series of successive

    independence approximations.

    In GRNs each variable of the dataset is represented by a node (or vertex) in

    the graph. There is a link joining two variables-nodes if these variables exhibit

    a particular form of dependency (the particular form of dependency depends

    explicitly on the inference method chosen). Some genes can produce a protein

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    5/48

    The Role of Information Theory... 141

    (or other biomolecules such as a microRNA) that is able to activate or repressthe production of another genes protein. There is thus, a presence ofcircuits

    coded in the DNA of a cell. A useful way to represent this circuits is a graph

    where the nodes represent the genes and the links or arcs are the interactions

    between them. Here we will be dealing with reverse engineering methods for

    GRNs using whole-genome gene expression data as input data. This problem

    is very general and useful in contemporary research in computational molecular

    biology, however it is a question that remains ut to date open due to its combina-

    torial nature and the poor information content of the data. Validation of network

    against available real-life data will be thus an important stage in the discovery

    of reliable GRNs.

    As we have seen there are two major shortcomings related to the feature

    selection and network inference procedures: i) non-linearity and ii) large num-

    ber of variables. IT methods are often efficient techniques to deal with issues

    i) and ii) [52, 22, 21, 38, 26]. It can be seen that most of these methods rely

    on some form ofmutual information metric. Mutual information (MI) is an

    information-theoretic measure of dependency which is model independent and

    has been used to define (and quantify) relevance, redundancy and interaction in

    such large noisy datasets. MI has the enormous advantage that captures non-

    linear dependencies [38, 26]. Finally MI it is rather fast to compute, hence it

    can be calculated a high number of times in a still reasonable amount of time,

    an explicit requirement in whole-genome transcription analysis.

    2. Information Theoretical Measures and Probability

    Measures

    We will introduce here the essential notions of IT that will be used, like

    entropy, mutual information and other measures. In order to do so, let Xand Ydenote two discrete random variables having the following features:

    Finite alphabet X and Yrespectively

    Joint probability mass distributionp(X, Y)

    Marginal probability mass distributionsp(X)andp(Y)Let also X and Y denote two additional discrete random variables defined

    onX andYrespectively, the associated probability mass distributions will be

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    6/48

    142 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    p(X)andp(Y), their joint probability mass distribution p(X, Y)and defined onJ, the joint probability sampling space; J =X Y. For particular realizations,we havep(x) =P(X=x)and p(y) =P(Y =y).

    Following Shannon [58], for every discrete probability distributionX it ispossible to define theinformation theoretical entropyHof such distribution asfollows

    H=Ks

    p(X)logp(X) (1)

    here H is called Shannon-Weavers entropy, Ks is a constant useful thedetermine the units in which entropy is measured andp(X)is the mass prob-

    ability density for state of the random variable given by X = x. Entropywas originally developed to serve as a measure of of the amount of uncertainty

    associated with the value ofXhence relating the predictabilityof an outcomewith the probability distribution.

    TheKullback-Leibler divergence, KL(, :)is a non-commutative measure ofthe difference between two discrete probability distributions [33].

    KL [p(Y);p(Y) ] = y Y

    p(y)logp(y)p(y) (2)

    TheJoint Kullback-Leibler divergencebetween two probability mass distri-

    butionsp(X, Y)andp(X, Y)is given by:KL [p(X, Y);p(X, Y) ] =

    xX

    p(x)yY

    p(y|x)logp(x, y)p(x, y) (3)

    In a similar way, it is possible to define the Conditional Kullback-Leibler

    divergencebetweenp(Y|X)andp(Y|X)as follows:KL [p(Y|X);p(Y|X) ] =

    xX

    p(x)y Y

    p(y|x)logp(y|x)p(y|x) (4)

    Equation 4 means that a conditional Kullback-Leibler divergence can also

    be defined as the expected value of the Kullback-Leibler divergence of the con-

    ditional probability mass functions averaged over the conditioning random vari-

    ables.

    Recalling equation 2 we notice that it can be rephrased as follows:

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    7/48

    The Role of Information Theory... 143

    KL [p(Y);p(Y) ] = yY

    p(y)logp(y) y Y

    p(y)logp(y) (5)We could see that the first term in the right hand side of equation 5 is pre-

    cisely the negative of the entropy H(Y) as given by equation 1. Shannons en-

    tropy depends on the distributionp(Y)and, as Shannon himself showed [58], itis maximum for a uniform distribution u(Y).H[u(Y)] = log |Y|. If we replacep(y)foru(Y)in equation 5 we get:

    H[p(Y)] = log |Y|KL [p(Y); u(Y) ] (6)

    As we can see, equation 6 states that the entropy of a random variable Yis the logarithm of the size of the support set minus the Kullback-Leibler di-vergence between the probability distribution of Y and the uniform distribution

    over the same domain Y. Thus, the closer the probability distribution is to auniform distribution, the higher is the entropy. Hence, entropy measures ran-

    domness and unpredictability of a distribution.

    Now, let us consider a pair of discrete random variables(Y, X)with a JointProbability Distribution (JPD)p(Y, X). For these random variables the jointentropyH(Y, X)is given in terms of the JPD as:

    H(Y, X) =

    yY xXp(y, x)logp(y, x) (7)

    We could notice that the maximal joint entropy is attained under indepen-

    dence conditions of the random variables Y and X, that is when the JPD isfactorizedp(Y, X) =p(Y)p(X), in this case the entropy of the JPD is just thesum of their respective entropies. An inequality theorem could be stated as an

    upper bound for the join entropy:

    H(Y, X) H(Y) + H(X) (8)

    Equality only holds ifXandY are statistically independent.

    Also, given a Conditional Probability Distribution (CPD), the corresponding

    conditional entropyofY givenXcan be defined as:

    H(Y|X) = yY

    xX

    p(y, x)logp(y|x) (9)

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    8/48

    144 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    Conditional entropies are useful to measure the uncertainty of a randomvariable once another one (the conditioner) is known. It can be proved [12] that:

    H(Y, X) = H(X) + H(Y|X) H(Y) + H(X) (10)

    Or, in other words:

    H(Y|X) H(Y) (11)

    Equality only holds when Xand Yare statistically independent. Expression11 is extremely useful in the inference/prediction scenario: ifYis a target vari-able andXis a predictor, adding variables can only decrease the uncertainty onthe targetY. This will result almost essential for IT methods of GRN inference.

    Entropy reduction by conditioning can be accounted in a pretty formal way

    if we consider a measure called the mutual information, I(Y, X) which is asymmetrical measure (i.e. I(Y, X) =I(X, Y)) that is written as:

    I(Y, X) =H(Y) H(Y|X) or I(X, Y) =H(X) H(X|Y) (12)

    If we resort to Shannons definition of entropy (equation 1) [58] and substi-

    tute it into equation 12 we get:

    H(Y, X) =

    yY xXp(x, y)log

    p(x, y)

    p(x)p(y) (13)

    Mutual information can be written as the product of the Kullback-Liebler

    divergence between the JPD and the product distribution:

    I(Y, X) = KL [p(X, Y); p(X)p(Y) ] (14)

    Mutual information is also given by the Kullback-Liebler divergence be-

    tween the marginal distributionp(X)and the conditional distributionp(X|Y)

    I(Y, X) = KL [p(X|Y); p(X) ] (15)

    Mutual information and Kullback-Liebler divergences are two of the most

    widely used IT measures to solve the GRN inference problem.

    A comprehensive catalogue of algorithms to calculate diverse information

    theoretical measures has been developed for [R] the statistical scientific com-

    puting environment [27].

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    9/48

    The Role of Information Theory... 145

    3. Methods in Regulatory Network InferenceThe deconvolution of a GRN could be based on a maximum entropy opti-

    mization of the JPD of gene-gene interactions as given by gene expression ex-

    perimental data could be implemented as follows [26]. The JPD for the station-

    ary expression of all genes, P({gi}),i = 1, . . . , N may be written as follows[38]:

    P({gi}) = 1

    Z expHgen (16)

    Hgen = [Ni

    i(gi) Ni,j

    i,j(gi, gj) Ni,j,k

    i,j,k(gi, gj, gk) . . .] (17)

    Here N is the number of genes, Z is a normalization factor (the partitionfunction), thes areinteraction potentials. A truncation procedure in equation17 it is used to define an approximate hamiltonianHpthat aims to describe sta-tistical properties of the system. A set of variables (genes), interacts with eachother if and only if the potential between such set of variables is non-zero.The relative contribution ofis taken as proportional to the strength of the in-teraction between this set. Equation 17 does not define the potentials uniquely,

    thus, additional constraints should be provided in order to avoid ambiguity. A

    usual approach to do so is specify

    s using maximum entropy (MaxEnt) ap-

    proximations consistent with the available information on the system in the form

    of marginals. Information theory provides a set of useful criteria for setting up

    probability distribution functions (PDFs) on the basis of partial knowledge.

    The MaxEnt estimate of a PDF is the least biased estimate possible, given

    the information, i.e. the PDF that is maximally non-committal with regard to

    missing information [28]. It is not possible to constrain the system via the

    specification of all possible N-way potentials when N is large, hence one has

    to approximate the interaction structure. According to the current genomics

    literature, sample sizes of order 102 (the usual maximum size available inmost present-day studies) are generally sufficient to estimate 2-way marginals,

    whereas 3-way marginals (e.g. triplet interactions i,j,k(gi, gj , gk)) requireabout an order of magnitude more samples, a sample size unattainable under

    present circumstances. Being this the case, one is usually confronted with a

    2-way hamiltonian of the form:

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    10/48

    146 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    Figure 1. A set of genesi interacts with another set of genesk by means ofa potential= 0and is non-interacting with another set of genesj since thecorresponding potential functional is equal to zero.

    Happrox =N

    i i(gi) N

    i,j i,j(gi, gj) (18)Under that approximation, the reconstruction (or deconvolution) of the as-

    sociated GRN consists in the inverse-problem of determining the complete set

    of relevant 2-way interactions i,j(gi, gj) consistent with the JPD (equations16 and 17) that defines all known constrictions, e.g. the values of the stationary

    expression of genesgi as given by the set ofi(gi)s and non-committal withevery other restriction in the form of a marginal. The modeling of a GRN de-

    pends on the description of the interactions in the form of several correlation

    functions. A great deal of work has been done within the framework of the

    Bayesian Network (BN) approach [51, 23]. BN models both static and dynamic

    have provided with a better understanding of the problem in terms of solvabil-

    ity, noise reduction and algorithmic complexity. Since BNs are a form of the

    Directed Acyclic Graph (DAG) problem, there are several instances (e.g. feed-

    forward loops, feed-back cycles, etc.) in which the DAG formalism of BNs

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    11/48

    The Role of Information Theory... 147

    fails short. It has been noted [6] that BNs require a larger number of data points(samples) to infer the probability density distributions whereas information the-

    oretical approaches perform well for steady-state data and can be applied even

    when few experiments (compared to the number of genes) are available. A re-

    cently developed approach is the use of statistical and information theoretical

    models to describe the interactions [36].

    If we consider a 2-way interaction hamiltonian, all gene pairs i,j for which

    i,j= 0 are said to be non-interacting. This is true for genes that are statisticallyindependent,P(gi, gj) P(gi) P(gj), but it is also valid for genes that do nothave a direct interaction but are connected via other genes i.e. i,j = 0 butP(gi, gj) = P(gi) P(gj). Several metrics such as Pearson Correlation, SquareCorrelation and Spearman Ranked coefficients over the sampling universe have

    been used, but the performance of these methods is usually poor as suffers from

    a big number of false positive predictions.

    3.1. Information Theoretical Methods

    3.1.1. Mutual Information

    An information theoretical measure that has been used successfully to infer

    2-way interactions in GRNs is mutual information (MI) [38, 37, 3, 4]. MI for

    a pair of random variables, andis defined asI(, ) = H() + H() H(, ). HereHis the information theoretical entropy (Shannons entropy),

    H(x) = logp(xi) = ip(xi)logp(xi). MI measures the degree ofstatistical dependency between two random variables. From the definition one

    can see that I(, ) = 0 if and only if andare statistically independent.Estimating MI between gene expression profiles under high throughput exper-

    imental setups typical of todays research in the field is a computational and

    theoretical challenge of considerable magnitude. One possible approximation

    is the use of estimators. Under a Gaussian kernel approximation [60], the JPD

    of a 2-way measurementXi= (xi, yi),i = 1, 2, . . . , M is given as [38]:

    f(X) =

    1

    MiG[

    h1|X

    Xi|

    ]

    h2 (19)

    G is the bivariate standard normal density and h is the associated kernelwidth [38]. The mutual information could be evaluated as follows:

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    12/48

    148 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    I({xi}, {yi}) = 1

    M

    i

    log f(xi, yi)

    f(xi) f(yi) (20)

    hence, two genes with expression profilesgi and gj for whichI(gi, gj)= 0are said to interact each other with a strength I(gi, gj) (gi, gj), whereastwo genes for which I(gi, gj) is zero are declared non-directly interactingto within the given approximations. Since MI is reparametrization invari-

    ant, one usually calculates the normalized mutual information. In this case

    I(gi, gj) [0, 1],i, j.

    Figure 2. Panel i shows a bivariate interaction between gene A and genes B

    and C, panel ii shows an indirect interaction of gene A on gene C mediated by

    gene B, panel iii depicts two independent interactions between gene A and B

    and gene A and C.

    A highly customizable set of algorithms for mutual information inference

    of gene regulatory networks has been implemented in the [R]/BioConductor

    scheme [43, 42] and is called minet. The inference proceeds in two steps.

    First, the Mutual Information Matrix (MIM) is computed, a square matrix whose

    MIMij term is the mutual information between genexi and xj . Secondly, aninference algorithm takes the MIM matrix as input and attributes a score to

    each edge connecting a pair of nodes. Different entropy estimators are imple-

    mented in this package as well as different inference methods, namely aracne,

    clrand mrnet, finally the package integrates accuracy assessment tools, like

    PR-curves and ROC-curves, to compare the inferred network with a reference

    one [41]. The approach used there is based also on techniques from informa-

    tion theory, it is called themaximum relevance/minimum redundancy algorithm

    (MRMR) [17] and results a highly effective information-theoretic technique for

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    13/48

    The Role of Information Theory... 149

    feature (or variable) selection in supervised learning. The MRMR principleconsists in selecting among the least redundant variables the ones that have the

    highest mutual information with the target.

    MRNET [41] extends this feature selection principle to networks in order to

    infer gene-dependence relationships from microarray data. The MRMR method

    [17] used in conjunction with a best-first search strategy for performing filter

    selection in supervised learning problems can be performed as follows: Con-

    sidering a supervised learning task, where the output is denoted by Y and V

    is the set of input variables. MRMR ranks the set V of inputs according to

    a score that is the difference between the mutual information with the output

    variable Y (maximum relevance) and the average mutual information with the

    previously ranked variables (minimum redundancy). Hence direct interactionsshould be well ranked, whereas indirect interactions should be badly ranked

    by the method. Then a greedy search algorithm starts by selecting the vari-

    ableXi that shows the highest mutual information to the target Y. The follow-ing selected variable Xj will be the one with a high information I(Xj ; Y) tothe target and at the same time a low information I(Xj; Xi) to the previouslyselected variable. In the following steps, given a setSof selected variables,the criterion updatesSby choosing the variable that maximizes the score. Ateach step of the algorithm, the selected variable is expected to allow an efficient

    trade-off between relevance and redundancy. The MRMR criterion is therefore

    an optimal pairwise approximation (a proxy) of the conditional mutual infor-

    mation between any two genes Xj andYgiven the set S of selected variablesI(Xj; Y|S). MRNET (and minetalso) works-out by repeating such MRMRalgorithm for every target gene (or in any case for every gene to search for de

    novotranscriptional interactions).

    MRNET reverse engineers networks by means of a forward selection strat-

    egy that aims to identify a maximally-independent set of neighbors for every

    variable. A known limitation of algorithms based on forward selection, how-

    ever, is that the quality of the selected subset strongly depends on the first vari-

    able selected (dependence of initial conditions). A modified version is presented

    called mrnetb[43], which is an improved version of MRNET that overcomes

    this shortcoming by using backward selection followed by a sequential replace-

    ment and it can be implemented with about the same computational burden as

    the original forward selection strategy. The optimization problem of MRNET is

    a form of binary quadratic optimization for which backward elimination com-

    bined with a sequential search is known to perform well. Backward elimination

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    14/48

    150 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    starts with a set containing all the variables and then selects the variable Xiwhose removal induces the highest increase of the objective function. The pro-

    cedure is enhanced by an iterative sequential replacement which, at each step,

    swaps the status of a selected and a non selected variable such that the largest

    increase in the objective function is achieved. The sequential replacement is

    stopped when no further improvement is met [43]. Forward selection, backward

    elimination, and sequential replacement all have an algorithmic complexity of

    O(n2)so that the network built by backward elimination followed by sequentialreplacement has the same asymptotic computational cost as the one based on a

    forward selection strategy alone.

    As one could further notice, the inference of GRNs by means of such high

    performance IT methods is posed by large computational complexity. The lim-iting condition to these approaches is the time-consuming step of computing

    the MI matrix. A method has been proposed by Qiu and colleagues [53] to

    reduce this computation time. It is based in the application of spectral analy-

    sis to re-order the genes, so that genes that share regulatory relationships are

    more likely to be placed close to each other. Then, using a sliding window ap-

    proach with appropriate window size and step size, the MI for the genes within

    the sliding window is then computed, and the remainder is assumed to be zero.

    Qius method does not incur performance loss in regions of high-precision and

    low-recall, while the computational time is significantly lowered. The essence

    of Qius method is as follows: To determine the new gene ordering, a Lapla-

    cian matrix is derived from the correlation matrix of the gene expression data,assuming the correlation matrix provides an adequate approximation to the adja-

    cency matrix for our purpose, then it is computed the Fiedler vector [11], which

    is the eigenvector associated with the second smallest eigenvalue of the Lapla-

    cian matrix. Since the Fiedler vector is smooth with respect to the connectivity

    described by the Laplacian matrix, the elements of the Fiedler vector are then

    sorted to obtain the desired gene ordering. The computational complexity of ob-

    taining the gene ordering is negligible compared to the computation of the MI

    matrix. The reduction in computational complexity is the result of computing

    only the diagonal part of the reshuffled MI matrix. Because the remaining en-

    tries of the MI matrix are set to be zeros, there is potential loss of reconstruction

    accuracy although due to Fielder minimization [53] this effect is not expected

    to be significant. In fact, according with a benchmark of the method [53] in the

    high-precision low-recall regime, applying the sliding window does cause a per-

    formance loss. In some cases, applying the sliding window yields slightly better

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    15/48

    The Role of Information Theory... 151

    performance. In the low-precision regime, however, the windowed version haslower recall but this regime is dismissed, because one is not able to distinguish

    biologically meaningful links from false positive ones.

    3.1.2. Markov Random Fields

    A Markov random field is a n-dimensional random process defined on a

    discrete lattice. Usually the lattice is a regular 2-dimensional grid in the plane,

    either finite or infinite. Assuming thatXnis a Markov Chain taking values in afinite set,

    P(Xn= xn|Xk= xk; k=n) = P(Xn= xn|Xn1 = xn1; Xn+1 = xn+1) (21)

    Hence, full conditional distribution ofXndepends of only in the neighborsXn1and Xn+1: In the 2-D setting, ifS= 1;2; . . . ; NS= 1;2; . . . ; Nis theset ofN2 points, called sites or states. The aforementioned morphism defines aconditional Markov random field [32].

    Markov random field (MRF) models have been applied in several scenarios

    within he computational molecular biology setting, for instance with regards

    to functional prediction of proteins in protein-protein interaction networks [14,

    15, 35], in the discovery of molecular pathways for protein interaction and gene

    expression data [56] and in general network-based analysis for genomic data

    [63, 64]. In the case of reverse engineering methods for network inference, aMRF model could be stated as follows [63]:

    An arbitrary state assignment for a gene set will be denoted by x =(x1, x2, . . . , xp), here xi is the expression state (either equally or differen-tially expressed, 0 or 1 respectively) of gene i, letx be the true but unknowngene expression state. We can interpret this as a particular realization of a

    random vector X = (X1, X2, . . . , X p) where Xi assigns expression state togene i. Let yi stand for the experimentally observed mRNA expression levelof genei and y the corresponding vector, that here is interpreted as a particu-lar realization of a random vector Y = (Y1, Y2, . . . , Y n). Yi itself is a vectoryi = (yi,1, yi,2, . . . yi,m, yi,m+1, yi,m+2, . . . , yi,m+n). This vector containsmreplicates under one condition and n replicates on the other condition. The

    joint distribution ofYcould be given in terms of a MRF, to write down thisjoint probability we need to know conditional dependence/independence. In-

    formation theory could then be useful to determine from the distributions such

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    16/48

    152 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    conditional dependencies. One way to do that is by means of the so-called It-erative Conditional Mode (ICM) algorithm [63] but other IT-based alternatives

    could be also used.

    Conditional dependencies are not the only application of IT and MRFs in

    transcriptional network inference. To study functional robustness in GRNs,

    Emmert-Streib and Dehmer [20] modeled the information processing within the

    network as a first order Markov chain and studied the influence of single gene

    perturbations on the global, asymptotic communication among genes. Differ-

    ences were accounted by an information theoretic measure that allowed to pre-

    dict genes that are fragilewith respect to single gene knockouts. The informa-

    tion theoretic measure used to capture the asymptotic behavior of information

    processing evaluates the deviation of the unperturbed (or normal (n)) state from

    the perturbed (p) state caused by the perturbation of gene k. The relative entropy

    or Kullback- Leibler (KL) divergence was used to quantify this deviation:

    KLi,k=KLpp,i,k ;p

    n,i

    =m

    pp,i,k (m)logpp,i,k (m)

    pn,i (m) (22)

    In equation 22 the stationary distributionspp,i,k andpn,i are given by:

    pp,i,k = limtTtp0i (23)

    pn,

    i = limtTt

    kp0i (24)

    The Markov chain given byTk corresponds to the process obtained by per-turbing gene k in the network. By means of this Markov chain model sup-plemented with an information theoretical KL measure, Emmert-Streib and

    Dehmer [20] were able to study the asymptotic behavior of the transcriptional

    regulatory network of yeast regarding information propagation under the influ-

    ence of single gene perturbations. Hence not only static network properties

    (such as structure) of the transcriptional regulation networks but also dynamic

    features (such as robustness) could be analyzed from the standpoint of IT. The

    study concludes that the knocked out genes destroy some communication paths

    and, hence, can still have a strong impact on the information processing within

    the cell. It seems to be reasonable to assume that the further away the knockout

    gene is from the starting gene (say in Dijkstra distance [16]) the less the impact

    will be. This is a strong evidence that information processing on a systems level

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    17/48

    The Role of Information Theory... 153

    depends crucially on the information processing in a local environment of thegene that sends the information.

    From a perspective of information processing the connection between

    asymptotic information change and local network structure represented by their

    degrees is interesting because it indicates that a local subgraph may be sufficient

    to study information processing in the overall network. This finding seems truly

    interesting because it would allow to reduce the computational complexity (and

    the computational burden also) that arises when studying large genomes on a

    systems scale. From the standpoint of information processing, it was shown

    that the connection between asymptotic information changes and local network

    structure at a local subgraph level may be sufficient to study information pro-

    cessing in the overall network.

    3.1.3. Data Processing Inequality

    In engineering and information theory, the data processing inequality (DPI)

    is a simple but useful theorem that states that no matter what processing you do

    on some data, you cannot get more information (in the sense of Shannon [58])

    out of a set of data than was there to begin with. In a sense, it provide a bound

    on how much can be accomplished with signal processing [12]. More quanti-

    tatively, consider two random variables, X and Y, whose mutual information is

    I(X, Y). Now consider a third random variable, Z, that is a (probabilistic) func-tion of Y only. The only qualifier meansP

    Z|XY(z|x, y) =P

    Z|Y(z|y), which in

    turn implies thatPX|Y Z(x|y, z) =PX|Y(x|y), as is easy to show using Bayestheorem. The DPI states that Z cannot have more information about X than Y

    has about X; that is I(X; Z) I(X; Y). This inequality, which again is a prop-erty that Shannons information should have, can be proved, thus, I(X; Z) =H(X) H(X|Z) H(X) H(X|Y, Z) = H(X) H(X|Y) = I(X; Y).The inequality follows because conditioning on an extra variable (in this case Y

    as well as Z) can only decrease entropy, and the second to last equality follows

    becausePX|Y Z(x|y, z) = PX|Y(x|y). This same principle is applicable eitherto engineering control systems or to biological signal processing such as the one

    present in GRNs [38, 57].

    In reference [38] the DPI states that if genesg1andg3 interact only througha third gene, g2; we have that I(g1, g3) min[I(g1, g2); I(g2, g3)]. Hence,the least of the three MIs can come from indirect interactions only so that the

    proposed algorithm (ARACNe) examines each gene triplet for which all three

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    18/48

    154 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    MIs are greater than I0 and removes the edge with the smallest value. DPI isthus useful to quantify efficiently the dependencies among a large number of

    genes. The ARACNe algorithm eliminates those statistical dependencies that

    might be of an indirect nature, such as between two genes that are separated by

    intermediate steps in a transcriptional cascade. Such genes will very likely have

    non-linear correlated expression profiles which may result in in high MI, and

    otherwise would be selected as candidate interacting genes. Given a transcrip-

    tion factor, application of the DPI will generate predictions about other genes

    that may be its direct transcriptional targets or its upstream transcriptional reg-

    ulators [39, 25].

    The use of the DPI may result not only in a greater assessment of the re-

    sults but also in a significant reduction of the computational burden associated

    with network inference. Zola, et al. [67] presented a parallel method integrating

    mutual information, data processing inequality, and statistical testing to detect

    significant dependencies between genes, and efficiently exploit parallelism in-

    herent in such computations. They developed a method to carry out permuta-

    tion testing for assessing statistical significance of interactions, while reducing

    its computational complexity by a factor ofO(n2), where n is the number ofgenes. The problem of inference (usually consuming thousand of computation

    hours) at the whole genome network level by constructing a 15,222 gene net-

    work of the plant Arabidopsis thaliana from 3,137 microarray experiments in

    30 minutes on a 2,048-CPU IBM Blue Gene/L, and in 2 hours and 25 minutes

    on a 8-node Cell blade cluster [67].

    3.1.4. Minimum Description Length

    One of the major drawbacks for the information theoretic models to infer

    GRNs is that of setting up a threshold which defines the regulatory relationships

    between genes. The minimum description length (MDL) principle has been

    implemented to overcome this problem [10, 19]. The description length used

    by the MDL principle is the sum of model length and data encoding length.

    A user-specified fine tuning parameter is used as control mechanism between

    model and data encoding, but it is difficult to find the optimal parameter. A new

    inference algorithm has been proposed, which incorporates mutual information

    (MI), conditional mutual information (CMI) [defined in terms of the associ-

    ated conditional entropies] and predictive minimum description length (PMDL)

    principle to infer gene regulatory networks from DNA microarray data. In this

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    19/48

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    20/48

    156 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    It is also noticeable that the MDL principle also helps to achieve a goodtrade-off between the network model complexity and the accuracy of data fit-

    ting, since given a network and a dataset, the MDL principle evaluates simul-

    taneously the goodness of fit of the network and the data. Intuitively, the more

    complicated the network is, the better the data would be fitted. However, very

    often models which are over-fitted relative to the actual systems are selected,

    which give rise to numerous errors. MDL aims to achieve a good trade-off be-

    tween model complexity and fitness of the data. A general criterion is thus ob-

    tained for constructing the network so as to contain only direct interactions. The

    convergence of the proposed MDL-based network inference algorithms can be

    assessed by the recovery of the topology of some artificial networks and through

    the error rate plots obtained through extensive simulations on datasets produced

    by synthetic networks [66].

    3.1.5. Kullback-Liebler Divergence

    Kullback-Liebler divergence [33] (as well as its symmetricized version, the

    Jensen-Shannon measure) are, as it turns out, very commonly used informa-

    tion densities in GRN inference and other problems in computational molecular

    biology. Either as unique measure [45, 44] or used in conjunction with other

    indicators, such as spectral metrics [29], Markov fields [20], minimum descrip-

    tion lengths [19], Bayesian networks [50, 31, 46, 48] and multivariate analysis

    [40].

    However, by far the most general use of the KL-divergence within GRNinformation setting is by playing the role of the multi-information: it is known

    [40] that for two variables, X1 andX2, independence is well defined via de-composition of the bivariate JPD, P(X1, X2) = P(X1)P(X2), and mutualinformationI(X1; X2) =log2 P(X1, X2)/[P(X1)P(X2)] which is the onlymeasure of dependence [58]. Along the same lines, thetotal interaction(i.e. the

    deviation from independence) in a multivariate JPD, P(Xi), i = 1,...,N, canbe measured by the multi-information as follows:

    I(X1;X2; . . .X N) = KL [P(X1;X2; . . .X N), P] = KL

    P(X1;X2; . . .X N),i

    P(Xi)(29)HereP(X1; X2; . . . X N) is the full JPD andP

    =

    i P(Xi) is the prob-

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    21/48

    The Role of Information Theory... 157

    ability distribution approximated under independence assumption. SinceP

    isthe maximum entropy (MaxEnt) distribution [28] that has the same univariate

    marginals asPonly without statistical dependencies among the variables, themulti-information is given by the KL divergence between the JPD and its Max-

    Ent approximation with univariate marginal constraints. This KL-divergence

    measures the gain in information by knowing the complete JPD against assum-

    ing total independence. In a similar fashion, thus, MaxEnt distributions con-

    sistent with various multivariate marginals of the JPD introduce no statistical

    interactions apart from the corresponding marginals. By comparing the JPD to

    its MaxEnt approximations under various marginal constraints, we are expect-

    ing to separate dependencies included in the low-order statistics from those not

    present in them [40].

    Assuming that we have a N-variables GRN and we know a set of marginal

    distributions of all variable subsets (for size k 1), One can ask what is theJPD Pk that captures all multivariate interactions prescribed by these marginals,but introduces no additional dependencies. This is of course equivalent to

    search for the minimum I(X1; X2; . . . X N)or conversely, its maximum entropyH(X1; X2; . . . X N), turning our inference problem into a MaxEnt problem:

    Pk arg maxP,{}

    H(P)

    M

    M(PkM PM)

    (30)

    whereMis the set of constrained variables.

    3.1.6. Information Based Similarity

    A promising approach consists in considering that the interactivity of the

    system is based oncommunication channels(either real or abstract) for the bio-

    signals. Thus, Information Theory (IT) could play a useful role in identifying

    entropic measures between pairs {gi, gj} of genes within the sampling universeas potential interactions i,j. IT can also provide with means to test for theMaxEnt distribution, by considering, for example the Kullback-Liebler (KL)

    divergence (in the sense of multi-information) or the Connected Information as

    criteria of iterative convergence to the MaxEnt PDF in the same sense that the

    cumulative distribution leads to the specification of usual PDFs [61].

    One possible approach that we propose below is based on the quantification

    of the so called Information-Based Similarity Index (IBS) [65] initially devel-

    oped to work out the complex structure generated by the human heart beat time

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    22/48

    158 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    series. Nevertheless, IBS has proved to be a very powerful tool in the compar-ison of the dynamics of highly nonlinear processes. Within the present context

    [26], the symbolic sequence represent the expression values of a single gene

    (say gene k-th) all along the sampling universe (of size M), as given by a vector = gk = (gk1 , gk2 , . . . , gkM). Let us consider a series that could well representa gene expression vector. It is possible to classify each pair of successive points

    into one of the following binary statesBn, if(n+1 n)< 0 thenBn= 0; inthe other caseBn= 1. This procedure maps theMstep real-valued time series(i)into anM 1step binary-valued seriesB(i). It is now possible to definea binary sequence of lengthm (called anm-bit word). Each of the m-bit wordswk represents a unique pattern in a given time series. For every unitary time-shift, the algorithm makes a different collectionW of m-bit words over thewhole time series,W = {w1, w2, . . . , wn}. It is expected that the frequencyof occurrence of these m-bit words will reflect the underlying dynamics of the

    original (real-valued) time series. We are then looking to write down a proba-

    bility distribution function in therank-frequencyrepresentation (RF-PDF). This

    RF-PDF represents the statistical hierarchy of symbolic words of the original

    series [65]. Two given symbolic sequences are said to have similarity if they

    give rise to similar probability distribution functions.

    Following the very same order of ideas, Yang and collaborators [65] defined

    a measure of similarity (akin to statistical equivalence) between two series by

    plotting the rank number of every m-bit word in the first series with the rank

    for the same m-bit word in the second series. Of course since the series are

    supposed to be finite, the m-bit words are not equally likely to appear. Themethod introduces the likelihood of each word by defining a weighted distance

    mbetween two given symbolic sequences1 and 2 as follows:

    m(1, 2) = 1

    2m 1

    2mk=1

    |R1(wk) R2(wk)|F(wk) (31)

    F(wk) is the normalized likelihood of the m-bit word k, weighted by itsgiven Shannon entropy, i.e.:

    F(wk) = 1

    Z

    [p1(wk)logp1(wk) p2(wk)logp2(wk)] (32)

    pi(wk) and Ri(wk) represent the probability and rank of a givenword wk in the i-th series. The normalization factor in equation 32 is

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    23/48

    The Role of Information Theory... 159

    the total Shannons entropy of the ensemble and is calculated as Z =k[p1(wk)logp1(wk) p2(wk)logp2(wk)]. m(1, 2) is called the In-

    formation Based Similarity Index (IBS) between series 1, and 2 (e.g. ex-pression vectors g1 and g2 for genes 1 and 2 respectively). One notices thatm(1, 2) [0, 1]; 1, 2; m. In fact one is able to consider m(1, 2)as a probability measure. Iflim m(1, 2) 1 the series are absolutely dis-similar, whereas in the opposite case (limm(1, 2) 0) the two series be-come equivalent (in the statistical sense). One can then approximate the value

    of the interaction potentials(gi, gj) as follows. If one is to consider interac-tion as given by correlation or information flow, one can notice that high values

    ofm imply stronger dissimilarity, hence lower correlation and sincem is aprobability measure, one can define the complementary measure

    m

    = 1mand then one can approximate(gi, gj) m(gi, gj).

    4. Bayesian and Machine Learning Methods

    Systems biology aims to understand biological processes in living systems

    by developing mathematical models which are capable of integrating both ex-

    perimental and theoretical knowledge, and it works both ways: Given a pre-

    specified mathematical framework, the behavior of a set of genes in a specific

    GRN can be simulated under a variety of biological conditions and used to test

    hypotheses. But also, given a particular pre-specified mathematical framework,

    the observation of gene behavior under specific conditions may be used to inferthe underlying GRN. Generally speaking, the reconstruction of a GRN based on

    experimental data is known as areverse engineering approach.

    In the context of information theory combined with systems biology, there

    are two well known information extraction approaches, characterized as top-

    downand bottom-up, both have been used to infer GRNs from high-throughput

    data sources such as microarray gene expression measurements. A top-down

    approach mainly breaks down a system, in order to gain insights into the sys-

    tem. On the other side, bottom-up approaches seek to construct a synthetic gene

    networks.

    The simplest network in an information theory approach is the correlation

    network. This is an undirected graph with edges that are weighted by correla-

    tion coefficients. It is simple, computationally manageable and with small data

    requirement. The drawback of these is that the models are static and they do not

    infer the causality of gene regulation.

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    24/48

    160 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    4.1. Bayesian NetworksA Bayesian network (BN) is a probabilistic graphical network model, de-

    scribed by a directed acyclic graph (DAG). In the model each node represents a

    random variable and edges define conditional independence relations between

    these random variables. These relationships e.g, gene-gene interactions, can be

    seen in a directed graph without cycles.Without cyclesmeans a gene may have

    no direct or indirect interaction with itself. In order to reverse engineer a gene

    network using this approach, one would need to find the directed acyclic graph

    that best describes the gene expression data. This particular limitation of a di-

    rected acyclic graph can be overcome by using a dynamic Bayesian network.

    4.2. Dynamic Bayesian Networks

    Bayesian networks that model sequences of variables are called dynamic

    Bayesian networks (DBNs). Murphy and Mian [47] first introduced the use of

    DBNs to model gene expression time series data. The benefits of DBNs include

    the ability to handle latent variables and missing data (such as transcription

    factor protein concentrations, which may have an effect on mRNA steady state

    levels) and to model stochasticity. Friedman et al. [23] explored experimental

    applications to microarray data analysis. Dynamic Bayesian networks may also

    use continuous measurements rather than discrete. Feedback loops can also

    be unfolded with respect to time, by explicitly modeling the influence of gene

    g1 at time t1 on another gene g2 at time t2, where t2 > t1. An appropriatemodel for gene expression microarray data belongs to the class of linear statespace models, widely used in estimation and control problems arising in system

    modeling. These models consist of a state variable that is either unobserved

    or partially observed, an observable that evolves in a linear relation to the state

    variable, and a structural specification which is a set of parameters in the linear

    and distributional relationships between state variables, observables, and noise

    terms.

    4.3. State-Space Models

    State-Space models, also known as Linear Dynamical Systems (LDS), are

    a subclass of dynamic Bayesian networks. A state space model is a mathemati-

    cal model for a process that accepts inputs which are the drivers of the process

    and generates outputs that are interpreted as observable manifestations of what

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    25/48

    The Role of Information Theory... 161

    is going on inside the process and how this internal behavior is affected by theinputs. These models are suitable for modeling time series data where we have

    a series of observations related to a series of unobserved variables changing

    over time. Time series models in state-space representation can be thought of as

    unobservedcomponent models. The state vector represents those unobserved

    or hidden or missing variables and their dynamics over time are governed by

    a state transition equation. In the very general setting of a state-space model,

    the state vector determines the future evolution of the dynamic system, given

    future time paths of all of the variables affecting the system. The variables are

    not restricted, they can be either discrete with a countable number of possible

    values or continuous with an associated density curve. For example, modeling

    gene expression data assumes continuous variables and requires the inclusion

    of hidden states. Hidden variables could model the effects of genes that have

    not been included in the experiment, they could also model levels of regulatory

    proteins as well as possible effects of mRNA or protein degradation. One goal

    is to infer the characteristics and properties of the unobserved variables based

    on the observations. In linear state-space models, a sequence of p-dimensional

    real-valued observation vectors {y1...,yT}, is modeled by assuming that at eachtime step ytwas generated from a K-dimensional real-valued hidden (i.e. unob-served) state variable xt, and that the sequence ofxs is governed by a first-orderMarkov process. This type of model is shown pictorially in Figure (3).

    A linear-Gaussian state space model of the time series {yt} is specified bythe matricesAandCcalled system matrices and is described by a pair of equa-

    tions:

    xt+1 = Axt+ wt (33)

    yt = Cxt+ vt (34)

    These two equations represent the most basic form of a state-space model.

    The vector xt RK is called the state vector at time t. The state equation

    (33) shows how this vector evolves with time. A is the dynamic or transitionstate matrix, and its eigenvalues are important in determining the way the data

    behave. The observation equation (34) specifies the relationship between the

    observed data and this newly introduced vectorxt. Cdescribes the relation be-tween state and observation, and wtand vtare zero-mean random noise vectors.

    For the most general case the noise vectors could be mutually correlated,

    although serially uncorrelated. In the particular Linear Gaussian case they are

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    26/48

    162 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    Figure 3. State-Space model.

    mutually independent and independent of the initial state value x0. Assumingthat the initial statex0is fixed or Gaussian distributed, and the noise vectors are

    jointly Gaussian, then the state and output of the system is also Gaussian. That

    is, all future hidden states xt and observationsyt generated from those hiddenstates will be Gaussian distributed.

    This model has been extensively used in state-space modeling. Brockwell

    and Davis [7] develop the state-space model described by (33) and (34) as

    well as the associated Kalman filter recursions and apply these in representing

    ARMA (autoregressive moving average) and ARIMA (autoregressive integrated

    moving average) processes. The Kalman filter recursions define recursive esti-

    mators for the state vector xt, given observations up to the present time t. Stofferand Shumway [59] present a similar development and apply it to representing

    ARMAX (autoregressive-moving average with exogenous terms) models. Stof-

    fer and Shumway also develop the recursive smoother, which gives estimators

    of the state variablextgiven observations prior to and after time t, and developstate space models that include exogenous inputs in the state equation, observa-

    tion equation, or both. State-space models can be written in different ways. The

    structure of the model used in this thesis includes exogenous variables in both

    equations and its derivation is detailed in the next section.

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    27/48

    The Role of Information Theory... 163

    4.4. LDS Model for Gene ExpressionFluorescent intensities are measures of gene expression levels. Values of

    some of these variables influence the values of others through the regulatory

    proteins they express, including the possibility that the expression of a gene at

    one time point may, in various circumstances, influence the expression of that

    same gene at a later time point.

    To model the effects of the influence of the expression of one gene at a

    previous time point on another gene and its associated hidden variables the LDS

    model with inputs we modify the structure as follows.

    We let the observations y(i)t = g

    (i)t , the expression level of genei at time

    pointt, and the inputsht = gtandut= gt1to give the model shown in Figure

    4.

    Figure 4. Bayesian network representation of the model for gene expression.

    This model is described by the following equations:

    xt+1 = Axt+ Bgt+ wt (35)

    gt = Cxt+ Dgt1+ vt (36)

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    28/48

    164 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    Model Assumptions

    The vectorut Rpu is the exogenous input observation vector,ht R

    ph

    represents the exogenous influence on the hidden states. As before, the state

    and observation vectorsxtand ythave dimensionsKand p, respectively.Ais the state transition matrix,Bis the input to state matrix in the state transition equation,Cis the state to observation matrix andDis the input to observation matrix.The state and observation noise vectors, wt and vt respectively, are randomvectors serially independent and identically distributed, and also independent

    of the initial values ofx andy and independent of one another.

    Remarks

    These system matrices A ,B,C ,Dare taken to be constant in this researchbut they also may be varying over time in which case it is appropriate to add a

    subscript indicating this.

    When the sequence {x1, w1,...,wT} is independent then the distribution ofxt+1|xt,...x1 is the same as the distribution ofxt+1|xt, hence the state vectorxtevolves with a first-order Markov property withAas the transition matrix. The noise vectors can also be viewed as hidden variables. Here the matrix

    Din the observation equation captures gene-gene expression level influences atconsecutive time points whilst the matrix Ccaptures the influence of the hiddenvariables on gene expression level at each time point. Matrix B models theinfluence of gene expression values from previous time points on the hidden

    states, and A is the state transition matrix. However, our interests focus onCB + D which not only captures the direct gene to gene interaction but alsothe gene to gene interactions through the hidden states over time. This is

    actually the matrix we will concentrate the analysis on, since it captures all of

    the information related to gene-gene interaction over time.

    5. Constrained LDS

    Mathematically speaking, the idea of adding constraints to the model is ba-

    sically to reduce the number of parameters to estimate. Narrowing down the

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    29/48

    The Role of Information Theory... 165

    range of parameters to estimate by adding extra constraints reduces dimension-ality which can considerably simplify the search for the parameters that best

    describe the model. At all times during modeling with constraints diagnostics

    should be made to make sure the model still fits well after taking account of

    the constraints. How precisely to include these forms of information into the

    inference process was not a straightforward task. However, this is the true art of

    modeling.

    From the biological point of view, the current application to gene expression

    data is already complex. Data generation, low-level analyses and classification

    are known to be crucial in getting gene expression levels. Different algorithms

    can lead to different sets of genes. Hence, biological mining should be present in

    any machine learning approach. In this sense, any knowledge about gene behav-

    ior and regulatory interactions are helpful. Now, if this additional information

    can be included and modeled, estimation not only becomes more realistic due

    to the reduction os parameters but also due to a more biology based approach.

    Given either a-priori or new hypothesized information leading to a set of

    plausible models, the LDS model is re-trained based on this knowledge about

    the parameters. The a-priori information would be supplied by past experiments

    or biological knowledge, while the new hypothesized information is obtained

    from the bootstrap analysis

    5.1. Model Definition

    Two competing motivations must be kept in mind when defining a model:

    fidelity and tractability. The models fidelity describes how closely it corre-

    sponds to reality. On the other hand, the models tractability focuses on the ease

    with which it can be mathematically described as well as analyzed and validated

    statistically based on observations and measurements. It is understandable that

    increasing one (either fidelity or tractability) is usually done at the expense of

    the other. Consequently, the ideal model should be developed in close cooper-

    ation between the science governing the application and feasible mathematical

    and statistical methods. One common assumption that aids tractability is that

    model errors are normally (or Gaussian) distributed. Indeed, a large number

    of existing algorithms and methods of statistical inference are based on jointly

    Gaussian observables. Though rarely satisfied exactly in practice, this assump-

    tion is often justified because it makes the analysis of the model tractable and

    the resulting statistical inferences are robust in the sense of being insensitive to

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    30/48

    166 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    small departures from normality. The model definition used in this work, is de-fined with the Gaussian assumption only insofar as it makes the analysis of the

    models more straight forward and tractable. However, for statistical inferences

    and validation of the model, no essential use of the Gaussian assumption is be

    made. Instead, more general methods such as bootstrapping are employed.

    5.2. Structural Specification

    We will concentrate here on incorporating a-priori information, and for this,

    the emphasis is on constraining elements in the matrix D. The reason for this is

    simple: D describes the direct gene-to-gene interactions over time, and there-

    fore seems the most suitable place to incorporate a-priori information. Recall,that the gene regulatory network is constructed from the estimate of CB + D,

    and thus has incorporated in it also the influence of hidden variables (e.g. the

    influence of missing genes / proteins, etc.). Thus, the hypothesized form of this

    dag entails that some elements of the matrix CB + D are zero. The idea now

    would be to impose those constraints on CB + D and re-estimate the model

    structural parameters under these constraints and verify that the model still fits

    the data well. Imposing constraints reduces the dimensionality of the unknown-

    parameter space, and thus creates a new estimation problem (one for which the

    remaining unconstrained parameters can be estimated more precisely). Because

    of this, solving this new estimation problem (and performing diagnostics) could

    expose shortcomings in how well the constrained model describes the data, or

    could expose other parts of the model structure that were obscured because of

    the larger number of parameters to estimate in the unconstrained model.

    5.3. Estimation

    With the structural specification known, the objective is to estimate, in a

    least-squares sense, the unknown or unobserved state variables from the avail-

    able observations. The so-called Kalman filter solves this problem, and vari-

    ations of the filter give interpolation, extrapolation, and smoothing estimators

    of the state variables (see the book by Aoki [5], for example). The resulting

    estimators are optimal in the sense of least- squares, given that one is restricted

    to consideration of estimators that are linear functions of the observables. Their

    derivation can be accomplished in generality by casting the problem in the con-

    text of approximation in a Hilbert space of random variables possessing finite

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    31/48

    The Role of Information Theory... 167

    second order moments. This reduces the problem to one of computing projec-tions onto the subspaces spanned by the observables, but the derivations and

    machinery of that theoretical approach are tedious. However, in the special

    case when the states and observables are jointly Gaussian, the least squares es-

    timators of state are given by conditional expectations (conditioned on the ob-

    servables) which are in turn linear functions of the observables. Moreover, the

    conditional expectation operator has all the essential properties of the subspace

    projection operator in the Hilbert space context. As a consequence, the shorter

    and more elegant analysis of the problem in the Gaussian context leads to ex-

    actly the same estimators of the state variables as the more general Hilbert space

    context. Thus, in terms of formulating the state estimators, there is no loss of

    generality in assuming Gaussian joint distributions.

    Regarding the estimation of the structural parameters, in the absence of as-

    sumptions regarding the joint distributions of the state variables and observables

    or any other pertinent information, a weighted least-squares approach would be

    reasonable and justified. If the assumption is made that the state variables and

    observables are jointly Gaussian, then the method of maximum likelihood leads

    to parameter estimators that are essentially equivalent to those yielded by the

    weighted least-squares approach. Thus, again there is no loss of generality in

    making the Gaussian assumption for constructing estimators of structural pa-

    rameters.

    5.4. Derivation

    To model the effects of the influence of the expression of one gene at a

    previous time point on another gene and its associated hidden variables, we

    consider the state-space model

    xt+1= Axt+ Byt+ wt (37)

    yt = C xt+ Dyt1+ vt (38)

    The column vectorx is the state vector of hidden variables for the system,u is the input observation vector, C is the state to observation matrix whichcaptures the influence of the hidden variables on gene expression level at each

    time point.

    The matrix D describes the gene-to-gene interaction at consecutive timepoints. From this matrix we obtain the Bayesian network representation of the

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    32/48

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    33/48

    The Role of Information Theory... 169

    + tr(R1

    )(Syy SyxC

    SyuD

    CS

    yx+ CP C

    + CSxuD DSyu+ DS

    xuC

    + DSuuD)

    where

    Syy =N

    j=1

    Tt=1

    y(j)t y

    (j)t

    Syx =N

    j=1

    Tt=1

    y(j)t x

    (j)t

    Syu =

    Nj=1

    Tt=1

    y(j)t u

    (j)t

    Sxu =N

    j=1

    Tt=1

    x(j)t u

    (j)t

    Suu =N

    j=1

    Tt=1

    u(j)t u

    (j)t

    P =N

    j=1

    Tt=1

    E[xtxt|y1,...,yT]

    Taking partial derivatives of (39) and making them equal to zero, we solve for

    C, D and R. In other words, we find the unconstrained estimatorsthat mini-mize the likelihood function (39).

    D = (Syu SyxP1Sxu)(Suu S

    xuP

    1Sxu)1 (40)

    C = (Syx DSxu)P

    1 (41)

    R = 1

    NT

    Syy CS

    yx DS

    yu

    (42)

    To obtain the constrained estimators(Dcons, Ccons, and Rcons) we needto solve the following

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    34/48

    170 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    Constrained Minimization ProblemMinimize

    2L(C ,D,R) = NTlog |R|

    + tr(R1)(Syy SyxC SyuD

    CSyx+ CP C

    + CSxuD DSyu+ DS

    xuC

    + DSuuD)

    subject to the constraintDF G= 0

    Solution: We introduce the Lagrange Multipliers method to minimize

    the new likelihood function (39) subject to the constraintDFG= 0.

    Let us define the real-valued column vector of Lagrange multipliers

    = (1, 2,...,n). The likelihood function and the constraints associ-

    ated with it define our objective function as:

    M(C ,D,R) = tr(N Tlog |R|)

    + tr(R1)(Syy SyxC SyuD

    CSyx+ CP C

    + CSxuD DSyu+ DS

    xuC

    + DSuuD)

    + tr[(DF G)] (43)

    Necessary conditions for a minimum of M(C ,D,R) are that elements inC,D,R,andbe chosen to give

    M

    C = 0,

    M

    D, and

    M

    =C onstraints= 0

    The third expression implies that a minimum for M is also a minimum for thelikelihood function (39).

    M

    C =

    Ctr(R1)(SyxC

    CSyx + CP C + CSxuD

    + DSxuC)

    = 2R1

    (CconsP+ DconsS

    xu Syx) = 0 (44)

    M

    D =

    D

    tr(R1)(SyuD

    + CSxuD DSyu + DS

    xuC + DSuuD

    ) + F

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    35/48

    The Role of Information Theory... 171

    = 2R1(CconsSxu+ DconsSuu Syu) + F = 0 (45)

    M

    = DconsFG= 0 (46)

    From (44) and (45) we get the constrained estimators forCand D

    Ccons = (Syx DconsSxu)P

    1 (47)

    Dcons = (Syu CconsSxu 1

    2

    RconsF)S1uu

    Using the expressions (40) and (41) for the unconstrained estimators we get the

    constrainedD matrix

    Dcons= D 1

    2Rcons

    F(Suu SxuP

    1Sxu)1

    Substituting these back into (46) and solving forgives:

    1

    2Rcons

    = (DF G)(F(Suu SxuP

    1Sxu)1F)1

    Putting the expression above back into (5.4.) and solving for Dcons we finallyobtain the constrained estimators for C and D in terms of the unconstrainedones.

    Dcons = D (DFG)(F(Suu S

    xuP1Sxu)

    1F)1F(Suu S

    xuP1Sxu)

    1

    Ccons = C (DF G)(F(Suu S

    xuP1Sxu)

    1F)1F(Suu S

    xuP1Sxu)

    1SxuP1

    Similarly, the constrained covariance matrix Rconsis obtained by differentiatingwith respect toR and solve.

    M

    R = NTR1cons(Syy SyxC

    cons SyuD

    cons CconsS

    yx + CconsPC

    cons

    +CconsSxuDcons DconsS

    yu + DconsS

    xuC

    cons+ DconsSuuD

    cons)

    = 0 (48)

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    36/48

    172 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    leads to

    Rcons = R+ 1

    N T(Syu+ CconsSxu+ DconsSuu)Dcons

    = R+ 1

    N T

    1

    2Rcons

    FD

    cons

    (49)

    = R 1

    N T(DFG)(F(Suu S

    xuP

    1Sxu)1F)1G (50)

    Unfortunately, this constraints cannot be implemented in the model used for

    this research. The selection of the matrices F and G that could zero out some

    elements inD become difficult as the size of the matrix increases. However, by

    re-writting the constrained problem using the vec operator we can easily handleany matrix size.

    5.5. Vec Formulation

    The vec operator vectorizes a matrix by piling up the columns. That is,

    suppose we want to vectorize a 2x2 matrixM

    M=

    m11 m12m21 m22

    , vec(M) =

    m11m21m12

    m22

    The Kronecker product of two matrices plays an important role when using the

    vec operator. There are important relationships that will be used in the develop-

    ment of the constrained minimization problem in vec formulation.

    Definition: The Kronecker product of two matrices,A and B , whereA ismxn andB is pxq, is defined as

    A B=

    A11B A12B . . . A1nBA21B A22B . . . A2nB

    . . . . . . . . . . . .Am1B Am2B . . . AmnB

    ,

    which is anmpxnqmatrix.

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    37/48

    The Role of Information Theory... 173

    Important Operator Relationships

    vec(AXB ) = (BT A)vec(X) (51)

    (AC BD) = (A B)(C D) (52)

    (A B)1 = A1 B1 (53)

    dxTAx

    dx = xT(A + AT) (54)

    To show the application of the vec operator in the constraint settings let us look

    at the following example.

    EXAMPLE:

    Let us consider a 2x2 matrix D and suppose we want to constrain it to bediagonal. Select the matricesF andG to be

    D=

    d11 d12d21 d22

    , F =

    0 1 0 00 0 1 0

    G=

    00

    Then, applying the constraint Fvec(D)=G we get that the elements d1 andd2are zero and the matrixD becomes:

    D= d11 00 d22

    In general, for anyn xn matrixD we can find matricesF andG and solve theconstrained minimization problem using vec formulation as follows:

    Constrained Minimization Problem 2

    Minimize

    2L(C ,D,R) = NTlog |R|+ tr(R1)(Syy SyxC

    SyuD CSyx+ CP C

    + CSxuD DSyu+ DS

    xuC

    + DSuuD)

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    38/48

    174 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    subject to the constraintFvec(D) -G = 0

    Solution: We introduce the Lagrange Multipliers method to minimize

    the objective function

    M(C ,D,R) = tr(N Tlog |R|)

    + tr(R1)(Syy SyxC SyuD

    CSyx+ CP C

    + CSxuD DSyu+ DS

    xuC

    + DSuuD)

    + (Fvec(D) G) (55)

    subject to the constraintFvec(D)-G= 0.

    M

    C =

    Ctr(R1)(SyxC

    CSyx+ CP C + CSxuD

    + DSxuC)

    = 2R1(CconsP+ DconsS

    xu Syx) = 0 (56)

    M

    vec(D) = 2vec(R1consSyu) + 2vec(R

    1consCconsSxu) +

    2vec(R1consDconsSuu) +vec(F) = 0 (57)

    M

    = Fvec(Dcons) G= 0 (58)

    M

    R = N TR1cons(Syy SyxC

    cons SyuD

    cons CconsS

    yx+

    CconsP C

    cons+ CconsSxuD

    cons DconsS

    yu+ DconsS

    xuC

    cons+

    DconsSuuD

    cons) = 0 (59)

    From (57) and the following expressions

    vec(R1consDconsSuu) = (Suu R1cons)vec(Dcons)

    vec(R1consCconsSxu) = (S

    xu R

    1cons)vec(Ccons)

    vec(F) = F

    vec(Ccons) = vec(SyxP1) (P1Sxu I)vec(Dcons)

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    39/48

    The Role of Information Theory... 175

    we have that,

    vec(Dcons) =vec(D) 1

    2((Suu S

    xuP

    1Sxu)1 Rcons)F

    We still need to work out the value for. Hence, substituting (57) into (58) andsolving for gives:

    = (Fvec(D) G)(1

    2F(Suu S

    xuP

    1Sxu)1 Rcons)F

    )1 (60)

    Now, putting this expression forback into (5.5.) we obtain

    vec(Dcons) =vec(D) V1F[F V1F]1(Fvec(D) G) (61)

    where,

    V = (Suu SxuP

    1Sxu)1 Rcons)

    Finally, from (59) we obtain the expression for Rcons implicitly in theform ofRcons = R+ f(Rcons) for which we will need to iterate and reshapethe matrixDconsat each iteration.

    Rcons= R+ 1

    NT

    1

    2Rcons

    FDcons

    (62)

    5.6. Constraints Implementation - EM Procedure

    In order to apply the EM algorithm, we require initial values of the state

    and covariance as well as the parameters which are initialized using linear

    regression. Then the EM procedure operates as follows:

    E-step

    Given the initial estimatorsx0, P0 and initial estimators ofA ,B,C ,D,Q andRuse the Kalman filter equations to compute the estimates forx+t andPt.

    M-step

    Re-estimate the unconstrainedA, B,C, D, Q, and R using the values for x+t

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    40/48

    176 Enrique Hernandez-Lemus and Claudia Rangel-Escareno

    and Pt in the formulas for a,b, c, d, e, and P (* Here, is where we add theconstraintsFvec(D) -G= 0*)

    ALGORITHM:

    1. Start with the unconstrained estimates ofC, D , andR. Equations (40)-(42).

    2. The vec expression for the constrained Cconsand Dconsare in fact func-tions ofRconswhich is in turn a function of the unconstrained Ruand thepreviousRcons, and has to be calculated by iteration. That is,

    vec(Dcons) = vec(D) V1F[FV1F]1(Fvec(D) G) (63)

    vec(Ccons) = vec(C) (P1d I)V1F[FV1F]1(Fvec(D) G) (64)

    whereV(Rcons(r))is as in (5.5.), and

    Rcons= R+ f(Rcons)withRcons(0) =R and

    Rcons(r + 1) = R + f(Rcons(r)); r = 0, 1, 2,... until||Rc(r+ 1) Rc(r)||< tol

    Hence,

    Rc= Rc(r+ 1),

    Cc = C c(Rc(r+ 1)), andDc= Dc(Rc(r+ 1))

    3. Now, in the iteration process,

    Rc(r+ 1) = 1

    N T[a Cc(Rc(r))b Dc(Rc(r))c

    +(Cc(Rc(r))d + Dc(Rc(r))e c)Dc(Rc(r+ 1))]

    So, for each iterationr we need to reshape vec(Dc) and vec(Cc) and putit back into matrix form to compute a new Rc(r+1). Continue this until

    convergence and once we have the final Rc put it back one more time to

    find vec(Dc) and vec(Cc) and reshape them.

  • 8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference

    41/48

    The Role of Information Theory... 177

    4. Then Dc and Cc are the matrices that go back to the E-step to be used(along with the other parameters) to find an updated and more accurate

    estimate ofx+t andPt

    6. Conclusions

    Information theory as such, is concerned with the quantification, analysis

    and forecasting of information processing in systems under incomplete and/or

    noisy data acquisition. As we discussed in this chapter, the problem of the in-

    ference and analysis of gene regulatory networks from experimental data on

    gene expression at a genome wide scale, is closely related with the foundational

    tenets of information theory. In fact, given the current biological understand-ing of gene regulation as an extremely complex signal processing phenomena,

    information theoretical tools and concepts result a natural choice for the task

    of inference/analysis of such GRNs. We presented several instances in which

    information theory, either on its own, or combined with probabilistic graphical

    models, Bayesian statistics and machine-learning techniques have been used in

    the inference and assessment of GRNs.

    Purely information theoretical approaches are based on complex graph ren-

    derings (i.e. both cyclic and acyclic probabilistic models are allowed) and are

    able to describe the system using either continuous or discrete probability den-

    sity functions. The means for dealing with incomplete or noisy data is by quan-

    tifying interactions that are usually valued by means of statistical dependencemeasures such as mutual information and Kullback-Leibler divergences, either

    on a marginal or conditional setting. The use of minimum description length

    as a measure of algorithmic complexity, of the data processing inequality to

    discriminate between direct and indirect interactions, and of Shannons signal

    processing theorems to establish thresholds or bounds of confidence, is usually

    supplemented with optimization based on maximum entropy (MaxEnt) tech-

    niques.

    In the other hand, Bayesian/machine-learning implementations of informa-

    tion theoretical models are usually based on directed acyclic graphs (DAGs),

    these also allow either discrete or continuous probability distribution funct