Upload
nahomyitbarek
View
78
Download
0
Tags:
Embed Size (px)
Citation preview
Learning probabilistic networks of condition-specific response:
Digging deep in yeast stationary phase
Sushmita Roy∗, Terran Lane∗, and Margaret Werner-Washburne+
∗Department of Computer Science, University of New Mexico+Department of Biology, University of New Mexico
Abstract
Condition-specific networks are functional networks of genes describing molecular behavior un-
der different conditions such as environmental stresses, cell types, or tissues. These networks
frequently comprise parts that are unique to each condition, and parts that are shared among
related conditions. Existing approaches for learning condition-specific networks typically iden-
tify either only differences or similarities across conditions. Most of these approaches first learn
networks per condition independently, and then identify similarities and differences in a post-
learning step. Such approaches have not exploited the shared information across conditions
during network learning.
We describe an approach for learning condition-specific networks that simultaneously identi-
fies the shared and unique subgraphs during network learning, rather than as a post-processing
step. Our approach learns networks across condition sets, shares data from conditions, and leads
to high quality networks capturing biologically meaningful information.
On simulated data from two conditions, our approach outperformed an existing approach
of learning networks per condition independently, especially on small training datasets. We
further applied our approach to microarray data from two yeast stationary-phase cell popu-
lations, quiescent and non-quiescent. Our approach identified several functional interactions
that suggest respiration-related processes are shared across the two conditions. We also iden-
tified interactions specific to each population including regulation of epigenetic expression in
the quiescent population, consistent with known characteristics of these cells. Finally, we found
several high confidence cases of combinatorial interaction among single gene deletions that can
be experimentally tested using double gene knock-outs, and contribute to our understanding of
differentiated cell populations in yeast stationary phase.
1
1 Introduction
Although the DNA for an organism is relatively constant, every organism on earth has the po-
tential to respond to different environmental stimuli or to differentiate into distinct cell-types or
tissues. Different environmental conditions, cell-types or tissues can be considered as different in-
stantiations of a global variable, the condition variable, which induces condition-specific responses.
These condition-specific responses typically require global changes at the transcript, protein and
metabolic levels and are of interest as they provide insight into how organisms function at a systems
level. Condition-specific networks describe functional interactions among genes and other macro-
molecules under different conditions, providing a systemic view of condition-specific behavior in
organisms.
Analysis of condition-specific responses has been one of the principal goals of molecular biology,
and several approaches have been developed to capture condition-specific responses at different
levels of granularity. The most common approach is the identification of differentially expressed
genes in a condition of interest using genome-wide measurements of gene, and often protein expres-
sion [20]. More recent approaches are based on bi-clustering, which cluster genes and conditions
simultaneously [5,7,9,29], and identify sets of genes that are co-regulated in sets of conditions. How-
ever, these approaches do not provide fine-grained interaction structure that explains the condition-
specific response of genes. More advanced approaches additionally identify transcription modules
(set of transcription factors regulating a set of target genes) that are co-expressed in a condition-
specific manner [11,13,26,31], but these too do not provide detailed interaction information among
genes for each condition.
In this paper, we describe a novel approach, Network Inference with Pooling Data (NIPD), for
condition-specific response analysis that emphasizes the fine-grained interaction patterns among
genes under different conditions. The main conceptual contribution of our approach is to learn
networks for any subset of conditions. This subsumes existing approaches that find either only
patterns that are specific to each condition, or only patterns that are shared across conditions.
To make this clear, let us consider a simple example of two environmental starvation conditions:
Carbon and Nitrogen starvation. Using our approach we can simultaneously find patterns that are
2
specific only to Carbon starvation, only to Nitrogen starvation, and those that are shared across
these two conditions. From the methodological stand-point our work is similar to Bayesian multi-
nets [10], which we extend by allowing data to be pooled across conditions and learning networks
for any subset of conditions.
NIPD is based on the framework of probabilistic graphical models (PGMs), where edges rep-
resent pairwise and higher-order statistical dependencies among genes. Similar to existing PGM
learning algorithms, NIPD infers networks by iteratively scoring candidate networks and selecting
the network with the highest score [12]. However, NIPD uses a novel score that evaluates candidate
networks with respect to data from any subset of conditions, pooling data for subsets with more
than one conditions. This subset score and search strategy of NIPD incorporates and exploits the
shared information across the conditions during structure learning, rather than as a post-processing
step. As a result, we are able to identify sub-networks not only specific to one condition, but to mul-
tiple conditions simultaneously, which allows us to build a more holistic picture of condition-specific
response.
The data pooling aspect of NIPD makes more data available for estimating parameters for
higher-order interactions, i.e., interactions among more than two genes. This enables NIPD to
robustly estimate higher-order interactions, which are more difficult to estimate due to the high
number of parameters relative to pairwise dependencies.
By formulating NIPD in the framework of PGMs we have additional benefits: (a) PGMs are
generative models of the data, providing a system-wide description of the condition-specific behavior
as a probabilistic network, (b) the probabilistic component naturally handles noise in the data,
(c) the graph structure captures condition-specific behavior at the level of gene-gene interactions,
rather than coarse clusters of genes, (d) the PGM framework can be easily extended to more
complex situations where the condition variable itself may be a random variable that must be
inferred during network learning. We implement NIPD with undirected, probabilistic graphical
models [14]. However, the NIPD framework is applicable to directed graphs as well.
We are not the first to propose networks for capturing condition-specific behavior [24, 34].
Several network-based approaches have been developed for capturing condition-specific behavior
3
such as disease-specific subgraphs in cancer [8], stress response networks in yeast [21], or networks
across different species [4,28]. However, these approaches are not probabilistic in nature, often rely
on the network being known, and are restricted to pairwise co-expression relationships rather than
general statistical dependencies. Other approaches such as differential dependency networks [34],
and mixture of subgraphs [24], construct probabilistic models, but focus on differences rather
than both differences and similarities. The majority of these approaches infer a network for each
condition separately, and then compare the networks from different conditions to identify the edges
capturing condition-specific behavior.
We compared NIPD against an existing approach for learning networks from the conditions
independently. We refer to this approach as INDEP, which represents a general class of existing
algorithms that learn networks per condition independently. On simulated data from networks
with known ground truth, NIPD inferred networks with higher quality than did INDEP, especially
on small training datasets. We also applied our approach to microarray data from two yeast
(Saccharomyces cerevisiae) cell types, quiescent and non-quiescent, isolated from glucose-starved,
stationary phase cultures [2]. Networks learned by NIPD were associated with many more Gene
ontology biological processes [3], or were enriched in targets of known transcription factors (TFs)
[17], than networks learned by INDEP. Many of the TFs were involved in stress response, which
is consistent with the fact that the populations are under starvation stress. NIPD also identified
many more shared edges, which represent biologically meaningful dependencies than the INDEP
approach. This suggests that by pooling data from multiple conditions, we are able to not only
capture shared structures better, but also to infer networks with higher overall quality.
2 Results
The goal of our experiments was three fold: (a) to examine the quality of condition-specific net-
works inferred by our approach that combines data from different conditions (NIPD) versus an
independent learner (INDEP), (b) to evaluate the algorithmic performance (measured by network
structure quality) as a function of training data size, (c) analyze how two different cell populations
behave, at the network level, in response to the same starvation stress. We address (a) and (b)
4
on simulated data from networks with known topology, giving us ground truth to directly validate
the inferred networks. We address (c) on microarray data from two yeast cell populations isolated
from glucose-starved stationary phase cultures [2].
2.1 NIPD had superior performance on networks with known ground truth
We simulated data from two sets of networks, each set with two networks, one network per condition.
In the first, HIGHSIM, the networks for the two conditions, shared a larger portion (60%) of the
edges, and in the second, LOWSIM, the networks shared a smaller (20%) portion of the edges.
We compared the networks inferred by NIPD to those inferred by INDEP by assessing the match
between true and inferred node neighborhoods (See Supplementary Methods). Briefly, the data were
split into q partitions, where q ∈ {2, 4, 6, 8, 10}, and networks learned for each partition. The size of
the training data decreased with increasing q. We first evaluated overall network structure quality
by obtaining the number of nodes on which one approach was significantly better (t-test p-value,
< 0.05) in capturing its neighborhood as a function of q. On LOWSIM, NIPD was significantly
better for smaller amounts of training data. On HIGHSIM, NIPD performed significantly better
than INDEP for all training data sizes (Fig 1).
Next, we evaluated how well the shared edges were captured as a function of decreasing amounts
of training data (Supplementary Fig 1). NIPD captured shared edges better than INDEP on
LOWSIM as the amounts of training data decreased. NIPD was better than INDEP on HIGHSIM
regardless of the size of the training data.
Our results show that when the underlying networks corresponding to the different conditions
share a lot of structure, NIPD has a significantly greater advantage than INDEP, which does not do
any pooling. Furthermore, as training data size decreases, NIPD is better than INDEP for learning
both overall and shared structures, independent of the extent of sharing in the true networks.
2.2 Application to yeast quiescence
We applied NIPD to microarray data from two yeast cell populations, quiescent (QUIESCENT)
and non-quiescent (NON-QUIESCENT), isolated from glucose starvation-induced stationary phase
5
cultures [2]. The two cell populations are in the same media but have differentiated physiologically
and morphologically, suggesting that each population is responding differently. We learned networks
using NIPD and INDEP treating each cell population as a condition. Because each array in the
dataset was obtained from a single gene deletion mutant, the networks were constrained such that
genes with deletion mutants connected to the remaining genes1.
The inferred networks from both methods were evaluated using information from Gene Ontology
(GO) process, GO Slim [3] and transcriptional regulatory networks [17]. Gene Ontology is a
hierarchically structured ontology of terms used to annotate genes. GO slim is a collapsed single
level view of the complete GO terms, providing high level information of the processes, functions
and cellular locations involving a set of genes. Finally, we analyzed combinations of genes with
deletions that were in the neighborhood of other non deletion genes.
2.2.1 NIPD identified more biologically meaningful dependencies
To determine if one network was more biologically meaningful than the other, we examined the net-
works based on Gene Ontology (GO) slim category (process, function and location), transcription
factor binding data and GO process, referred as GOSLIM, TFNET and GOPROC, respectively
(Fig 2). Network quality was determined by the number of GOSLIM categories (or TFNET or
GOPROC) with better coverage than random networks (See Methods). Both approaches were
equivalent for GOSLIM, with INDEP outperforming NIPD in QUIESCENT and NIPD outper-
forming INDEP on NON-QUIESCENT. NIPD outperformed INDEP with a larger margin than
was outperformed on TFNET categories from NON-QUIESCENT. NIPD was consistently better
than INDEP on GOPROC categories.
The networks learned by NIPD had many more edges than the networks learned by INDEP
(Supplementary Table 1). To estimate the proportion of the edges capturing biologically meaningful
relationships, we computed semantic similarity of genes connected by the edges [16]. Although both
INDEP and NIPD had significantly better semantic similarity than random networks, INDEP
degraded in p-value for QUIESCENT at the highest value of semantic similarity (Fig 3). NIPD-1This is not a bi-partite graph because the genes with deletion mutants are allowed to connect to each other.
6
inferred networks had many more edges with high semantic similarity than INDEP, while keeping
the proportion of edges satisfying a particular semantic similarity threshold close to INDEP. This
suggests that NIPD identifies more dependencies that are biologically relevant than INDEP without
suffering in precision.
2.2.2 NIPD identified more shared edges representing common starvation response
We performed a more fine-grained analysis of the inferred networks by considering each gene and
its immediate neighborhood and tested whether these gene neighborhoods were enriched in GO
biological processes, or in the target set of transcription factors (TFs) (See Methods). Using a false
discovery rate (FDR) cutoff of 0.05, we identified many more subgraphs in the networks inferred
by NIPD than by INDEP to be enriched in a GO process or in targets of TFs (Figs 4, 5). NIPD
identified more processes and larger subgraphs in both populations (oxidative phosphorylation,
protein folding, fatty acid metabolism, ammonium transport) than did INDEP.
NIPD identified subgraphs involved in aerobic respiration and oxidative phosphorylation were
enriched in targets of HAP4, a global activator for respiration genes. The presence of HAP4 targets
in both cell populations makes sense because both populations are experiencing glucose starvation
and must switch to respiration for deriving energy. We also found the TFs, MSN2, MSN4, and
HSF1, regulating subgraphs involved in protein folding. These TFs activate stress responses and
are known to activate genes involved in heat, oxidative and starvation stress. We also found
targets of SIP4 in both populations. SIP4 is a transcriptional activator of gluconeogenesis [32],
expressed highly in glucose repressed cells [15], and therefore would be expected to be present in
both quiescent and non-quiescent cells. In contrast, the only shared regulatory connection found
by INDEP was HAP4. We conclude that the NIPD approach identified more networks that were
biologically relevant and informative about glucose starvation response than did INDEP.
7
2.2.3 Wiring differences in NIPD-inferred networks exhibit population-specific star-
vation response
NIPD identified several processes associated exclusively with quiescent cells. This included regu-
latory processes (regulation of epigenetic gene expression, and regulation of nucleobase, nucleoside
and nucleic acid metabolism) and metabolic processes (pentose phosphate shunt). These were
novel predictions that highlight differences between these cells based on network wiring. INDEP
identified only one population-specific GO process (response to reactive oxygen species in NON-
QUIESCENT). An INDEP identified subgraph specific to quiescent (protein de-ubiquitination), was
actually a subset of the NIPD-identified subgraph involved in epigenetic gene expression regulation,
indicating that NIPD subsumed most of the information captured by INDEP.
NIPD QUIESCENT networks contained subgraphs enriched exclusively in targets of SKO1, and
AZF1. Both of these are zinc finger TFs, with AZF1 protein expressed highly under non-fermentable
carbon sources [27], and SKO1 which regulates low affinity glucose transporters [30], and are both
consistent with the condition experienced by these cells. Unlike NIPD, which identified SIP4 to
be associated with both populations, INDEP identified SIP4 only in QUIESCENT. However, as
we describe in the previous section, it is more likely that SIP4 is involved in both QUIESCENT
and NON-QUIESCENT populations. INDEP also found the TFs YAP7 and AFT2 exclusively in
QUIESCENT and NON-QUIESCENT, respectively. YAP7 is involved in general stress response
and would be expected to have targets in both QUIESCENT and NON-QUIESCENT. AFT2 is
required under oxidative stress and is consistent with the over-abundance of reactive oxygen species
in NON-QUIESCENT population [1].
NIPD also identified wiring differences in the subgraphs involved in shared processes. For ex-
ample in addition to HAP4, NIPD identified HAP2 as an important TF in QUIESCENT. The
presence of both HAP2 and HAP4 makes biological sense because they are both part of the
HAP2/HAP3/HAP4/HAP5 complex required for activation of respiratory genes. The presence
of both HAP2 and HAP4 in QUIESCENT, but not NON-QUIESCENT, suggests that the QUI-
ESCENT population maybe better equipped for respiration and long term survival in stationary
phase.
8
Overall, the NIPD inferred networks captured key differences and similarities in metabolic and
regulatory processes, which are consistent with existing information about these cell populations
[1,2], and also include novel findings that can provide new insight into starvation response in yeast.
2.2.4 NIPD identified several knock-out combinations
The microarrays used in this study measured expression profile of single gene deletions that were
previously identified to be highly expressed at the mRNA level in stationary phase. We constrained
the inferred networks to identify neighborhoods of genes comprising only the genes with deletion
mutants, allowing us to identify combinations of such deletion mutants and their targets. Such com-
binations can be validated in the laboratory to verify cross-talk between pathways. We found that
NIPD-inferred networks contained significantly more deletion combinations compared to random
networks for both the quiescent and non-quiescent populations (p-value < 3E-10, Supplementary
Tables 3, 4, 5), which was not the case for the INDEP-identified networks (Supplementary Tables 6,
7).
A more stringent analysis of the knock-out combinations using GO process semantic similar-
ity identified several double knock-out and target gene candidates (Supplementary Table 2). We
also found more deletion combinations in NON-QUIESCENT compared to QUIESCENT. This is
consistent with the identification of many more mutants affecting non-quiescent than quiescent
cells [2]. In QUIESCENT, we found three genes that were all likely down-stream targets of a
COX7-QCR8 double knock-outs, all involved in the cytochrome-c oxidase complex of the mito-
chondrial inner membrane. Other deletion mutant combinations were involved in mitochondrial
ATP synthesis and ion transport. Many of these genes have been shown to be required for qui-
escent non-quiescent cell function, viability and survival [2, 18]. In NON-QUIESCENT, we found
several knock-out combinations involved in oxidative phosphorylation, aerobic respiration etc, in-
cluding a novel combination, YMR31 and QCR8, connected to TPS2. All three genes are found in
the mitochondria, which play a critical and complex role in starved cells, but the exact mechanisms
are not well-understood. Experimental analysis of this triplet can provide new insights into the role
of mitochondria in glucose-starved cells. In summary, these results demonstrated another benefit
9
of data pooling in NIPD: learning more complex, combinatorial relationships among genes.
3 Discussion
Inference and analysis of cellular networks has been one of the cornerstones of systems biology.
We have developed a network learning approach, Network Inference with Pooling Data (NIPD) to
capture a systemic view of condition-specific response. NIPD is based on probabilistic graphical
models and infers the functional wiring among genes involved in condition-specific response. The
crux of our approach is to learn networks for any subset of conditions capturing fine-grained gene
interaction patterns not only in individual conditions but in any combination of conditions. This
allows NIPD to robustly identify both shared and unique components of condition-specific cellular
networks. In comparison to an approach that learns networks independently (INDEP), NIPD
(a) pools data across different conditions, enabling better exploitation of the shared information
between conditions, (b) learns better overall network structures in the face of decreasing amounts
of training data, and (c) learns structures with many more biologically meaningful dependencies.
Small training data sets, which are especially common for biological data, present significant
challenges for any network learning approach. In particular, approaches such as INDEP may learn
drastically different networks due to small data perturbations leading to differences that are not
biologically meaningful. NIPD is more resilient to small perturbations because by pooling data
from different conditions during network learning, NIPD effectively has more data for estimating
parameters for the shared parts of the network.
Another challenge in the analysis of condition-specific networks is to extract patterns that
are shared across conditions. Approaches such as INDEP that learn networks for each condition
independently, and then compare the networks, are more likely to learn different networks making
it difficult to identify the similarities across conditions. Application of both NIPD and INDEP
approaches to microarray data from two yeast populations showed that many of subgraphs that
would be considered specific to each population by INDEP, were actually shared biological processes
that must be activated in both populations irrespective of their morphological and physiological
differences.
10
One of the strengths of NIPD in comparison with INDEP was its ability to identify pairs of gene
deletions and downstream targets using data from individual gene deletions. Amazingly, several
of these gene deletions are already known to have a phenotypic effect on stationary phase cultures
and often on quiescent or non-quiescent cells (Supplementary Table 2) [2,18]. These predictions are
therefore good candidates for future experiments using double deletion mutants, and are a drastic
reduction of the space of possible combinations of sixty-nine single gene deletions. Identification of
population-specific malfunctions in signaling pathways via experimental analysis of these multiple
deletions can provide new insight into aging and cancer studies using yeast stationary phase as a
model system.
The NIPD approach establishes ground-work for important future enhancements, including the
ability to efficiently learn networks from many conditions. The probabilistic framework of NIPD can
be easily extended to automatically infer the condition variable to make NIPD widely applicable to
datasets with uncertainty about the conditions. The NIPD approach can also integrate novel types
of high-throughput data including RNASeq [33] and ChipSeq [25]. These extensions will allow
us to systematically identify the parts, and the wiring among them that determine stage-specific,
tissue-specific and disease specific behavior in whole organisms.
4 Methods
4.1 Independent learning of condition-specific networks: INDEP
Existing approaches of learning condition-specific networks [4, 21, 28] can be considered as spe-
cial cases of a general independent learning approach, INDEP, where networks for each condition
are learned independently and then compared to identify network parts unique or shared across
conditions.
Let {D1, · · · ,Dk} denote k datasets from k conditions. In the INDEP approach, each network
Gc, 1 ≤ c ≤ k, is learned independently using data from Dc only. Our implementation of the
INDEP framework considered each Gc as an undirected probabilistic graphical model, or a Markov
random field (MRF) [14], which like Bayesian networks, can capture higher-order dependencies,
11
but additionally captures cyclic dependencies. We use a pseudo-likelihood framework with an
MDL penalty to learn the structure of the MRF [6]. The pseudo-likelihood score for a network
Gc describing data Dc is PLL(Gc) =∑N
i=1 PLLV(Xi,Mci, c) where X1, · · · , XN are the random
variables (one for each gene), encoding the expression value of a gene. PLLV is Xi’s contribution to
the overall pseudo-likelihood and is defined, including a minimum description length (MDL) penalty,
as PLLV(Xi,Mci, c) =∑|Dc|
d logP (Xi = xdi|Mci = mcdi) + |θci|log(|Dc|)2 . Here Mci is the Markov
blanket (MB) of Xi in condition c and xdi and mcdi are assignments to Xi and Mci, respectively
from the dth data point. θci are the parameters of the conditional distribution P (Xi|Mci). We
assume the conditional distributions to be conditional Gaussians. The structure learning algorithm
for each graph is described in [22].
4.2 Network Inference with Pooling Data: NIPD
The NIPD approach that we present extends the INDEP approach by incorporating shared infor-
mation across conditions during structure learning. In this framework, we do not learn networks
for each condition c separately. Instead, we devise a score for each edge addition that considers
networks for any subset of the conditions. Let C denote the set of k conditions. For a non-singleton
set, E ⊆ C, we pool the data from all conditions e ∈ E and evaluate the overall score improve-
ment on adding an edge to networks for all e ∈ E. To learn {G1, · · · ,Gk} for the k conditions
simultaneously, we maximize the following MDL-based score:
S(G1, · · · ,Gk) = P (D1, · · · ,Dk|θ1, · · · , θk)P (θ1, · · · , θk|G1, · · · ,Gk) + MDL Penalty (1)
Here θ1, · · · , θk are the maximum likelihood parameters for the k graphs. We assume P (Dc|θ1, · · · , θk) =
P (Dc|θc). That is, if we know the parameters θc, the likelihood of the data from condition, Dc, given
θc can be estimated independently. Thus, P (D1, · · · ,Dk|θ1, · · · , θk) =∏kc=1 P (Dc|θc). Because our
networks are MRFs, we use pseudo-likelihood PLL(Dc). We expand the complete condition-specific
parameter set θc, to {θc1, · · · , θcN}, which is the set of parameters of each variable Xi, 1 ≤ i ≤ N ,
12
in condition c. Using the parameter modularity assumption for each variable, we have:
P (θ1, · · · , θk|G1, · · · ,Gk) =N∏i=1
P (θ1i, · · · , θki|M1i, · · · ,Mki) (2)
Note the parameters of conditional probabilities of individual random variables are independent, but
the parameters per variable are not independent across conditions. To enforce dependency among
the θci, we make Mci depend on all the neighbors of Xi in condition c and all sets of conditions
that include c. To convey the intuition behind this idea, let us consider the two condition case
C = {A,B}. A variable Xj can be in Xi’s MB in condition A, either if it is connected to Xi only
in condition A, or if it is connected to Xi in both conditions A and B. Let M∗Ai be the set of
variables that are connected to Xi only in condition A but not in both A and B. Similarly, let
M∗{A,B}i denote the set of variables that are connected to Xi in both A and B conditions. Hence,
MAi = M∗Ai ∪M∗
{A,B}i. More generally, for any c ∈ C, Mci =⋃
E∈powerset(C) : c∈E M∗Ei, where M∗
Ei
denotes the neighbors of Xi only in condition set E. To incorporate this dependency in the structure
score, we need to define P (Xi|Mci) such that it takes into account all subsets E, c ∈ E. We assume
that the MBs, M∗Ei, independently influence Xi. This allows us to write P (Xi|Mci) as a product:
P (Xi|Mci) ∝∏
E∈powerset(C) : c∈E P (Xi|M∗Ei). To learn the k graphs, we exhaustively enumerate
over condition sets, E, and estimate parameters θEi by pooling the data for all non-singleton E.
Our structure learning algorithm maintains a conditional distribution for every variable, Xi for
every set E ∈ powerset(C). We consider the addition of an edge {Xi, Xk} in every set E. This addi-
tion will affect the conditionals of Xi and Xj in all conditions e ∈ E. Because the MB per condition
set independently influence the conditional, the pseudo-likelihood PLLV(Xi,Mei, e) decomposes as∑E s.t: e∈E PLLV(Xi,M∗
Ei, e) (Supplementary information). The net score improvement of adding
an edge {Xi, Xj} to a condition set E is given by:
∆Score{Xi,Xj},E =∑e∈E
|De|∑d=1
PLLV(Xi,Mei ∪ {Xj}, e)− PLLV(Xi,Mei, e) +
PLLV(Xj ,Mej ∪ {Xi}, e)− PLLV(Xj ,Mej , e) (3)
13
Because of the decomposability of PLLV(Xi|Mei), all terms other than those involving the Markov
blanket variables in condition set E remain unchanged producing the score improvement:
∆Score{Xi,Xj},E = PLLV(Xi|M∗Ei ∪Xj)− PLLV(Xi|M∗
Ei)
This score decomposability allows us to efficiently learn networks over condition sets. Our structure
learning algorithm is described in more detail in Supplementary material.
4.3 Simulated data description and analysis
We generated simulated datasets using two sets of networks of known structure, HIGHSIM and
LOWSIM. All networks had the same number of nodes n = 68 and were obtained from the E. coli
regulatory network [23]. We used the INDEP model for generating the eight simulated datasets.
The parameters of the INDEP model were initialized using random partitions of an initial dataset
generated from a differential-equation based regulatory network simulator [19].
4.4 Microarray data description
Each microarray measures the expression of all yeast genes in response to genetic deletions from
quiescent (85) and non-quiescent (93) populations [2], with 69 common to both populations. The
arrays had biological replicates producing 170 and 186 measurements per gene in the quiescent
and non-quiescent populations, respectively. We filtered the microarray data to exclude genes with
> 80% missing values, resulting in 3,012 genes. We constrained the network structures such that a
gene connected to only the 69 genes with deletion mutants and no gene had more than 8 neighbors.
4.5 Validation of network edges using coverage of annotation categories
The coverage of an annotation category A is defined as the harmonic mean of a precision and
recall. Let L denote the complete list of genes used for network learning, LA ⊆ L denote the genes
annotated with A. Let lA denote the number edges in our learned network among two genes gi
and gj , such that gi ∈ LA and gj ∈ LA. Let tA be the total number of edges that are connected to
genes in LA (note tA > lA). Let sA denote the total number of edges that could exist among the
14
genes in LA, which is(|LA|
2
)if |LA| < 8 and |LA| ∗ 8 if |LA| > 8. Precision for category A is defined
as pA = lAtA
and recall is defined as rA = lAsA
. These are used to define the coverage of category A,
2pArApA+rA
. We compute this coverage score for all categories using each inferred network, and compare
the score against an expected coverage from random networks with the same degree distribution.
To compare of NIPD against INDEP, assume we were comparing the inferred quiescent networks.
Let AINDEP and ANIPD denote the categories better than random in the INDEP and NIPD quiescent
networks, respectively. To determine how much better INDEP is than NIPD, we obtain the number
of categories in AINDEP ∪ANIPD on which INDEP has a better coverage than NIPD. We similarly
assess how much better NIPD is than INDEP. We repeat this procedure for the non-quiescent
networks. We also compared the semantic similarity of edges in inferred and random networks [16]
(Supplementary material).
4.6 Evaluation of gene deletion combinations
We identified combinations of genes with deletion mutants from Markov blankets comprising > 1 of
these deletion genes. We evaluated each algorithm’s ability to capture gene deletion combinations
by comparing the number of such combinations in random networks with the same number of
edges. This random network model provided a rough significance assessment on the number of
inferred knock-out combinations (Supplementary Table 3). We then performed a more stringent
analysis based on semantic similarity, using the sub-network spanning only the genes with deletion
combinations. We generated random networks with the same degree distributions as this sub-
network and computed the semantic similarity of each gene with the set of deletion genes connected
to it, in the inferred and random networks. We then selected genes with significantly higher semantic
similarity than in random networks (ztest, p-value <0.05).
5 Acknowledgements
This work is supported by grants from NIMH (1R01MH076282-03) and NSF (IIS-0705681) to
T.L., from NIH (GM-67593) and NSF (MCB0734918) to M.W.W. and from HHMI-NIH/NIBIB
(56005678).
15
!"# $$" %"# %%$ &# %"# %%$ &#!"
#
"
%#
%"
'()*+,-+./0(1(12+30.0
4+,-+50/(067*8
HIGHSIM NET1
+
+
!"# $$" %"# %%$ &# %"# %%$ &#!"
#
"
%#
'()*+,-+./0(1(12+30.0
4+,-+50/(067*8
HIGHSIM NET2
+
+
9:;<
:9<=;
9:;<
:9<=;
!"# $$" %"# %%$ &#!$
#
$
!
'
()*+,-.,/01)2)23,41/1
5,-.,610)178+9
LOWSIM NET1
,
,
:;<=
;:=><
!"# $$" %"# %%$ &#!$
#
$
!
'
?
()*+,-.,/01)2)23,41/1
5,-.,610)178+9
LOWSIM NET2
,
,
:;<=
;:=><
Figure 1: Number of variables (y-axis) on which one method was significantly better than the otheras function of the size of the training data (x-axis). Left is for the two networks (HIGHSIM) thatshare 60% edges and right is for the two networks (LOWSIM) that share 20% of their edges. Thetop and bottom graphs are for networks from the individual conditions.
0
4
8
12
16
QUIESCENT NON‐QUIESCENT
# of Categories
GOSLIM INDEP>NIPD
NIPD>INDEP
0
4
8
12
16
QUIESCENT NON‐QUIESCENT
# of Categories
TFNET INDEP>NIPD
NIPD>INDEP
0
20
40
60
80
QUIESCENT NON‐QUIESCENT # of Categories
GOPROC INDEP>NIPD
NIPD>INDEP
Figure 2: Network quality comparison based on coverage of GOSlim (GOSLIM), targets of tran-scription factors (TFNET) and GO process (GOPROC). Each bar represents the number of cat-egories on which INDEP had better coverage than NIPD (INDEP>NIPD) or NIPD had bettercoverage than INDEP (NIPD>INDEP).
References
[1] C. Allen, S. Buttner, A. D. Aragon, J. A. Thomas, O. Meirelles, J. E. Jaetao, D. Benn,
S. W. Ruby, M. Veenhuis, F. Madeo, and M. Werner-Washburne. Isolation of quiescent and
nonquiescent cells from yeast stationary-phase cultures. J Cell Biol, 174(1):89–100, July 2006.
[2] Anthony D. Aragon, Angelina L. Rodriguez, Osorio Meirelles, Sushmita Roy, George S. David-
son, Chris Allen, Ray Joe, Phillip Tapia, Don Benn, and Margaret Werner-Washburne. Charac-
terization of differentiated quiescent and non-quiescent cells in yeast stationary-phase cultures.
Molecular Biology of the Cell, 2008.
[3] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis,
16
‐1
0
1
2
3
4
5
6
7
0 0.2 0.4 0.6 0.8 1 1.2 1.4
log(# of Edges)
Seman1c Similarity
QUIESCENT RAND (NIPD) NIPD RAND (INDEP) INDEP
‐1
0
1
2
3
4
5
6
7
0 0.2 0.4 0.6 0.8 1 1.2 1.4
log(# of Edges)
Seman1c Similarity
NON‐QUIESCENT RAND (NIPD) NIPD RAND (INDEP) INDEP
Figure 3: Network quality comparison based on semantic similarity. The dashed lines representsthe background distribution generated from random networks and the solid lines represents thedistribution of the semantic similarity in the inferred networks.
YMR144W ADH2YMR187C
MUQ1
CAT2SIP18
ethanol metabolic process
ALD4
YMR090WEMP46
ALD2
SWP1
UTR1
YDR154C
GDH3
YJL016W
ALD3
FDH1
regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolic process
LSC2
ACS1
regulation of gene expression, epigenetic
carboxylic acid biosynthetic process
FMP37
AYR1
ADY2
REG2 FOX2
ATO3
nitrogen utilization ammonium transport
CTA1
PXA1
CUP2
SFA1
MAP1
CRS5
response to metal ion
FTH1ETR1
MDH3
fatty acid metabolic process
NADH regeneration
PEX11
MSS18
PAT1
PAI3
PUF4
DOA4
UBP10SBE22
GAC1
PDC5
ISW2
PDC1
TKL2YDL218W
YMR118CRPL2A
pentose-phosphate shunt
FYV7
GND2
SOD1
pentose metabolic process
beta-alanine biosynthetic process
YJR096W
polyamine catabolic process
RDH54
SOL4
HSP26
HAP2_TF
SNC2
QCR6
organelle ATP synthesis coupled electron transport
COX7
AVT7 INH1
COX8
ATP3
ATX2
NBP2
YOR052C
YGR001C
ATP16
oxidative phosphorylation
MIR1
ATP2
CCW12
ERV46
IDP2
NDI1
CDC48
PTR2
YNL194C
PCK1
ICL1
PIN3
SIP4_TF
YGL088W
UBC8
ILV1
HAP4_TF
QCR7
QCR8COX13
THO1
QCR9
LPD1NDE2
AAT2
acetyl-CoA metabolic process
KNS1
FAS1
SDS23
SDH2 YET3
KGD1
PRB1
HXT5
AZF1_TF
IRA2
SKO1_TF
OM14 XBP1
FAA1
HSP78
YDJ1
SIS1
HSP104
HSF1_TF
protein folding
BIO2
STI1
MSN2_TF
aerobic respiration
HSP30
MSN4_TF
HSP42
YDR266C
Figure 4: GO processes and TF targets for subgraphs from the NIPD-inferred networks using thequiescent population. The text below each subgraph indicates the process. The diamonds representthe TFs. A TF is connected to the subgraph which is enriched in the targets of the TF. The circularnodes represent the genes in the network and color represents the extent of differential expression,red: up-regulated, green: down-regulated.
17
ion transport
CCW12
CDC48
BSD2
KGD2
oxidative phosphorylation
ATP2
HAP4_TF
SDH2
MIR1
PMT1
ATP16
RIP1
SIS1
PGM2
HSP78
SSA2
MSN2_TF
HSP30
protein folding
HSP104
HSP42
MSN4_TF
ACS1
YJR096W
HSP26
TDH1
SSE2
STI1
SOD1
HSF1_TF
HSP12
URA6
PTR2
PIN3
SIP4_TF
PCK1
IDP2
ATP1
ICL1
AYR1
PEX11
MDH3
PUS5PST2
ADH2
CYB2
ADY2
YER121W
FOX2
RPS14A
ammonium transport
PXA1
ATO3
nitrogen utilization
RPL25LSC2
FMP37
energy derivation by oxidation of organic compounds fatty acid metabolic process
ETR1YKL187C
URA2
CRC1
GSC2
YAT2
YIR035C
YAT1
PYC2
carnitine metabolic process
QCR8
COX13
ILV1
TPS2
ARO3
QCR6
QCR9
COX7
aerobic respiration
mitochondrial electron transport, ubiquinol to cytochrome c
APJ1
YDR154CEMP46
ALD3
SOL4
beta-alanine biosynthetic process
AVT6 ALD2
polyamine catabolic process
GDH3
UTR1 YGR201C
YMR114C
Figure 5: GO processes and TF targets for subgraphs from the NIPD-inferred networks using thenon-quiescent population. Legend is similar to Fig 4
K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis,
S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene
ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet,
25(1):25–29, May 2000.
[4] S. Bergmann, J. Ihmels, and N. Barkai. Similarities and differences in genome-wide expression
data of six organisms. PLoS Biol, 2(1), January 2004.
[5] Sven Bergmann, Jan Ihmels, and Naama Barkai. Iterative signature algorithm for the analysis
of large-scale gene expression data. Physical review. E, Statistical, nonlinear, and soft matter
physics, 67(3 Pt 1), March 2003.
[6] Julian Besag. Efficiency of pseudolikelihood estimation for simple gaussian fields. Biometrika,
64(3):616–618, December 1977.
[7] Richard Bonneau, David J Reiss, Paul Shannon, Marc Facciotti, Leroy Hood, Nitin S Baliga,
and Vesteinn Thorsson. The inferelator: an algorithm for learning parsimonious regulatory
networks from systems-biology data sets de novo. Genome Biology, 2006.
18
[8] Han-Yu Chuang, Eunjung Lee, Yu-Tsueng Liu, Doheon Lee, and Trey Ideker. Network-based
classification of breast cancer metastasis. Mol Syst Biol, 3, October 2007.
[9] Karthik Devarajan. Nonnegative matrix factorization: An analytical and interpretive tool in
computational biology. PLoS Comput Biol, 4(7):e1000029+, July 2008.
[10] Dan Geiger and David Heckerman. Advances in probabilistic reasoning. In Proceedings of
the seventh conference (1991) on Uncertainty in artificial intelligence, pages 118–126, San
Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc.
[11] Christopher T. Harbison, D. Benjamin Gordon, Tong Ihn Lee, Nicola J. Rinaldi, Kenzie D.
Macisaac, Timothy W. Danford, Nancy M. Hannett, Jean-Bosco Tagne, David B. Reynolds,
Jane Yoo, Ezra G. Jennings, Julia Zeitlinger, Dmitry K. Pokholok, Manolis Kellis, P. Alex
Rolfe, Ken T. Takusagawa, Eric S. Lander, David K. Gifford, Ernest Fraenkel, and Richard A.
Young. Transcriptional regulatory code of a eukaryotic genome. Nature, 2004.
[12] David Heckerman. A Tutorial on Learning Bayesian Networks. Technical Report MSR-TR-
95-06, Microsoft research, March 1995.
[13] Hyunsoo Kim, William Hu, and Yuval Kluger. Unraveling condition specific gene transcrip-
tional regulatory networks in saccharomyces cerevisiae. BMC Bioinformatics, 2006.
[14] Steffen L. Lauritzen. Graphical Models. Oxford Statistical Science Series. Oxford University
Press, New York, USA, July 1996.
[15] P. Lesage, X. Yang, and M. Carlson. Yeast snf1 protein kinase interacts with sip4, a c6 zinc
cluster transcriptional activator: a new role for snf1 in the glucose response. Molecular and
cellular biology, 16(5):1921–1928, May 1996.
[16] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble. Investigating semantic similarity
measures across the gene ontology: the relationship between sequence and annotation. Bioin-
formatics, 19(10):1275–1283, July 2003.
19
[17] Kenzie Macisaac, Ting Wang, D. Benjamin Gordon, David Gifford, Gary Stormo, and Ernest
Fraenkel. An improved map of conserved regulatory sites for saccharomyces cerevisiae. BMC
Bioinformatics, 7(1):113+, March 2006.
[18] M. Juanita Martinez, Sushmita Roy, Amanda B. Archuletta, Peter D. Wentzell, Sonia S.
Anna-Arriola, Angelina L. Rodriguez, Anthony D. Aragon, Gabriel A. Quinones, Chris Allen,
and Margaret Werner-Washburne. Genomic analysis of stationary-phase and exit in saccha-
romyces cerevisiae: Gene expression and identification of novel essential genes. Mol. Biol. Cell,
15(12):5295–5305, December 2004.
[19] Pedro Mendes, Wei Sha, and Keying Ye. Artificial gene networks for objective comparison of
analysis algorithms. Bioinformatics, 19:122–129, 2003.
[20] Wei Pan. A comparative review of statistical methods for discovering differentially expressed
genes in replicated microarray experiments. Bioinformatics, 18(4):546–554, April 2002.
[21] Rokhlenko, Oleg, Wexler, Ydo, Yakhini, and Zohar. Similarities and differences of gene ex-
pression in yeast stress conditions. Bioinformatics, 23(2):e184–e190, January 2007.
[22] Sushmita Roy, Terran Lane, and Margaret Werner-Washburne. Learning structurally consis-
tent undirected probabilistic graphical models. In ICML, page 114, 2009.
[23] Heladia Salgado, Socorro Gama-Castro, Martin Peralta-Gil, Edgar Diaz-Peredo, Fabiola
Sanchez-Solano, Alberto Santos-Zavaleta, Irma Martinez-Flores, Veronica Jimenez-Jacinto,
Cesar Bonavides-Martinez, Juan Segura-Salazar, Agustino Martinez-Antonio, and Julio
Collado-Vides. Regulondb (version 5.0): Escherichia coli k-12 transcriptional regulatory net-
work, operon organization, and growth conditions. Nucleic Acids Research, 34:D394, 2006.
[24] Guido Sanguinetti, Josselin Noirel, and Phillip C. Wright. Mmg: a probabilistic tool to identify
submodules of metabolic pathways. Bioinformatics, 24(8):1078–1084, April 2008.
[25] Dominic Schmidt, Michael D. Wilson, Christiana Spyrou, Gordon D. Brown, James Hadfield,
and Duncan T. Odom. Chip-seq: Using high-throughput sequencing to discover proteindna
interactions. Methods, 48(3):240–248, July 2009.
20
[26] Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller, and Nir Friedman. Learning module
networks. Journal of Machine Learning Research, 6:557–588, April 2005.
[27] T. Stein, J. Kricke, D. Becher, and T. Lisowsky. Azf1p is a nuclear-localized zinc-finger protein
that is preferentially expressed under non-fermentative growth conditions in saccharomyces
cerevisiae. Current genetics, 34(4):287–296, October 1998.
[28] Joshua M. Stuart, Eran Segal, Daphne Koller, and Stuart K. Kim. A gene-coexpression network
for global discovery of conserved genetic modules. Science, 302(5643):249–255, October 2003.
[29] Amos Tanay, Roded Sharan, Martin Kupiec, and Ron Shamir. Revealing modularity and
organization in the yeast molecular network by integrated analysis of highly heterogeneous
genomewide data. Proceedings of the National Academy of Sciences of the United States of
America, 101(9):2981–2986, March 2004.
[30] Lidia Tomas-Cobos, Laura Casadome, Gloria Mas, Pascual Sanz, and Francesc Posas. Expres-
sion of the hxt1 low affinity glucose transporter requires the coordinated activities of the hog
and glucose signalling pathways. The Journal of biological chemistry, 279(21):22010–22019,
May 2004.
[31] D. P. Tuck, H. M. Kluger, and Y. Kluger. Characterizing disease states from topological
properties of transcriptional regulatory networks. BMC Bioinformatics, 7, 2006.
[32] O. Vincent and M. Carlson. Sip4, a snf1 kinase-dependent transcriptional activator, binds to
the carbon source-responsive element of gluconeogenic genes. The EMBO journal, 17(23):7002–
7008, December 1998.
[33] Zhong Wang, Mark Gerstein, and Michael Snyder. Rna-seq: a revolutionary tool for transcrip-
tomics. Nat Rev Genet, 10(1):57–63, January 2009.
[34] Bai Zhang, Huai Li, Rebecca B. Riggins, Ming Zhan, Jianhua Xuan, Zhen Zhang, Eric P.
Hoffman, Robert Clarke, and Yue Wang. Differential dependency network analysis to identify
condition-specific topological changes in biological networks. Bioinformatics, pages btn660+,
December 2008.
21
Appendix
1 Generation and analysis of simulated data
We first obtained a sub-network of n = 68 nodes, G1, from the E. coli regulatory network [23]. We
then generated two networks, G2 and G3, by flipping 20% and 60% of G1’s edges, respectively.
{G1,G2} comprised networks in HIGHSIM and {G1,G3} comprised networks in LOWSIM. For
each pair of networks, we generated initial datasets using a differential equation-based gene regu-
latory network simulator [19]. We then split the data into two parts, learned two INDEP models
for each partition, and generated data from these models. We repeated this procedure four times
producing eight sets of simulated data with different parameters but the same network topology.
It was possible to generate all eight sets from the regulatory network simulator by perturbing the
kinetic constants, but our current data generation procedure was faster.
We compared the structure of the networks inferred by INDEP and NIPD using a per-variable
neighborhood comparison. Assume we are comparing the INDEP-inferred networks against the true
networks in HIGHSIM. We compare each of the true networks, {G1,G2} one at a time. Let GINDEP1
and GINDEP2 be the two inferred networks inferred by INDEP using datasets from HIGHSIM. For
each variable, Xi, we compare Xi’s neighborhood in G1 to its inferred neighborhoods in both
GINDEP1 and GINDEP
2 to obtain match score F INDEPi1 and F INDEP
i2 , respectively. INDEP’s match of
Xi’s neighborhood in G1 is the max of F INDEPi1 and F INDEP
i2 . We obtain a match score for different
folds of the data. Similarly we obtain a match score for NIPD for all variables from different folds
of the data. We then obtain the number of variables on which NIPD has a significantly higher
match score compared to INDEP as a function of training data size. We repeat this procedure
for all eight datasets for HIGHSIM to obtain the average number of variables NIPD is better than
INDEP. We repeat this procedure for G2 and then for the NIPD.
22
2 Semantic similarity based-validation
We use the definition of semantic similarity from Lord et al. using [16]. Semantic similarity between
two annotation terms is defined as a function of the maximal amount of information present in a
common ancestor of the terms. For GO terms the information is inversely proportional to the
number of genes that are annotated with a term, that is a very specific term with few genes has
more information than a broader term that has many more genes annotated with it. The functional
similarity between two genes is given by the average semantic similarity of sets of GO process terms
associated with the genes. Let gi and gj be two genes connected by an edge in our inferred network.
Let Ti and Tj be the set of GO process terms associated with gi and gj , respectively. The average
semantic similarity, sim(gi, gj) for all pairs of terms is
sim(gi, gj) =1
|Tp| ∗ |Tq|∑
tp∈Ti,tq∈Tj
semsim(tp, tq)
Semantic similarity, semsim(tp, tq) = −log(mina∈Ppqpa), where Ppq is the set of common ancestors
of the terms tp and tq in the GO process “is-a” hierarchy. −log(pa) is the amount of information
associated with a term a, and pa is probability of the term defined as the ratio of the number of
genes annotated with the term a to the total number of genes with a GO process assignment.
We used semantic similarity for global validation of the inferred edges and also for assessing
the strength of association between combinations of single gene knock-outs and a target gene.
In both cases, we generated random networks with the same degree distributions as the inferred
networks and estimated a background semantic similarity distribution. For assessing the strength
of association between a gene, gi and the set of knock-out genes that are connected to it, Ki, we
had to compare the similarity of a gene with a set of genes. We assumed GO process terms for
the set Ki to be the union of all terms associated with the genes, gj ∈ Ki. We then computed the
semantic similarity between the term set associated with gene gi and the union of terms associated
with Ki.
23
3 Structure learning algorithm of NIPD in detail
Our score for structure learning is based on the pseudo-likelihood of the data given model and
requires us to compute the conditional probability distribution of each variable in a condition c.
We require that the parameters of this conditional distribution be dependent such that we can pool
the data from the different conditions to estimate the parameters. The conditional distribution,
P (Xi|Mci) in condition c is defined as a product:
P (Xi = xid|Mci = mcid) ∝∏
E∈powerset(C) : c∈E
P (Xi = xdi|M∗Ei = m∗Ei), (4)
where d is the data point index and M∗E is the Markov blanket (MB) of Xi exclusively in condition
set E. The proportionality term can be eliminated using the normalization term 1Zcid
. In our
conditional Gaussian case, 1Z1id
= N (µ1id|µ3id, σ21i + σ2
3i), where σ23i is the standard deviation from
the condition set {1, 2}, µ1id = wT1im
∗1id, is the mean of the conditional Gaussian using the dth data
point in condition 1. Thus, 1Z1id
is the probability of µ1id from a Gaussian distribution with mean
estimated from the pooled data. To make the product in Eq 4 a valid conditional distribution, we
need to subtract out the normalization term. However, working with the unnormalized form gives
us three benefits. First, and most important, it enables our score to be a decomposable sum on
taking logarithms. Second the normalization term behaves as a smoothing term for a condition-
specific mean, µ1id, preferring network structures with means µ1id closer to the shared mean µ3id.
Third, avoiding the computation of the Zid for each data point, gives us some runtime benefits.
Our structure learning algorithm begins with k empty graphs and proposes edge additions for all
variables, for all subsets of the condition set C. The while loop iteratively makes edge modifications
until the score no longer improves. The outermost for loop (Steps 4-17 ) iterates over variables
Xi to identify new candidate MB variables Xj in a condition set E. We iterate over all candidate
MBs Xj (Steps 5-15) and condition sets E (Steps 6-14) and compute the score improvement for
each pair {Xj ,E} (Step 16). In Steps 7-9 we add a check that if a variable Xj is already present
in any subset or super set D of E, we do not include it as a candidate. If the current condition
set under consideration has more than one conditions, data from these conditions is pooled and
24
parameters for the new distribution P (Xi|M∗Ei) is estimated using the pooled dataset (Steps 10-
12). A candidate move for a variable Xi is composed of a pair {X ′j ,E′} with the maximal score
improvement over all variables and conditions (Step 16). After all candidate moves have been
identified, we attempt all the moves in the order of decreasing score improvement (Step 18). Each
move adds the edge {Xi, X′j} in condition set E′. However, if either Xi or X ′j was already updated
in a previous move, we ignore the move. Because not all candidate moves are made, by sorting the
move order in decreasing score improvement, we enable moves with the highest score improvements
to be attempted first. The algorithm converges when no edge addition improves the score of the k
graphs.
Algorithm 1 NIPD1: Input:
Random variable set, X = {X1, · · · , X|X|}Set of conditions CDatasets of RV joint assignments, {D1, · · · ,D|C|}maximum neighborhood size, kmax
2: Output:Inferred graphs G1, · · · ,G|C|
3: while Score(G1, · · · ,G|C|) does not stabilize do4: for Xi ∈ X do {/*Propose moves*/ }5: for Xj ∈ (X \ {Xi}) do6: for E ∈ powerset(C) do7: if Xj ∈M∗
iD, s.t either D ⊂ E or E ⊂ D then8: Skip Xj .9: end if
10: if |E| > 1 then11: Estimate parameters for new conditional P (Xi|M∗
Ei ∪ {Xj}) using pooled dataset DE obtainedfrom merging all De s.t. e ∈ E.
12: end if13: compute ∆Score{XiXj}E.14: end for15: end for16: Store {Xi, X
′j ,E
′} as candidate move for Xi, where {X ′j ,E′} = arg maxj,E
∆Score{XiXj}E
17: end for18: Make candidate moves {Xi, X
′j ,E
′} in order of decreasing score improvement /*Attempt moves to modify graphstructures*/
19: end while
25
!"# $$" %"# %%$ &##'!
#'"
#'(
#')
#'*
+,-./01/234,5,56/7424
+843.7/976./:!;<03.
=>?=+>@
/
/
!"# $$" %"# %%$ &##'$"
#'A
#'A"
#'!
#'!"
#'"
+,-./01/234,5,56/7424
+843.7/976./:!;<03.
BCD+>@
/
/
E>FG >EG9F
E>FG >EG9F
Figure 1: Shared edges in the HIGHSIM and LOWSIM networks
METHOD POPULATION EDGE-CNT SHARED EDGE-CNT
NIPDQUIESCENT 378
271NON-QUIESCENT 402
INDEPQUIESCENT 171
25NON-QUIESCENT 200
Table 1: Structure of the inferred networks using INDEP and NIPD.
26
ATP3
HAP4_TFHAP1_TF
COX8INH1
COX7
QCR6
COX13
oxidative phosphorylation
UBC8QCR9
QCR7
aerobic respiration
QCR8
YGL088W
THO1
carnitine metabolic process
YAT1DOA4
UBP10PUF4URA2
YAT2
carboxylic acid biosynthetic process
protein deubiquitination
HEM3
YAP7_TF
AVT7TDH1
RTN2
tricarboxylic acid cycle intermediate metabolic process
AAT2
NDE2
KGD1
SDH2
YET3
ADH2
nitrogen utilization
ammonium transport
ADY2
ATO3
fatty acid beta-oxidation
MSS18
PEX11
AYR1
YIR016W
FOX2POT1
ETR1
YIL161W
JEN1
YIL039W
Figure 2: GO processes and TF targets for subgraphs from the INDEP-inferred networks using thequiescent population. The text below each subgraph indicates the process. The diamonds representthe TFs. A TF is connected to the subgraph which is enriched in the targets of the TF. The circularnodes represent the genes in the network and color represents the extent of differential expression,red: up-regulated, green: down-regulated.
HSF1_TF
mitochondrial electron transport, ubiquinol to cytochrome c
COX4
HAP4_TF
QCR8
MTH1
COX13
SML1
QCR9SED1
SDH2
aerobic respiration
SDH3
ATP16
SDS23
GSP1
mitochondrial electron transport, succinate to ubiquinone
SIP4_TF
CLB5
ATP synthesis coupled proton transport
STF2OM45
ATP1
NDE2
SSE2
polyamine catabolic process
ALD2
beta-alanine biosynthetic process
TDH1
ALD3
HSP26
HSP12
YGR201C
RPL25
ALD4
CCW12
ADY2RPS26B
ATO3
nitrogen utilization
ammonium transport
GDH2
PCK1
ICL1
RIP1
REG1ATP2
IDP2
generation of precursor metabolites and energy
KGD2
SIS1
SSA2
MSN2_TF
HSP104
MSN4_TF
protein folding
YGL010W
OM14
AFT2_TF
PRB1
CTT1
LEU4
CTA1
PRE8
response to reactive oxygen species
Figure 3: GO processes and TF targets for subgraphs from the INDEP-inferred networks using thenon-quiescent population. Legend is similar to Fig 2
27
Deletion combinations Downstream effect ProcessQUIESCENT
COX7∗, QCR8 COX13, QCR6, QCR9 ATP synthesis coupled electron transportADY2∗, CTA1∗ ATO3 ion transportETR1∗, ACS1 AYR1 carboxylic acid metabolic process
NON-QUIESCENTATP12, SDH2∗ ATP16 oxidative phosphorylationYMR31, QCR8 TPS2 trehalose bio-synthesis, mitochondrial functionADY2∗, YAL054C ATO3 ion transportNQM1, QCR8 COX13 aerobic respirationCOX7∗, QCR8 QCR9 electron transportSIP18, YGR236C AZR1 response to stimulusETR1∗, ADH2 LSC2 energy derivation by oxidation
Table 2: Knock-out combinations identified by NIPD in the quiescent and non-quiescent popula-tions. Genes with ∗ have been shown to have a phenotype in stationary phase or in quiescent andnon-quiescent cells in [2, 18].
METHOD POPULATION INFERRED-NW RAND-MEAN RAND-STD Z-PVALNIPD QUIESCENT 47 25.51 4.1839 2.80E-07
NON-QUIESCENT 76 28.91 4.8076 1.18E-22INDEP QUIESCENT 7 5.71 2.1569 0.5498
NON-QUIESCENT 10 7.7 2.2042 0.2967
Table 3: Statistical significance of the number of deletion gene combinations in the networks inferredby NIPD and INDEP. The second column is the number of combinations in the inferred networks,and the subsequent columns specify the mean and standard deviation and z-test p-value fromrandom networks.
28
Deletion combination Downstream targetYCR010C,YHR138C YBR050CYKL217W,YMR136W YBR090CYDL218W,YGR088W YBR117CYCR021C,YOR374W YBR149WYGL121C,YLR312C YDL137WYIL136W,YOR374W YDL215CYAL054C,YDR504C YDR059CYDL223C,YDR262W YDR077WYCR021C,YLL026W YDR171WYDL222C,YDR262W YDR204WYDL222C,YDR262W YDR216WYDL223C,YDR262W YDR293CYCR010C,YDR256C YDR384CYBR230C,YIL101C YEL060CYHR138C,YMR096W YER020WYJL166W,YMR256C YFR033CYDR504C,YPL230W YGL036WYDR070C,YMR107W YGL054CYBR212W,YHR139C YGL156WYJL166W,YMR256C YGL191WYBR230C,YLR377C YGR079WYDR256C,YMR303C YGR180CYJL166W,YMR256C YGR183CYGR236C,YHR138C YGR235CYBR072W,YMR250W YGR248WYDL204W,YDL218W,YFL014W YGR256WYAL054C,YBR026C YIL124WYDL204W,YMR169C YJL016WYBL078C,YMR170C YJR049CYFL030W,YJR121W YJR077CYGR088W,YLR377C YKL141WYKR097W,YLL041C YKL182WYCR010C,YDR256C YKR009CYDR504C,YLL041C YLL019CYBR230C,YLR312C YLR311CYMR303C,YOR374W YML042WYDL218W,YHR139C YMR118CYDL204W,YMR175W YMR174CYBR212W,YHR139C YNL115CYIL136W,YKL217W YNL116WYDR262W,YLR312C YNL190WYDL222C,YKR097W YNL194CYDL223C,YHR139C YNL195CYHR096C,YOR317W YOL081WYBR026C,YCR021C YOL147CYDL168W,YDR070C,YLR327C YOR031WYCR010C,YHR138C YPL147W
Table 4: The deletion combinations identified by NIPD on the quiescent population.
29
Deletion combination Downstream targetYAL054C,YBR067C YAL041WYHR139C,YMR170C YAL062WYDR070C,YGR236C YBL001CYBL049W,YIL136W YBL064CYIL136W,YKR097W YBL099WYBR230C,YHR138C YBR050CYKL217W,YMR107W YBR090CYDR256C,YOR374W YBR149WYBR214W,YNL055C YBR286WYJR121W,YLL041C YDL004WYBR056W,YBR067C YDL046WYCR005C,YLR327C YDL110CYFL014W,YMR096W YDL124WYJR121W,YNL055C YDL126CYGL121C,YHR138C YDL137WYBR212W,YOR374W YDL215CYHR097C,YMR191W YDL234CYDR504C,YKL217W YDR059CYFR049W,YJL166W YDR074WYDR262W,YLL041C YDR077WYBR212W,YLR377C YDR111CYFL030W,YJR121W YDR148CYCR021C,YLL026W YDR171WYDL222C,YDR262W YDR204WYDR256C,YLR312C YDR264CYDL223C,YDR262W YDR293CYAL054C,YCR010C YDR384CYKL109W,YPL134C YDR497CYHR096C,YLR312C YDR505CYDR070C,YGL121C YDR513WYIL136W,YNL015W,YOR374W YEL034WYBR230C,YMR096W YEL060CYBR230C,YCR010C YER121WYLR312C,YMR096W YGL010WYAL054C,YBR026C YGL080WYBR056W,YJL066C YGL173CYGR043C,YJL166W YGL191WYGR043C,YLR327C YGR008C
Deletion combination Downstream targetYAR035W,YMR191W YGR032WYJL166W,YMR256C YGR183CYIR038C,YMR170C,YMR175W YGR201CYGR236C,YMR175W YGR224WYDR070C,YHR138C YGR235CYBR026C,YMR303C YGR244CYDL204W,YMR169C,YMR250W YGR248WYDL218W,YFL014W YGR256WYCR005C,YDL204W YHR009CYCR005C,YHR139C YHR162WYGR236C,YHR096C,YMR107W YIL057CYIL101C,YIL160C YIR016WYDL168W,YDL204W YJL016WYFL030W,YJR121W YJR077CYBR230C,YNL055C YKL148CYGR043C,YGR088W YKL150WYAL054C,YMR303C YKL187CYCR010C,YDR256C,YMR303C YKR009CYAL054C,YKL217W YKR046CYHR096C,YLL041C YLL019CYCR021C,YER150W YLR216CYBR214W,YGL121C YLR356WYIL160C,YMR303C YML042WYBR056W,YBR067C YMR031CYDR070C,YLR327C,YPL230W YMR110CYBL078C,YMR170C YMR114CYDL218W,YMR175W YMR118CYDR504C,YMR191W YMR201CYBR230C,YKL217W YNL104CYIL136W,YKL217W YNL116WYDL168W,YPL230W YNL144CYDR262W,YLR312C YNL190WYDL222C,YJL066C YNL194CYBL048W,YIL101C YOL084WYBR026C,YCR021C YOL147CYDL168W,YGR236C,YLR327C YOR031WYDR504C,YLR395C YOR052CYDL223C,YMR136W YPL265W
Table 5: Deletion combinations identified by NIPD on the non-quiescent population.
Deletion combination Downstream targetYBR067C,YMR250W YAL061WYCR010C,YMR303C YDR384CYGR236C,YJL166W YER063WYHR139C,YKR097W YKL182WYIL160C,YKL217W YKR009CYBR067C,YDL223C YNL195CYBR026C,YMR303C YOL147C
Table 6: Deletion combinations identified by INDEP on the quiescent population.
30
Deletion combination Downstream targetYCR010C,YOR374W YDR384CYMR169C,YMR170C YGR201CYIL101C,YIL160C YIR016WYAL054C,YDR256C,YMR303C YKR009CYHR096C,YLR377C YLL019CYDR256C,YGR088W YML092CYHR139C,YMR175W YMR118CYAR035W,YBR230C YNL104CYGL121C,YOR317W YOL081WYBL078C,YDL168W YOR031W
Table 7: Deletion combinations identified by INDEP on the non-quiescent population.
31