View
223
Download
2
Tags:
Embed Size (px)
Citation preview
Modeling Regulatory Networks
John Griffin
CS 374
Stanford
Fall 2004
2
Two primary articles
“Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data”Segal, Shapira, Regev, Pe’er, Botstein, Koller, Friedman [SSR]
Nature Genetics, June 2003
“Probabilistic discovery of overlapping cellular processes and their regulation”Battle, Segal, Koller [BSK]
RECOMB ’04 conference, March 2004
Second article builds on and extends first
3
Purposes of this research Predict what genes work together as
“modules” for different biological functions, and find what genes regulate each module under what conditions
Daphne Koller quoted in popular press article (supplementary reference 1): “What we’re doing is developing a suite of computational tools that take reams of data and automatically extract a picture of what’s happening in the cell….It tells you where to look for good biology.”
4
Outline Microarray technology radically
improves gene expression data volume & precision
Key terms Bayesian networks overview SSR article BSK article References
5
Background and motivation: “Observing the living genome,” 1999 article (supplementary reference 2)
“…DNA microarrays have the advantage of being comprehensive, inexpensive (in the case of printed DNA microarrays) and easy to use; an entire genome can be surveyed in a single hybridization experiment. Surveying the variation in abundance of each gene’s transcripts across an arbitrary series of samples is simply a matter of measuring the differential hybridization to a DNA microarray of fluorescently labeled cDNAs prepared from a series of mRNA samples.”
6
“Observing the living genome”1999 article (supplementary reference 2)
7
Typical microarray layout Applications
Identification of gene sequence
Determination of expression level (abundance) of genes. Each cell in array
shows expression level of a particular gene. A single microarray slide gets exposed to a particular experimental condition.Fluorescence indicates expression level.
2003: Affymetrix selling arrays with DNA for 30k-50k human genes – encode all known human proteins
8
Radical improvements
“DNA microarray approaches to identifying differentially expressed genes are fundamentally different from the traditional methods. Most importantly, they are systematic. Previous genome-wide approaches produced ‘lists’ of differentially expressed genes, or, in some cases, semiquantitative counts of the relative frequency with which specific transcripts were encountered in sequencing cDNAs isolated under a given condition. The qualitative or semiquantitative nature of the results, and the labor-intensive methodology, prevented the assembly of coherent pictures of the ‘patterns’ in which each gene is expressed, or of the characteristic patterns of gene expression in each cell, tissue or process.”
9
Outline Microarray technology radically improves
gene expression data volume & precision Key terms Bayesian networks overview SSR article BSK article References
10
Key terms Bayes’ Rule, Bayesian networks cis-regulatory motif module module group cDNA regulator expression profile node regulation program / regulation tree
11
Key terms, 1 of 4
Bayes’ Rule, Bayesian networks: to be explained
cis-regulatory motif: A short (6-to-12-ish) series of DNA bases that can bind to an “activator” or “repressor” protein. Illustrated at right as activator/repressor binding sites.
12
Key terms, 2 of 4 Module: set of genes that participate in a
coherent biological process Module group: set of modules that all share at
least one cis-regulatory motif cDNA: single-stranded DNA that is
complementary to messenger RNA or DNA that has been synthesized from messenger RNA by reverse transcriptase. This is what binds to the ordered array of DNA strands on microarrays
13
Key terms, 3 of 4 regulator: a gene that encodes a protein
whose concentration regulates the expression of other genes
expression profile: concentrations of various genes in given bio-experimental circumstances
14
Key terms, 4 of 4 node: locus in a regulation program/tree. Ovals in diagram at left.
regulation program / regulation tree: Upper part of diagram. A representation of different modes of regulation of genes within a module. 3 types of modes: 1) unregulated, 2) more transcription due to activator gene “upregulation”, 3) less transcription due to repressor gene “upregulation.” See diagram, slide 11. Arrow in oval can point up for upregulation, down for downregulation.
15
Outline Microarray technology radically improves
gene expression data volume & precision Key terms Bayesian networks overview SSR article BSK article References
16
Bayesian networks (BN) in brief Graphs in which nodes represent random
variables (Lack of) Arcs represent conditional
independence assumptions Present & absent arcs provide compact
representation of joint probability distributions BNs have complicated notion of
independence, which takes into account the directionality of the arcs
17
Bayes’ RuleCan rearrange the conditional probability formula
to get P(A|B) P(B) = P(A,B), but by symmetry we can also get: P(B|A) P(A) = P(A,B) It follows that:
The power of Bayes' rule is that in many situations where we want to compute P(A|B) it turns out that it is difficult to do so directly, yet we might have direct information about P(B|A). Bayes' rule enables us to compute P(A|B) in terms of P(B|A).
18
Simple Bayesian network example, from “Bayesian Networks Without Tears” article (supplementary reference 4): P(hear your dog bark as you get home) = P(hb) = ?
19
Need prior P for root nodes and conditional Ps, that consider all possible values of parent nodes, for nonroot nodes
20
Major benefit of BN We can know P(hb) based only on the conditional
probabilities of hb and its parent node. We don’t need to know/include all the ancestor probabilities between hb and the root nodes.
21
This BN benefit hugely reduces # of numbers and computations needed for large networks, e.g. hundreds or thousands of genes
SSR article: many separate Bayesian networks generated based on gene expression data. Here one activator and one repressor form basic BN, with 3 corresponding expression “contexts” shown at bottom.
22
Independence assumptions Source of savings
in # of values needed
From our simple example: are ‘family-out’ and ‘hear-bark’ independent, i.e. P(hb|fo)=P(hb)? Intuition might say they are not independent…
23
Independence assumptions …but in fact they can be
assumed to be independent if some conditions are met.
Conditions are symbolized by presence/absence and direction of arrows between nodes.
Knowing whether dog is or is not in the house is all that is needed to know probability of hearing a bark, so family being in or out is independent. This kind of independence assumption is what allows savings in how many numbers must be specified for probabilities.
24
Order of reduction of required numbers
Complete specification of probability distribution of n binary random variables needs 2n – 1 joint probabilities (jp). So for our example, 31 jp would be needed. But BN independence assumptions can reduce this to just 10 jp (listed on slide 19).
25
Evaluating Bayesian networks
Generally NP hard!
26
Where do the numerical estimates of probability come from?
Can be, at least initialized with, expert opinion
Can be learned by system Both SSR and BSK articles lay out basics
and some details of iterative algorithms for finding probability numbers.
27
Bayesian networks applied to diverse applications
“Computerized tongue diagnosis based on Bayesian networks”: devising expert system for Chinese medical method (supplementary reference 3)
28
For good entry-level BN tutorial: see supplementary references 4 and 5
29
Outline Microarray technology radically improves
gene expression data volume & precision Key terms Bayesian networks overview SSR article BSK article References
30
Aims of SSR article Bayesian network-based algorithms are
applied to gene expression data to generate good testable hypotheses.
31
Results of SSR article Expression data set, from other researchers
circa 2000, is for genes of yeast subjected to various kinds of stress
Compiled list of 466 candidate regulators Applied analysis to 2355 genes in all 173
arrays of yeast data set This gave automatic inference of 50 modules
of genes All modules were analyzed with external data
sources to check functional coherence of gene products and validity of regulatory program
Three novel hypotheses suggested by method were tested in bio lab and found to be accurate
32
Results of SSR article 2 examples of 50 modules inferred by SSR
methods: Respiration – mostly genes encoding
respiration proteins or glucose-metabolism proteins. One primary regulator predicted – Hap4 – which is known from past experiments to play activation role in respiration. Secondary regulators affect Hap4 expression.
Nitrogen catabolite repression – 29 genes tied to process by which yeast uses best available nitrogen source. Key regulator suggested is Gat1, due to 26 of 29 genes having Gat1 regulatory motif in their upstream regions.
33
Results of SSR article Evaluating module content and regulation
programs All 50 modules were tested to see if proteins
coded in same module had related functions Scored modules on how many genes are
noted in current bio databases as being related to the predicted function – diagram, next slide
31 of 50 modules had coherence >50%; only 4 had coherence <30%.
34
Results of SSR article
Colored boxes indicate that known experimental evidence validates the predicted regulatory role of a regulator (named in one of the ‘Reg’ columns) in a given module (each row of the table).
M, C and G column headers and different colors of boxes represent different sorts of experimental evidence that validate the model’s prediction.
C(%): functional coherence of module, from literature mentions of module genes.
#G: number of genes in module
35
Results of SSR article To find global relationships between
modules, graph (next 2 slides) made showing modules & their motifs. Motifs were found within the 500 base pairs upstream from each gene.
Observations from this graph: modules with related biological functions often shared at least one motif, & sometimes shared one or more regulator genes.
36
Module relationships, 1 of 2
37
Module relationships, 2 of 2
38
Additional tests of predictions
Inferred regulator models were evaluated by comparing known functions of predicted regulators with their predicted regulation functions
Three previously untested hypotheses suggested by the model were tested with experiments comparing wild-type expression with deletion-type expression under the conditions hypothesized (e.g. heat shock and hypo-osmotic shift). A “paired-t test” showed that all three regulators do have roles in the hypothesized conditions.
39
Results summary
The method is able to accurately predict functions for regulators, their targets and experimental conditions under which regulation occurs.
40
Model-building method
Three stages, illustrated on next slide:
Preprocessing
Module networks procedure
Post-processing
41
42
PreprocessingCandidate regulators are chosen from among
known and suspected transcription factors and signal transduction molecules. Informed choice of candidate regulators makes algorithm workable – without selectivity, bad results are likely.
43
Module network procedure
Genes are partitioned into modules and regulation program is sought for each module to explain gene expression in module.
44
Post-processing“Enrichment” of annotations for predicted
modules are sought in literature; enrichment of regulatory motifs sought within 500 base pairs upstream from genes
45
What does a BN look like here? Need to specify two things to describe a
BN Graph topology (structure) Parameters of each conditional probability
distribution Possible to learn both from data Learning structure is much harder than
learning parameters
46
Regulator programs: more complex Bayesian networks, made along lines of earlier simple exampleSimple generic example seen earlier…
47
…and real example: respiration & carbon regulation module (continued next slide)
48
Colored entries in columns on right show genes with enriched literature annotations for that column’s module (probabilities of overall enichments are at top of columns, previous slide)
49
Outline Microarray technology radically improves
gene expression data volume & precision Key terms Bayesian networks overview SSR article BSK article References
50
BSK article overview Authors: Battle, Segal, Koller
All from Stanford CS department Proposes a “novel probabilistic model of gene
regulation for the task of identifying overlapping biological processes and the regulatory mechanism controlling their activation.”
Detailed discussion of their COPR algorithm along with experimental methods and results
Builds on and extends work of Article 1; in particular, allows genes to simultaneously belong to more than one biological “process”
51
From BSK abstract:
“…A key feature of our approach is that we allow genes to participate in multiple processes, thus providing a more biologically plausible model for the process of gene regulation. We present an algorithm to learn this model automatically from data, using only genome-wide measurements of gene expression as input. We compare our results to those obtained by other approaches, and show significant benefits can be gained by modeling both the organization of genes into overlapping cellular processes and the regulatory programs of these processes. Moreover, our method successfully grouped genes known to function together, recovered many regulatory relationships that are known in the literature, and suggested novel hypotheses regarding the regulatory role of previously uncharacterized proteins.”
52
COPR model approach COPR = Coregulated Overlapping
Processes model Combines continuous and discrete
scoring Allows genes to participate in multiple
biological processes
53
COPR modelGenes, arrays, expression measurements,
regulators and biological processes are the elements represented
First component of COPR model represents gene expression and its decomposition into activity level of processes. Formula details to follow…
Second component of COPR model is the regulatory model.
54
First COPR componentGenes, arrays, expression measurements, regulators and
biological processes are the elements represented
G={g1, g2…gn} genes
Attributes: g.M1...g.Mj, where g.Mp is Boolean showing whether g is part of process p (discrete variable). g.M is array of g.M1..g.Mn
A={a1,a2…ak} arrays
Attributes: a.C1…a.Cj, where a.Cp shows degree to which process p is active in array a (continuous variable)
E={e11…enk} expressions, one for each gene in each arrayAttributes: e.Gene, e.Array, e.Level – the
respective gene, array and level of expression of that gene in that array
55
First COPR component, cont’d.
Expression level of gene g in array a is assumed to be a sum of g’s expression levels in each of the processes in which it participates, where g’s expression level in process p is the activity of the process a.Cp.
Assumed: “the expression of gene g in array a is normally distributed with a mean that is equal to the sum, over processes p in which g participates, of the activity level of p:” (a = variance of array a)
56
Second COPR component
Regulatory model: Assumed that genes in the same process are coregulated, and therefore share the same regulatory mechanism. So a regulation program is defined for each process (Mp).
New attribute is added for A={a1,a2…ak} arrays
Attribute: a.R1..a.Rn where a.Rp shows expression level of regulator Rp in array a
57
COPR model with 2 processes & 3 regulators
58
Learning a COPR model from microarray data Consists of learning a probabilistic model
from partially observed data Arrays provide complete e.Level data for each
gene, and complete a.R values for each regulator
From this data, need to group genes into processes, estimate process activity levels for each experiment and learn regulatory control programs governing each process
Tool: a “hard-assignment” variant of the structural EM (SEM) algorithm
59
SEM (Stochastic Expectation & Maximization) algorithm summary Initialized by using a “standard
expression clustering technique” to choose assignments of genes to processes
Next an initial set of activity levels is found with least squares method
Initial result is assignments of values to variables G.M (Booleans showing whether genes belong to processes) and A.C (continuous variables showing process activity levels in arrays)
60
SEM summary, continued Next, iterations of two different steps
alternate, giving new values for G.M and A.C, until those values stabilize. Step 1 (M-step): find regression tree for each
process p that maximizes Bayesian score, which is the posterior probability of the regression tree given the gene expression data
Step 2(E-step): find the most likely joint assignments to G.M and A.C. Difficult; involves both discrete and continuous
variables; an approximate rather than an exact algorithm is used to avoid # of operations exponential to the number of processes
61
Experimental results Statistical validations of results from
yeast stress data COPR learned processes whose genes are
more enriched with literature annotations, motifs and transcription factor targets than those learned by earlier learning model (“Module Networks”) made by same authors
Results comparisons on next 3 slides
62
Literature gene annotations
63
Presence of known motifs
64
Known transcription factor targets
65
BSK conclusion & future work ideas “In many cases, our COPR model’s predictions
are remarkably coherent: A process associated with a certain cellular function is often predicted to be active in precisely the conditions where that function plays a role. Such coherent results involving uncharacterized genes or regulators can suggest novel biological hypotheses that can be tested in the lab.”
Could build on COPR work by Integrating additional data sources to improve
discovery of regulators Applying method to human gene expression data
66
Outline
Microarray technology radically improves gene expression data volume & precision
Key terms Bayesian networks overview SSR article BSK article References
67
ReferencesMain articles: links on CS 374 web site1. “Module networks: identifying regulatory modules and their condition-
specific regulators from gene expression data,” Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N, Nature Genetics, June 2003http://robotics.stanford.edu/~erans/module_nets/
2. “Probabilistic discovery of overlapping cellular processes and their regulation,” Battle A, Segal E, Koller D, RECOMB ’04 conference, March 2004
Supplementary references1. “Computer scientists develop tool for mining genomic data”, Graduating
Engineer & Computer Careers, Vol. 27 No. 1, Fall 2004:132. “Observing the living genome,” Ferea T, Brown P, Current Opinion in
Genetics & Development, 1999, 9:715–7223. “Computerized Tongue Diagnosis Based on Bayesian Networks,” Pang B,
Zhang D, Li N, Wang K, IEEE Transactions on Biomedical Engineering, Vol. 51 No. 10, October 2004:1803-1810
4. “Bayesian Networks Without Tears,” Charniak E, American Association for Artificial Intelligence, AI Magazine, Winter 1991: 50-63http://www.kddresearch.org/Resources/Papers/Intro/notears.pdf
5. “A Brief Introduction to Graphical Models and Bayesian Networks,” Murphy K, 1998, http://www.ai.mit.edu/~murphyk/Bayes/bnintro.html.