Modeling Regulatory Networks John Griffin CS 374 Stanford Fall 2004

Modeling Regulatory Networks

John Griffin

CS 374

Stanford

Fall 2004

2

Two primary articles

“Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data”Segal, Shapira, Regev, Pe’er, Botstein, Koller, Friedman [SSR]

Nature Genetics, June 2003

“Probabilistic discovery of overlapping cellular processes and their regulation”Battle, Segal, Koller [BSK]

RECOMB ’04 conference, March 2004

Second article builds on and extends first

3

Purposes of this research Predict what genes work together as

“modules” for different biological functions, and find what genes regulate each module under what conditions

Daphne Koller quoted in popular press article (supplementary reference 1): “What we’re doing is developing a suite of computational tools that take reams of data and automatically extract a picture of what’s happening in the cell….It tells you where to look for good biology.”

4

Outline Microarray technology radically

improves gene expression data volume & precision

Key terms Bayesian networks overview SSR article BSK article References

5

Background and motivation: “Observing the living genome,” 1999 article (supplementary reference 2)

“…DNA microarrays have the advantage of being comprehensive, inexpensive (in the case of printed DNA microarrays) and easy to use; an entire genome can be surveyed in a single hybridization experiment. Surveying the variation in abundance of each gene’s transcripts across an arbitrary series of samples is simply a matter of measuring the differential hybridization to a DNA microarray of fluorescently labeled cDNAs prepared from a series of mRNA samples.”

6

“Observing the living genome”1999 article (supplementary reference 2)

7

Typical microarray layout Applications

Identification of gene sequence

Determination of expression level (abundance) of genes. Each cell in array

shows expression level of a particular gene. A single microarray slide gets exposed to a particular experimental condition.Fluorescence indicates expression level.

2003: Affymetrix selling arrays with DNA for 30k-50k human genes – encode all known human proteins

8

Radical improvements

“DNA microarray approaches to identifying differentially expressed genes are fundamentally different from the traditional methods. Most importantly, they are systematic. Previous genome-wide approaches produced ‘lists’ of differentially expressed genes, or, in some cases, semiquantitative counts of the relative frequency with which specific transcripts were encountered in sequencing cDNAs isolated under a given condition. The qualitative or semiquantitative nature of the results, and the labor-intensive methodology, prevented the assembly of coherent pictures of the ‘patterns’ in which each gene is expressed, or of the characteristic patterns of gene expression in each cell, tissue or process.”

9

Outline Microarray technology radically improves

gene expression data volume & precision Key terms Bayesian networks overview SSR article BSK article References

10

Key terms Bayes’ Rule, Bayesian networks cis-regulatory motif module module group cDNA regulator expression profile node regulation program / regulation tree

11

Key terms, 1 of 4

Bayes’ Rule, Bayesian networks: to be explained

cis-regulatory motif: A short (6-to-12-ish) series of DNA bases that can bind to an “activator” or “repressor” protein. Illustrated at right as activator/repressor binding sites.

12

Key terms, 2 of 4 Module: set of genes that participate in a

coherent biological process Module group: set of modules that all share at

least one cis-regulatory motif cDNA: single-stranded DNA that is

complementary to messenger RNA or DNA that has been synthesized from messenger RNA by reverse transcriptase. This is what binds to the ordered array of DNA strands on microarrays

13

Key terms, 3 of 4 regulator: a gene that encodes a protein

whose concentration regulates the expression of other genes

expression profile: concentrations of various genes in given bio-experimental circumstances

14

Key terms, 4 of 4 node: locus in a regulation program/tree. Ovals in diagram at left.

regulation program / regulation tree: Upper part of diagram. A representation of different modes of regulation of genes within a module. 3 types of modes: 1) unregulated, 2) more transcription due to activator gene “upregulation”, 3) less transcription due to repressor gene “upregulation.” See diagram, slide 11. Arrow in oval can point up for upregulation, down for downregulation.

15



16

Bayesian networks (BN) in brief Graphs in which nodes represent random

variables (Lack of) Arcs represent conditional

independence assumptions Present & absent arcs provide compact

representation of joint probability distributions BNs have complicated notion of

independence, which takes into account the directionality of the arcs

17

Bayes’ RuleCan rearrange the conditional probability formula

to get P(A|B) P(B) = P(A,B), but by symmetry we can also get: P(B|A) P(A) = P(A,B) It follows that:

The power of Bayes' rule is that in many situations where we want to compute P(A|B) it turns out that it is difficult to do so directly, yet we might have direct information about P(B|A). Bayes' rule enables us to compute P(A|B) in terms of P(B|A).

18

Simple Bayesian network example, from “Bayesian Networks Without Tears” article (supplementary reference 4): P(hear your dog bark as you get home) = P(hb) = ?

19

Need prior P for root nodes and conditional Ps, that consider all possible values of parent nodes, for nonroot nodes

20

Major benefit of BN We can know P(hb) based only on the conditional

probabilities of hb and its parent node. We don’t need to know/include all the ancestor probabilities between hb and the root nodes.

21

This BN benefit hugely reduces # of numbers and computations needed for large networks, e.g. hundreds or thousands of genes

SSR article: many separate Bayesian networks generated based on gene expression data. Here one activator and one repressor form basic BN, with 3 corresponding expression “contexts” shown at bottom.

22

Independence assumptions Source of savings

in # of values needed

From our simple example: are ‘family-out’ and ‘hear-bark’ independent, i.e. P(hb|fo)=P(hb)? Intuition might say they are not independent…

23

Independence assumptions …but in fact they can be

assumed to be independent if some conditions are met.

Conditions are symbolized by presence/absence and direction of arrows between nodes.

Knowing whether dog is or is not in the house is all that is needed to know probability of hearing a bark, so family being in or out is independent. This kind of independence assumption is what allows savings in how many numbers must be specified for probabilities.

24

Order of reduction of required numbers

Complete specification of probability distribution of n binary random variables needs 2n – 1 joint probabilities (jp). So for our example, 31 jp would be needed. But BN independence assumptions can reduce this to just 10 jp (listed on slide 19).

25

Evaluating Bayesian networks

Generally NP hard!

26

Where do the numerical estimates of probability come from?

Can be, at least initialized with, expert opinion

Can be learned by system Both SSR and BSK articles lay out basics

and some details of iterative algorithms for finding probability numbers.

27

Bayesian networks applied to diverse applications

“Computerized tongue diagnosis based on Bayesian networks”: devising expert system for Chinese medical method (supplementary reference 3)

28

For good entry-level BN tutorial: see supplementary references 4 and 5

29



30

Aims of SSR article Bayesian network-based algorithms are

applied to gene expression data to generate good testable hypotheses.

31

Results of SSR article Expression data set, from other researchers

circa 2000, is for genes of yeast subjected to various kinds of stress

Compiled list of 466 candidate regulators Applied analysis to 2355 genes in all 173

arrays of yeast data set This gave automatic inference of 50 modules

of genes All modules were analyzed with external data

sources to check functional coherence of gene products and validity of regulatory program

Three novel hypotheses suggested by method were tested in bio lab and found to be accurate

32

Results of SSR article 2 examples of 50 modules inferred by SSR

methods: Respiration – mostly genes encoding

respiration proteins or glucose-metabolism proteins. One primary regulator predicted – Hap4 – which is known from past experiments to play activation role in respiration. Secondary regulators affect Hap4 expression.

Nitrogen catabolite repression – 29 genes tied to process by which yeast uses best available nitrogen source. Key regulator suggested is Gat1, due to 26 of 29 genes having Gat1 regulatory motif in their upstream regions.

33

Results of SSR article Evaluating module content and regulation

programs All 50 modules were tested to see if proteins

coded in same module had related functions Scored modules on how many genes are

noted in current bio databases as being related to the predicted function – diagram, next slide

31 of 50 modules had coherence >50%; only 4 had coherence <30%.

34

Results of SSR article

Colored boxes indicate that known experimental evidence validates the predicted regulatory role of a regulator (named in one of the ‘Reg’ columns) in a given module (each row of the table).

M, C and G column headers and different colors of boxes represent different sorts of experimental evidence that validate the model’s prediction.

C(%): functional coherence of module, from literature mentions of module genes.

#G: number of genes in module

35

Results of SSR article To find global relationships between

modules, graph (next 2 slides) made showing modules & their motifs. Motifs were found within the 500 base pairs upstream from each gene.

Observations from this graph: modules with related biological functions often shared at least one motif, & sometimes shared one or more regulator genes.

36

Module relationships, 1 of 2

37

Module relationships, 2 of 2

38

Additional tests of predictions

Inferred regulator models were evaluated by comparing known functions of predicted regulators with their predicted regulation functions

Three previously untested hypotheses suggested by the model were tested with experiments comparing wild-type expression with deletion-type expression under the conditions hypothesized (e.g. heat shock and hypo-osmotic shift). A “paired-t test” showed that all three regulators do have roles in the hypothesized conditions.

39

Results summary

The method is able to accurately predict functions for regulators, their targets and experimental conditions under which regulation occurs.

40

Model-building method

Three stages, illustrated on next slide:

Preprocessing

Module networks procedure

Post-processing

41

42

PreprocessingCandidate regulators are chosen from among

known and suspected transcription factors and signal transduction molecules. Informed choice of candidate regulators makes algorithm workable – without selectivity, bad results are likely.

43

Module network procedure

Genes are partitioned into modules and regulation program is sought for each module to explain gene expression in module.

44

Post-processing“Enrichment” of annotations for predicted

modules are sought in literature; enrichment of regulatory motifs sought within 500 base pairs upstream from genes

45

What does a BN look like here? Need to specify two things to describe a

BN Graph topology (structure) Parameters of each conditional probability

distribution Possible to learn both from data Learning structure is much harder than

learning parameters

46

Regulator programs: more complex Bayesian networks, made along lines of earlier simple exampleSimple generic example seen earlier…

47

…and real example: respiration & carbon regulation module (continued next slide)

48

Colored entries in columns on right show genes with enriched literature annotations for that column’s module (probabilities of overall enichments are at top of columns, previous slide)

49



50

BSK article overview Authors: Battle, Segal, Koller

All from Stanford CS department Proposes a “novel probabilistic model of gene

regulation for the task of identifying overlapping biological processes and the regulatory mechanism controlling their activation.”

Detailed discussion of their COPR algorithm along with experimental methods and results

Builds on and extends work of Article 1; in particular, allows genes to simultaneously belong to more than one biological “process”

51

From BSK abstract:

“…A key feature of our approach is that we allow genes to participate in multiple processes, thus providing a more biologically plausible model for the process of gene regulation. We present an algorithm to learn this model automatically from data, using only genome-wide measurements of gene expression as input. We compare our results to those obtained by other approaches, and show significant benefits can be gained by modeling both the organization of genes into overlapping cellular processes and the regulatory programs of these processes. Moreover, our method successfully grouped genes known to function together, recovered many regulatory relationships that are known in the literature, and suggested novel hypotheses regarding the regulatory role of previously uncharacterized proteins.”

52

COPR model approach COPR = Coregulated Overlapping

Processes model Combines continuous and discrete

scoring Allows genes to participate in multiple

biological processes

53

COPR modelGenes, arrays, expression measurements,

regulators and biological processes are the elements represented

First component of COPR model represents gene expression and its decomposition into activity level of processes. Formula details to follow…

Second component of COPR model is the regulatory model.

54

First COPR componentGenes, arrays, expression measurements, regulators and

biological processes are the elements represented

G={g1, g2…gn} genes

Attributes: g.M1...g.Mj, where g.Mp is Boolean showing whether g is part of process p (discrete variable). g.M is array of g.M1..g.Mn

A={a1,a2…ak} arrays

Attributes: a.C1…a.Cj, where a.Cp shows degree to which process p is active in array a (continuous variable)

E={e11…enk} expressions, one for each gene in each arrayAttributes: e.Gene, e.Array, e.Level – the

respective gene, array and level of expression of that gene in that array

55

First COPR component, cont’d.

Expression level of gene g in array a is assumed to be a sum of g’s expression levels in each of the processes in which it participates, where g’s expression level in process p is the activity of the process a.Cp.

Assumed: “the expression of gene g in array a is normally distributed with a mean that is equal to the sum, over processes p in which g participates, of the activity level of p:” (a = variance of array a)

56

Second COPR component

Regulatory model: Assumed that genes in the same process are coregulated, and therefore share the same regulatory mechanism. So a regulation program is defined for each process (Mp).

New attribute is added for A={a1,a2…ak} arrays

Attribute: a.R1..a.Rn where a.Rp shows expression level of regulator Rp in array a

57

COPR model with 2 processes & 3 regulators

58

Learning a COPR model from microarray data Consists of learning a probabilistic model

from partially observed data Arrays provide complete e.Level data for each

gene, and complete a.R values for each regulator

From this data, need to group genes into processes, estimate process activity levels for each experiment and learn regulatory control programs governing each process

Tool: a “hard-assignment” variant of the structural EM (SEM) algorithm

59

SEM (Stochastic Expectation & Maximization) algorithm summary Initialized by using a “standard

expression clustering technique” to choose assignments of genes to processes

Next an initial set of activity levels is found with least squares method

Initial result is assignments of values to variables G.M (Booleans showing whether genes belong to processes) and A.C (continuous variables showing process activity levels in arrays)

60

SEM summary, continued Next, iterations of two different steps

alternate, giving new values for G.M and A.C, until those values stabilize. Step 1 (M-step): find regression tree for each

process p that maximizes Bayesian score, which is the posterior probability of the regression tree given the gene expression data

Step 2(E-step): find the most likely joint assignments to G.M and A.C. Difficult; involves both discrete and continuous

variables; an approximate rather than an exact algorithm is used to avoid # of operations exponential to the number of processes

61

Experimental results Statistical validations of results from

yeast stress data COPR learned processes whose genes are

more enriched with literature annotations, motifs and transcription factor targets than those learned by earlier learning model (“Module Networks”) made by same authors

Results comparisons on next 3 slides

62

Literature gene annotations

63

Presence of known motifs

64

Known transcription factor targets

65

BSK conclusion & future work ideas “In many cases, our COPR model’s predictions

are remarkably coherent: A process associated with a certain cellular function is often predicted to be active in precisely the conditions where that function plays a role. Such coherent results involving uncharacterized genes or regulators can suggest novel biological hypotheses that can be tested in the lab.”

Could build on COPR work by Integrating additional data sources to improve

discovery of regulators Applying method to human gene expression data

66

Outline

Microarray technology radically improves gene expression data volume & precision

Key terms Bayesian networks overview SSR article BSK article References

67

ReferencesMain articles: links on CS 374 web site1. “Module networks: identifying regulatory modules and their condition-

specific regulators from gene expression data,” Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N, Nature Genetics, June 2003http://robotics.stanford.edu/~erans/module_nets/

2. “Probabilistic discovery of overlapping cellular processes and their regulation,” Battle A, Segal E, Koller D, RECOMB ’04 conference, March 2004

Supplementary references1. “Computer scientists develop tool for mining genomic data”, Graduating

Engineer & Computer Careers, Vol. 27 No. 1, Fall 2004:132. “Observing the living genome,” Ferea T, Brown P, Current Opinion in

Genetics & Development, 1999, 9:715–7223. “Computerized Tongue Diagnosis Based on Bayesian Networks,” Pang B,

Zhang D, Li N, Wang K, IEEE Transactions on Biomedical Engineering, Vol. 51 No. 10, October 2004:1803-1810

4. “Bayesian Networks Without Tears,” Charniak E, American Association for Artificial Intelligence, AI Magazine, Winter 1991: 50-63http://www.kddresearch.org/Resources/Papers/Intro/notears.pdf

5. “A Brief Introduction to Graphical Models and Bayesian Networks,” Murphy K, 1998, http://www.ai.mit.edu/~murphyk/Bayes/bnintro.html.

Documents

Modeling Regulatory Networks John Griffin CS 374 Stanford Fall 2004