172
Microarrays, Expression, and Regulatory Networks Thanks to Prof. Mehmet Koyuturk, Case Western Reserve University.

Microarrays, Expression, and Regulatory Networks

  • Upload
    lazar

  • View
    53

  • Download
    0

Embed Size (px)

DESCRIPTION

Microarrays, Expression, and Regulatory Networks. Thanks to Prof. Mehmet Koyuturk, Case Western Reserve University. Central Dogma. A functional protein (or sometimes, RNA) that is coded by a particular gene is often called the product of that gene. Gene Expression. - PowerPoint PPT Presentation

Citation preview

Page 1: Microarrays, Expression, and Regulatory Networks

Microarrays, Expression, and Regulatory Networks

Thanks to Prof. Mehmet Koyuturk, Case Western Reserve University.

Page 2: Microarrays, Expression, and Regulatory Networks

Central Dogma

3. DNA Microarrays2

A functional protein (or sometimes, RNA) that is coded by a particular gene is often called the product of that gene

Page 3: Microarrays, Expression, and Regulatory Networks

Gene Expression

3. DNA Microarrays

3

Gene expression is the process of synthesizing a functional gene product (protein or RNA) from a segment of DNA that specifies inheritable information ( a gene)

In a multicellular organism, all cells contain identical genomes, but different genes are expressed in different types of cells

Regulation of gene expression Development, response to environmental signals

Page 4: Microarrays, Expression, and Regulatory Networks

Studying Genome-wide Expression

3. DNA Microarrays

4

The types and of expression levels of prescribed genes are linked to the phenotype of a cell The concentration of mRNA corresponding to each

gene in the genome provides a measure of gene expression

Gene expression is regulated at various stages, so mRNA concentration is not necessarily a perfect indicator of the activity of a gene (in terms of the activity of gene product) Splice variants Post-translational modification Proteomics

Page 5: Microarrays, Expression, and Regulatory Networks

Research Questions

3. DNA Microarrays

5

Knowledge of genome-wide expression makes it possible to study fundamental questions related to gene expression What genes are expressed in what cell types (e.g.,

different tissues)? How does gene expression change over time (e.g.,

cell cycle)? How does a certain type of disease influence the

expression of one or more genes to alter phenotype?

How do the expression levels of different groups of genes change under different conditions?

How do genes regulate each other’s expression?

Page 6: Microarrays, Expression, and Regulatory Networks

DNA Microarray Technology

3. DNA Microarrays

6

Measure the amount of mRNA (corresponding to each gene) existing in a given cell in bulks (for thousands of genes)

The major tool in transcriptomics It is possible to measure the expression of a large

number of genes (often, the entire genome) in a single sample

Makes it possible to compare the expression levels of several genes in one sample

Large-scale application of traditional techniques Hybridization-based methods

Page 7: Microarrays, Expression, and Regulatory Networks

Hybridization

3. DNA Microarrays

7

The process of joining two complementary strands of DNA or one each of DNA and RNA to form a double-stranded molecule

Key idea in microarray technology

Page 8: Microarrays, Expression, and Regulatory Networks

What is a DNA Microarray?

3. DNA Microarrays

8

A DNA microarray is a slide onto which a regular pattern of spots is deposited Each spot contains many copies of a specified

single-stranded DNA sequence (i.e., multiple biologically identical sequences)

All sequences are chemically bonded to the surface of the slide

There is a different DNA sequence at each spot (i.e., the sequences at different spots are biologically different)

Spots are small Can fit thousands of spots on a single slide a few

centimeters across

Page 9: Microarrays, Expression, and Regulatory Networks

How do DNA Microarrays work?

3. DNA Microarrays

9

The DNA sequences in spots act as probes that hybridize with complementary sequences If complementary sequence exists in the sample,

the corresponding DNA sequence in a spot will hybridize

Solutions extracted from tissue samples contain large numbers of mRNAs of many different types that happen to be present in the cells at the time of the experiment The amount of mRNA hybridized in a spot provides

an estimate of the concentration of the mRNA in the sample

Each spot targets a different type of mRNA (gene)

Page 10: Microarrays, Expression, and Regulatory Networks

Measuring mRNA Concentration

3. DNA Microarrays

10

The spots in which hybridization takes place can be visualized using fluorescence techniques Sequences in the sample are fluorescently labeled The DNA that hybridizes is visually identifiable as

glowing spots on the array Spots that have nothing hybridized are not visible The intensity of fluorescence at each spot is

proportional to the amount of the corresponding type of mRNA in the sample

DNA microarrays can detect presence of sequences corresponding to all spots simultaneously

Page 11: Microarrays, Expression, and Regulatory Networks

Types of Microarrays

3. DNA Microarrays

11

1. Oligonucleotide arrays2. cDNA arrays

Page 12: Microarrays, Expression, and Regulatory Networks

Oligonucleotide Arrays

3. DNA Microarrays

12

Oligo: Just a few, scanty Oligonucleotide arrays use short DNA

sequences (in the spots) Usually 25 nucleotides Several spots correspond to one gene

Oligonucleotides are synthesized in situ based on the sequences One base at a time

Sometimes called a chip Commercially manifactured by Affymatrix

Page 13: Microarrays, Expression, and Regulatory Networks

Production of Oligonucleotide Arrays

3. DNA Microarrays13

Photolitography & combinatorial chemistry

Page 14: Microarrays, Expression, and Regulatory Networks

Hybridization Specificity

3. DNA Microarrays

14

Each oligonucleotide should hybridize to a specific gene in the organism Short sequences => Cross hybridization is more probable There are a lot of genes in an organism with related

sequences Perfect match/Mismatch (PM/MM) probe strategy

Two spots for each oligonucleutide PM: Identical to the target MM: Differs only at the base in the middle of sequence Assumption: non-specific binding is identical for PM

and MM probes PM-MM provides a measure of specific hybridization

Page 15: Microarrays, Expression, and Regulatory Networks

Use of Oligonucleotide Arrays

3. DNA Microarrays15

Page 16: Microarrays, Expression, and Regulatory Networks

cDNA Arrays

3. DNA Microarrays

16

A cDNA is a DNA strand synthesized using a reverse transcriptase enzyme, which makes a DNA sequence that is complementary to an RNA template Reverse of what happens in transcription It is possible to synthesize cDNAs from mRNAs present in

cells There are cDNA libraries that contain sequences of genes

known to be expressed in particular cell types Use cDNA sequences as probe sequences on

microarrays Knowledge of sequence is not necessary Experimentally identifying a set of suitable cDNAs is

sufficient

Page 17: Microarrays, Expression, and Regulatory Networks

cDNA Arrays

3. DNA Microarrays

17

cDNAs are quite long: 500-2000 bases Hybridization is much more specific A cDNA contains a large fraction of a gene sequence,

but not necessarily the entire gene Generally, one spot is adequate to recognize a single

gene The process of array manufacture is less

reproducible It is not easy to control the amount of DNA at

each spot It is not usually possible to compare absolute

intensities of spots from different slides Use two samples on one array!

Page 18: Microarrays, Expression, and Regulatory Networks

Two-color Hybridization

3. DNA Microarrays

18

Two samples One test sample One control (reference) sample

Prepare RNA extracts from each sample separately

Make cDNA from each sample using nucleotides labeled with a different color Reference sample: Green (Cy5) Test sample: Red (Cy3)

Mix labeled populations, let mixture to hybridize with array cDNAs from different samples should bind to spot

in proportion to their concentrations

Page 19: Microarrays, Expression, and Regulatory Networks

Two-Color Hybridization

3. DNA Microarrays

19

Red spot: The gene is expressed significantly more in the test sample

Green spot: The gene is expressed significantly more in the reference sample

Yellow spot: The gene has about the same expression level in both samples

The fraction of red intensity to green intensity provides a measure of relative expression

Page 20: Microarrays, Expression, and Regulatory Networks

Use of cDNA Arrays

3. DNA Microarrays20

Page 21: Microarrays, Expression, and Regulatory Networks

Use of Two-Color Hybridization

3. DNA Microarrays

21

Compare cell/tissue samples Cells before and after an experimental

perturbation Successive times during a temporally staged

process Between stages of differentiation Mutant cell vs. wild type

How do we compare multiple samples? Time-course experiments

Page 22: Microarrays, Expression, and Regulatory Networks

Comparing Multiple Samples

3. DNA Microarrays

22

Choose a single reference sample Need not be related to samples being examined Time course experiments: Initial sample

Since the concentration of each mRNA in the reference sample is mixed, the relative expression with respect to reference sample provides a fair comparison between all other samples

Reference sample should provide a hybridization signal for each gene (should have non-zero mRNA concentration) Approximation to ideal reference sample: Equal

mixture of material from all samples

Page 23: Microarrays, Expression, and Regulatory Networks

Oligonucleotide vs. cDNA Arrays

3. DNA Microarrays

23

cDNA does not require probe design cDNA provides higher specificity due to longer

sequences of targets However, cDNA may contain repetitive sequences

that are often obtained in various genes Techniques like PM/MM enhance specificty of

oligonucleotides cDNA arrays are more useful on a global level

Screening steady-state mRNA expression levels Oligonucleotide arrays are more useful when

more precise analysis is required SNPs

Page 24: Microarrays, Expression, and Regulatory Networks

Relative Expression

3. DNA Microarrays

24

Ri : Red intensity (test sample)

Gi : Green intensity (reference sample)

Intensity ratio: Ti = Ri / Gi If > 1, the gene is up-regulated in the test sample If < 1, the gene is down-regulated in the test

sample Eliminates spot-to-spot variability to a certain

extent

Page 25: Microarrays, Expression, and Regulatory Networks

Channel Normalization

3. DNA Microarrays

25

There are millions of individual mRNA molecules in one sample It can be assumed that the average mass of each

molecule is approximately the same It can be assumed that arrayed elements

represent a random sampling of the genes in the organism

We use two samples of equal mass, so the total hybridization intensities should be the same

n

i i

n

i i

G

R

tN1

1

i

i

ti G

R

NT

Page 26: Microarrays, Expression, and Regulatory Networks

log Ratio

3. DNA Microarrays

26

Mi = log2(Ri / Gi) log-transformation makes the distribution closer to

normal distribution Mi = 1 => gene i’s expression level is doubled Mi = -1 => gene i’s expression level is halved Mi = 0 => gene i’s expression level is unchanged

Page 27: Microarrays, Expression, and Regulatory Networks

Average Intensity of a Spot

3. DNA Microarrays

27

2 2

1log log

2i i i i iA R G R G

log-scaled geometric average of the intensities for the test and reference samples

A measure of the overall expression of a gene

Page 28: Microarrays, Expression, and Regulatory Networks

Ratio/Intensity Plot

3. DNA Microarrays

28

x axis: overall expression of a geney axis: change in expression of a gene (across samples)

Page 29: Microarrays, Expression, and Regulatory Networks

Mean Relative Intensity

3. DNA Microarrays

29

For a gene whose expression level has not changed, we expect that Ri / Gi so that Mi = 0 Most genes should have unchanged expression level In our example, most points are below the horizonal axis This is likely to be because of a systematic bias, rather

than suggesting that most genes are down-regulated in the experiment

Dye bias Efficiency of labeling in two DNA populations may be

different Binding between DNA and probe may be affected by the

dye in a systematic way Efficiency of detecting flourescent signal may be

different

Page 30: Microarrays, Expression, and Regulatory Networks

Array Normalization

3. DNA Microarrays

30

Used to minimize systematic variations in the gene expression levels of the two samples hybridized to the array and allows comparison of gene expression levels across multiple slides

Main assumption: After log-transformation the distribution of relative intensity values approach a normal distribution

Page 31: Microarrays, Expression, and Regulatory Networks

Housekeeping Genes

3. DNA Microarrays

31

Normalize using housekeeping genes A housekeeping gene is one that is assumed to be

expressed at a constant level that does not change between reference and test samples

Shift data so that we will have Mi = 0 for housekeeping genes

It is not easy to find genes whose expression will surely remain unchanged

Page 32: Microarrays, Expression, and Regulatory Networks

Global Normalization

3. DNA Microarrays

32

Subtract the mean relative intensity over all spots from all spots so that the mean will be zero

All these methods are global in the sense that they only change the position of the cloud of points in the M/A plot, not the shapeˆ

i iM M M 1

n

ii

MM

n

Page 33: Microarrays, Expression, and Regulatory Networks

Self-Normalization

3. DNA Microarrays

33

Dye-flip experiments Another way of eliminating dye bias Perform a second experiment by in which the red

and green labeling of samples is done in reverse Subtract Mi values from each other Result will be twice the unbiased Mi value, since the

term that corresponds to bias will be canceled out The normalized value of each spot depends only

on the measured intensity ratios for that spot Bias is assumed to be independent in all spots Bias is assumed to be reproducible between arrays

Page 34: Microarrays, Expression, and Regulatory Networks

Intensity-Dependent Bias

3. DNA Microarrays

34

Bias may depend on the average intensity on a spot

In our example, there is an upward trend in the Mi values for higher values of Ai

Whether a gene (on a global sense) is up- or down- regulated should not depend on its average expression level Fluorescence detector may be

saturated at high intensity

Page 35: Microarrays, Expression, and Regulatory Networks

LOWESS

3. DNA Microarrays

35

LOcally WEighted Scatterplot Smoothing Fit a smooth curved function m(A) through the

data points This is an estimate of bias as a function of average

intensity Correct values as

The shift depends on the average intensity on the spot, but the function that determines shift is global Neither global nor self-normalization

ˆ ( )i i iM M m A

Page 36: Microarrays, Expression, and Regulatory Networks

Normalization by LOWESS

3. DNA Microarrays

36

Page 37: Microarrays, Expression, and Regulatory Networks

Gene Normalization

3. DNA Microarrays

37

Array normalization makes arrays cross-comparable

Two identically expressed genes in terms of Cy5 intensities may end up having different log ratios

Solution: Center expression values for each gene so that each gene will have mean (or median) expression value of 0

Example (on blackboard)

Page 38: Microarrays, Expression, and Regulatory Networks

Gene Expression Matrix

3. DNA Microarrays

38

Genes

Samples

Now, we are ready to analyze our data!

Page 39: Microarrays, Expression, and Regulatory Networks

4. Gene Expression Data Analysis

Page 40: Microarrays, Expression, and Regulatory Networks

Analyzing Gene Expression Data

4. Gene Expression Data Analysis

40

Clustering How are genes related in terms of their expression

under different conditions? Differential gene expression

Which genes are affected by change in condition, tissue, disease?

Classification (supervised analysis) Given expression profile for a gene, can we assign a

function? Given the expression levels of several genes in a

sample, can we characterize the type of sample (e.g., cancerous or normal)?

Regulatory network inference How do genes regulate each others expression to

orchestrate cellular function?

Page 41: Microarrays, Expression, and Regulatory Networks

Clustering

4. Gene Expression Data Analysis

41

Group similar items together Clustering genes based on their expression

profiles We can measure the expression of multiple genes

in multiple samples Genes that are functionally related should have

similar expression profiles Gene expression profile

A vector (or a point) in multi-dimensional space, where each dimension corresponds to a sample

Clustering of multi-dimensional real-valued data is a well-studied problem

Page 42: Microarrays, Expression, and Regulatory Networks

Motivating Example

4. Gene Expression Data Analysis

42

Expression levels of 2,000 genes in 22 normal and 40 tumor colon tissues (Alon et al. , PNAS,

1999)

Page 43: Microarrays, Expression, and Regulatory Networks

Applications of Clustering

4. Gene Expression Data Analysis

43

Functional annotation If a gene with unknown function is clustered

together with genes that perform a particular function, then that is likely to be associated with that function

Identification of regulatory motifs If a group of genes are co-regulated, then it is

likely that their regulation is modulated by similar transcription factors, so looking for common elements in the neighborhood of the coding sequences of genes in a cluster, we can identify regulatory motifs and their location (promoters)

Modular analysis

Page 44: Microarrays, Expression, and Regulatory Networks

Gene Expression Matrix

4. Gene Expression Data Analysis

44

m g

en

es

n samples Generally, m >> n

m = O(103) n = O(101)

Each row is an n-dimensional vector

Expression profile

Tiniii

ij

eeee

njmieE

],...,,[

1 ,1 ],[

21

Page 45: Microarrays, Expression, and Regulatory Networks

Proximity Measures

4. Gene Expression Data Analysis

45

How do we decide which genes are similar to each other?

Euclidian distance

Manhattan distance

n

kjkikjiji eeeeeeEuclidian

1

2

2)(),(

| |),(1

1 jk

n

kikjiji eeeeee tanManhat

Page 46: Microarrays, Expression, and Regulatory Networks

Distance

4. Gene Expression Data Analysis

46

Minkowski distance General version of Euclidian, Manhattan etc.

p is a parameter

n

k

pjkikpjiji eeeeeeMinkowski

1

)(),(

jkiknk

ji eeee 1

max

Page 47: Microarrays, Expression, and Regulatory Networks

Normalization

4. Gene Expression Data Analysis

47

If we want to measure the distance between directions rather than absolute magnitude, it may be necessary to standardize mean and variation of expression levels for each gene

i

iikik

Tiniii

n

kiikii

n

kikii

eeeeee

en

e

en

e

'''2

'1

'

1

2

1

,],...,,[

)(1

)(

1)(

Page 48: Microarrays, Expression, and Regulatory Networks

Correlation

4. Gene Expression Data Analysis

48

The similarity between the variation of two random variables

A vector is treated as sampling of a random variable

Covariance

2

1

],[][

))((1

],[

ijii

n

kjjkiikji

eeCoveVar

een

eeCov

Page 49: Microarrays, Expression, and Regulatory Networks

Pearson Correlation Coefficient

4. Gene Expression Data Analysis

49

Pearson correlation coefficient

Pearson correlation is equal to the cosine of the angle (or inner product of) normalized expression profiles

Pearson correlation is normalized

ji

n

kjjkiik

ji

jiji

ee

eVareVar

eeCoveePearson

1

))((

][][

],[),(

1),(1 ji eePearson

),(),( ''jiji eePearsoneePearson

Page 50: Microarrays, Expression, and Regulatory Networks

Euclidian Distance & Correlation

4. Gene Expression Data Analysis

50

Euclidian distance (normalized) and Pearson correlation coefficient are closely related

These are the two most commonly used proximity measures in gene expression data analysis

Without loss of generality, we will use to denote the distance between two expression profiles

)),( 1(2),( ''jiji eePearsonneeEuclidian

),( jiij ee

Page 51: Microarrays, Expression, and Regulatory Networks

Other Measures of correlation

4. Gene Expression Data Analysis

51

Pearson is vulnerable to outliers If two genes have very high expression in a single

profile, it might dominate to show that the two expression levels are highly correlated

Jackknife correlation: Estimate n correlations by taking each dimension (sample) out, take the minimum among them

Pearson is not robust for non-Gaussian distributions Spearman’s rank order correlation coefficient: Rank

expression levels, replace each expression level with its rank

More robust against outliers A lot of loss of information

Page 52: Microarrays, Expression, and Regulatory Networks

Clustering Methods

4. Gene Expression Data Analysis

52

Hierarchical clustering Group genes into a tree

(a.k.a, dendrogram), so that each branch of the tree corresponds to a cluster

Higher branches correspond to coarser clusters

Partitioning Partition genes into several

groups so that similar genes will be in the same partition

Page 53: Microarrays, Expression, and Regulatory Networks

Hierarchical clustering

4. Gene Expression Data Analysis

53

Direction of clustering Bottom-up (agglomerative): Start from individual

genes, join them into groups until only one group is left

Top-down (divisive): Start with one group consisting of all genes, keep partitioning groups until each group contains exactly one gene

Agglomerative clustering is computationally less expensive Why?

Hierarchical clustering methods are greedy Once a decision is made, it cannot be undone

Page 54: Microarrays, Expression, and Regulatory Networks

Agglomerative clustering

4. Gene Expression Data Analysis

54

Start with m clusters: Each cluster contains one gene

At each step, choose two clusters that are closest (or most correlated), merge them

How do we evaluate the distance between two clusters? Single-linkage: If clusters contain two very close

genes, than the clusters are close to each other)(min),(

,ij

CjCilk

lk

CC

Page 55: Microarrays, Expression, and Regulatory Networks

Agglomerative Clustering

4. Gene Expression Data Analysis

55

Complete linkage: Two clusters are close to each other only if all genes inside them are close to each other

Group average: Two clusters are close to each other if their centers are close to each other

k lCi Cj

ijlk

lk CCCC 1

),(

)(max),(,

ijCjCi

lklk

CC

Page 56: Microarrays, Expression, and Regulatory Networks

Divisive Clustering

4. Gene Expression Data Analysis

56

Recursive bipartitioning Find an “optimal” partitioning of the genes into two

clusters Recursively work on each partition Since the number of clusters is an issue for partitioning

based clustering algorithms, the magic number 2 solves a lot of problems

May be computationally expensive The problem is “global” At every level of the tree, we have to work on all of the

genes If tree is imbalanced, there might be as many as m

levels With a reasonable stopping criterion, maybe

considered a partition-based clustering as well

Page 57: Microarrays, Expression, and Regulatory Networks

Partition Based Clustering

4. Gene Expression Data Analysis

57

Find groups of genes such that genes in each group are similar to each other, while being somewhat less similar to those in other clusters

Easily interpratable Especially, for large datasets (as compared to

hierarchical)

Page 58: Microarrays, Expression, and Regulatory Networks

Number of Clusters

4. Gene Expression Data Analysis

58

Clustering is “unsupervised”, so generally we do not have prior knowledge on how many clusters underly the data

It is very difficult to partition data into an “unknown” number of clusters

Most algorithms assume that K (number of clusters) is known

Try different values of K, find the one that results in best clustering

Very expensive

Page 59: Microarrays, Expression, and Regulatory Networks

Overlapping vs. Disjoint Clusters

4. Gene Expression Data Analysis

59

Genes do not have a single function Most genes might be involved in

different processes, so their expression profiles might demonstrate similarities with different genes in different contexts

Can we allow a gene to be included in more than one cluster?

Allowing overlaps between clusters poses additional challenges To what extent do we allow overlaps?

(We definitely don’t want to identify two identical clusters)

Page 60: Microarrays, Expression, and Regulatory Networks

Fuzzy Clustering

4. Gene Expression Data Analysis

60

Assign weights to each gene-cluster pair, showing the extent (or likelihood) of the gene belonging to the cluster Difficult interpretation Partitioning is a special case of fuzzy clustering,

where the weights are restricted to binary values Hierarchical clustering is also “fuzzy” in some

sense Continuous relaxation might alleviate

computational complexity as well

Page 61: Microarrays, Expression, and Regulatory Networks

K-Means Clustering

4. Gene Expression Data Analysis

61

The most famous clustering algorithm Given K, find K disjoint clusters such that the

total intracluster variation is minimized

kCi

ik

k eC

1

kCi

iik e ),(

K

kk

1

Cluster mean:

Intracluster variation:

Total intracluster variation:

Page 62: Microarrays, Expression, and Regulatory Networks

K-Means Algorithm

4. Gene Expression Data Analysis

62

K-Means is an iterative algorithm that alters parameters based on each other’s values until no improvement is possible

1. Choose K expression profiles randomly, designate each of them as the center of one of the K clusters

2. Assign each gene to a cluster2.1. Each gene is assigned to the cluster with closest

center to its profile

3. Redetermine cluster centers4. If any gene was moved, go back to Step 2, else

stop

Page 63: Microarrays, Expression, and Regulatory Networks

Sample Run of K-Means

4. Gene Expression Data Analysis

63

Page 64: Microarrays, Expression, and Regulatory Networks

Self Organizing Maps

4. Gene Expression Data Analysis

64

Just like K-means, we have K clusters, but this time they are organized into a map Often a 2D grid We want to organize clusters so that similar

clusters will be in proximity in the map A way of visualizing in low-dimensional (2D) space

Just like K-means, each cluster is associated with a weight vector It was the cluster center in K-means

Each weight vector is first initialized randomly to some gene’s expression profile

Page 65: Microarrays, Expression, and Regulatory Networks

SOM Algorithm

4. Gene Expression Data Analysis

65

At each step, a gene is selected at random The distance between the gene’s expression

profile and each cluster’s weight vector is calculated, and the cluster with closest weight vector becomes the winner

The winner’s and its neighbors’ (according to the 2D mapping) weight vectors are adjusted to represent the gene’s expression profile better

Cj is the winner cluster for gene i at time t α is a decreasing function of time, θ is the

neighborhood function

))()(,()()()1( ikjkkk etwCCttwtw

Page 66: Microarrays, Expression, and Regulatory Networks

Sample SOM Output

4. Gene Expression Data Analysis

66

Page 67: Microarrays, Expression, and Regulatory Networks

Gene Co-expression Network

4. Gene Expression Data Analysis

67

Nodes represent genes Weighted edges between nodes represent

proximity (correlation) between genes’ expression profiles

This is indeed a way of predicting interactions between genes

Page 68: Microarrays, Expression, and Regulatory Networks

Graph Theoretical Clustering

4. Gene Expression Data Analysis

68

Partition the graph into heavy subgraphs Maximize total weight (number of edges) inside a

cluster Minimize total weight (number of edges) between

clusters Heuristic algorithms

CLICK: Recursive min-cut CAST: Iterative improvement one by one for each

cluster Loss of information?

Page 69: Microarrays, Expression, and Regulatory Networks

Model Based Clustering

4. Gene Expression Data Analysis

69

Generating model Each cluster is associated with a distribution (that

generates expression profiles for associated genes) specified by model parameters

The probability that a gene belongs to a cluster is specified by hidden parameters

Expectation Maximization (EM) algorithm Start with a guess of model parameters E-step: Compute expected values of hidden parameters

based on model parameters M-step: Based on hidden parameters, estimate model

parameters to maximize the likelihood of observing the data at hand, iterate

K-means is a special case

Page 70: Microarrays, Expression, and Regulatory Networks

Evaluation of Clusters

4. Gene Expression Data Analysis

70

In general, we want to maximize intra-cluster similarity, while minimizing inter-cluster similarity

Homogeneity, separation Based on the proximity metric

Reference partition Information on “true clusters” that comes from a

different source (apart from expression data) Molecular annotation (e.g., Gene Ontology) Jaccard coefficient, sensitivity, specificity

Cluster annotation Processes that are significantly enriched in a cluster

Page 71: Microarrays, Expression, and Regulatory Networks

Homogeneity & Separation

4. Gene Expression Data Analysis

71

Heterogeneity (or homogeneity in reverse direction) How similar are the genes in one cluster?

Separation How dissimilar are different clusters?

Good clustering: high heterogeneity, low separation

kCji

ijCCCH

,)1(

2)(

k lCi Cj

ijlk

lk CCCCS 1

),(

Page 72: Microarrays, Expression, and Regulatory Networks

Overall Quality

4. Gene Expression Data Analysis

72

Overall heterogeneity

Overall separation

How do these change with respect to number of clusters? Can we optimize these values to choose the best

number of clusters?

kC

kk CHCm

H )(1

lk

lk

CClklk

CClk

CCSCCCC

S,

,

),(1

Page 73: Microarrays, Expression, and Regulatory Networks

Bayesian Information Criterion

4. Gene Expression Data Analysis

73

A statistical criterion for evaluating a model Penalizes model complexity (number of free

parameters to be estimated)

k is the number of free parameters in the model, which increases with the number clusters

RSS is the “total error” in the model Trade-off number of clusters and optimization

function to choose the best number of clusters

Page 74: Microarrays, Expression, and Regulatory Networks

Reference Partitioning

4. Gene Expression Data Analysis

74

If there is information about “ground truth” from an independent source, we can compare our clustering to such reference partitioning

Pairwise assessment Let Cij = 1 if gene i and gene j are assigned to the

same cluster by the clustering algorithm, 0 otherwise

Let Rij = 1 if gene i and gene j are in the same cluster according to reference partition

jiijij

jiijij

jiijij

jiijij

RCnRCn

RCnRCn

,10

,01

,00

,11

)(

Page 75: Microarrays, Expression, and Regulatory Networks

Comparing Partitions

4. Gene Expression Data Analysis

75

Rand index (symmetric)

Jaccard coefficient (sparse)

Minkowski measure (sparse)

01100011

0011

nnnn

nnRand

011011

11

nnn

nJaccard

0111

0110

nn

nnMinkowski

Page 76: Microarrays, Expression, and Regulatory Networks

Cluster Annotation

4. Gene Expression Data Analysis

76

Clustering results in groups of genes that are co-expressed (or co-regulated) For each group, can we tell something about the

biological phenomena that underlies our observation (their co-expression)?

We have partial knowledge on the function of many individual genes Gene Ontology, COG (Clusters of Ortholog Groups),

PFAM (Protein Domain Families) Taking a statistical approach, we can assign

function to each group of genes A function popular in a cluster is associated with

that cluster

Page 77: Microarrays, Expression, and Regulatory Networks

Gene Ontology

4. Gene Expression Data Analysis

77

Ontology: Study of being (e.g., conceptualization) Gene Ontology is an attempt to develop a

standardized library of cellular function Unified view of life: Processes, structures, and

functions recur in diverse organisms Three concepts of Gene Ontology

Biological process: A recognized series of events or molecular functions (e.g., cell cycle, development, metabolism)

Molecular function: What does a gene’s product do? (e.g., binding, enzyme activity, receptor activity)

Cellular component: Localization within the cell (e.g., membrane, nucleus, ubiquitin ligase complex)

Page 78: Microarrays, Expression, and Regulatory Networks

Hierarchy in Gene Ontology

4. Gene Expression Data Analysis

78

Gene Ontology is hierarchical A process might have subprocesses

Seed maturation is part of seed development A process might be described at different levels of

detail Seed dormation is a(n example of) seed maturation

Same for function and component Gene Ontology terms are related to each other

via “is a” and “part of” relationships If process A is part of process B, then A is B’s child

(B is A’s parent); B involves A If function C is a function D, then C is D’s child; C is

a more detailed specification of D

Page 79: Microarrays, Expression, and Regulatory Networks

4. Gene Expression Data Analysis

79

Page 80: Microarrays, Expression, and Regulatory Networks

GO Hierarchy is a DAG

4. Gene Expression Data Analysis

80

Gene Ontology is hierarchical, but the hierarcy is not represented by a tree, it is represented by a directed acyclic graph (DAG) A GO term can have

multiple parents (and obviously a GO term might (should?) have multiple children)

Page 81: Microarrays, Expression, and Regulatory Networks

Annotation

4. Gene Expression Data Analysis

81

GO-based annotation assigns GO terms to a gene A gene might have multiple functions, can be

involved in multiple processes Multiple genes might be associated with the same

function, multiple genes take part in a process True-path rule

If a gene is annotated with a term, then it is also annotated by its parents (consequently, all ancestors)

How does the number of genes associated with each term changes as we go down on the GO DAG?

Page 82: Microarrays, Expression, and Regulatory Networks

GO Annotation of Gene Clusters

4. Gene Expression Data Analysis

82

There a |C| genes in a cluster C |T| genes are associated with GO term t |C ∩ T| genes are in C and are associated with

t What is the association between cluster C and

term t? If we chose random clusters, would we be able to

observe that at least this many (|C ∩ T|) of the |C| genes in C are associated with t?

What is the probability of this observation? Statistical significance based on

hypergeometric distribution

Page 83: Microarrays, Expression, and Regulatory Networks

Hypergeometric Distribution

4. Gene Expression Data Analysis

83

We have n items, m of which are good If we choose r items from the entire set of items

at random, what is the probability that at least k of them will be good?

n is the number of genes in the organism m=|T|, r=|C|, k= |C ∩ T| The lower p is, the more likely that there is an

underlying association between the term and the cluster (the term is significantly enriched in the cluster)

),min(

][rm

ki

r

n

ir

mn

i

m

kKPp

Page 84: Microarrays, Expression, and Regulatory Networks

GO Hierarchy & Cluster Annotation

4. Gene Expression Data Analysis

84

How specific (general) is the annotation we attach to a cluster? If a cluster is larger, then it might correspond to a

more general process Some processes might be over-represented in the

study set How do we find the best location of a cluster in GO

hierarchy? Parent-child annotation

Condition probability of enrichment of a term in a cluster on the enrichment of its parent terms in the cluster

The gene space is defined as the set of genes that are associated with t’s parents

Page 85: Microarrays, Expression, and Regulatory Networks

Parent-Child Annotation

4. Gene Expression Data Analysis

85

Page 86: Microarrays, Expression, and Regulatory Networks

Multiple Hypotheses Testing

4. Gene Expression Data Analysis

86

The p-value for a single term provides an estimate of the probability of having the observed number of genes attached to that particular term We have many terms, even if the likelihood of

enrichment is small for a particular term, it might be very probable that one term will be enriched as much as observed in the cluster

We have to account for all hypotheses being tested simultaneously

Bonferroni correction: Apply union rule, add all p-values

Which terms should we consider while correcting for multiple hypotheses for a single term?

Page 87: Microarrays, Expression, and Regulatory Networks

Representativity of Terms

4. Gene Expression Data Analysis

87

How good does a significantly enriched term represent a cluster? How many of the genes in the cluster are attached

to the term? How many of the genes attached to the term are

in the cluster? For term t that is significantly enriched in

cluster C Specificity: |C ∩ T|/|C|, a.k.a. precision Specificity: |C ∩ T|/|T|, a.k.a. recall

Page 88: Microarrays, Expression, and Regulatory Networks

Biclustering

4. Gene Expression Data Analysis

88

A particular process might be active in certain conditions A group of genes

might be expressed (or up-regulated, supressed, co-regulated, etc.) in only a subset of samples

They might behave almost independently under other conditions

Page 89: Microarrays, Expression, and Regulatory Networks

Clustering vs. Biclustering

4. Gene Expression Data Analysis

89

Clustering is a global approach Each gene is a point in the space defined by all

samples How about points that are clustered in a subspace?

Biclustering: While clustering genes, also choose a set of dimensions (samples) that provides best clustering and vice versa a.k.a, co-clustering, subspace clustering… This is a much harder problem, because you are not

only trying to find groups of points that are close to each other in multi-dimensional space, but also trying to identify a subspace in which groups are more evident

Page 90: Microarrays, Expression, and Regulatory Networks

Biclustering Applications

4. Gene Expression Data Analysis

90

Sample/tissue classification for diagnosis The samples with leukemia show specific characters

for a subset of genes Identification of co-regulated genes

Certain sets of genes exhibit coherent activations under specific conditions (while behaving more or less arbitrarily with respect to each other under other conditions)

Functional annotation Biological processes, functional classes are

overlapping Different sets of samples reveal different functional

relationships

Page 91: Microarrays, Expression, and Regulatory Networks

Biclustering Principles

4. Gene Expression Data Analysis

91

A cluster of genes is defined with respect to a cluster of samples and vice versa

The clusters are not necessarily exclusive or exhaustive A gene/condition may belong to more than one

cluster A gene/condition may not belong to any cluster at

all Biclusters are not “perfect”

Noise Statistical inference becomes particularly

important

Page 92: Microarrays, Expression, and Regulatory Networks

Biclustering Formulation

4. Gene Expression Data Analysis

92

Given a gene expression matrix A with gene set G and sample set S, a bicluster is defined by a subset of genes I and a subset of samples J

General idea: A bicluster is a “good” one if AIJ , the submatrix defined by I and J, has some coherence (low variance, low rank, similar ordering of rows, etc.)

The biclustering problem can be defined as one of finding a single bicluster in the entire gene expression matrix, or as one of extracting all biclusters (with some restriction on the relationship between biclusters)

Page 93: Microarrays, Expression, and Regulatory Networks

Coherence of a Submatrix

4. Gene Expression Data Analysis

93

Page 94: Microarrays, Expression, and Regulatory Networks

Distribution of Biclusters

4. Gene Expression Data Analysis

94

Page 95: Microarrays, Expression, and Regulatory Networks

Bipartite Graph Model

4. Gene Expression Data Analysis

95

Just like symmetric matrices, which can be modeled as arbitrary graphs, rectangular matrices can be modeled using bipartite graphs

With proper definition of edge weights, biclustering can be posed as the problem of finding “heavy” subgraphs

Page 96: Microarrays, Expression, and Regulatory Networks

Row, Column, Matrix Means

4. Gene Expression Data Analysis

96

Page 97: Microarrays, Expression, and Regulatory Networks

Objective Function

4. Gene Expression Data Analysis

97

Low-variance (constant) bicluster Ideal bicluster: Minimize bicluster variance

Low-rank (constant row, constant column, coherent values) bicluster Ideal constant row: Ideal constant column: General rank-one bicluster: Define residue for each value: Minimize mean squared residue

Page 98: Microarrays, Expression, and Regulatory Networks

Missing Values

4. Gene Expression Data Analysis

98

Not all expression levels are available for each gene/sample pair A solution is to replace missing values (random

values, gene mean, sample mean, regression) Generalize definition row, column, and

bicluster means to handle missing values implicitly Occupancy threshold:A bicluster is one with adequate number of (non-missing) values in each row and column

Page 99: Microarrays, Expression, and Regulatory Networks

Overlapping Biclusters

4. Gene Expression Data Analysis

99

The expression of a gene in one sample may be thought of as a superposition of contribution for multiple biclusters

Plaid model: : contribution of bicluster k on the expression

value of the ith gene in the jth sample and (generally binary) specify the membership

of row i and column j in the kth bicluster, respectively

Minimize

is defined to reflect “bicluster type” , , ,

Page 100: Microarrays, Expression, and Regulatory Networks

Discrete Coherence

4. Gene Expression Data Analysis

100

A bicluster is defined to be one with coherent ordering of the values on rows and/or columns (as compared to values themselves)

Order-preserving submatrix (OPSM) A submatrix is order preserving if there is an

ordering of its columns such that the sequences of values in every row is increasing

Gene expression motifs (xMOTIFs) The expression level of a gene is conserved across

a subset of conditions if the gene is in the same “state” in each of the conditions

An xMOTIF is a subset of genes that are simultaneously conserved across a subset of samples

Page 101: Microarrays, Expression, and Regulatory Networks

Binary Biclusters

4. Gene Expression Data Analysis

101

Quantize gene expression matrix to binary values SAMBA: A 1 corresponds to a significant change in the

expression value PROXIMUS: A 1 means that the gene is “expressed” in

the corresponding sample A bicluster is a “dense submatrix”, i.e. one with

significantly more number of 1’s than one would expect Bipartite graph model: Bicliques, heavy subgraphs It is possible to statistically quantify the density of a

submatrix Log-likelihood:

p-value:

Page 102: Microarrays, Expression, and Regulatory Networks

Biclustering Algorithms

4. Gene Expression Data Analysis

102

Enumeration Go for it!

Greedy algorithms Make a locally optimal choice at every step

Divide and conquer Solve problem recursively

Alternating iterative heuristics Fix one dimension, solve for other, alternate

iteratively Model Based Parameter estimation

e.g., EM algorithm

Page 103: Microarrays, Expression, and Regulatory Networks

Enumerating Biclusters

4. Gene Expression Data Analysis

103

m rows, n columns in the matrix 2m X 2n possible biclusters in total Not doable in realistic amounts of time Is it really necessary?

Put some restriction on size of biclusters SAMBA models the problem as one of finding

heavy subgraphs in a bipartite graph Key assumption is sparsity: Nodes of the bipartite

graph have bounded degree Find K heavy bipartite subgraphs (biclusters) with

bounded degree enumeration Refine them to optimize overlap and add/remove nodes

that improve bicluster quality

Page 104: Microarrays, Expression, and Regulatory Networks

Greedy Algorithms

4. Gene Expression Data Analysis

104

Basic idea: Refine existing biclusters by adding/removing genes/samples to improve the objective function Generally, quite fast How to choose initial biclusters? How to jump over bad local optima? (Global awareness,

Hill-climbing) Optimization function: mean-squared residue

Node deletion: Start with a large bicluster, keep removing genes/samples that contribute most to total residue

Node addition: Start with a small bicluster, keep adding genes/samples that contribute least to total residue

Repeat these alternatingly to improve global awareness

Page 105: Microarrays, Expression, and Regulatory Networks

Finding All Biclusters

4. Gene Expression Data Analysis

105

If biclusters are identified one by one, we should make sure that we do not identify the same bicluster again and again Masking discovered biclusters: Fill bicluster with

random values First identify disjoint biclusters, then grow them to

capture overlaps Flexible Overlapped Biclustering (FLOC)

Generate K initial biclusters Make decision from the gene/sample perspective

(as compared to bicluster perspective): Choose the best (maximum gain) action for each gene

Page 106: Microarrays, Expression, and Regulatory Networks

Generalizing K-Means to Biclustering

4. Gene Expression Data Analysis

106

Assume K gene clusters, L sample clusters Notice that this is a little counter-intuitive, we do

not have well-defined biclusters, we rather have clusters of genes and samples, and each pair of gene and sample clusters defines a bicluster

R: mxk gene clustering matrix, C: nxl sample clustering matrix R(i,k)=1 if gene i belongs to cluster k (actually,

columns are normalized to have unit norm) Minimize total residue:

Page 107: Microarrays, Expression, and Regulatory Networks

KL-Means Algorithm

4. Gene Expression Data Analysis

107

We can show that Batch iteration

Given R, compute (mxl matrix) serves as a prototype for column

clusters For each column, find the column of that is

closest to that column, update the corresponding entry of C accordingly

Once C is fixed, repeat the same for rows to compute R from

Converges to a local minimum of the objective function

Page 108: Microarrays, Expression, and Regulatory Networks

OPSM Algorithm Recall that an order preserving submatrix (OPSM)

is one such that all rows have their entries in the same order

Growing partial models Fix the extremes first The idea: Columns with very high or low values are

more informative for identifying rows that support the assumed linear order

Start with all (1,1) partial models, i.e., only consider the preservation of the first and last elements, keep the best ones

Expand these to obtain (2,1) models, then (2,2) until we have (s/2, s/2) models, s being the number of columns in target bicluster

4. Gene Expression Data Analysis

108

Page 109: Microarrays, Expression, and Regulatory Networks

Divide and Conquer Algorithms Block clustering (a.k.a., Direct clustering)

Recursive bipartitioning Sort rows according to their mean, choose a row such

that the total variance above and below the row is minimized

Do the same for columns Pick the row or column that results in minimum intra-

cluster variances, split matrix into two based on that row or column

Continue splitting recursively One problem is that once two rows/columns go to

different biclusters, they can never come together Gap Statistics: Find a large number of biclusters, then

recombine

4. Gene Expression Data Analysis

109

Page 110: Microarrays, Expression, and Regulatory Networks

Binormalization Normalize matrix on both dimensions Independent scaling of rows and columns

Here, R and C are diagonal matrices that contain row

and column means, respectively Bistochastization

Goal: Rows will add up to a constant (or will have constant norm), columns will add up to a separate constant

Repeat independent scaling of rows and columns until stability is reached

The residual of entire matrix is also normalized in the sense that both rows and columns have zero mean

4. Gene Expression Data Analysis

110

Page 111: Microarrays, Expression, and Regulatory Networks

Spectral Biclustering Singular value decomposition

The eigenvalues of the matrices ATA and AAT (say, σ2) are the same

Each σ is called a singular value of A and the corresponding left and right eigenvectors are called singular vectors

If σ1 is the largest singular vector of A such that ATAv1 = σ1v1 and AATu1 = σ1u1 , then σ1u1v1

T is the best rank-one approximation to A, i.e., ||A- σuvT ||2 is minimized by σ1 , u1 , and v1

(over all orthogonal vector pairs with unit norm)

Consequently, the entries of u and v are ordered in such a way that similar rows have similar values on u, similar columns have similar values on v Split matrix based on u and v

4. Gene Expression Data Analysis

111

Page 112: Microarrays, Expression, and Regulatory Networks

6. Gene Regulatory Networks

Page 113: Microarrays, Expression, and Regulatory Networks

Regulation of Gene Expression

6. Gene Regulatory Networks

113

Transcriptional Regulation of telomerase protein component gene hTERT

Page 114: Microarrays, Expression, and Regulatory Networks

Genetic Regulation & Cellular Signaling

6. Gene Regulatory Networks

114

Page 115: Microarrays, Expression, and Regulatory Networks

Organization of Genetic Regulation

6. Gene Regulatory Networks

115

GeneUp-regulation

Down-regulation

Negative ligand-independent repression at chromatin level

Genetic network that controls flowering time in A. thaliana(Blazquez et al, EMBO Reports, 2001)

Page 116: Microarrays, Expression, and Regulatory Networks

Gene Regulatory Networks Transcriptional Regulatory Networks

Nodes with outgoing edges are limited to transcription factors

Can be reconstructed by identifying regulatory motifs (through clustering of gene expression & sequence analysis) and finding transcription factors that bind to the corresponding promoters (through structural/sequence analysis)

6. Gene Regulatory Networks

116

Page 117: Microarrays, Expression, and Regulatory Networks

Gene Regulatory Networks Gene expression networks

General model of genetic regulation Identify the regulatory effects of genes on each

other, independent of the underlying regulatory mechanism

Can be inferred from correlations in gene expression data, time-series gene expression data, and/or gene knock-out experiments

6. Gene Regulatory Networks

117

Observation Inference

Page 118: Microarrays, Expression, and Regulatory Networks

Boolean Network Model

6. Gene Regulatory Networks

118

Binary model, a gene has only two states ON (1): The gene is expressed OFF (0): The gene is not expressed

Each gene’s next state is determined by a boolean function of the current states of a subset of other genes A boolean network is specified by two sets Set of nodes (genes) State of a gene: Collection of boolean functions

Page 119: Microarrays, Expression, and Regulatory Networks

Logic Diagram

6. Gene Regulatory Networks

119

Cell cycle regulation

Retinoblastma (Rb) inhibits DNA synthesis

Cyclin Dependent Kinase 2 (cdk2) & cyclin E inactivate Rb to release cell into S phase

Up-regulated by CAK complex and down-regulated by p21/WAF1

p53

Page 120: Microarrays, Expression, and Regulatory Networks

Wiring Diagram

6. Gene Regulatory Networks

120

Page 121: Microarrays, Expression, and Regulatory Networks

Dynamics of Boolean Networks Gene activity profile (GAP)

Collection of the states of individual genes in the genome (network) The number of possible GAPs is 2n

The system ultimately transitions into attractor states Steady state (point) attractors Dynamic attractors: state cycle Each transient state is associated with an attractor

(basins of attraction) In practice, only a small number of GAPs correspond to

attractors What is the biological meaning of an attractor?

6. Gene Regulatory Networks

121

Page 122: Microarrays, Expression, and Regulatory Networks

State Space of Boolean Networks Equate cellular with

attractors Attractor states are

stable under small perturbations Most perturbations

cause the network to flow back to the attractor

Some genes are more important and changing their activation can cause the system to transition to a different attractor

6. Gene Regulatory Networks

122

This slide is taken from the presentation by I. Shmulevich

Page 123: Microarrays, Expression, and Regulatory Networks

Identification of Boolean Networks We have the “truth table” available

Binarize time-series gene expression data REVEAL

Use mutual information to derive logical rules that determine each variable If the mutual information between a set of variables and the

target variable is equal to the entropy of that variable, then that set of variables completely determines the target variable

For each variable, consider functions consisting of 1 variable, then 2, then 3, …, then i…, until one is found Once the minimum set of variables that determine a variable is

found, we can infer the function from the truth table In general, the indegrees of genes in the network is small

6. Gene Regulatory Networks

123

Page 124: Microarrays, Expression, and Regulatory Networks

REVEAL

6. Gene Regulatory Networks

124

Page 125: Microarrays, Expression, and Regulatory Networks

Limitations of Boolean Networks The effect of intermediate gene expression

levels is ignored It is assumed that the transitions between

states are synchronous A model incorporates only a partial description

of a physical system Noise Effects of other factors

One may wish to model an open system A particular external condition may alter the

parameters of the system Boolean networks are inherently deterministic

6. Gene Regulatory Networks

125

Page 126: Microarrays, Expression, and Regulatory Networks

Probabilistic Models Stochasticity can account for

Noise Variability in the biological system Aspects of the system that are not captured by the

model Random variables include

Observed attributes Expression level of a particular gene in a particular

sample Hidden attributes

The boolean function assigned to a gene?

6. Gene Regulatory Networks

126

Page 127: Microarrays, Expression, and Regulatory Networks

Probabilistic Boolean Networks Each gene is associated with multiple boolean

functions Each function is associated with a probability

Can characterize the stochastic behavior of the system

6. Gene Regulatory Networks

127

Page 128: Microarrays, Expression, and Regulatory Networks

Bayesian Networks A Bayesian network is a representation of a joint

probability distribution A Bayesian network B=(G, ) is specified by two

components A directed acyclic graph G, in which directed edges

represent the conditional dependence between expression levels of genes (represented by nodes of the graph)

A function that specifies the conditional distribution of the expression level of each gene, given the expression levels of its parents Gene A is gene B’s parent if there is a directed edge from A

to B P(B | Pa(B)) = (B, Pa(B))

6. Gene Regulatory Networks

128

Page 129: Microarrays, Expression, and Regulatory Networks

Conditional Independence In a Bayesian network, if no direct between two

genes, then these genes are said to be conditionally independent

The probability of observing a cellular state (configuration of expression levels) can be decomposed into product form

6. Gene Regulatory Networks

129

Page 130: Microarrays, Expression, and Regulatory Networks

Variables in Bayesian Network Discrete variables

Again, genes’ expression levels are modeled as ON and OFF (or more discrete levels)

If a gene has k parents in the network, then the conditional distribution is characterized by rk parameters (r is the number of discrete levels)

Continuous variables Real valued expression levels We have to specify multivariate continuous

distribution functions Hybrid networks

6. Gene Regulatory Networks

130

Page 131: Microarrays, Expression, and Regulatory Networks

Equivalence Classes of Bayesian Nets Observe that each network structure implies a

set of independence assumptions

More than one graph can imply exactly the same set of independencies (e.g., X->Y and Y->X) Such graphs are said to be equivalent

By looking at observations of a distribution, we cannot distinguish between equivalent graphs An equivalence class can be uniquely represented

by a partially directed graph (some edges are undirected)

6. Gene Regulatory Networks

131

Page 132: Microarrays, Expression, and Regulatory Networks

Learning Bayesian Networks Given a training set D = {x1, x2, …, xn} of m

independent instances of the n random variables, find an equivalence class of networks B=(G, ) that best matches D x’s are the gene expression profiles

Based on Bayes’ formula, the posterior probability of a network given the data can be evaluated as

where C is a constant (independent of G) and

is the marginal likelihood that averages the probability of data

over all possible parameter assignments to G

6. Gene Regulatory Networks

132

Page 133: Microarrays, Expression, and Regulatory Networks

Learning Algorithms The Bayes score S(G : D) depends on the particular

choice of priors P(G) and P( | G) The priors can be chosen to be

structure equivalent, so that equivalent networks will have the same score

decomposable, so that the score can be represented as the superposition of contributions of each gene

The problem becomes finding the optimal structure (G) We can estimate the gain associated with addition,

removal, and reversal of an edge Then, we can use greedy-like heuristics (e.g., hill

climbing)

6. Gene Regulatory Networks

133

Page 134: Microarrays, Expression, and Regulatory Networks

Causal Patterns Bayesian networks model dependencies between

multiple measurements How about the mechanism that generated these

measurements? Causal network model: Flow of causality

Model not only the distribution of observations, but also the effect of observations

If gene X codes for a transcription factor of gene Y, manupilating X will affect Y, but not vice versa

But in Bayesian networks, X->Y and Y->X are equivalent

Intervention experiments (as compared to passive observation): Knock X out, then measure Y

6. Gene Regulatory Networks

134

Page 135: Microarrays, Expression, and Regulatory Networks

Dynamic Bayesian Networks Dependencies do not

uncover temporal relationships Gene expression

varies over time Dynamic Bayesian

Networks model the dependency between a gene’s expression level at time t and expression levels of parent genes at time t-1

6. Gene Regulatory Networks

135

Page 136: Microarrays, Expression, and Regulatory Networks

Topology of Biological Networks

Page 137: Microarrays, Expression, and Regulatory Networks

Topological Characteristics of Networks Local characteristics

Subgraphs, motifs Clustering

Global characteristics Degree distribution Reachability Hierarchy, assortativity

Topology & Function Robustness: Degree distribution, reachability,

hieararchy Modularity: Motifs, clustering, hierarchy Dynamics: Do general topological properties

determine behavior?

8. Topology of Biological Networks

137

Page 138: Microarrays, Expression, and Regulatory Networks

Real-World Networks

Biological networks at different scales Population, tissue, cell

Cellular networks Metabolic pathways, transcriptional networks,

protein-protein interactions Other networks

Internet, social networks (Erdös number, Kevin Bacon network, friendship), electronic circuits, parallel computers

8. Topology of Biological Networks

138

Page 139: Microarrays, Expression, and Regulatory Networks

Understanding Networks Are there commonalities in the topological

characteristics of different networks? Turns out to be yes Do these have anything to do with function, origin,

growth of these networks? How about differences?

8. Topology of Biological Networks

139

Internet vs. S. cerevisiae PPI network

Page 140: Microarrays, Expression, and Regulatory Networks

Graphs vs. Networks A network is a “functional”

structure, in which nodes and links are “active” Information flow Underlying dynamics

A graph is an abstraction of a network (or, in general, pairwise relationships between entities)

The two terms are commonly used interchangeably

8. Topology of Biological Networks

140

Page 141: Microarrays, Expression, and Regulatory Networks

Modeling Networks

Graph representation of metabolic pathways (a) Edges may represent substrate-product relationships

between metabolites (b), or producer-consumer relationships between enzymes

We can drop “common” metabolites [c]

8. Topology of Biological Networks

141

Page 142: Microarrays, Expression, and Regulatory Networks

Directed vs. Undirected Graphs An edge (or link) is directed if there is a

specified directionality (such as cause and effect) in the relationship between the two objects represented by the nodes Metabolic pathways are directed, because

many reactions are irreversible Protein-protein interactions are generally

undirected, because in most cases all we know is that they bind to each other

The semantics of topological properties (and/or motifs) may be different for directed and undirected graphs Cycle

8. Topology of Biological Networks

142

Page 143: Microarrays, Expression, and Regulatory Networks

Connectivity Degree of a node in the network

How many links does a node have to other nodes?

Social networks: How “social” or “active” is a person?

Protein-protein interactions: How “sticky” or “functional” is a protein?

Directed graphs In-degree and out-degree Internet: How popular is a website? Metabolic pathways: How many

reactions use a metabolite as substrate?

8. Topology of Biological Networks

143

Page 144: Microarrays, Expression, and Regulatory Networks

Degree Distribution Define P(k) as the

probability (relative frequency) that a selected node has exactly k links N(k) = Number of nodes

with degree k P(k) = N(k)/N, where N is

the total number of nodes Degree is a local property,

degree distribution is a global property

8. Topology of Biological Networks

144

Average degree distribution of metabolic networks of 43

organisms

Page 145: Microarrays, Expression, and Regulatory Networks

Reachability Path

Sequence of nodes that are linked to each other that connect two specified nodes to each other

Shortest path The path between two nodes that contains

minimum number (length) of edges Quantifies the reachability between two

molecules, length of shortest path is a.k.a distance

Network’s overall navigability Diameter: Maximum distance in a network Mean path length: Average distance in a

network

8. Topology of Biological Networks

145

Page 146: Microarrays, Expression, and Regulatory Networks

Small World Effect First identified on social networks

People were sent letters and asked to forward the letter to them if they personally knew a specified person, if not they were supposed to send it to a fried who could be likely to…

Result: “Six degrees of separation” (average) Most natural networks demonstrate small world

phenomenon Neural networks, WWW

Metabolism Paths of three or four reactions can link most metabolite

pairs Local perturbations in metabolite concentrations can

reach the whole network very quickly

8. Topology of Biological Networks

146

Page 147: Microarrays, Expression, and Regulatory Networks

Clustering A network is clustered if we can say that

If A and B are connected and B and C are connected, than it is likely that A and C are connected

Clustering coefficient The fraction of observed triangles among all possible

triangles around a node , where k is node degree, and nI is the

number of pairs of neighbors of I that are connected to each other

Distribution of clustering coefficients The function C(k): Average clustering coefficient of

nodes with degree k Diameter and average degree depend on total number of

nodes, but P(k) and C(k) do not

8. Topology of Biological Networks

147

Page 148: Microarrays, Expression, and Regulatory Networks

Random Graphs Known as Erdös-Renyi graphs

Mark N nodes, draw an edge between any pair of proteins with fixed probability p

Mean path length is proportional to log(N) Degree distribution peaks around average degree,

clustering coefficient does not depend on degree

8. Topology of Biological Networks

148

Random network, degree distribution, clustering coefficient distribution

Page 149: Microarrays, Expression, and Regulatory Networks

Scale-Free Networks The degree distribution follows a power-law

P(k) k, where is the degree exponent parameter

In other words, the number of nodes with degree k is inversely proportional to an exponent of k Many low-degree nodes, a few hubs

Mean path length is proportional to log(log(N)) Terminology: Absence of a typical node in the

network

8. Topology of Biological Networks

149

Scale-free network, degree distribution, clustering coefficient distribution

Page 150: Microarrays, Expression, and Regulatory Networks

Scale-Free Networks in Nature

8. Topology of Biological Networks

150

Metabolic network, Actor collaboration, WWW, Power grid

2.26, 2.3, 2.1, 4

Page 151: Microarrays, Expression, and Regulatory Networks

Mathematical Model for Power Law There are y nodes of degree x, where

, i.e., The maximum degree in the graph is The number of vertices, n is:

where is the Riemann zeta function The number of edges, E is

8. Topology of Biological Networks

151

Page 152: Microarrays, Expression, and Regulatory Networks

Role of Degree Exponent The smaller the value of , the more important

the hubs are If then the hubs are not relevant For 2 < < 3, then there is a hierarchy of hubs, with

the most connected hub being in contact with a small fraction of all nodes

For 2, a star-like network emerges, with the largest hub being in contact with a large fraction of nodes

Scale-free networks are generally interesting for Unusual properties emerge for this regime This is the range that is observed in most biological (as

well as non-biological) networks

8. Topology of Biological Networks

152

Page 153: Microarrays, Expression, and Regulatory Networks

Hierarchical Networks “General” scale free networks still do

not capture one observed property of cellular networks

Hierarchical networks are clusters of clusters of clusters of… connected through local hubs, less local

hubs, …, global hubs

8. Topology of Biological Networks

153

Average clustering coefficientdistribution for the metabolic networks of 43 organisms

C(k) 1/k

Page 154: Microarrays, Expression, and Regulatory Networks

Hubs in Cellular Networks PPI networks

Kinases form the core of the network Genetic regulation

Most transcriptional factors regulate a few genes, a few general transcription factors interact with many genes

Recall that, the gene expression matrix of yeast cell cycle contains a few strong principal components

However, incoming degree distribution is rather approximated by an exponential function

Most genes are regulated by only one to three transcription factors

8. Topology of Biological Networks

154

Page 155: Microarrays, Expression, and Regulatory Networks

Growth Models How do these networks gain these properties? Preferential attachment

At each time point, a node is added to the network, and connected to a node with probability that is proportional to the current degree of that node 1st order:

2nd order:

This growth model generates scale-free networks with degree exponent 3

8. Topology of Biological Networks

155

Page 156: Microarrays, Expression, and Regulatory Networks

Duplication/Divergence Gene duplications are

considered as one of the driving forces of molecular evolution When gene is duplicated,

the corresponding protein has two copies, so an additional node with the same neighbors is added to the network

Proteins with already high degree are likely to have their neighbors duplicated => Preferential attachment

8. Topology of Biological Networks

156

Page 157: Microarrays, Expression, and Regulatory Networks

Network Evolution & Topology Scale free model predicts that the nodes that

appeared early in the history of the network are the most connected nodes Remnants of the RNA world, such as coenzyme A,

NAD, GTP are among the most connected substrates in the metabolic network

Elements of most ancient metabolic pathways, such as glycolisis and tricarboxylic acid cycle

In PPI networks, cross-genome comparisons indicate that, on an average, there is positive correlation between evolutionary history and number of links a protein has

8. Topology of Biological Networks

157

Page 158: Microarrays, Expression, and Regulatory Networks

Assortativity Social networks are generally assortative

People who know many people also know each other

Productive authors do write papers together Most cellular networks are observed

to be disassortative In general, hubs avoid linking directly to each

other Metabolic pathways, PPI networks, as well as

WWW Function of disassortativity? Selective value of

dissartotivity? Evolution of disassortativity? Do existing models generate disassortativity?

8. Topology of Biological Networks

158

Page 159: Microarrays, Expression, and Regulatory Networks

Modules & Clustering Modularity

Groups of physically or functionally linked molecules that work together to perform a (relatively) distinct function

Friend groups in social networks, labs in co-authorship Protein complexes Temporally co-regulated groups of genes

High clustering in cellular networks Average clustering coefficient is independent of network

size for metabolic pathways For an arbitrary scale-free network, average clustering

coefficient decreases by network size PPI and DDI networks also have high clustering

coefficients

8. Topology of Biological Networks

159

Page 160: Microarrays, Expression, and Regulatory Networks

Subgraphs as Elementary Units Subgraphs capture specific patterns of interconnections

that characterize a given network at the local level Not all subgraphs are equally significant

The abundance of squares and the absence of triangles can tell us something fundamental about the architecture of the square lattice

8. Topology of Biological Networks

160

Does not exist at all!

Abundant!

Page 161: Microarrays, Expression, and Regulatory Networks

Network Motifs A network motif is a subgraph (in topological

terms, i.e., ignoring identity of nodes) that occurs much more frequently in the network of interest, compared to a random network that has similar global properties

8. Topology of Biological Networks

161

Page 162: Microarrays, Expression, and Regulatory Networks

Generating Random Graphs Random graphs are used to assess the

statistical significance of the frequency of a motif As degree distribution is the key characteristic of

scale-free networks, generally the graphs are randomized to preserve degree distribution

Simulation Edge switching algorithm

8. Topology of Biological Networks

162

Page 163: Microarrays, Expression, and Regulatory Networks

Analytical Methods Arbitrary degree distribution

The probability of existence of an edge between u and v is defined as

where du and dv are specified “expected degrees” of u and vObserve that E[Du] = du where Du is the corresponding R.V.

However, in order for P to be a well-defined probability function, we must have

whereIn general, this is not the case for PPI networksThese models are generally useful for multigraphs rather

than simple graphs, because of dependencies

8. Topology of Biological Networks

163

Page 164: Microarrays, Expression, and Regulatory Networks

Common Network Motifs

8. Topology of Biological Networks

164

Page 165: Microarrays, Expression, and Regulatory Networks

PPI and Transcription Integrate protein-protein interactions and

transcriptional regulation Motifs might reveal how these two types of interaction

work together for regulation of cellular processes

Possible interaction patterns between a pair of proteins Red directed arrows represent transcriptional

regulation Black bidirectional arrows represent protein-protein

interaction

8. Topology of Biological Networks

165

Page 166: Microarrays, Expression, and Regulatory Networks

Motifs in Integrated TRI-PPI Network

8. Topology of Biological Networks

166

Page 167: Microarrays, Expression, and Regulatory Networks

Conservation of Motifs Is there a

“selective value” of motifs?

If motifs are conserved, then one might expect that proteins that are parts of motifs will also be conserved

8. Topology of Biological Networks

167

Page 168: Microarrays, Expression, and Regulatory Networks

Motif Constituents Conserved Together

8. Topology of Biological Networks

168

Page 169: Microarrays, Expression, and Regulatory Networks

Motif Clusters

8. Topology of Biological Networks

169

Motifs generally tend to form clusters Hierarchical

modularity On the left, 209

bi-fan motifs on Ecoli TRN are shown altogether Shared edges

are in blue, others in red

Page 170: Microarrays, Expression, and Regulatory Networks

Topological Robustness Scale-free networks are robust to random attacks

It is not easy to disconnect the network via random node deletions

In random (Erdös-Renyi) graphs, the network falls apart when the number of accidental node failures reach to a certain threshold

Scale- free networks do not have such threshold: Even if 80% random nodes fail, remaining 20% are still connected

Attack vulnerability Dependence on hubs If a key hub fails, the network turns into a collection

of small isolated node clusters

8. Topology of Biological Networks

170

Page 171: Microarrays, Expression, and Regulatory Networks

Robustness, Lethality, & Redundancy Lethal proteins

Only about 10% of nodes with less than 5 interactions are essential

This rate is 60% for proteins with more than 15 interactions

Redundancy: Only 18.7% of S. cerevisiae proteins are lethal when deleted individually

Evolution of robustness Highly connected yeast genes have a smaller

evolutionary distance to their orthologs in C. elegans

The structure of important proteins is subject to more selective pressure

8. Topology of Biological Networks

171

Page 172: Microarrays, Expression, and Regulatory Networks

Functional and Dynamical Robustness Nodes have different biological function

Network topology is not a sole indicator of lethality Experimentally identified protein complexes

tend to be composed of uniformly essential or non-essential proteins Dispensability of whole complex determines

importance of subunits

8. Topology of Biological Networks

172