62
Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach T. R. Hvidsten Assistent professor in Bioinformatics Umeå Plant Science Centre Umeå University Sweden

Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

  • Upload
    others

  • View
    27

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Predicting gene function from gene expression trends, protein features and cis-regulatory

information – A rough set modeling approach

T. R. Hvidsten Assistent professor in Bioinformatics

Umeå Plant Science Centre Umeå University

Sweden

Page 2: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 2

Hybridization

Page 3: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 3

Microarray

Page 4: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 4

Regulatory logics

•  Massive readouts of cell content in terms of RNA molecules (transcriptomics), proteins (proteomics) and the products of metabolic processes (metabolomics) can be explained by the regulatory logics hard-wired in the DNA sequence

•  Regulation is organized in modules of genes often participating in the same biological process

Page 5: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 5 Promoter

Transcription region

Regulatory region

Enhancer Silencer Response elements Promoter

Binding sites

•  Gene expression is regulated by regulatory proteins (transcription factors)

•  Transcription factors depend on recognizing sequence motifs (binding sites) in order to effect the expression of genes

•  Transcription factors combine to respond to a large number of stress factors (e.g. heat shock) with a large number of expression outcomes

Gene regulation

Yeast

Page 6: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 6

Regulatory logics

•  Massive readouts of cell content in terms of RNA molecules (transcriptomics), proteins (proteomics) and the products of metabolic processes (metabolomics) can be explained by the regulatory logics hard-wired in the DNA sequence

•  Regulation is organized in modules of genes often participating in the same biological process

Page 7: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 7

Yeast regulatory modules

From: Modularity and Dynamics of Cellular Networks Qi Y, Ge H PLoS Computational Biology Vol. 2, No. 12, e174, 2006.

Page 8: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 8

The machine learning strategy …

… iteratively uses experiments to provide representative examples and computational models to provide experimentalists with new, testable hypotheses •  Nearest neighbor predictors

–  evolutionary link –  need few examples

•  Model inducers –  more powerful –  interpretable models

Example: experimentally determined

Unknown

Page 9: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 9

Data representation Truth

Data representation 2

Data representation 1

Page 10: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Predicting biological process from gene expression time profiles

Papers:

I.  T. R. Hvidsten, A. Lægreid and J. Komorowski. Learning rule-based models of biological process from gene expression time profiles using gene ontology, Bioinformatics 19(9): 1116-23, 2003.

II.  A. Lægreid, T. R. Hvidsten, H. Midelfart, J. Komorowski and A. K. Sandvik. Predicting Gene Ontology Biological Process From Temporal Gene Expression Patterns, Genome Research, 13(5): 965-979, 2003.

Page 11: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 11

Hierarchical clustering

Iyer et al., The transcriptional program in the response of human fibroblasts to serum, Science, 283(5398): 83-87, 1999

Page 12: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 12

Ordered controlled vocabulary organized in a taxonomy for describing the molecular role of gene products

• Molecular function: the tasks performed by individual gene products

•  Biological process: broad biological goals that are accomplished by ordered assemblies of molecular functions

•  Cellular component: subcellular structures, locations, and macromolecular complexes

Gene Ontology

Page 13: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 13

Energy pathways DNA metabolism Amino acid and derivative metabolism Protein targeting

Lipid metabolism Transport Ion hemostasis Intracellular traffic

Cell death Cell motility Stress response Organelle organization and biogenesis

Oncogenesis Cell cycle Cell adhesion Cell surface receptor linked signal

transduction

Intracellular signaling cascade Developmental processes Blood coagulation Circulation

Biological processes from GO

Page 14: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 14

Gene Ontology vs. expression clustering

Page 15: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 15

Gene 0HR 15MIN 30MIN 1HR 2HR 4HR 6HR 8HR 12HR 16HR 20HR 24HR Process g1 0.00 -0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84 -1.00 -0.60 -0.94 Unknown

g2 0.00 0.66 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 -0.42 Transport and

defense response g3 0.00 0.14 -0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38 -0.49 -0.81 -1.12 Cell cycle control

g4 0.00 -0.04 0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09 -0.71 -0.76 -0.62 Positive control of cell proliferation

g5 0.00 0.28 0.37 0.11 -0.17 -0.18 -0.60 -0.23 -0.58 -0.79 -0.29 -0.74 Positive control of cell proliferation

... ... ... ... ... ... ... ... ... ... ... ... ... ...

Process

Positive controlof cell

proliferation

Defenseresponse

Cell cyclecontrol

Ontology

Transport

g2 ... g2 ... g3 ...g4 ... g5

0 - 4(Increasing) AND 6 - 10(Decreasing) AND 14 - 18(Constant) => GO(cell proliferation)

1. Annotation

2. Extracting features for learning

3. Inducing minimal decision rules using rough sets

4. The function of uncharacterized genes is predicted using the rules !-2

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12 14 16 18 20 22 24

Methodology

Page 16: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 16

Rough set

•  A mathematical theory for viewing data in terms of sets of indiscernible objects (equivalence classes)

•  A rough set X –  lower approximation: –  upper approximation: – 

•  A crisp set – 

XA

XA

∅≠− XAXA

∅=− XAXA

Page 17: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 17

Rule Induction

•  IF-part (antecedent, premise): the minimal set of discrete changes in expression needed to uphold the discriminatory power of the full data set

•  THEN-part (consequent): all functions of genes described by the premise-side

•  We want rules that describe the expression profiles of several genes with one or a few functions

–  accuracy: the fraction of genes matching the IF-part that are annotated with the process in the THEN-part

–  coverage: the fraction of genes annotated with the process in the THEN-part that matches the IF-part

IF 0 - 4(Constant) AND 0 - 10(Increasing)

THEN GO(prot. met. and mod.) OR GO(mesoderm develop.) OR GO(prot. biosynt.)

Page 18: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 18

Genetic algorithm for reduct computation

•  Individuals in the population are subsets of time intervals: 00110 … (absent, absent, present, present, absent, …)

•  Fitness: –  A: Fraction of genes with different function that can be

discerned from the function of interest + –  B: Number of time intervals that are absent

•  Typically, approximate solutions are obtained, i.e. A < 1.0

Page 19: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 19

Rule example

Rule Covered genes

0 - 4(Constant) AND 0 - 10(Increasing) => GO(protein metabolism and modification) OR GO(mesoderm development) OR GO(protein biosynthesis)

M35296 J02783 D13748 X05130

X60957 D13748

-1

-0.5

0

0.5

1

1.5

2

2.5

3

0 2 4 6 8 10 12 14 16 18 20 22 24

Page 20: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 20

Classification

IF … THEN … IF … THEN … IF … THEN … IF … THEN …

IF … THEN … IF … THEN … IF … THEN …

IF 0 - 4(Constant) AND 0 - 10(Increasing) THEN GO(protein metabolism and modification ) OR

GO(mesoderm development) OR GO(protein biosynthesis)

IF … THEN IF … THEN … IF … THEN …

IF … THEN … IF … THEN …

X60957

-1-0.5

00.5

11.5

22.5

3

0 2 4 6 8 10 12 14 16 18 20 22 24

Process Votes protein metabolism and modification 6 mesoderm development 3 proteolysis and peptidolysis 2 transcription 1 protein biosynthesis 1 vision 1 …

+4

Votes are normalized and processes with vote fractions higher than a selection-threshold are chosen as predictions

+1 +1

Page 21: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 21

Threshold selection

1

Fraction of votes for “protein biosynthesis”

Test set g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13

Sensitivity = 2/3, Specificity=1 Sensitivity = 1, Specificity=2/3

Gene with function “protein biosynthesis” Gene with a different function

sensitivity: TP/(TP+FN) specificity: TN/(TN+FP)

Threshold 1

Threshold 2

Page 22: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 22

ROC analysis and classifier evaluation

1

sens

itivi

ty

1 – specificity False alarm

1

No discrimination

Perfect discrimination

AUC

0 0

•  ROC: Receiver operating characteristics curve results from plotting sensitivity against specificity for all possible thresholds

–  sensitivity: TP/(TP+FN) –  specificity: TN/(TN+FP)

•  AUC: Area under the ROC curve

Page 23: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 23

ROC analysis and classifier evaluation

1

sens

itivi

ty

1 - specificity 1

No discrimination

Perfect discrimination

0 0

•  Which ROC curve is better?

•  A dominants B and C and clearly has a higher AUC

•  B and C have approximately the same AUC

•  B is better for some thresholds, C for others

A B

C

Page 24: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 24

Selecting a selection threshold: Over all classes: Coverage/recall = TP/(TP+FN) Precision = TP/(TP+FP) Coverage: 84% Precision: 50% Coverage: 71% Precision: 60% Coverage: 39% Precision: 90%

Cross validation estimates*

PROCESS AUC SE P-VALUE Ion homeostasis 1.00 0.00 0.008 Protein targeting 0.99 0.03 0.000 Blood coagulation 0.96 0.08 0.000 DNA metabolism 0.94 0.09 0.000 Intracellular signaling cascade 0.94 0.06 0.000 Cell cycle 0.93 0.04 0.000 Energy pathways 0.93 0.12 0.004 Oncogenesis 0.92 0.11 0.000 Circulation 0.91 0.11 0.001 Cell death 0.90 0.10 0.000 Developmental processes 0.90 0.07 0.000 Defense (immune) response 0.88 0.05 0.000 Transcription 0.88 0.11 0.002 Cell adhesion 0.87 0.09 0.002 Stress response 0.86 0.15 0.002 Protein metabolism and modification 0.85 0.10 0.000 Cell motility 0.84 0.11 0.000 Cell surface rec linked signal transd 0.82 0.15 0.005 Lipid metabolism 0.81 0.14 0.000 Cell organization and biogenesis 0.79 0.11 0.000 Cell proliferation 0.79 0.06 0.002 Transport 0.79 0.17 0.001 Amino acid and derivative metabolism 0.69 0.06 0.288

AVERAGE

0.88

0.09

* V. R. Iyer et al. The transcriptional program in the response of human fibroblasts to serum, Science, 283(5398): 83-87, 1999.

Treshold independent:

Page 25: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 25

Symbol

Gene name

Molecular function

Comment

Reference

(PMID) CCNG1

Cyclin G1

CDK kinase regulator

p53 target

11327114

CDKN1C

cyclin-dependent kinase inhibitor 1C

cyclin-dependent protein kinase inhibitor tumor suppressor 7729684

CAT

catalase

oxidoreductase tumor progression 8513880,

ALDH3A2

aldehyde dehydrogenase 10

aldehyde dehydrogenase tumor progression 92393980

ADD3

adducin 3 (gamma) membrane-cytoskeleton-associated protein

tumor progression 9607561

TFDP2

transcription factor Dp-2 (E2F dimerization partner 2)

transcription co-factor cell cycle regulation 7784053

ATRX

alpha thalassemia/mental retardation syndrome

DNA helicase transcription & DNA repair

10362365

EPS15

epidermal growth factor receptor pathway substrate 15

kinase substrate growth regulation 93361014

EGR1

early growth response 1

transcription factor tumor suppressor 9109500

NR4A2

nuclear receptor subfam 4, group A, m2 (Nurr1, Not)

ligand-dependent nuclear receptor proto-oncogene 9592180

NR4A3

nuclear receptor subfam 4, group A, m 3 (Nor1)

ligand-dependent nuclear receptor proto-oncogene 9592180

COPEB

core promoter element binding protein

transcription factor proto-oncogene 9268646

Reclassifying annotated genes 12 of 24 ”false positive” predictions for oncogenesis was ”missing annotations” found through litterature search

Page 26: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 26

Prediction of uncharacterized genes: 2008

•  96 (of 211) ”uncharacterized genes” in 2003 has some relevant annotation in 2008

•  ~1/3 have correct predictions by our method •  Notes:

–  We predict 23 biological processes (Random guessing = 1/23) –  Not trivial to map old annotations to new GO tree –  Also noted in the original Science paper: there is a huge

overrepresentation of down-regulated genes among 2003-uncharacterized genes: not a representative training set

•  Example: Gene MRE11A correctly predicted to DNA metabolic process, cell cycle and cellular component organization

Page 27: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Predicting biological process from gene expression time profiles – including protein features

Papers:

I.  K. Wabnik, T. R. Hvidsten, A. Kedzierska, J. Van Leene, G. De Jaeger, G. T. S. Beemster, J. Komorowski and M. T. R. Kuiper. Gene expression trends and protein features effectively complement each other in gene function prediction, Accepted to Bioinformatics, 2008.

Page 28: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 28

Results

•  Protein features: derived directly from amino acid*

•  Synergy from combining expression and protein sequence features

•  Expression data increases preformance for relevant biological processes

•  Extention: Pick the most significant/certain predictions from each dataset from many diverse sets

Coverage (recall)

Pre

cisi

on

*U. Lichtenberg, T. Jensen , L. Jensen and S. Brunak . Protein feature based identification of cell cycle regulated proteins in yeast. J Mol Biol, 329, 149-170, 2003.

Page 29: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 29

Rule interpretation

E.g.: IF 1.03-1.05(decrease) AND 1.07-1.1(increase) AND sequence length ([*,-0.52) -[0.16,*] THEN leaf morphogenesis

Page 30: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Discovering regulatory logics

•  T. R. Hvidsten, B. Wilczynski, A. Kryshtafovych, J. Tiuryn, J. Komorowski and K. Fidelis. Discovering regulatory binding site modules using rule-based learning. Genome Research 15: 856-66, 2005.

•  B. Wilczynski, T. R. Hvidsten, A. Kryshtafovych, J. Tiuryn, J. Komorowski, K. Fidelis. Using Local Gene Expression Similarities to Discover Regulatory Binding Site Modules, BMC Bioinformatics 7: 505, 2006.

Page 31: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 31

Many studies have used gene expression data to search for overrepresented sequence motifs in co-expressed genes Pilpel et al. (2001) found that genes sharing pairs of binding sites are significantly more likely to be co-expressed than genes with only single binding sites in common Expression coherence score (EC)

Pilpel, Y., P. Sudarsanam, and G.M. Church. 2001. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet 29: 153-159.

Combinatorial regulation

Page 32: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 32

Rule Induction

•  IF-part (antecedent, premise): the minimal set of binding sites needed to uphold the discriminatory power of the full data set

•  THEN-part (consequent): an expression profiles •  We want rules that describe a combination of binding

sites common to genes with similar expression –  accuracy: the fraction of genes matching the IF-part that have

similar expression to the profile in the THEN-part –  coverage: the fraction of genes with similar expression to the

profile in the THEN-part that matches the IF-part

IF binding site i AND binding site j

THEN particular expression profile

Page 33: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 33

HA

P234

RA

P1

PHO

SWI5

ECB

MC

M1'

RPL18A 0 1 0 1 0 1 …RPS18A 0 1 0 1 0 1 …RPL16B 1 1 0 1 0 1 …RPL26A 0 1 0 1 0 1 …RPS24A 0 1 0 1 0 1 …RPL30 0 1 0 1 0 1 …

RPL14A 0 1 0 1 0 1 …SST2 0 1 0 1 0 1 …DRS2 0 0 0 1 0 1 …GIT1 0 1 0 1 0 0 …CLN3 0 0 0 1 1 1 …RPO21 1 0 0 0 1 1 …BIT89 0 1 0 0 0 1 …… … … … … … … …

Binding sites

Gene

-­‐3

-­‐2

-­‐1

0

1

2

3

4

Similar expression to RPL18A?

Rule learning

IF RAP1 AND SWI5 AND MCM1' THEN

Evaluation: Gene Ontology Binding data

Next gene

yesyesyesyesyesyesyesnononononono…

Filtering

•  a database of 43 known and 313 putative yeast binding site motifs

•  expression profiles for yeast genes measured under six different conditions: cell cycle and five stress conditions (sporulation, diauxic shift, heat and cold shock, pheromone and DNA-damaging agents)

Data and Method

Page 34: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 34

An example of a binding site module

a) Cell cycle: RPL30, RPL18A, RPL26A, RPL14A, RPS18A, (SST2), RPL16B

-­‐2. 5

-­‐2

-­‐1. 5

-­‐1

-­‐0. 5

0

0. 5

1

1. 5

2

b) Sporulation: RPL30, RPL18A, RPL14A, (RPS18A), RPL16B, RPL26A

c) Diauxic shift: RPL30, RPL18A, RPL14A, RPS18A, SST2, RPL16B, RPL26A

-­‐4

-­‐3

-­‐2

-­‐1

0

1

2

3

d) Heat and cold shock: RPS24A, RPL26A, RPL14A, RPS18A, RPL16B

-­‐3

-­‐2

-­‐1

0

1

2

3

4

-­‐3

-­‐2

-­‐1

0

1

2

3

-­‐3

-­‐2

-­‐1

0

1

2

3

e) DNA-damaging agents: RPL30, RPL18A, RPL26A, RPL14A, RPS18A, (SST2), RPL16B

IF RAP1 AND SWI5 AND MCM1' THEN

The rule was (re-) discovered in five of the six expression data sets The central gene in the expression cluster is underlined Genes with differing expression profiles are in parentheses

Page 35: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 35

Biological significance

The transcription factor RAP1 targets two different sets of genes: •  genes that encode ribosomal proteins and that have an extremely high

expression in rapidly growing yeast cells •  genes encoding several nonribosomal proteins

→ Combinatorial regulatory mechanism to separate these activities: RAP1 specifically target ribosomal proteins in growing yeast cells by requiring the presence of cell cycle regulators MCM1 and SWI5.

Literature support: •  Gray and Fassler (1993): RAP1 forms a complex with MCM1 •  Lydall et al. (1991): MCM1 and SWI5 are responsible for the cell-cycle-

restricted transcription of SW15

Gasch, A.P. and Eisen, M.B. 2002. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biol. 3: 0059.0051-0059.0022. Lydall, D., Ammerer, G., and Nasmyth, K. 1991. A new role for MCM1 in yeast: Cell cycle regulation of SW15 transcription. Genes & Dev. 5: 2405-2419

Page 36: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 36

Biological significance cont: Gene Ontology

IF RAP1 AND SWI5 AND MCM1' THEN <similar expression> Gene symbol Biological process Molecular function Cellular component

RPL16B protein biosynthesis RNA binding, structural constituent of ribosome cytosolic ribosome (sensu Eukarya), large

ribosomal subunit RPL26A protein biosynthesis RNA binding, structural constituent

of ribosome cytosolic ribosome (sensu Eukarya), large ribosomal subunit

RPS18A protein biosynthesis structural constituent of ribosome cytosolic ribosome (sensu Eukarya), eukaryotic 43S pre-initiation complex,eukaryotic 48S initiation complex, mall ribosomal subunit

RPL30 protein biosynthesis, rRNA processing, mRNA splicing, regulation of translation structural constituent of ribosome cytosolic ribosome (sensu Eukarya),

cytoplasm, large ribosomal subunit RPL18A protein biosynthesis structural constituent of ribosome cytosolic ribosome (sensu Eukarya), large

ribosomal subunit RPL14A protein biosynthesis RNA binding, structural constituent

of ribosome cytosolic ribosome (sensu Eukarya), large ribosomal subunit

SST2 signal transduction, adaptation to pheromone during conjugation with cellular fusion GTPase activator activity plasma membrane

RPS24A protein biosynthesis structural constituent of ribosome cytosolic ribosome (sensu Eukarya), eukaryotic 43S pre-initiation complex, eukaryotic 48S initiation complex, small ribosomal subunit

P-VALUE 2.35E-04 (protein biosynthesis)

2.36E-06 (structural constituent of ribosome)

5.66E-07 (cytosolic ribosome)

Page 37: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 37

Statistical evaluation – Gene Ontology

•  Fraction of significant rules (Bonferroni corrected P < 0.01) •  Comparison to: genes with similar expression, genes with

common motifs and randomly sampled genes

Expression data

Gene Ontology evaluation (significant fractions P < 0.01)

Rule (P-values)

Randomized tests Similar expression | Common motifs | Random

Molecular function

Biological Process

Cellular component

Molecular Function

Biological Process

Cellular component

Cell cycle 0.31 (0.000) 0.46 (0.000) 0.41 (0.000) 0.05 0.04 0.01 0.13 0.18 0.03 0.03 0.04 0.00 Sporulation 0.26 (0.000) 0.54 (0.000) 0.44 (0.000) 0.08 0.04 0.01 0.19 0.17 0.02 0.05 0.03 0.00 Diauxic Shift 0.30 (0.000) 0.43 (0.000) 0.44 (0.000) 0.04 0.05 0.02 0.11 0.17 0.03 0.02 0.03 0.00 Heat and cold shock 0.54 (0.000) 0.64 (0.006) 0.60 (0.000) 0.24 0.06 0.03 0.46 0.24 0.05 0.17 0.04 0.01 Pheromone 0.51 (0.000) 0.67 (0.000) 0.60 (0.000) 0.10 0.05 0.01 0.25 0.16 0.02 0.08 0.03 0.00 DNA-damaging agents 0.39 (0.000) 0.64 (0.000) 0.61 (0.000) 0.09 0.05 0.01 0.19 0.17 0.03 0.07 0.04 0.00

Page 38: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 38

Genome-wide location analysis

ABF1 ACE2 ADR1 ARG80 ARG81 ARO80 ASH1 AZF1 BAS1 CAD1 CBF1 HIR1 …YAL001C 0.580 0.730 0.490 0.240 0.050 0.300 0.600 0.250 0.061 1.000 1.000 0.005 …YAL002W 1.000 0.760 0.560 0.150 0.079 0.160 1.000 0.012 0.620 1.000 1.000 0.220 …YAL015C 1.000 0.210 0.130 0.250 0.190 0.340 1.000 0.410 0.410 1.000 1.000 0.400 …YAL016W 0.002 0.920 0.170 0.480 0.340 0.320 0.690 0.150 0.580 0.310 0.540 0.630 …YAL017W 0.970 0.048 0.003 0.980 0.840 0.970 0.940 0.610 0.870 0.210 0.340 0.440 …YAL018C 0.970 0.048 0.003 0.980 0.840 0.970 0.940 0.610 0.870 0.210 0.340 0.440 …YAL019W 0.600 0.360 0.011 0.900 0.830 0.700 0.170 0.940 0.650 0.380 0.051 0.590 …YAL020C 0.600 0.360 0.011 0.900 0.830 0.700 0.170 0.940 0.650 0.380 0.051 0.590 …YAL021C 0.410 0.340 0.160 0.400 0.720 0.410 0.590 0.960 0.840 0.630 0.017 0.170 …YAL022C 0.810 0.190 0.001 0.970 0.870 0.350 0.530 0.790 0.930 0.500 0.085 0.047 …YAL023C 0.000 0.930 0.380 0.430 0.770 0.280 0.330 0.840 0.840 0.400 0.041 0.730 …

… … … … … … … … … … … … … …

GenesTranscription factors

Lee et al. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298: 799-804.

Lee et al. (2002) experimentally measured the probability of 106 yeast transcription factors binding to gene promoters

Page 39: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 39

Statistical evaluation – TF binding

•  Fraction of significant rules (Bonferroni corrected P < 0.01) •  Comparison to: genes with similar expression, genes with

common motifs and randomly sampled genes

Expression data Expression similarity thresholds

No. rules unique/all

Binding data evaluation (significant fractions P < 0.01)

Rules (P-value)

Randomized tests

Similar expression Common motifs Random

Cell cycle 0.250 39/109 0.54 (0.000) 0.11 0.17 0.02

Sporulation 0.250 45/81 0.13 (0.708) 0.09 0.18 0.02

Diauxic shift 0.200 150/428 0.29

(0.000) 0.06 0.18 0.02

Heat and cold shock 0.125 52/123 0.52

(0.000) 0.18 0.18 0.02

Pheromone 0.150 53/91 0.39 (0.001) 0.14 0.17 0.02

DNA-damaging agents 0.200 59/116 0.35

(0.000) 0.10 0.17 0.02

Page 40: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 40

Module complexity

The distribution of the number of binding sites in the discovered modules

3

74

122

61

182

0

20

40

60

80

100

120

140

1 2 3 4 5 6

Number of binding sites

Num

ber o

f rul

es

Page 41: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 41

Restricting expression similarity to time-windows

IF REB1 AND SWI5 AND SCB

⎯⎯⎯ →⎯− 90109 .]~[

B. Wilczynski, T. R. Hvidsten, A. Kryshtafovych, J. Tiuryn, J. Komorowski, K. Fidelis. Using Local Gene Expression Similarities to Discover Regulatory Binding Site Modules, BMC Bioinformatics 7: 505, 2006.

•  Not identified using global expression similarity

•  P-value for the ChIP-Chip data: 4 · 10-6

•  Corresponds to the M1/G phase boundary which is the active time of the SWI5 factor

Page 42: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 42

Explanatory power

•  ChIP-Chip data •  Cell cycle expression

data •  Number of genes

explained by significant rules

•  Comparison to other studies

Page 43: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Model-based detection of periodic expression: using biological knowledge

C. R. Andersson*, T. R. Hvidsten*, A. Isaksson, M. G. Gustafsson, J. Komorowski. Revealing cell cycle control by combining model-based detection of periodic expression with cis-regulatory descriptors, Accepted to BMC Systems Biology, 1: 45, 2007. * Contributed equally

Page 44: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 44

•  To recover mRNA-levels, cultures must be synchronized. •  Synchronization halts cells at a particular point in the cell cycle. •  Typically the mating system or temperature sensitive mutants are

used (cdc28, cdc15). •  Spellman (1998). S cerevisae under various synchronizations (alpha-

factor, ts cdc15, cdc28 and elutriation)

Synchronization

Spellman PT, Sherlock G, Zhang MQ, Iyer KA, Eisen MB, Brown PO, Botstein D, Futcher B, Comprehensive identication of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization . Mol Biol Cell 1998, 9:3273-97.

Page 45: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 45

Synchronization cont.

Time

• Synchronization reveals periodic expression

• Periodic expression is related to the cell cycle machinery

Page 46: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 46

Time

Exp

ress

ion

Model 1

Time

Exp

ress

ion

Model 0

Time

Exp

ress

ion

Signal

Which model is most similar to the signal? → Probability (Periodical expression)

T

Detecting periodically expressed genes

Prior knowledge: Period time of the cell cycle in Yeast

Page 47: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 47

Conditional periodicity

Page 48: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 48

Model-based analysis of gene regulation

•  Rather than relying on clustering, apply a model to describe the gene expression –  Prior knowledge:

•  Period time of the cell cycle •  Periodic expression is related to the cell cycle machinery

•  Advantage: –  Interpretable hypotheses –  More specific hypotheses

Page 49: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 49

Periodic classes Classication with rejection:

–  P(Periodic) > 0.90: Periodic –  P(Periodic) < 0.10: Not periodic

Class α-factor Cdc28 cdc15 No. genes

000 Not periodic Not periodic Not periodic 1173

001 Not periodic Not periodic Periodic 55

010 Not periodic Periodic Not periodic 140

100 Periodic Not periodic Not periodic 4

011 Not periodic Periodic Periodic 127

101 Periodic Not periodic Periodic 11

110 Periodic Periodic Not periodic 115

111 Periodic Periodic Periodic 19

Sum 1644

Page 50: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 50

Method

Page 51: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 51

Cis-regulatory descriptors

Transcription factor binding and sequence motifs co-occurring in promoter regions

–  Overrepresented p < 0.05 –  Three best pairs for each factor –  Three best pairs for each motif

Known associations

TF P-value Motif

YOX1 2.25e-06 MCM1

3.83e-05 ECB

0.0034 m_organization_of_cell wall_orfnum2SD_n6

UME6 5.70e-114 m_meiosis_orfnum2SD_n3

1.75e-70 Ume6(URS1)

1.74e-17 m_glyoxylate_cycle_orfnum2SD_n11

ABF1 6.5e-214 ABF1

2.98e-09 Ume6(URS1)

7.8e-09 RPN4

Page 52: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 52

Cis-regulatory descriptors cont

• 1459 cis-regulatory descriptors in total

• Remember: 1644 genes

Motif P-value TF

SCB 3.67e-08 AZF1

1.02e-05 UME6

3.31e-05 SWI4

SFF 1.14e-10 FKH2

9.26e-10 FKH1

1.92e-05 HIR1

Page 53: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 53

Results •  Pairs of cis-regulatory descriptors are better at explaining the periodic

classes than single descriptors. •  Pairs are specific to different periodic classes: indicate that the

synchronization methods induce different perturbations that initially activate different regulatory mechanisms visible in the two first periods of the cell cycle.

Class Observed/Expected (P-value) - Single

Observed/Expected (P-value) - Pairs

All genes | Within classes

No. genes

001 1.1 (0.35) 7.3 (5e-10) 0.86 (0.71) 55

010 0.77 (0.9) 0.95 (0.91) 0.31 (1.0) 140

100 0.85 (0.84) 0.36 (1.0) 0.0 (1.0) 127

011 9.1 (1.3e-11) 19.3 (<1e-20) 18.18 (1.02e-12) 4

101 4.42 (1.55e-9) 19.1 (2.5e-12) 12.9 (1.3e-14) 11

110 2.2 (1.3e-10) 1.6 (1.44e-10) 1.81 (0.0013) 115

111 3.7 (4.79e-13) 17.1 (<1e-20) 6.7 (3.8e-14) 19

Page 54: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 54

Results cont.

•  Combinations of regulatory descriptors suggests that the periodic classes are regulated in an additive fashion.

Periodic class 111: MCM1-SFF’ AND FKH1-SFF’ AND FKH2-SFF’ AND FKH2-SFF

Periodic class 110: MCM1-SFF’ AND FKH1-SFF’ AND FKH2-SFF

Periodic class 110: FKH1-SFF’ AND FKH2-SFF’

Periodic class 010: MCM1-SFF’ AND FKH1-SFF’

Periodic class 100: MCM1-SFF’ AND FKH2-SFF

Periodic in all classes

Periodic in two classes

Periodic in one class

Page 55: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 55

Results cont.

•  Replacing clustering-based classification of dynamic gene expression patterns with model-based classification is advantageous for discovering the mechanisms underlying cellular control processes.

Page 56: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 56

Explanatory power

•  The point (0.034, 0.73) on the curve is associated with the 145 rules with p-value lower than 0.000195.

•  These rules include 19 of the 26 known phase specific regulators (73%) and 18 other regulators 3.4%.

•  Furthermore, they describe 24% of the genes in the periodic classes.

Page 57: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 57

Examples of ”interactions”

Ellipses/rectangles: transcription factors/sequence motifs Green/Blue: Cell cycle related transcription factors/sequence motifs Reed: ”interactions” between transcription factors

Page 58: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Future challenges: towards regulatory networks

Page 59: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 59

Modules combined into networks

•  Modules reduce the number of nodes in the network •  The regulatory mechanisms predicted for each module

can further restict the potential regulators •  The expression of predicted transcription factors can be

used to link modules together (e.g. Using Bayesian networks)

Page 60: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 60

Preliminary results: trees (Populus/aspen)

Three platforms to explain regulation in trees: 1.  Transcriptomics: Genes and Transcription

Factors 2.  Metabolomics: Metabolites 3.  Proteomics: Proteins

Page 61: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 61

Machine learning issues

•  Representation –  Expression trends over time intervals –  Expression similarity as decision classes

•  Data integration •  Expression data – protein sequence features •  Sequence motifs – expression data •  Sequence motifs – binding data

•  Knowledge integration •  Gene Ontology •  Periodic expression targets the cell cycle macinery

Page 62: Predicting gene function from gene expression …...Predicting gene function from gene expression trends, protein features and cis-regulatory information – A rough set modeling approach

Torgeir R. Hvidsten 2008.12.15 62

Acknowledgements

•  C. R. Andersson •  K. Fidelis •  M. G. Gustafsson •  A. Isaksson •  J. Komorowski •  A. Lægreid

•  A. Kryshtafovych •  M. T. R. Kuiper •  A. Sandvik •  K. Wabnik •  B. Wilczynski