31
Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10 rvard School of Public Health partment of Biostatistics

Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

Computational Methodology for Microbial and Metagenomic Characterization using Large Scale

Functional Genomic Data Integration

Curtis Huttenhower

03-08-10Harvard School of Public HealthDepartment of Biostatistics

Page 2: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

2

Outline

1. Network models of functional data

2. Network models of microbes

3. Network models of microbiomes

Page 3: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

3

Meta-analysis for unsupervisedfunctional data integration

Following up with round-robin and semi-supervised evaluations

Huttenhower 2006Hibbs 2007

1

1log2

1'

'

''

z

+ =

Page 4: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

4

Functional network prediction from diverse microbial data

486 bacterial expression

experiments

876 raw datasets

310 postprocessed

datasets

304 normalized coexpression networks

in 27 species

Integrated functional interaction networks

in 15 species

307 bacterial interaction

experiments

154796 raw interactions

114786 postprocessed

interactions

Page 5: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

5

Functional maps for cross-speciesknowledge transfer

Following up with unsupervised and partially anchored network alignment

Huttenhower 2008Huttenhower 2009

Page 6: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

6

Functional maps for functional metagenomics

Mapping genes into pathways

Mapping pathways into

organisms

+ Integrated functional interaction networks

in 27 species

Mapping organisms into phyla

=

GOS 4441599.3Hypersaline Lagoon, Ecuador

Page 7: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

7

Functional maps for functional metagenomics

NodesProcess cohesiveness in obesity

VeryDownregulated

Baseline(no change)

VeryUpregulated

EdgesProcess association in obesity

MoreCoregulated

LessCoregulated

Baseline(no change)

Summarizes information from ~10M metagenomic reads and ~500 genome-

scale microbial experiments.

Page 8: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

8

• Sleipnir C++ library for computational functional genomics

• Data types for biological entities• Microarray data, interaction data, genes and gene sets,

functional catalogs, etc. etc.• Network communication, parallelization

• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)

• And it’s fully documented!

Efficient Computation For Biological Discovery

Massive datasets and genomes require efficient algorithms and implementations.

It’s also speedy: microbial data integration

computationtakes <3hrs.

Page 9: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

9

Thanks!

NIGMShttp://function.princeton.edu/hefalmp

http://huttenhower.sph.harvard.edu/sleipnir

Olga TroyanskayaMatt HibbsChad MyersDavid HessChris ParkAna PopAaron Wong

Hilary CollerErin Haley

Jacques Izard

Wendy Garrett

Sarah FortuneTracy Rosebrock

Page 10: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10
Page 11: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

11

Functional mapping:Functional associations between processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

Information mapped from ~100 E. coli experiments

Page 12: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

12

Meta-analysis for unsupervisedfunctional data integration

Following up with round-robin and semi-supervised evaluations

Evangelou 2007

Huttenhower 2006Hibbs 2007

1

1log2

1'

'

''

z

+ =

Page 13: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

13

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

The strength of these relationships indicates how

cohesive a process is.

Chemotaxis

Page 14: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

14

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

Chemotaxis

Page 15: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

15

Functional mapping: mining integrated networks

Flagellar assembly

The strength of these relationships indicates how

associated two processes are.

Predicted relationships between genes

HighConfidence

LowConfidence

Chemotaxis

Page 16: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

16

Functional maps for cross-speciesknowledge transfer

G17

G16G15

G10

G6

G9

G8

G5

G11

G7

G12

G13

G14

G2

G1

G4

G3

O8

O4O5

O7

O9

O6

O2

O3

O1

O1: G1, G2, G3O2: G4O3: G6…

ECG1, ECG2BSG1ECG3, BSG2…

Page 17: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

17

Functional network prediction from diverse microbial data

486 bacterial expression

experiments

876 raw datasets

310 postprocessed

datasets

304 normalized coexpression networks

in 27 species

Integrated functional interaction networks

in 15 species

307 bacterial interaction

experiments

154796 raw interactions

114786 postprocessed

interactions

E. Coli Integration

← Precision ↑, Recall ↓

Page 18: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

18

Functional maps for functional metagenomics

GOS 4441599.3Hypersaline Lagoon, Ecuador

KEGG Pathways

Org

anis

ms

Pathog ens

Env.

Mapping genes into pathways

Mapping pathways into

organisms

+ Integrated functional interaction networks

in 27 species

Mapping organisms into phyla

=

Page 19: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

19

Functional maps for cross-speciesknowledge transfer

← Precision ↑, Recall ↓

Following up with unsupervised and partially anchored network alignment

Page 20: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

20

E. Coli Integration

Functional network prediction from diverse microbial data

486 bacterial expression

experiments

876 raw datasets

310 postprocessed

datasets

304 normalized coexpression networks

in 27 species

Integrated functional interaction networks

in 15 species

307 bacterial interaction

experiments

154796 raw interactions

114786 postprocessed

interactions

Page 21: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

21

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

Data integration summarizes an impossibly huge amount of experimental data into an

impossibly huge number of predictions; what next?

Page 22: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

22

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

How can a biologist take advantage of all this data to study

his/her favorite gene/pathway/disease without

losing information?

Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease

associations• Underlying experimental results and

functional activities in data

Page 23: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

23

Functional Mapping:Scoring Functional Associations

How can we formalizethese relationships?

Any sets of genes G1 and G2 in a network can be compared

using four measures:

• Edges between their genes

• Edges within each set• The background edges

incident to each set• The baseline of all edges

in the network

),(),(

),(

2121

21, 21 GGwithin

baseline

GGbackground

GGbetweenFA GG

Stronger connections between the sets increase association.

Stronger within self-connections or nonspecific background connections decrease association.

Page 24: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

24

Functional Mapping:Bootstrap p-values

• Scoring functional associations is great……how do you interpret an association score?– For gene sets of arbitrary sizes?– In arbitrary graphs?– Each with its own bizarre distribution of edges?

Empirically!# Genes 1 5 10 50

1

5

10

50

Histograms of FAs for random sets

For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is

approximately normal with mean 1.

Standard deviation is asymptotic in the sizes

of both gene sets.

Maps FA scores to p-values for any gene sets and

underlying graph.

100

102

104

100

101

102

103

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

|G1|

|G2|

Null distribution σs for one graph

|)(|||

|||)(|),(ˆ

1),(ˆ

ji

jijiFA

jiFA

GCG

BGGAGG

GG

)(1)( ),(ˆ),,(ˆ, 212121xxFAP GGGGGG

Page 25: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

25

Microbial Communities andFunctional Metagenomics

• Metagenomics: data analysis from environmental samples– Microflora: environment includes us!

• Pathogen collections of “single” organisms form similar communities

• Another data integration problem– Must include datasets from multiple organisms

• What questions can we answer?– What pathways/processes are present/over/under-

enriched in a newly sequences microbe/community?– What’s shared within community X?

What’s different? What’s unique?– How do human microflora interact with diabetes,

obesity, oral health, antibiotics, aging, …– Current functional methods annotate

~50% of synthetic data, <5% of environmental data

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2

AGA

With Jacques Izard, Wendy Garrett

Page 26: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

26

Data Integration for Microbial Communities

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2

AGA

~350 available expression datasets

~25 species

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2

AGA

Weskamp et al 2004

Flannick et al 2006

Kanehisa et al 2008

Tatusov et al 1997

• Data integration works just as well in microbes as it does in yeast and humans• We know an awful lot about some microorganisms and almost nothing about others• Sequence-based and network-based tools for function transfer both work in isolation• We can use data integration to leverage both and mine out additional biology

Page 27: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

27

Functional Maps forFunctional Metagenomics

Page 28: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

28

Validating Orthology-BasedFunctional Mapping

Does unweighted data integration predict functional relationships?

What is the effect of “projecting” through an orthologous space?

Recall

log(

Pre

cisi

on/R

ando

m)

KEGG

GO

Recall

log(

Pre

cisi

on/R

ando

m)

Recall

log(

Pre

cisi

on/R

ando

m)

GO

Unsupervised integration

Individual datasets

Recall

log(

Pre

cisi

on/R

ando

m) Individual

datasets

KEGG

Unsupervised integration

Page 29: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

29

Validating Orthology-BasedFunctional Mapping

YG17

YG16YG15

YG10

YG6

YG9

YG8

YG5

YG11

YG7

YG12

YG13

YG14

YG2

YG1

YG4

YG3Holdout set,

uncharacterized “genome”

Random subsets,characterized “genomes”

Page 30: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

30

Validating Orthology-BasedFunctional Mapping

Page 31: Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower 03-08-10

31KEGG KEGG

GO GO

Validating Orthology-BasedFunctional Mapping

Can subsets of the yeast genome predict a heldout subset’s

functional maps?

Can subsets of the yeast genome predict a heldout subset’s

interactome?

0.68 0.48

0.39 0.25

0.30 0.37

0.27 0.39

0.43

0.40

What have we learned?• Yeast is incredibly well-curated

• KEGG tends to be more specific than GO

• Predicting interactomes by projecting through

functional maps

works decently in the absolute best case