42
Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 01-06-10 11 rvard School of Public Health partment of Biostatistics

Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 01-06-1011 Harvard School of Public Health Department of Biostatistics

  • View
    221

  • Download
    1

Embed Size (px)

Citation preview

Scalable data mining for functional genomics and metagenomics

Curtis Huttenhower

01-06-1011Harvard School of Public HealthDepartment of Biostatistics

2

What tools enable biological discoveries?

Our job is to create computational microscopes:

To ask and answer specific biomedical questions using

millions of experimental results

3

Outline

2. Metagenomics:Modeling microbial

communities for public health

1. Data mining:Integrating very large

genomic data compendia

4

A computational definition offunctional genomics

Genomic data Prior knowledge

Data↓

Function

Function↓

Function

Gene↓

Gene

Gene↓

Function

5

A framework for functional genomics

HighSimilarity

LowSimilarity

HighCorrelation

LowCorrelation

G1G2

+

G4G9

+

G3G6

-

G7G8

-

G2G5

?

0.9 0.7 … 0.1 0.2 … 0.8

+ - … - - … +

0.8 0.5 … 0.05 0.1 … 0.6

HighCorrelation

LowCorrelation

Fre

quen

cy

Let.Not let.

Fre

quen

cy

SimilarDissim.

Fre

quen

cy

P(G2-G5|Data) = 0.85

100Ms gene pairs →

← 1

Ks

data

sets

+ =

6

Functional networkprediction and analysis

Global interaction network

Carbon metabolism network Extracellular signaling network Gut community network

Currently includes data from30,000 human experimental results,

15,000 expression conditions +15,000 diverse others, analyzed for

200 biological functions and150 diseases

HEFalMp

7

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

1

1log2

1'

'

''

z

eiey ,

ieeeiey ,,

i

ieiee yw ,*,̂

22,

*, ˆ

1

eie

ies

w

Simple regression:All datasets are equally accurate

Random effects:Variation within and

among datasets and interactions

8

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

1

1log2

1'

'

''

z

+ =

9

Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune

Graphle http://huttenhower.sph.harvard.edu/graphle/

10

Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune

Graphle http://huttenhower.sph.harvard.edu/graphle/

X?

11

Outline

2. Metagenomics:Modeling microbial

communities for public health

1. Data mining:Integrating very large

genomic data compendia

12

What to do with your metagenome?

(x1010)

Diagnostic or prognostic

biomarker for host disease

Public health tool monitoring

population health and interactions

Comprehensive snapshot of

microbial ecology and evolution

Reservoir of gene and protein

functional informationWho’s there?

What are they doing?

What do functional genomic data tell us about microbiomes?

What can our microbiomes tell us about us?*

*Using terabases of sequence and thousands of experimental results

13

The Human Microbiome Project

2007 - ongoing

• 300 “normal” adults, 18-40

• 16S rDNA + WGS• 5 sites/18 samples +

blood• Oral cavity: saliva, tongue,

palate, buccal mucosa, gingiva,

tonsils, throat, teeth• Skin: ears, inner elbows• Nasal cavity• Gut: stool• Vagina: introitus, mid, fornix

• Reference genomes (~200+800)

All healthy subjects; followup projects in psoriasis, Crohn’s,

colitis, obesity, acne, cancer, antibiotic

resistant infection…

Hamady, 2009

Kolenbrander, 2010

14

HMP Organisms: Everyone andeverywhere is different

← Body sites + individuals →

← O

rga

nis

ms

(ta

xa

) →

ear gut nose mouth vaginaarmmucosa palate gingiva tonsils saliva sub. plaq. sup. plaq. throat tongue

Every microbiome is surprisingly different

Most organisms are rare in most places

Even common organisms vary tremendously in abundance

among individuals

Aerobicity, interaction with the immune system, and

extracellular medium appear to be major determinants

There are few, if any, organismal biotypes

in health

15

HMP: Metabolic reconstruction

WGS reads

Pathways/modules

Genes(KOs)

Pathways(KEGGs)

Functional seq.KEGG + MetaCYC

CAZy, TCDB,VFDB, MEROPS…

BLAST → Genes

rra

r

raa

p

gap

ggc

)(

)(

1

)()1(

||

1)(

Genes → PathwaysMinPath (Ye 2009)

SmoothingWitten-Bell

otherwiseTNNgc

gcTNTVTNgc

)/()(

0)()/()/()(Gap filling

c(g) = max( c(g), median )

300 subjects1-3 visits/subject~6 body sites/visit

10-200M reads/sample100bp reads

BLAST

?Taxonomic limitation

Rem. paths in taxa < ave.

XipeDistinguish zero/low

(Rodriguez-Mueller in review)

16

HMP: Metabolic reconstruction

Pathway coverage Pathway abundance

17

HMP: Metabolic reconstruction

Pathway abundance← Samples →

← P

ath

wa

ys

18

HMP: Metabolic reconstruction

Pathway coverage← Samples →

← P

ath

wa

ys

Aerobic body sites

Gastrointestinal body sites

All b

od

y sites

(“core”)

19

GeneexpressionSNPgenotypes

Metagenomic biomarker discovery

Healthy/IBDBMI

Diet

Taxa &pathways

Batch effects?Populationstructure?

Niches &Phylogeny

Test for correlates

Multiplehypothesiscorrection

Featureselection

p >> n

Confounds/stratification/environment

Cross-validate

Biological story?

Independent sample

Intervention/perturbation

20

LEfSe: Metagenomic classcomparison and explanation

LEfSe

http://huttenhower.sph.harvard.edu/lefse

Nicola Segata

LDA +Effect Size

21

LEfSe: The TRUC murine colitis microbiotaWith Wendy Garrett

22

MetaHIT: The gut microbiome and IBD

WGS reads

Pathways/modules

124 subjects: 99 healthy21 UC + 4 CD

ReBLASTed against KEGG since published data

obfuscates read counts

Taxa

PhymmBrady 2009

Genes(KOs)

Pathways(KEGGs)

Qin 2010

With Ramnik Xavier, Joshua Korzenik

23

MetaHIT: Taxonomic CD biomarkers

Firmicutes

Enterobacteriaceae

Up in CDDown in CD

UC

24

MetaHIT: Functional CD biomarkers

Motility Transporters Sugar metabolism

Down in CD

Up in CD

Subset of enriched modules in CD patientsSubset of enriched pathways in CD patients

Growth/replication

25

MetaHIT: Enzymes and metabolites over/under-enriched in the CD microbiome

Transporters

Growth/replication

Motility

Sugarmetabolism

Down in CD

Up in CD

Inferredmetabolites

Enzymefamilies

26

Outline

2. Metagenomics:Modeling microbial

communities for public health

1. Data mining:Integrating very large

genomic data compendia

• HMP: microbiome in health,

18 body sites in 300 subjects

• HUMAnN: metagenomic

metabolic and functional

pathway reconstruction

• LEfSe: biologically relevant

community differences

• Network framework for

scalable data integration

• HEFalMp: human data

integration

• Meta-analysis forunsupervised

functionalnetwork integration

27

Thanks!

Jacques IzardWendy Garrett

Pinaki SarderNicola Segata

Levi Waldron LarisaMiropolsky

http://huttenhower.sph.harvard.edu

Interested? We’re recruiting students and postdocs!

Human Microbiome Project

HMP Metabolic Reconstruction

George WeinstockJennifer WortmanOwen WhiteMakedonka MitrevaErica SodergrenVivien Bonazzi Jane PetersonLita Proctor

Sahar AbubuckerYuzhen Ye

Beltran Rodriguez-MuellerJeremy ZuckerQiandong Zeng

Mathangi ThiagarajanBrandi Cantarel

Maria RiveraBarbara Methe

Bill KlimkeDaniel Haft

Ramnik Xavier Dirk Gevers

Bruce Birren Mark DalyDoyle Ward Eric AlmAshlee Earl Lisa Cosimi

Sarah Fortune

http://huttenhower.sph.harvard.edu/sleipnir

29

Functional network prediction from diverse microbial data

486 bacterial expression

experiments

876 raw datasets

310 postprocessed

datasets

304 normalized coexpression networks

in 27 species

Integrated functional interaction networks

in 15 species

307 bacterial interaction

experiments

154796 raw interactions

114786 postprocessed

interactions

E. Coli Integration

← Precision ↑, Recall ↓

30

Predicting gene function

Cell cycle genes

Predicted relationships between genes

HighConfidence

LowConfidence

31

Predicting gene functionPredicted relationships

between genes

HighConfidence

LowConfidence

Cell cycle genes

32

Cell cycle genes

Predicting gene functionPredicted relationships

between genes

HighConfidence

LowConfidence

These edges provide a measure of how likely a gene is to

specifically participate in the process of

interest.

33

Comprehensive validation of computational predictions

Genomic data

Computational Predictions of Gene Function

MEFITSPELLHibbs et al 2007

bioPIXIEMyers et al 2005

Genes predicted to function in mitochondrion organization

and biogenesis

Laboratory ExperimentsPetite

frequencyGrowthcurves

Confocal microscopy

New known functions for correctly predicted genes

Retraining

With David Hess, Amy Caudy

Prior knowledge

34

Evaluating the performance of computational predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

135Under-annotations

82Novel Confirmations,

First Iteration

17Novel Confirmations,

Second Iteration

340 total: >3x previously known genes in ~5 person-months

35

Evaluating the performance of computational predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

95Under-annotations

40Confirmed

Under-annotations

80Novel Confirmations

First Iteration

17Novel Confirmations

Second Iteration

340 total: >3x previously known genes in ~5 person-months

Computational predictions from large collections of genomic data can be

accurate despite incomplete or misleading gold standards, and they

continue to improve as additional data are incorporated.

36

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

The strength of these relationships indicates how

cohesive a process is.

Chemotaxis

37

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

Chemotaxis

38

Functional mapping: mining integrated networks

Flagellar assembly

The strength of these relationships indicates how

associated two processes are.

Predicted relationships between genes

HighConfidence

LowConfidence

Chemotaxis

39

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

40

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

41

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

42

Functional mapping:Associations among processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered