39
Scalable data mining for functional genomics and metagenomics Curtis Huttenhower 12-02-10 rvard School of Public Health partment of Biostatistics

Scalable data mining for functional genomics and metagenomics

  • Upload
    gilead

  • View
    69

  • Download
    0

Embed Size (px)

DESCRIPTION

Scalable data mining for functional genomics and metagenomics. Curtis Huttenhower 12-02-10. Harvard School of Public Health Department of Biostatistics. What tools enable biological discoveries?. Our job is to create computational microscopes: - PowerPoint PPT Presentation

Citation preview

Page 1: Scalable data mining for functional genomics and metagenomics

Scalable data mining for functional genomics and metagenomics

Curtis Huttenhower

12-02-10Harvard School of Public HealthDepartment of Biostatistics

Page 2: Scalable data mining for functional genomics and metagenomics

2

What tools enable biological discoveries?

Our job is to create computational microscopes:

To ask and answer specific biomedical questions using

millions of experimental results

Page 3: Scalable data mining for functional genomics and metagenomics

3

Outline

3. Data mining:Integrating very large

genomic data compendia

1. Metagenomics:Network models of

microbial communities

2. Microbial biomarkers:Metagenomics in public health

Page 4: Scalable data mining for functional genomics and metagenomics

4

What’s metagenomics?Total collection of microorganisms

within a community

Also microbial community or microbiota

Total genomic potential of a microbial community

Total biomolecular repertoire of a microbial community

Study of uncultured microorganisms from the environment, which can include

humans or other living hosts

Page 5: Scalable data mining for functional genomics and metagenomics

5

What to do with your metagenome?

(x1010)

Diagnostic or prognostic

biomarker for host disease

Public health tool monitoring

population health and interactions

Comprehensive snapshot of

microbial ecology and evolution

Reservoir of gene and protein

functional informationWho’s there?

What are they doing?

What do functional genomic data tell us about microbiomes?

What can our microbiomes tell us about us?*

*Using terabases of sequence and thousands of experimental results

Page 6: Scalable data mining for functional genomics and metagenomics

6

The Human Microbiome Project

2007 - ongoing

• 300 “normal” adults, 18-40

• 16S rDNA + WGS• 5 sites/18 samples +

blood• Oral cavity: saliva, tongue,

palate, buccal mucosa, gingiva,

tonsils, throat, teeth• Skin: ears, inner elbows• Nasal cavity• Gut: stool• Vagina: introitus, mid, fornix

• Reference genomes (~200+800)

All healthy subjects; followup projects in psoriasis, Crohn’s,

colitis, obesity, acne, cancer, antibiotic

resistant infection…

Hamady, 2009

Kolenbrander, 2010

Page 7: Scalable data mining for functional genomics and metagenomics

7

Information provided by metagenomic assays

16S reads

WGS reads

Taxa

Orthologous clusters

Pathways/modules

Functional roles

Pathway activity

Genomic data(Reference genomes)

Functional data(Experimental models)

Binning

Clustering

Microbiome data

Page 8: Scalable data mining for functional genomics and metagenomics

8

HMP: Data features

16S reads

Orthologous clusters

Pathways/modules

Taxa

Genes(KOs)

Pathways(KEGGs)

Page 9: Scalable data mining for functional genomics and metagenomics

9

HMP Organisms: Everyone andeverywhere is different

← Body sites + individuals →

← O

rgan

ism

s (ta

xa) →

ear gut nose mouth vaginaarmmucosa palate gingiva tonsils saliva sub. plaq. sup. plaq. throat tongue

Every microbiome is surprisingly different

Most organisms are rare in most places

Even common organisms vary tremendously in abundance

among individuals

Aerobicity, interaction with the immune system, and

extracellular medium appear to be major determinants

There are few, if any, organismal biotypes

in health

Page 10: Scalable data mining for functional genomics and metagenomics

10

HMP: Metabolic reconstruction

WGS reads

Pathways/modules

Genes(KOs)

Pathways(KEGGs)

Functional seq.KEGG + MetaCYC

CAZy, TCDB,VFDB, MEROPS…

BLAST → Genes

rra

r

raa

p

gap

ggc

)(

)(

1

)()1(

||1)(

Genes → PathwaysMinPath (Ye 2009)

SmoothingWitten-Bell

otherwiseTNNgcgcTNTVTN

gc)/()(

0)()/()/()(Gap filling

c(g) = max( c(g), median )

300 subjects1-3 visits/subject~6 body sites/visit

10-200M reads/sample100bp reads

BLAST

?Taxonomic limitation

Rem. paths in taxa < ave.

XipeDistinguish zero/low

(Rodriguez-Mueller in review)

Page 11: Scalable data mining for functional genomics and metagenomics

11

HMP: Metabolic reconstruction

Pathway coverage Pathway abundance

Page 12: Scalable data mining for functional genomics and metagenomics

12

HUMAnN: Evaluation on synthetic metagenomes

High complexity, staggered, ≤90% identity

LC, stg.

Page 13: Scalable data mining for functional genomics and metagenomics

13

HMP: Metabolic reconstruction

Pathway abundance← Samples →

← P

athw

ays→

Page 14: Scalable data mining for functional genomics and metagenomics

14

HMP: Metabolic reconstruction

Pathway coverage← Samples →

← P

athw

ays→

Aerobic body sites

Gastrointestinal body sites

All body sites (“core”)

Page 15: Scalable data mining for functional genomics and metagenomics

15

HMP: MetaCyc Coverage + Abundance

Page 16: Scalable data mining for functional genomics and metagenomics

16

HMP: Metabolism, host-microbiome interactions, and microbial taxa

>3200 gene families differential in the

mucosa

>1500 upregulated outsidethe mucosa and not in any

Actinobacterial genome

16S

WGS

Page 17: Scalable data mining for functional genomics and metagenomics

17

Outline

3. Data mining:Integrating very large

genomic data compendia

1. Metagenomics:Network models of

microbial communities

2. Microbial biomarkers:Metagenomics in public health

Page 18: Scalable data mining for functional genomics and metagenomics

18

~2000

AML/ALLSurvivalMutation

Geneexpression

Batcheffects

Functionalmodules

Page 19: Scalable data mining for functional genomics and metagenomics

19

~2005

Healthy/DiabetesBMIM/F

SNPgenotypes

Populationstructure

LD

Page 20: Scalable data mining for functional genomics and metagenomics

20

2010

Healthy/IBDTemperatureLocation

Taxa &Orthologs

???

Niches &Phylogeny Test for

correlatesMultiple

hypothesiscorrection

Featureselection

p >> n

Confounds/stratification/environment

Cross-validate

Biological story?

Independent sample

Intervention/perturbation

Page 21: Scalable data mining for functional genomics and metagenomics

21

LEfSe: Metagenomic classcomparison and explanation

LEfSe

Coming soon to a URL near you!

Nicola Segata

LDA +Effect Size

Page 22: Scalable data mining for functional genomics and metagenomics

22

LEfSe: Evaluation on synthetic data

Page 23: Scalable data mining for functional genomics and metagenomics

23

LEfSe: The TRUC murine colitis microbiotaWith Wendy Garrett

Page 24: Scalable data mining for functional genomics and metagenomics

24

MetaHIT: The gut microbiome and IBD

WGS reads

Pathways/modules

124 subjects: 99 healthy21 UC + 4 CD

ReBLASTed against KEGG since published data

obfuscates read counts

Taxa

PhymmBrady 2009

Genes(KOs)

Pathways(KEGGs)

Qin 2010

With Ramnik Xavier, Joshua Korzenik

Page 25: Scalable data mining for functional genomics and metagenomics

25

MetaHIT: Taxonomic CD biomarkers

Firmicutes

Enterobacteriaceae

Up in CDDown in CD

UC

Page 26: Scalable data mining for functional genomics and metagenomics

26

MetaHIT: Functional CD biomarkers

Motility Transporters Sugar metabolism

Down in CD

Up in CD

Subset of enriched modules in CD patientsSubset of enriched pathways in CD patients

Growth/replication

Page 27: Scalable data mining for functional genomics and metagenomics

27

MetaHIT: Enzymes and metabolites over/under-enriched in the CD microbiome

Transporters

Growth/replication

Motility

Sugarmetabolism

Down in CD

Up in CD

Inferredmetabolites

Enzymefamilies

Page 28: Scalable data mining for functional genomics and metagenomics

28

Outline

3. Data mining:Integrating very large

genomic data compendia

1. Metagenomics:Network models of

microbial communities

2. Microbial biomarkers:Metagenomics in public health

Page 29: Scalable data mining for functional genomics and metagenomics

29

A computational definition offunctional genomics

Genomic data Prior knowledge

Data↓

Function

Function↓

Function

Gene↓

Gene

Gene↓

Function

Page 30: Scalable data mining for functional genomics and metagenomics

30

A framework for functional genomics

HighSimilarity

LowSimilarity

HighCorrelation

LowCorrelation

G1G2

+

G4G9

+…

G3G6

-

G7G8

-…

G2G5

?

0.9 0.7 … 0.1 0.2 … 0.8

+ - … - - … +

0.8 0.5 … 0.05 0.1 … 0.6

HighCorrelation

LowCorrelation

Freq

uenc

y

Let.Not let.

Freq

uenc

y

SimilarDissim.

Freq

uenc

y

P(G2-G5|Data) = 0.85

100Ms gene pairs →

← 1

Ks

data

sets

+ =

Page 31: Scalable data mining for functional genomics and metagenomics

31

Functional networkprediction and analysis

Global interaction network

Carbon metabolism network Extracellular signaling network Gut community network

Currently includes data from30,000 human experimental results,

15,000 expression conditions +15,000 diverse others, analyzed for

200 biological functions and150 diseases

HEFalMp

Page 32: Scalable data mining for functional genomics and metagenomics

32

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

11log

21'

'

''

z

eiey ,

ieeeiey ,,

i

ieiee yw ,*,̂

22,

*, ˆ

1

eieie s

w

Simple regression:All datasets are equally accurate

Random effects:Variation within and

among datasets and interactions

Page 33: Scalable data mining for functional genomics and metagenomics

33

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

11log

21'

'

''

z

+ =

Page 34: Scalable data mining for functional genomics and metagenomics

34

Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune

Graphle http://huttenhower.sph.harvard.edu/graphle/

Page 35: Scalable data mining for functional genomics and metagenomics

35

Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune

Graphle http://huttenhower.sph.harvard.edu/graphle/

X?

Page 36: Scalable data mining for functional genomics and metagenomics

36

• Sleipnir C++ library for computational functional genomics

• Data types for biological entities• Microarray data, interaction data, genes and gene sets,

functional catalogs, etc. etc.• Network communication, parallelization

• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)

• And it’s fully documented!

Sleipnir: Software forscalable functional genomics

Massive datasets require efficientalgorithms and implementations.

It’s also speedy: microbial data integration

computationtakes <3hrs.

Page 37: Scalable data mining for functional genomics and metagenomics

37

Outline

3. Data mining:Integrating very large

genomic data compendia

1. Metagenomics:Network models of

microbial communities

2. Microbial biomarkers:Metagenomics in public health

• Metagenomics: structure and

function of microbialcommunities

• HMP: microbiome in health,

18 body sites in 300 subjects• HUMAnN: metagenomic

metabolic and functional

pathway reconstruction

• Network framework for

scalable data integration

• HEFalMp: human data

integration• Meta-analysis for

unsupervised functional

network integration

• LEfSe: biologically relevant

community differences• Iron and sugar transport as

key players in the IBDmicrobiota

• Sleipnir: software for scalable

genomic data mining

Page 38: Scalable data mining for functional genomics and metagenomics

38

Thanks!

Jacques IzardWendy Garrett

Pinaki SarderNicola Segata

Levi Waldron LarisaMiropolsky

http://huttenhower.sph.harvard.edu

Interested? We’re recruiting students and postdocs!

Human Microbiome Project

HMP Metabolic Reconstruction

George WeinstockJennifer WortmanOwen WhiteMakedonka MitrevaErica SodergrenVivien Bonazzi Jane PetersonLita Proctor

Sahar AbubuckerYuzhen Ye

Beltran Rodriguez-MuellerJeremy ZuckerQiandong Zeng

Mathangi ThiagarajanBrandi Cantarel

Maria RiveraBarbara Methe

Bill KlimkeDaniel Haft

Ramnik Xavier Dirk Gevers

Bruce Birren Mark DalyDoyle Ward Eric AlmAshlee Earl Lisa Cosimi

Sarah Fortune

http://huttenhower.sph.harvard.edu/sleipnir

Page 39: Scalable data mining for functional genomics and metagenomics