52
C A M E R A A Metagenomics Resource for Microbial Ecology Saul A. Kravitz J. Craig Venter Institute Rockville, Maryland USA KNAW Colloquium May 29, 2008

CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Embed Size (px)

DESCRIPTION

CAMERA Presentation by Saul Kravitz at KNAW ICoMM Colloquium May 2008 in Amsterdam, Netherlands. See http://camera.calit2.net

Citation preview

Page 1: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

C A M E R AA Metagenomics Resource

for Microbial Ecology

Saul A. KravitzJ. Craig Venter InstituteRockville, Maryland USA

KNAW ColloquiumMay 29, 2008

Page 2: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Goals

• Introduce you to CAMERA

• Encourage you to use CAMERA

• What can CAMERA do for you?

Page 3: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Presentation Outline

• Introduction to Metagenomics

• Global Ocean Sampling (GOS) Expedition

• CAMERA Capabilities and Features- Compute Resources

- Data Resources

- Tools Resources

• Looking Forward

Page 4: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Within an environment- What biological functions are present (absent)?

- What organisms are present (absent)

• Compare data from (dis)similar environments- What are the fundamental rules of microbial ecology

• Adapting to environmental conditions?- How?

- Evidence and mechanisms for lateral transfer

• Search for novel proteins and protein families

- And diversity within known families

Metagenomic Questions

Page 5: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Genomics – ‘Old School’- Study of a single organism's genome - Genome sequence determined using shotgun

sequencing and assembly- >1300 microbes sequenced, first in 1995

- DNA usually obtained from pure cultures (<1%) • Metagenomics

- Application of genome sequencing methods to environmental samples (no culturing)

- Environmental shotgun sequencing is the most widely used approach

- Environmental Metadata provides key context

Genomics vs Metagenomics

Page 6: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Complexity of Microbial Communities

• Simple (e.g., AMD, gutless worm)- Few species present (<10)

- Diverse

Variations on standard genomics techniques

• Complex (e.g., Soil or Marine)- Many species present (>10, often >1000)

- Many closely related

New techniques

Page 7: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Global Ocean Sampling Expedition

Page 8: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Global Ocean Sampling (GOS)• 178 Total Sampling Locations

- Phase 1: 7.7M reads, >6M proteins 3/07- Phase 2-IO: 2.2M reads 3/08- Phase 2: ~10M reads future

• Diverse Environments- Open ocean, estuary, embayment, upwelling, fringing reef, atoll…

3/08

3/07

4/04

Page 9: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Most sequence reads are unique- Very limited assembly- Most sequences not taxonomically anchored- Relating shotgun data to reference genomes- Annotation challenging

• New Techniques Needed- Fragment Recruitment- Extreme Assembly to find pan genomes- Sample to Sample Comparisons

GOS: Sequence Diversity in the Ocean

Rusch et al (PLoS 2007)

Page 10: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Comparing of Dominant Ribotypes

Page 11: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Comparison of Total Genomic Content

Page 12: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Novel clustering process• Sequence similarity based

• Predict proteins and group into related clusters

• Include GOS and all known proteins

• Findings• GOS proteins

• cover ~all existing prokaryotic families

• expands diversity of known protein families

• ~10% of large clusters are novel

• Many are of viral origin

• No saturation in the rate of novel protein family discovery

GOS Protein Analysis Yooseph et al (PLoS 2007)

Page 13: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Rubisco homologs

Added Protein Family Diversity

Yooseph et al (PLoS 2007)

New Groups

GOS prokaryotes

Known eukaryotes

Known prokaryotes

Page 14: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Study of dsDNA viruses from shotgun data- 155k viral proteins identified from 37 GOS I sites (~2.5%)

- 59% of viral sequences were bacteriophage

• Viral acquisition and retention of host metabolic genes is common and widespread- Viruses have made these genes “their own”- Clade tightly with viral genes

• Codistribution of P-SSM4-like cyanophage and the dominant ecotype of Prochlorococcus in GOS samples.

GOS Viral Analysis(Williamson et al PLoSOne 2008)

Page 15: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Viral acquisition of host genestalC Gene

GOS Viral

Public Viral

GOS Bacterial

Public Bacterial

Public Euk

Page 16: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Reference Genomes

• Overview- 150+ reference marine microbes (101 released)

- Scaffold for GOS

- Sequenced, assembled, autoannotated

• Isolation Metadata- Incomplete

• Bottlenecks- Availability of DNA

- Purity of DNA

• Status and Data- https://research.venterinstitute.org/moore/

Page 17: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Significant investment in sequencing- Only accessible to bioinformatics elite

- Diversity of user sophistication and needs

• Bioinformatics and Computation Challenges- Assembly, annotation, comparative analysis, visualization

- Dedicated compute resources

• Importance of Metadata- Metadata required for environmental analysis

- Need to drive standards

• Compliance with Convention on Biodiversity

Motivations for CAMERA

Page 18: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Convention on Biological Diversity

• Sample in territorial waters?- Country granted certain rights by CBD

- Sampling agreements may contain restrictions

• CAMERA users must acknowledge potential restrictions on commercial data use

• CAMERA maintains mapping of country-of-origin for all data objects

Page 19: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA – http://camera.calit2.net

• “Convenient acronym for cumbersome name…”- Henry Nichols, PLoS Biology

• Mission- Enable Research in Marine Microbiology

• Debuted March 2007

[email protected]

Page 20: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA Capabilities

• Compute Resources- 512 node compute grid + 200 Tb storage

• Data and Metadata Resources- Annotated Metagenomic and genomic data

• Tools Resources- Scalable BLAST

- Fragment Recruitment

- Metagenomic Annotation

- Text Search

Page 21: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

512 Processors ~5 Teraflops

~ 200 Terabytes Storage

CAMRA Compute and Storage Complexat UCSD/Calit2

Source: Larry Smarr, Calit2

Page 22: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA Metagenomic Data Volume

by Project

Page 23: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA Metagenomic Samples

Page 24: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA Users>2000 Registered Since March

2007

Page 25: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Metagenomic Sequence Collection- Reads and assemblies w/associated metadata

- CAMERA-computed annotation

• Protein Clusters- Maintaining clusters from Yooseph et al (Yooseph and Li, ’08)

• Genomic Data- Viral, Fungal, pico-Eukaryotes, Microbial- Moore Marine Genomes with Metadata

• Non-redundant sequence Collection- Genbank, Refseq, Uniprot/Swissprot, PDB etc

CAMERA Data Collections

Page 26: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

• Genome Standards Consortium- Led by Dawn Field, NIEeS

- Members from EU, UK, US

• Goals are to promote- Standardization of genomic descriptions

- Exchange & Integration of genomic data

• Metadata standardization key enabler- MIMS: Min Info for Metagenomic Sample

- GCDML: Standard format

Standardizing Contextual Metadata

Page 27: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Contextual Metadata Challenges

• Researchers Need to Collect and Submit

• Relevant metadata depends on study – MIMS- Specification of minimum metadata

• Standardize Exchange Format - GCDML

- Comprehensive and extensible

- Leverages Existing Ontologies, Validatable

And…

- Easy for a scientist to use...

• Need ongoing software support for tools

Page 28: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA Core Metadata by Project

• Defacto Core•Lattitude and Longitude•Collection date•Habitat and Geographic Location

• Missing metadata =

Page 29: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA Contextual Metadata

Page 30: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

CAMERA 1.3

http://camera.calit2.net

Page 31: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Scalable BLAST with Metadata

• Large searches permitted and encouraged

• 454 FLX run vs “All Metagenomic”

• Some larger tblastx jobs have run >20 hrs

• 10kbp BLASTN vs All Metagenomic – 1 min

• BLAST XML or Tabular Export

• Searches against NRAA

• BLAST XML output feeds MEGAN

• Searches against ‘All Metagenomic’

• GUI with metdata

• Tabular with metadata

Page 32: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Scalable BLAST with Metadata

Page 33: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Integration of Metadata and Data

Page 34: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Browsing Large Data Collections: Fragment Recruitment Viewer

• Microbial Communities vs Reference Genomes- Millions of sequence reads vs Thousands of genomes

• Definition: A read is recruited to a sequence if:- End-to-end blastN alignment exists

• Rapid Hypothesis Generation and Exploration- How do cultured and wildtype genomes differ?

- Insertions, deletion, translocations

- Correlation with environmental factors

• Export sequence and annotation• Credits: Doug Rusch and Michael Press

Page 35: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Fragment Recruitment ViewerS

eque

nce

Sim

ilarit

y

Genomic Position

Doug Rusch, JCVI

Page 36: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Seq

uenc

e S

imila

rity

Genomic Position

Annotation

Geographic Legend

Page 37: CAMERA Presentation at KNAW ICoMM Colloquium May 2008
Page 38: CAMERA Presentation at KNAW ICoMM Colloquium May 2008
Page 39: CAMERA Presentation at KNAW ICoMM Colloquium May 2008
Page 40: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Prochlorococcus marinus str. MIT 9312

• Coloring by geography • 80-95% identity cloud • = GOS Indian Ocean• Regions with no coverage

• Where?• Real?

Page 41: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Mate Status Highlights Differences

• Paired end (mate) sequencing • Coloring by mate status• Highlights cultured vs metagenomic differences • Selective display of

- Mates by status- Reads by sample

Page 42: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Mate Pairs Highlight Variation

Page 43: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

What Genes are Involved

Page 44: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

View

by

Sample

Page 45: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

View by Sample

Filter by mate status

Page 46: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Annotation ofEnvironmental Shotgun Data

• Gene Finding- Using Yooseph’s Protein Clusters, and/or- Metagene

• Functional Assignment- Variation of JCVI prok annotation pipeline*- Leverages protein cluster annotation -- soon

• Quality Nearly Comparable to Prokaryotic Genomic Annotation

Page 47: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Protein Clusters as Gene Finder

• Identification and soft mask of ncRNAs• Naïve identification of ORFs (60aa min)• Add peptides to clusters incrementally

- Yooseph and Li, 2008

• Predicted Genes based on ORFS in- Clusters of sufficient size- Clusters that satisfy additional filters

Page 48: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Protein ClustersAdvantages and Disadvantages

• Weaknesses- Homology-based- Stateful (also a strength)- Less sensitive (for now)

• Strengths- More specific- Transitive Annotation- Learns over time- Easy to maintain

Page 49: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Search for Dehalogenase

Page 50: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Browse Clusters

Page 51: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Near Future

• More extensive data collection

• Summary views of data sets by- Annotation

- Samples

- Mate Status

- Taxonomy

- Habitat and other contextual metadata

• 16S datasets?

Page 52: CAMERA Presentation at KNAW ICoMM Colloquium May 2008

Credits• JCVI CAMERA Team

- Leonid Kagan, Michael Press, Todd Safford, Cristian Goina, Qi Yang, Sean Murphy, Jeff Hoover, Tanja Davidsen, Ramana Madupu, Sree Nampally, Nikhat Zhafar, Prateek Kumar

- Doug Rusch, Shibu Yooseph, Aaron Halpern*, Granger Sutton, Shannon Williamson

- Marv Frazier and Bob Friedman

• Calit2 CAMERA Team- Adam Brust, Michael Chiu, Brian Fox, Adam Dunne, Kayo

Arima

- Larry Smarr and Paul Gilna

http://camera.calit2.net