56
Metagenomics: the theory of assembly (and not only) Mihai Pop

Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Metagenomics:the theory of assembly (and not only)

Mihai Pop

Page 2: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Metagenomics• 1-5% of all bacteria can be cultured

– standard microbiology gives us a skewed view of the world

• Culture-free approaches– 16S rRNA sequencing– random sequencing of entire population

• 16S rRNA sequencing– tells us about relative diversity of organisms– no information about what these organisms do– 16S – multi-copy gene – difficult to estimate true abundances

• Random sequencing– potential to explore entire genomic content– requires deep sequencing (expensive)

Page 3: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Why do we care?• Bacteria are everywhere in the environment• They are not all evil• Bacteria can be quite useful

– energy – bio-remediation– drug development

• Our bodies contain 1 order of magnitude more bacterial cells than human cells– critical to infant development (immune system, GI-tract)– provide essential nutrients (vitamin K, B12, essential amino-acids)– help digest complex molecules – starches, plant material– imbalances in normal bacterial populations correlate with disease

• Human microbiome project - nihroadmap.nih.gov/hmp/

Page 4: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

So...what are we looking for (17th century)?

Page 5: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

So...what are we looking for (21st century)?>F4BT0V001CZSIM rank=0000138 x=1110.0 y=2700.0 length=57

ACTGCTCTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACGTCTG

>F4BT0V001BBJQS rank=0000155 x=424.0 y=1826.0 length=47

ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCTCCCTGCGCCATCAA

>F4BT0V001EDG35 rank=0000182 x=1676.0 y=2387.0 length=44

ACTGACTGCATGCTGCCTCCCGTAGGAGTCGCCGTCCTCGACNC

>F4BT0V001D2HQQ rank=0000196 x=1551.0 y=1984.0 length=42

ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCGTCCCTCGAC

>F4BT0V001CM392 rank=0000206 x=966.0 y=1240.0 length=82

AANCAGCTCTCATGCTCGCCCTGACTTGGCATGTGTTAAGCCTGTAGGCTAGCGTTCATCCCTGAGCCAGGATCAAACTCTG

>F4BT0V001EIMFX rank=0000250 x=1735.0 y=907.0 length=46

ACTGACTGCATGCTGCCTCCCGTAGGAGTGTCGCGCCATCAGACTG

>F4BT0V001ENDKR rank=0000262 x=1789.0 y=1513.0 length=56

GACACTGTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACTCTG

>F4BT0V001D91MI rank=0000288 x=1637.0 y=2088.0 length=56

ACTGCTCTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACTCTG

>F4BT0V001D0Y5G rank=0000341 x=1534.0 y=866.0 length=75

GTCTGTGACATGCTGCCTCCCGTAGGAGTCTACACAAGTTGTGGCCCAGAACCACTGAGCCAGGATCAAACTCTG

>F4BT0V001EMLE1 rank=0000365 x=1780.0 y=1883.0 length=84

ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCTCCCTGCGCCATCAATGCTGCATGCTGCTCCCTGAGCCAGGATCAAACTCTG

Page 6: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Same or different?• Is it real or noise?• Is it the same 'organism' that I've seen before?• What does it do?

Leeuwenhoekasked the samequestions

Page 7: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Same or different in the sequence world• Is it real or noise?

– sequencing error correction– detection of experimental artifacts (contamination, chimeras, etc.)

• Is it the same organism I've seen before?– database searches

• What does it do?– more database searches

• Same broad analysis for 16S and Whole (meta)Genome Sequencing– the devil is in the details

Page 8: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

What is a species?• Concept is generally ill defined and impossible to define by

sequence alone

From: Eur J Clin Microbiol Infect Dis (2012) 31:899–904

Page 9: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Same vs. different, 16S vs WGS?

16S WGS

Page 10: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Metagenome assembly

Page 11: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

(meta)genome assembly is impossible

Page 12: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

(meta)genome assembly is impossible

actually...it's all about information

Page 13: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Conservation of information

data in data outAlgorithm

I(in) >= I(out)

reads assemblyAssemblerSequencing

I(genome) >= I(reads) >= I(assembly)

Page 14: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Mycoplasma genitalium, 25 bp readsKingsford et al., BMC Bioinformatics 2010

Page 15: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Read length matters

it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness

Read = 3 “words” (<= length of repeat)

it was theof times

worst

age of

wisdom

foolishnessbest

it was the, was the best, was the worst, was the agethe age of,...

Page 16: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Read length matters

it was theof times

worst

age of

wisdom

foolishnessbest

it was the best of times it was the worst of timesit was the age of wisdom it was the age of foolishness

Read = 6 “words” (> length of repeat)

it was the best of times it, times it was the worst of, times it was the age of, was the age of wisdom it, ...

Page 17: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Read length matters

k = 50 k = 1,000 k = 5,000

Page 18: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Read length matters...

Nagarajan, Pop. J. Comp. Biol. 2009, Kingsford et al., BMC Bioinformatics 2010

• Reads (much) longer than repeats – assembly trivial

• Reads roughly equal to repeats – assembly computationally difficult (NP-hard)

• Reads shorter than repeats – assembly undetermined

Number of possible reconstructions exponential in # of repeats

Page 19: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

What are repeats?Isolate genome

Metagenome

In metagenomes repeats are approximately genome-sized

Haplotype phasing with unknown number of haplotypes

Page 20: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Metagenomic questions

• What is the relative abundance of organism X versus organisms Y and Z?

• What proportion of organisms of type X have pathogenicity island P?

• Is pathogenicity island P only found in organism X or also in organisms Y and Z?

E. coli ETEC, EPEC, EAEC, EHEC, ...Shiga toxin in Shigella or E. coli, ...

Page 21: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishnessit was the epoch of belief it was the epoch of incredulityit was the season of light it was the season of darknessit was the spring of hope it was the winter of despair

it was the

of times

worst

epoch of

age of

wisdom

foolishness

best

incredulity

beliefseason of

light

darkness

spring of hope

winter of despair

Page 22: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishnessit was the epoch of belief it was the epoch of incredulityit was the season of light it was the season of darknessit was the spring of hope it was the winter of despairstable non-descript it was the likeliest thing uponpeculiarity was that it was the faintness of solitudequite sure that it was the prisonerParis as it was the episcopal mode amongbut it was the old scared lost lookmoreover it was the spot to whichthat night it was the fourteenth of August

it was the 17that it was 2the age of 2the epoch of 2the season of 2was the age 2was the epoch 2was the season 2was the spot 1...

But... coverage doesn't work in metagenomics,single cell genomics, etc.

Page 23: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Lack of coverage leads to errors

it was theof times

worst

age of

wisdom

foolishnessbest

it was the best of times it was the worst of timesit was the age of wisdom it was the age of foolishness

it was the worst of times it, times it was the worst of, times it was the age of, was the age of foolishness it

it was the worst of times it was the age of foolishness

Page 24: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Assembly is impossible• Long reads (even 10kbp) insufficient as repeats are as long

as genomes (100s of kbps to Mbps)• Errors impossible to avoid in low coverage genomes• Computationally, assembly is very very hard

WHY BOTHER?

Page 25: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Assembly as compression

Stool sample SRS049995– in: 11.2 Gbp – out: 174 Mbp + 20 Mbp (unassembled reads)

Reads

Metaphyler

Metaphyler

ORF callingAssembly

Liu et al. BMC Genomics 2011, Treangen et al. in preparation

functional profiling

pathway analysisstorageetc.

Page 26: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Interesting genomes

• Most microbes are not easily cultured and only known by 16S rDNA signature– e.g. RDP grew from ~80,000 (v. 10.4) to 2.1 million (v 10.28)– only ~10,000 sequences from type strains– only ~150,000 sequences from isolate genomes

– metagenomic assembly is only way to get the rest

• Clinical studies reveal interesting 16S patterns

OTU: Gammaproteobacteria;Pasteurellales;Pasteurellaceae;???association with diarrhea p=10-135,13-fold increase in cases

Page 27: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Strain structure matters

Page 28: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Metagenomic assemblytechnical issues

Page 29: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Main challenges in metagenomic assembly● Difficult to find repeats

– coverage vs. over-representation– within-genome vs. across-genome repeats

● High genomic variation– sequencing experiment has ~1015 cells, i.e., each read comes

from a different cell – phages, transposons, etc. affect only a fraction of the

population even in 'homogeneous' strains

Page 30: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Some solutions● Coverage-independent repeat detection/removal

– Bambus 2 scaffolder (local coverage, graph theoretic arguments)

– IDBA-UD (local coverage, mate-pair information)● Polymorphisms in the community

– Bambus 2 scaffolder – preserve variants that don't 'tangle' the graph

– IDBA-UD – 'smooth' out variants– MetaVelvet – attempt to decompose graph the assemble

haploid genomes based on coverage concordance

Note: none of these features have been fully evaluated in a realistic setting.

Page 31: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Aside: Repeat resolution with mate-pairs● Mate-pairs: pairs of sequencing reads separated by an

approximately known distance

● Commonly generated in (double barrelled) shotgunsequencing experiments

● Key idea: find (unique) path through assembly graphconsistent with length of a mate-pair

Page 32: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Caveats● Finding path consistent with mate-pair length – NP-hard

● Heuristic: Mate-pair "useful" if shortest path between ends is consistent with mate-pair length

● A mate-pair "disambiguates" a section of the graph if there is a unique shortest path consistent with the mate-pair length

Page 33: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

To infinity and beyond

Page 34: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Key insight● Non-trivial nodes can only be resolved by mate-pairs

that span them tightly

● Idea: pick the most useful mate-pairs

Page 35: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Tuning leads to better assemblies

Average improvement: 47.52% Average improvement: 82.7%

Tuned librariesStandard libraries

Page 36: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Likelihood based assembly● Find string of letters that maximizes likelihood of reads● Common 'trick' in speech/language processing● Main approach for quasi-species assembly

– Shorah– Vispa

● 16S reconstruction– Emirge

● Metagenomic assembly– Genovo

● Won't cover it here but keep your ears open

Page 37: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Assembly is just a small piece of the puzzle ● Taxonomic assessment● Gene Finding● Motif/variant detection● etc.

● The individual analyses can feed into each other– taxonomic assessment can help define assembly strategy– gene finding can highlight errors– etc.

Page 38: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Example: MetaCompass – comparative assembler

Liu et al. in preparation

Page 39: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Aside: taxonomic classification with MetaPhyler● WGS sequences classified against marker gene database

(rpoB, recA, etc.)● Blast-based classifier● Different classifier built for each gene and taxonomic level● Classifier automatically adjusts for alignment length● Works with both DNA and protein data

metaphyler.cbcb.umd.edu

Page 40: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

metAMOS● Integrated pipeline for metagenomic assembly

(mothur/Qiime for WGS analysis)– assembly, scaffolding, gene finding, taxonomic profiling, ...– builds upon other open-source tools– modular pipeline design using Ruffus

● Specialized metagenomic/specialized analyses (through Bambus 2)– coverage-independent repeat detection– genomic variant detection

Koren et al. Bioinformatics 2011, Treangen et al. Genome Biology 2013

Page 41: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Does it work?

Page 42: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

The nitty gritty details

Page 43: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Introduction• Metagenomic assembly is still an active area of research• No metagenomic "sequencher" yet available/possible• Data-sets can be huge

– typical HMP data-set - ~8 Gbp – download time - > 1 hour– uncompressing (bzip2) - ~25 minutes – stringent alignment to reference (e.g. assembly) > 25 minutes

• Assemblers need a lot of memory (>> 4 GB)– 26 CPU, 68 GB RAM - $2/hour @ Amazon EC2– 48 CPU, 64 GB RAM - ~$30 K @ Dell

256 GB RAM - ~$50 K @ Dell(compare to Illumina instrument price)

Page 44: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

What you need for assembly• Sequencing reads

– fasta– fastq (horrible format...but a standard)– .sra – even worse than fastq but favored by NCBI

• Library information – not always easy to figure out (lab estimates way off)– script provided (compute_mates.pl) for estimating library size from

alignments to 16S rRNA

Page 45: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Preprocessing• Guiding principle: Garbage In, Garbage Out• In general: by throwing away data, assembly can only

improve– quality trimming– removal of reads that are too short– removal of technical duplicates (artifact of many sequencing

technologies)– removal of contaminant sequences (e.g. human DNA)

• Don't be shocked if > 20% of the data are thrown out• Converting files is often challenging

Page 46: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Checking results• Contiguity is just part of the story• N50 doesn't make sense

– better measure: size to xx Mbp, number to xx Mbp

• Errors need to be taken into account– hard to do without a reference

Page 47: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

A bit about validation

Page 48: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

What are errors?• Chimeric contigs/scaffolds (due to repeats or mixed

organisms)• Incorrect consensus calls

• Missing information– contigs/scaffolds broken up unnecessarily– missing variants

• Software errors– 15-50 bugs/1000 lines of code– Celera Assembler – 300,000 loc

Page 49: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Model-based testing...2

Unknown Genome Assembly

Magicbiological

biochemicalbiophysical

signal processingetc.

Reads

Assemblercomputational magic

Model of

Magic

Same?

Magicbiological

biochemicalbiophysical

signal processingetc.

Page 50: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Modeling approach...aside

• Originally proposed by Gene Myers (early-mid 90s)• Used in several metagenomic assemblers

– Genovo (Laserson et al.)– Vispa (Westbrooks et al.)– Shorah (Zagordi et al.)

• Idea of single number reflecting assembly qualityGenovo 'Score

denovo'

Note: need to know where every read is placed/could be placed!

(information rarely produced by 'modern' assemblers)

∑iSWScore i−2∗length(contigs)+ 2∗minOvl∗num(contigs)

ALE: Clark, S. et al. Bioinformatics 29(4): 435-443.LAP: Ghodsi, M., et al. BMC Res Notes 6(1): 334.CGAL: Rahman, A. and L. Pachter (2013). Genome Biol 14(1): R8.

Page 51: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

51

Probabilities correlate with reference validation

Data from Assemblathon 1 Earl et al. Genome Research 2011, Ghodsi et al., 2013.

Page 52: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

52

Sub-sampling based validation

Page 53: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

53

Used to tune assembly parameters

Page 54: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Practical modeling - assembly invariants• Basic principles (modulo errors)

– overlapping reads must agree– mate-pairs must be consistently placed in assembly– coverage must match statistical process that generated the data– all reads must be used– assembly must be as contiguous as possible

• These assumptions mostly break/should be relaxed in metagenomic data

• Maximum likelihood approach should still work (though harder to interpret)

Schatz et al, Genome Biology 2008

Page 55: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard
Page 56: Metagenomics: the theory of assembly (and not only)biology.umd.edu/uploads/2/7/8/0/27804901/metagenomics_theory.pdfMetagenomics • 1-5% of all bacteria can be cultured – standard

Conclusions• Genome assembly, in general, is well studied, and very

hard• Key lesson: Garbage In Garbage Out (data more important

than algorithm)• Metagenomics offers valuable problems and information

(e.g.) scaffolding with cross-sample correlations

• Key: formalize and tackle sub-problems of interest to biologists– gene identification/clustering– comparative analysis wrt reference– reconstruct specific organism (rather than entire metagenome)

• Validation and standards are critical