52
Metagenomics assembly: the good, the bad, and the ugly Mihai Pop

Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Embed Size (px)

Citation preview

Page 1: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Metagenomics assembly:the good, the bad, and the ugly

Mihai Pop

Page 2: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

2

Shotgun sequencing

shearing

sequencing

assemblyoriginal DNA (hopefully)

Page 3: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Why assembly?Since Technology Read length Throughput/run Throughput/hour cost/run

1977- Sanger sequencing

1000-2000bp

4hr400-500 kbp

100 kbp $200

2005- 454 pyrosequencing

400bp 4hr500 Mbp

>100 Mbp $17,000

2006- Illumina/Solexa 50-100bp 7-10 days250 Gbp

1 Gbp $27,000

2007- ABI SOLiD 35-50bp 3 days6-20 Gbp

75-250 Mbp $3-5,000

2012 - Pacific Biosciencessingle molecule

~10-20 kbp15% error

3 hours3 Mbp

1-3 Mbp $2,500

2014ish Oxford Nanoporesingle molecule

? ? ? ?

Viruses ~100kbpBacteria ~1-5 MbpMost Eukarya ~100s of MbpHuman ~ 3Gbp

Page 4: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

4

A tale of assembly

it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness

Page 5: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

5

Assembling two cities

the age of foolishness

best of times it

it was the best

it was the best

it was the age

it was the worst

was the best of

was the best of

it was the age

the best of times

of times it was

times it was thewas the worst of

the worst of times

worst of times it

of times it was

times it was the

of times it was

it was the age

was the age of

it was the age

was the age of

the age of wisdom

age of wisdom it

of wisdom it was

wisdom it was the

Page 6: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

6

Greedy algorithm• Compute all pairwise overlaps• *Pick best (e.g. in terms of alignment score) overlap• Join corresponding reads• Repeat from * until no more joins possible

Greedy algorithm

Basis for many popular assembler: phrap, TIGR Assembler, CAP, etc.

Page 7: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

7

Greedy approach gets 'stuck'

the age of foolishness

best of times it

it was the best

it was the bestit was the age

it was the worst

was the best of

was the best of

it was the age

the best of times

of times it was

times it was the

the worst of times

worst of times it

of times it was

of times it was

it was the ageit was the age

was the age of

the age of wisdom

age of wisdom it

of wisdom it was

wisdom it was the

Page 8: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

8

Graph-based approaches

best of times it

it was the best

it was the age

it was the worst

was the best of

the best of times

times it was the

of times it was

was the worst of

the worst of times

worst of times it

of times it was

Page 9: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

(meta)genome assembly is impossible

it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness

it was theof times

worst

age of

wisdom

foolishnessbest

it was the best of times it was the age of wisdom it was the age of foolishness it was the worst of times

Page 10: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Mycoplasma genitalium, 25 bp readsKingsford et al., BMC Bioinformatics 2010

Page 11: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Puzzle• 13 pieces

– 6,227,020,800 ways to order them– 8,192 ways to split them in two layers– 3,538,944 ways to arrange them into 6 "rows" of two pieces each

and one with three– ...

• Why is it hard?– Constraint: need to fit everything in 8 in3 – 0.5x8x2 prism

Page 12: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Some algorithms• Greedy

– take longest piece and put it in the box– fit as much as possible on same row– repeat with the remaining longest piece– etc...

• Pick through each of the possible orderings of pieces– put pieces in order in the box as they fit– eventually you'll hit on the right order...

Page 13: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

A simpler solution

Violate implicit constraint – pieces must stay intact

Page 14: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

3072 different solutions (I think)

http://www.elversonpuzzle.com/

$11.95 + S&H

Page 15: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Read length matters

it was theof times

worst

age of

wisdom

foolishnessbest

it was the best of times it was the worst of timesit was the age of wisdom it was the age of foolishness

Read = 6 “words” (> length of repeat)

it was the best of times it, times it was the worst of, times it was the age of, was the age of wisdom it, ...

Page 16: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Read length matters

k = 50 k = 1,000 k = 5,000

Does anyone see the mistake in the picture?

Page 17: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Read length matters...

Nagarajan, Pop. J. Comp. Biol. 2009, Kingsford et al., BMC Bioinformatics 2010

• Reads (much) longer than repeats – assembly trivial

• Reads roughly equal to repeats – assembly computationally difficult (NP-hard)

• Reads shorter than repeats – assembly undetermined

Number of possible reconstructions exponential in # of repeats

Page 18: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

What are repeats?Isolate genome

Metagenome

In metagenomes repeats are approximately genome-sized

Haplotype phasing with unknown number of haplotypes

Page 19: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Metagenomic questions

• What is the relative abundance of organism X versus organisms Y and Z?

• What proportion of organisms of type X have pathogenicity island P?

• Is pathogenicity island P only found in organism X or also in organisms Y and Z?

E. coli ETEC, EPEC, EAEC, EHEC, ...Shiga toxin in Shigella or E. coli, ...

Page 20: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

What happens at low coverage?• Can you solve the puzzle if you remove one or a few

pieces?

• What can you say about the "solution"?

• How about a genome? What happens if you have gaps in coverage?

Page 21: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Lack of coverage leads to errors

it was theof times

worst

age of

wisdom

foolishnessbest

it was the best of times it was the worst of timesit was the age of wisdom it was the age of foolishness

it was the worst of times it, times it was the worst of, times it was the age of, was the age of foolishness it

it was the worst of times it was the age of foolishness

The log puzzle solution is not consistent with the actual correct solution

Page 22: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Mate-pairs don't really help

Page 23: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Assembly is impossible• Long reads (even 10kbp) insufficient as repeats are as long

as genomes (100s of kbps to Mbps)• Errors impossible to avoid in low coverage genomes• Mate-pairs don't help• Computationally, assembly is very very hard

WHY BOTHER?

Page 24: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Assembly as compression

Stool sample SRS049995– in: 11.2 Gbp – out: 174 Mbp + 20 Mbp (unassembled reads)

Reads

Metaphyler

Metaphyler

ORF callingAssembly

Liu et al. BMC Genomics 2011, Treangen et al. Gen. Biol. 2014

functional profiling

pathway analysisstorage

etc.

Page 25: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Interesting genomes

• Most microbes are not easily cultured and only known by 16S rDNA signature– e.g. RDP grew from ~80,000 (v. 10.4) to 2.1 million (v 10.28)– only ~10,000 sequences from type strains– only ~150,000 sequences from isolate genomes

– metagenomic assembly is only way to get the rest

• Clinical studies reveal interesting 16S patterns – what do the genomes do?

Page 26: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Strain structure matters

Page 27: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Genes are easy...

Page 28: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Metagenomic assemblytechnical issues

Page 29: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Main challenges in metagenomic assembly● Difficult to find repeats

– coverage vs. over-representation– within-genome vs. across-genome repeats

● High genomic variation– sequencing experiment has ~1015 cells, i.e., each read comes

from a different cell – phages, transposons, etc. affect only a fraction of the

population even in 'homogeneous' strains

Page 30: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

K-mer size and why it matters● overlaps versus errors or variants

– if k is too large, reads cannot be "stitched" together

● repeats

– if k is too small, unrelated reads get linked to each other

Page 31: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Repeat-detection/removal● Bambus 2 scaffolder (local coverage, graph theoretic

arguments)● IDBA-UD (local coverage, mate-pair information)

Page 32: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Polymorphisms in community● IDBA-UD – 'smooth' out variants● MetaVelvet – attempt to decompose graph the assemble

haploid genomes based on coverage concordance● MaryGold – detects some polymorphisms● Anv'io

Page 33: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Reference-guided assembly

● The entire read used to determine placement – repeats less of an isssue

● MetaCompass (https://github.com/marbl/MetaCompass)

Page 34: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Likelihood based assembly● Find string of letters that maximizes likelihood of reads● Common 'trick' in speech/language processing● Main approach for quasi-species assembly

– Shorah– Vispa

● 16S reconstruction– Emirge

● Metagenomic assembly– Genovo

● Won't cover it here but keep your ears open

Page 35: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

New information: correlation across samples

Quince – ConcoctBorenstein – Metagenomic deconvolution

Page 36: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

HMP 16S vs MetaHit MGS

Page 37: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

HMP 16S vs MetaHit MGSCatabacter hongkongiensis

Christensenella minuta

Christensenellaceae

Page 38: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Assembly is just a small piece of the puzzle ● Taxonomic assessment● Gene Finding● Motif/variant detection● etc.

● The individual analyses can feed into each other– taxonomic assessment can help define assembly strategy– gene finding can highlight errors– etc.

Page 39: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

metAMOS● Integrated pipeline for metagenomic assembly

(mothur/Qiime for WGS analysis)– assembly, scaffolding, gene finding, taxonomic profiling, ...– builds upon other open-source tools– modular pipeline design using Ruffus

● Specialized metagenomic/specialized analyses (through Bambus 2)– coverage-independent repeat detection– genomic variant detection

Koren et al. Bioinformatics 2011, Treangen et al. Genome Biology 2013

Page 40: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

A bit about validation

Page 41: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

What are errors?• Chimeric contigs/scaffolds (due to repeats or mixed

organisms)• Incorrect consensus calls

• Missing information– contigs/scaffolds broken up unnecessarily– missing variants

• Software errors– 15-50 bugs/1000 lines of code– Celera Assembler – 300,000 loc

Page 42: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Checking results• Contiguity is just part of the story• N50 doesn't make sense

– better measure: size to xx Mbp, number to xx Mbp

• Errors need to be taken into account– hard to do without a reference

Page 43: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical
Page 44: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical
Page 45: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Practical modeling - assembly invariants• Basic principles (modulo errors)

– overlapping reads must agree– mate-pairs must be consistently placed in assembly– coverage must match statistical process that generated the data– all reads must be used– assembly must be as contiguous as possible

• These assumptions mostly break/should be relaxed in metagenomic data

Schatz et al, Genome Biology 2008

Page 46: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical
Page 47: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Model-based testing

Unknown Genome Assembly

Magicbiological

biochemicalbiophysical

signal processingetc.

Reads

Assemblercomputational magic

Model of

Magic

Same?

Magicbiological

biochemicalbiophysical

signal processingetc.

Page 48: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Modeling approach...aside

• Originally proposed by Gene Myers (early-mid 90s)• Used in several metagenomic assemblers

– Genovo (Laserson et al.)– Vispa (Westbrooks et al.)– Shorah (Zagordi et al.)

• Idea of single number reflecting assembly qualityGenovo 'Score

denovo'

Note: need to know where every read is placed/could be placed!

(information rarely produced by 'modern' assemblers)

∑iSWScore i−2∗length(contigs)+ 2∗minOvl∗num(contigs)

ALE: Clark, S. et al. Bioinformatics 29(4): 435-443.LAP: Ghodsi, M., et al. BMC Res Notes 6(1): 334.CGAL: Rahman, A. and L. Pachter (2013). Genome Biol 14(1): R8.

Page 49: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

https://github.com/marbl/VALET

Page 50: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Missassembly found by VALET

Page 51: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

Conclusions• Genome assembly, in general, is well studied, and very hard• Key lesson: Garbage In Garbage Out (data more important

than algorithm)• Metagenomics offers valuable problems and information

(e.g.) scaffolding with cross-sample correlations

• Key: formalize and tackle sub-problems of interest to biologists– gene identification/clustering– comparative analysis wrt reference– reconstruct specific organism (rather than entire metagenome)

• Validation and standards are critical

Page 52: Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad, and the ugly ... technical issues. ... • Validation and standards are critical

E

A A

A

E

CB

I

GF

H J

D