17
Error correction and assembly complexity of single molecule sequencing reads. (2014) Hayan Lee, James Gurtowski, Shinjae Yoo, Shoshana Marcus, W. Richard McCombie, Michael Schatz http://www.biorxiv.org/content/biorxiv/early/2014/06/18/006395.full.pdf Presentation by Jennifer Shelton 2014 Figures from: http://www.biorxiv.org/content/biorxiv/early/ 2014/06/18/006395.full.pdf and https://pag.confex.com/pag/xxii/ recordingredirect.cgi/id/1368

ABJC: ECTools error correction journal club slides

Embed Size (px)

DESCRIPTION

Journal Club slides for ECTools PacBio read correction http://bioinformaticsk-state.blogspot.com/2014/09/ectools-error-correction.html. Paper at http://www.biorxiv.org/content/biorxiv/early/2014/06/18/006395.full.pdf

Citation preview

Page 1: ABJC: ECTools error correction journal club slides

Error correction and assembly complexity of single molecule

sequencing reads. (2014)

Hayan Lee, James Gurtowski, Shinjae Yoo, Shoshana Marcus, W. Richard McCombie, Michael Schatz

!http://www.biorxiv.org/content/biorxiv/early/2014/06/18/006395.full.pdf

!Presentation by Jennifer Shelton 2014

!Figures from: http://www.biorxiv.org/content/biorxiv/early/

2014/06/18/006395.full.pdf and https://pag.confex.com/pag/xxii/recordingredirect.cgi/id/1368

Page 2: ABJC: ECTools error correction journal club slides

ECTools: PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction

Abstract conclusion: We apply it several prokaryotic and eukaryotic genomes, and show it can achieve near-perfect assemblies of small genomes (< 100Mbp) and substantially improved assemblies of larger ones.!

Page 3: ABJC: ECTools error correction journal club slides

PacBio read length distribution

~10,000 bp PacBio average read length

Page 4: ABJC: ECTools error correction journal club slides

Genome assembly summary

de novo genome assembly summary : !1) "Reads are compared to each other to construct an assembly graph"!2) "The graph is then simplified"!3) The "graph is then traversed to reconstruct larger sequences"!!Goal : entire chromosomes = individual contigs!!de novo genome hurdles: !low coverage: "contigs can end due to gaps of coverage or errors"!high coverage: "repeats longer than the read length lead to “branches” in the assembly graph, effectively ending the contigs" (unstated hurdle: errors lead to branches with high coverage assemblies as well)!!!

Page 5: ABJC: ECTools error correction journal club slides

Repeats!

Probably best to trust empirical evidence above an assembly model from 1988: "100x coverage of 100bp reads, the human genome should assemble into contigs hundreds of gigabases long, far beyond the length of the genome itself"

Page 6: ABJC: ECTools error correction journal club slides

Repeats in rice

* *Author's ascribe failure of this model & assembly difficulty to repeats!Agreed repeats are an issue however the organisms they report on have repeat lengths that should be largely resolved by Illumina (e.g. Rice repeats tend to be around 100bp with comparatively few > 1,000 bp) from a graph that doesn't show <100bp repeats. !Rice does seem to be a great target for 10,000 bp PacBio reads

Page 7: ABJC: ECTools error correction journal club slides

Repeats and read length!

Page 8: ABJC: ECTools error correction journal club slides

Repeats in other high quality assemblies (used to model repeat complexity)!

10,000 bp PacBio average read length

Page 9: ABJC: ECTools error correction journal club slides

Revised assembly model!

old model new model

Page 10: ABJC: ECTools error correction journal club slides

Revised assembly model!

C2 chemistry

C3 chemistry

Page 11: ABJC: ECTools error correction journal club slides

Revised assembly model!

C2 chemistry

C3 chemistryNote!! this model assumes error free reads rather than the 10-15% error rate of PacBio reads.!Therefore one must plan to use Illumina reads to error correct back down to 1-3% error rates to see similar results or 50-100 x PacBio coverage + HGAP

Page 12: ABJC: ECTools error correction journal club slides

Revised assembly model!

I think the authors are plotting HGAP assemblies. Where possible I added outlined stars for the same organism from the ECTools results

Page 13: ABJC: ECTools error correction journal club slides

PacBioToCa splits reads at regions with low Illumina coverage!

Page 14: ABJC: ECTools error correction journal club slides

ECTools!

Nucmer to align unitigs to PacBio reads

Celera to assemble unitigs

All reads are in unitigs (if no unambiguous assembly possible then unitig = read)

Select best set of unitigs that best cover

PB read

This becomes a backbone for error correction

Use show-snps and "correct" PB read

where it disagrees w/ backbone

Trim PB read end where no unitig aligns

internal region where no unitig aligns are

kept

set base qual to 1% error in region where

unitig aligns

base quality in the region is set to 15% error

break and iterate if final base qual is below

"identity parameter"

Page 15: ABJC: ECTools error correction journal club slides

Assembly results!

Page 16: ABJC: ECTools error correction journal club slides

Choosing a program based on PacBio coverage!

Page 17: ABJC: ECTools error correction journal club slides

Assembly expectations from PAG 2014!