28
Assembly of Paired-end Solexa Reads Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities by Kmer Extension using Base Qualities Zemin Ning Zemin Ning The Wellcome Trust Sanger Institute The Wellcome Trust Sanger Institute

Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Embed Size (px)

Citation preview

Page 1: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Assembly of Paired-end Solexa Assembly of Paired-end Solexa Reads by Kmer Extension using Reads by Kmer Extension using

Base QualitiesBase Qualities

Zemin NingZemin Ning

The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute

Page 2: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Outline of the Talk:

Euler Path and Sequence Reconstruction Euler Hash Table Read Extension Using Base Qualities and Read Pairs Repeat Junctions and Single Base Variation Assembly Results Future Work

Page 3: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Sequence ReconstructionSequence Reconstruction- Hamiltonian path approach- Hamiltonian path approach

S=(ATGCAGGTCC)S=(ATGCAGGTCC)ATG ATG ->-> TGC TGC -> -> GCA GCA ->-> CAG CAG -> -> AGG AGG ->-> GGT GGT -> -> GTC GTC ->-> TCC TCC

ATG AGG TGC TCC GTC GGT GCA CAGATG AGG TGC TCC GTC GGT GCA CAG

VerticesVertices: k-tuples from the spectrum shown in red (8);: k-tuples from the spectrum shown in red (8);EdgesEdges: overlapping k-tuples (7);: overlapping k-tuples (7);PathPath: visiting all vertices corresponding to the : visiting all vertices corresponding to the sequence.sequence.

Page 4: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Sequence ReconstructionSequence Reconstruction- Euler path approach- Euler path approach

VerticesVertices: : correspond to (k-I)-tuples (7);correspond to (k-I)-tuples (7);EdgesEdges: : correspond to k-tuples from the spectrum (8);correspond to k-tuples from the spectrum (8);PathPath: : visiting all EDGES corresponding to the visiting all EDGES corresponding to the sequence.sequence.

ATAT

GTGT CGCG

CACA

GCGCTGTG

GGGG

ATGCGTGGCAATGCGTGGCA ATGGCGTGCAATGGCGTGCA

ATG ATG ->-> TGG TGG -> -> GGC GGC ->-> GCG GCG -> -> CGT CGT ->-> GTG GTG -> -> TGC TGC ->-> GCA GCA

Page 5: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

E k-tuples Indices, Offsets and links to the next7 ATG 1,1,28   3,1,28 4,1,288 ATC   2,1,29    10 AGT       4,5,3811 AGG 1,5,42 2,4,42 3,6,42  19 TAG     3,5,11  24 TTC       4,7,3228 TGC 1,2,45   3,2,46 4,2,4529 TCA   2,2,51    32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-138 GTT       4,6,2440 GTC 1,7,32 2,6,32 3,8,32  42 GGT 1,6,40 2,5,40 3,7,40  45 GCA 1,3,51     4,3,5146 GCT     3,3,53  51 CAG 1,4,11 2,3,11   4,4,1052 CAC     3,4,19  

SSAHA Type Hash TableSSAHA Type Hash Table

S1=(ATGCAGGTCC) , S2=(ATCAGGTCC)S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

Page 6: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Point to the Next - Hash Table LinksPoint to the Next - Hash Table Links

S1=(ATGCAGGTCC) , S2=(ATCAGGTCC)S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

E k-tuples Indices, Offsets and links to the next7 ATG 1,1,28   3,1,28 4,1,288 ATC   2,1,29    10 AGT       4,5,3811 AGG 1,5,42 2,4,42 3,6,42  19 TAG     3,5,11  24 TTC       4,7,3228 TGC 1,2,45   3,2,46 4,2,4529 TCA   2,2,51    32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-138 GTT       4,6,2440 GTC 1,7,32 2,6,32 3,8,32  42 GGT 1,6,40 2,5,40 3,7,40  45 GCA 1,3,51     4,3,5146 GCT     3,3,53  51 CAG 1,4,11 2,3,11   4,4,1052 CAC     3,4,19  

Page 7: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Repeat Repeat Repeat

Sequence Repeat Graph

reads

Page 8: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Assembly StrategyAssembly Strategy

Extend Solexa reads to long reads of 1-2 Kb

Genome/Chromosome

Capillary reads assemblerPhrap/Phusion

forward-reverse paired reads

30-40 bp

known dist

~500 bp

30-40 bp

Page 9: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Kmer Extension & WalkKmer Extension & Walk

Page 10: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Quality Filters on JunctionsQuality Filters on Junctions

Page 11: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

True Repeat JunctionsTrue Repeat Junctions

Page 12: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

All Low Base Quality CaseAll Low Base Quality Case

Page 13: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Repetitive Contig and Read PairsRepetitive Contig and Read Pairs

DepthDepthFor each hit read in the For each hit read in the contig, contig index and contig, contig index and offset are stored.offset are stored.

Insert lengthInsert length

Current read positionCurrent read position

Contig startContig start

Pair read positionPair read position

DepthDepth

Page 14: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Read Pairs to Resolve Repeat JunctionsRead Pairs to Resolve Repeat Junctions

Page 15: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute
Page 16: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Handling of Repeat JunctionsHandling of Repeat Junctions

Page 17: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Handling of Single Base VariationsHandling of Single Base Variations

Page 18: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Solexa reads:Number of reads: 3,084,185;Finished genome size: 2,007,491 bp;Read length: 39 and 36 bp;Estimated read coverage: ~40X;Estimated Kmer coverage: 14X;Number of vector reads: ?;

Assembly features: - contig statsTotal number of contigs: 362;Total bases of contigs: 1,938,732 bpN50 contig size: 10,849;Largest contig: 33,388 Averaged contig size: 5,356;Contig coverage over the genome: ~97 %;Contig extension errors: 1Mis-assembly errors: 3

S Suis P1/7 Solexa AssemblyS Suis P1/7 Solexa Assembly

Page 19: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Shredded reads:Number of reads: 1,338,161;Finished genome size: 2,007,491 bp;Read length: 36;Estimated read coverage: 24X;Insert size: 500 bp;

Assembly features:Paired_Data Not_Paired

Number of contigs: 35 317Total assembled bases: 1.996 Mb 1.956 MbN50 contig size: 243,039 13,929Largest contig: 474,070 33,460Averaged contig size: 57,043 6,168Contig coverage: >99.0 % >99.0 %Contig extension errors: 0 0Mis-assembly errors: 3 2

S Suis P1/7 Shredded Read AssemblyS Suis P1/7 Shredded Read Assembly

Page 20: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Solexa reads:Number of reads: 5,142,190;Finished genome size: 4,809,037 bp;Read length: 41;Estimated read coverage: ~15X;

Assembly features: - contig statsTotal number of contigs: 3,126;Total bases of contigs: 4,633,241 bpN50 contig size: 2,460;Largest contig: 15,325; Averaged contig size: 1,482;Contig coverage over the genome: ~97.5 %;Mis-assembly errors: 0

STyphi 6979 Solexa AssemblySTyphi 6979 Solexa Assembly

Page 21: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

404ty Solexa v 454 Detected Indels: 2341

02004006008001000

12001400160018002000

1 4 7 10 13 16 19 22 25 28

Indel size

Nu

mb

er

of

ind

els

Page 22: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Solexa reads:Number of reads: 4,808,788;

Finished genome size: 4,809,037 bp;Read length: 40;Estimated read coverage: 40X;

Assembly features: - contig statsTotal number of contigs: 65;Total bases of contigs: 4,800,992 bpN50 contig size: 158,460;Largest contig: 489,849; Averaged contig size: 73,861;Contig coverage over the genome: ~99.0 %;Mis-assembly errors: 3

STyphi CT18 Shredded Read AssemblySTyphi CT18 Shredded Read Assembly

Page 23: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Solexa reads:Number of reads: 11,630,428;

Finished genome size: 23.5 Mp;Read length: 40;Estimated read coverage: 20X;

Assembly features: - contig statsTotal number of contigs: 29,313;Total bases of contigs: 17.17 MpN50 contig size: 1,355;Largest contig: 14,136; Averaged contig size: 585;Contig coverage over the genome: ~72.8 %;Mis-assembly errors: ?

PF_3D7 Shredded Read AssemblyPF_3D7 Shredded Read Assembly

Page 24: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Clone Level Assembly with Shredded Clone Level Assembly with Shredded Error Free ReadsError Free Reads

Shred reads with given coverage

Genome/Chromosome

Organize reads into small groups covering clone

200 kb

forward-reverse paired reads

~40 bp

known dist

~500 bp

~40 bp

Page 25: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Human Chromosome X Human Chromosome X Shredded reads:Number of reads: 156 millionChromosome length: 156 MbNumber of Clones: 774Read length: 40;Estimated read coverage: 40X;

Assembly features: - contig statsTotal number of contigs: 28,204;Total bases of contigs: 148 MpN50 contig size: 30,968;Largest contig: 173,157; Averaged contig size: 5,254;

Page 26: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Zebrafish Chromosome 5 Zebrafish Chromosome 5

Shredded reads:Number of reads: 70.2 millionChromosome length: 70.3 MbNumber of Clones: 351Read length: 40;Estimated read coverage: 40X;

Assembly features: - contig statsTotal number of contigs: 22,405;Total bases of contigs: 67.5 MpN50 contig size: 9,587;Largest contig: 70,757; Averaged contig size: 3,012;

Page 27: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Plasmodium Chr14Plasmodium Chr14Shredded reads:Number of reads: 3.2 millionChromosome length: 3.29 MbNumber of Clones: 16Read length: 40;Estimated read coverage: 40X;Assembly features: - Original dataTotal number of contigs: 1,960;Total bases of contigs: 2.86 MpN50 contig size: 2,924;Largest contig: 18,366; Averaged contig size: 1,461;Assembly features: - Replacing “TATATA…”Total number of contigs: 1,333;Total bases of contigs: 3.05 MpN50 contig size: 4,596;Largest contig: 23,345; Averaged contig size: 2,287;

Page 28: Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute

Acknowledgements:

Ian Goodhead and Chris Clee James Bonfield Yong Gu and Adam Spargo Daniel Zerbino (EBI) Tony Cox Richard Durbin