Assembly of Paired-end Solexa Assembly of Paired-end Solexa Reads by Kmer Extension using Reads by Kmer Extension using
Base QualitiesBase Qualities
Zemin NingZemin Ning
The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute
Outline of the Talk:
Euler Path and Sequence Reconstruction Euler Hash Table Read Extension Using Base Qualities and Read Pairs Repeat Junctions and Single Base Variation Assembly Results Future Work
Sequence ReconstructionSequence Reconstruction- Hamiltonian path approach- Hamiltonian path approach
S=(ATGCAGGTCC)S=(ATGCAGGTCC)ATG ATG ->-> TGC TGC -> -> GCA GCA ->-> CAG CAG -> -> AGG AGG ->-> GGT GGT -> -> GTC GTC ->-> TCC TCC
ATG AGG TGC TCC GTC GGT GCA CAGATG AGG TGC TCC GTC GGT GCA CAG
VerticesVertices: k-tuples from the spectrum shown in red (8);: k-tuples from the spectrum shown in red (8);EdgesEdges: overlapping k-tuples (7);: overlapping k-tuples (7);PathPath: visiting all vertices corresponding to the : visiting all vertices corresponding to the sequence.sequence.
Sequence ReconstructionSequence Reconstruction- Euler path approach- Euler path approach
VerticesVertices: : correspond to (k-I)-tuples (7);correspond to (k-I)-tuples (7);EdgesEdges: : correspond to k-tuples from the spectrum (8);correspond to k-tuples from the spectrum (8);PathPath: : visiting all EDGES corresponding to the visiting all EDGES corresponding to the sequence.sequence.
ATAT
GTGT CGCG
CACA
GCGCTGTG
GGGG
ATGCGTGGCAATGCGTGGCA ATGGCGTGCAATGGCGTGCA
ATG ATG ->-> TGG TGG -> -> GGC GGC ->-> GCG GCG -> -> CGT CGT ->-> GTG GTG -> -> TGC TGC ->-> GCA GCA
E k-tuples Indices, Offsets and links to the next7 ATG 1,1,28 3,1,28 4,1,288 ATC 2,1,29 10 AGT 4,5,3811 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,3228 TGC 1,2,45 3,2,46 4,2,4529 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-138 GTT 4,6,2440 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,5146 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,1052 CAC 3,4,19
SSAHA Type Hash TableSSAHA Type Hash Table
S1=(ATGCAGGTCC) , S2=(ATCAGGTCC)S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)
Point to the Next - Hash Table LinksPoint to the Next - Hash Table Links
S1=(ATGCAGGTCC) , S2=(ATCAGGTCC)S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)
E k-tuples Indices, Offsets and links to the next7 ATG 1,1,28 3,1,28 4,1,288 ATC 2,1,29 10 AGT 4,5,3811 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,3228 TGC 1,2,45 3,2,46 4,2,4529 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-138 GTT 4,6,2440 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,5146 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,1052 CAC 3,4,19
Repeat Repeat Repeat
Sequence Repeat Graph
reads
Assembly StrategyAssembly Strategy
Extend Solexa reads to long reads of 1-2 Kb
Genome/Chromosome
Capillary reads assemblerPhrap/Phusion
forward-reverse paired reads
30-40 bp
known dist
~500 bp
30-40 bp
Kmer Extension & WalkKmer Extension & Walk
Quality Filters on JunctionsQuality Filters on Junctions
True Repeat JunctionsTrue Repeat Junctions
All Low Base Quality CaseAll Low Base Quality Case
Repetitive Contig and Read PairsRepetitive Contig and Read Pairs
DepthDepthFor each hit read in the For each hit read in the contig, contig index and contig, contig index and offset are stored.offset are stored.
Insert lengthInsert length
Current read positionCurrent read position
Contig startContig start
Pair read positionPair read position
DepthDepth
Read Pairs to Resolve Repeat JunctionsRead Pairs to Resolve Repeat Junctions
Handling of Repeat JunctionsHandling of Repeat Junctions
Handling of Single Base VariationsHandling of Single Base Variations
Solexa reads:Number of reads: 3,084,185;Finished genome size: 2,007,491 bp;Read length: 39 and 36 bp;Estimated read coverage: ~40X;Estimated Kmer coverage: 14X;Number of vector reads: ?;
Assembly features: - contig statsTotal number of contigs: 362;Total bases of contigs: 1,938,732 bpN50 contig size: 10,849;Largest contig: 33,388 Averaged contig size: 5,356;Contig coverage over the genome: ~97 %;Contig extension errors: 1Mis-assembly errors: 3
S Suis P1/7 Solexa AssemblyS Suis P1/7 Solexa Assembly
Shredded reads:Number of reads: 1,338,161;Finished genome size: 2,007,491 bp;Read length: 36;Estimated read coverage: 24X;Insert size: 500 bp;
Assembly features:Paired_Data Not_Paired
Number of contigs: 35 317Total assembled bases: 1.996 Mb 1.956 MbN50 contig size: 243,039 13,929Largest contig: 474,070 33,460Averaged contig size: 57,043 6,168Contig coverage: >99.0 % >99.0 %Contig extension errors: 0 0Mis-assembly errors: 3 2
S Suis P1/7 Shredded Read AssemblyS Suis P1/7 Shredded Read Assembly
Solexa reads:Number of reads: 5,142,190;Finished genome size: 4,809,037 bp;Read length: 41;Estimated read coverage: ~15X;
Assembly features: - contig statsTotal number of contigs: 3,126;Total bases of contigs: 4,633,241 bpN50 contig size: 2,460;Largest contig: 15,325; Averaged contig size: 1,482;Contig coverage over the genome: ~97.5 %;Mis-assembly errors: 0
STyphi 6979 Solexa AssemblySTyphi 6979 Solexa Assembly
404ty Solexa v 454 Detected Indels: 2341
02004006008001000
12001400160018002000
1 4 7 10 13 16 19 22 25 28
Indel size
Nu
mb
er
of
ind
els
Solexa reads:Number of reads: 4,808,788;
Finished genome size: 4,809,037 bp;Read length: 40;Estimated read coverage: 40X;
Assembly features: - contig statsTotal number of contigs: 65;Total bases of contigs: 4,800,992 bpN50 contig size: 158,460;Largest contig: 489,849; Averaged contig size: 73,861;Contig coverage over the genome: ~99.0 %;Mis-assembly errors: 3
STyphi CT18 Shredded Read AssemblySTyphi CT18 Shredded Read Assembly
Solexa reads:Number of reads: 11,630,428;
Finished genome size: 23.5 Mp;Read length: 40;Estimated read coverage: 20X;
Assembly features: - contig statsTotal number of contigs: 29,313;Total bases of contigs: 17.17 MpN50 contig size: 1,355;Largest contig: 14,136; Averaged contig size: 585;Contig coverage over the genome: ~72.8 %;Mis-assembly errors: ?
PF_3D7 Shredded Read AssemblyPF_3D7 Shredded Read Assembly
Clone Level Assembly with Shredded Clone Level Assembly with Shredded Error Free ReadsError Free Reads
Shred reads with given coverage
Genome/Chromosome
Organize reads into small groups covering clone
200 kb
forward-reverse paired reads
~40 bp
known dist
~500 bp
~40 bp
Human Chromosome X Human Chromosome X Shredded reads:Number of reads: 156 millionChromosome length: 156 MbNumber of Clones: 774Read length: 40;Estimated read coverage: 40X;
Assembly features: - contig statsTotal number of contigs: 28,204;Total bases of contigs: 148 MpN50 contig size: 30,968;Largest contig: 173,157; Averaged contig size: 5,254;
Zebrafish Chromosome 5 Zebrafish Chromosome 5
Shredded reads:Number of reads: 70.2 millionChromosome length: 70.3 MbNumber of Clones: 351Read length: 40;Estimated read coverage: 40X;
Assembly features: - contig statsTotal number of contigs: 22,405;Total bases of contigs: 67.5 MpN50 contig size: 9,587;Largest contig: 70,757; Averaged contig size: 3,012;
Plasmodium Chr14Plasmodium Chr14Shredded reads:Number of reads: 3.2 millionChromosome length: 3.29 MbNumber of Clones: 16Read length: 40;Estimated read coverage: 40X;Assembly features: - Original dataTotal number of contigs: 1,960;Total bases of contigs: 2.86 MpN50 contig size: 2,924;Largest contig: 18,366; Averaged contig size: 1,461;Assembly features: - Replacing “TATATA…”Total number of contigs: 1,333;Total bases of contigs: 3.05 MpN50 contig size: 4,596;Largest contig: 23,345; Averaged contig size: 2,287;
Acknowledgements:
Ian Goodhead and Chris Clee James Bonfield Yong Gu and Adam Spargo Daniel Zerbino (EBI) Tony Cox Richard Durbin