View
52
Download
0
Category
Tags:
Preview:
DESCRIPTION
Genome Assembly. Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington , Juliette Zerick. Outline. Input Data Sequence read data Pipeline Review U n-processed data Assemblers - PowerPoint PPT Presentation
Citation preview
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick
Genome Assembly
Outline Input Data Sequence read data Pipeline Review Un-processed data Assemblers Preliminary data – assembler comparison Visualization Future
Input Data
V. navarrensis V. vulnificus2423-01 2009V-1368
08-2462 06-2432
2541-90 08-2435
2756-81 08-2439
- 07-2444
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio navarrensis- 454
SequenceID 2423-01 08-2462 2541-90 2756-81
Min. Read Length
21 bp 25 bp 19 bp 28 bp
Max. Read Length
738 bp 573 bp 704 bp 704 bp
Avg. Read Length
423.27 (± 117.36 bp)
401.80 (± 117.12 bp)
416.23 (± 125.84 bp)
423.53(± 117.19 bp)
Total Reads 160,560 13,854 303,434 218,021
Coverage 15x 1.23x 28.06x 20.51x
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio vulnificus- 454
SequenceID 2009V-1368 06-2432 08-2435 08-2439 07-2444Min. Read Length
26 bp 21 bp 23 bp 22 bp 18 bp
Max. Read Length
593 bp 597 bp 723 bp 594 bp 736 bp
Avg. Read Length
416.05(± 123.19 bp)
371.91(± 112.13bp)
416.98 (± 121.56 bp)
418.12 (± 120.88 bp)
368.78(± 115.96 bp)
Total Reads 191,280 786,944 352,726 173,538 777,228
Coverage 17x 65x 32x 16x 63x
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio navarrensis- Illumina
SequenceID 2423-01 08-2462 2541-90 2756-81
Min. Read Length
76 bp 76 bp 76 bp 76 bp
Max. Read Length
76 bp 76 bp 76 bp 76 bp
Avg. Read Length
76 bp 76 bp 76 bp 76 bp
Total Reads 19,316,659 29,414,237 126,298,691 92,338,634
Coverage 326x 496x 250x 237x
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio vulnificus- Illumina
SequenceID 2009V-1368 06-2432 08-2435 08-2439 07-2444Min. Read Length
76 bp 76 bp 76 bp 76 bp 76 bp
Max. Read Length
76 bp 76 bp 76 bp 76 bp 76 bp
Avg. Read Length
76 bp 76 bp 76 bp 76 bp 76 bp
Total Reads 15,764,329 14,562,252 15,343,648 16,007,895 15,495,709
Coverage ~250x ~250x ~250x ~250x ~250x
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
454 raw reads
PRE-PROCESSING
Illumina raw reads
Pre-processing
454 reads
Illumina reads
Statistical analysis
Read stats
Published Genomes from public databases
V. vulnificus
YJ016
V. vulnificus CMCP6
V. vulnificus MO6-24/O
Align Illumina against the reference
FastqcPrinseqNGS QC
Compare mapping statistics
Reference genome
samstats
bwa
REFERENCE SELECTION
Hybrid DeNovo • Ray• MIRA
Illumina/ 454/ Hybrid DeNovo assembly
454 DeNovo• Newbler• CABOG• SUTTA
Illumina DeNovo• Allpaths LG• SOAP DeNovo• Velvet• Taipan• SUTTA
contigs * 3
Align illumina reads against 454 contigs
Unmapped reads
Mac vectorCLC wb
contigs
Unmapped reads
Evaluation
GAGEHawk-eye
Illumina/(454?) reference based
assembly
AMOScmp
contigs
Unmapped reads
DENOVO ASSEMBLY
REFERENCE BASED ASSEMBLY
Draft/ Finished genome
Reference evaluation
Reference evaluation
DNA DiffMUMmer
Parameter optimization
CONTIG MERGING
All possible combinations of the
best 3
MimimusMAIA
PAGITMauve
Finished genomeScaffolds
GAGE
GENOME FINISHING
Gap filling Nulceotide identity
MUMmer
GRASSBuilt-in
Process
454
Illumina
Info.
Chosen Ref.
Assemblers
Assemblers
Illumina454
LEGEND
hybrid
Pipeline: Revisited
Vibrio vulnificus- 454Metric 1368 2432 2435 2439 2444
Per Base Seq. QualityPer Seq. Quality ScorePer Base Seq. ContentPer Base GC ContentPer Seq. GC Content
Per Base N Content
Seq. Length Dist.
Seq. Dup. Levels
Overrepresented Seqs.Kmer Content
Vibrio navarrensis- 454; unprocessed data
Metric 2423-01 08-2462 2541-90 2756-81
Per Base Seq. Quality
Per Seq. Quality Score
Per Base Seq. Content
Per Base GC Content
Per Seq. GC Content
Per Base N Content
Seq. Length Dist.
Seq. Dup. Levels
Overrepresented Seqs.
Kmer Content
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio vulnificus- Illumina; unprocessed dataMetric 2009V-1368 06-2432 08-2435 08-2439 07-2444
Per Base Seq. QualityPer Seq. Quality ScorePer Base Seq. ContentPer Base GC ContentPer Seq. GC Content
Per Base N Content
Seq. Length Dist.
Seq. Dup. Levels
Overrepresented Seqs.Kmer Content
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio navarrensis- Illumina; unprocessed data
Metric 2423-01 08-2462 2541-90 2756-81
Per Base Seq. Quality
Per Seq. Quality Score
Per Base Seq. Content
Per Base GC Content
Per Seq. GC Content
Per Base N Content
Seq. Length Dist.
Seq. Dup. Levels
Overrepresented Seqs.
Kmer Content
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Per base sequence qualityvul_454_07-2444 nav_454_2541-90
vul_ill_06-2432 nav_ill_08-2462
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Per base sequence contentvul_454_06-2432
vul_ill_06-2432 nav_ill_06-2756-81
nav_454_08-2462
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Seq. duplicate levelsvul_454_08-2435 nav_454_2541-90
vul_ill_06-2432 nav_ill_08-2462
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Pre-processing stats
Parameter Value
Total sequences 15,343,648
Good sequences 9,775,116
Bad sequences 5,568,532
vul_ill_07-2444
Good readsExact repeatsTrim tail rightMin qual mean
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
454 raw reads
PRE-PROCESSING
Illumina raw reads
Pre-processing
454 reads
Illumina reads
Statistical analysis
Read stats
Published Genomes from public databases
V. vulnificus
YJ016
V. vulnificus CMCP6
V. vulnificus MO6-24/O
Align Illumina against the reference
FastqcPrinseqNGS QC
Compare mapping statistics
Reference genome
samstats
bwa
REFERENCE SELECTION
Hybrid DeNovo • Ray• MIRA
Illumina/ 454/ Hybrid DeNovo assembly
454 DeNovo• Newbler• CABOG• SUTTA
Illumina DeNovo• Allpaths LG• SOAP DeNovo• Velvet• Taipan• SUTTA
contigs * 3
Align illumina reads against 454 contigs
Unmapped reads
Mac vectorCLC wb
contigs
Unmapped reads
Evaluation
GAGEHawk-eye
Illumina/(454?) reference based
assembly
AMOScmp
contigs
Unmapped reads
DENOVO ASSEMBLY
REFERENCE BASED ASSEMBLY
Draft/ Finished genome
Reference evaluation
Reference evaluation
DNA DiffMUMmer
Parameter optimization
CONTIG MERGING
All possible combinations of the
best 3
MimimusMAIA
PAGITMauve
Finished genomeScaffolds
GAGE
GENOME FINISHING
Gap filling Nulceotide identity
MUMmer
GRASSBuilt-in
Process
454
Illumina
Info.
Chosen Ref.
Assemblers
Assemblers
Illumina454
LEGEND
hybrid
Pipeline: Revisited
Assemblers
Name Platform Source file Installation Usage
Allpaths LG Illumina
SOAP DeNovo Illumina
Velvet Illumina
SUTTA Hybrid
RAY Hybrid
CLC genomics workbench Hybrid
Newbler 454
CABOG 454
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
CLC Genomics Word Size: Automatic Word Size CLC bio's de novo assembly algorithm works by using de Bruijn graphs. It makes a table of all sub-sequences of a certain length (called words) found in the reads.
Bubble Size: Automatic Bubble Size A bubble is defined as a bifurcation in the graph where a path furcates into two nodes and then merge back into one.
Minimum Contig Length: 200
Mismatch cost : 2 The cost of a mismatch between the read and the reference sequence.
Insertion cost: 3 The cost of an insertion in the read (causing a gap in the reference sequence)
Deletion cost: 3 The cost of having a gap in the read. The score for a match is always 1.
Length fraction: 0.5 Set minimum length fraction of a read that must match the reference sequence. Setting a value at 0.5 means that at least half the read needs to match the reference
sequence for the read to be included in the final mapping.
Similarity: 0.8 Set minimum fraction of identity between the read and the reference sequence. If you want the reads to have e.g. at least 90% identity with the reference sequence in
order to be included in the final mapping, set this value to 0.9.
Update contigs based on mapped reads This means that the original contig sequences produced from the de novo assembly will be updated to reflect the mapping of the reads
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Velvet De brujin assembler
Max kmer length-31, default 29 Commands
velveth directory -k-mer -readtype –file format filename velvetg VAssemILL -exp_cov auto -cov_cutoff auto
exp_cov – allow the sytem to infer expected coverage of unique regions Cov_cutoff - Allow the system to infer the removal of low coverage nodes
Designed for very short reads (25-50bp)
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Newbler De Novo OLC assembler
Uses k-mer based hashing Command – runAssembly [filename] Designed for longer reads (454)
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
SOAP DeNovo2 Short reads DeNovo assembler Designed to study Illumina GAII contigs Command - SOAPdenovo-127mer all -s test.config -K 30 -R -p 4
-N 4600000 -o test_OP 1>ass.log 2>ass.err Parameters specified:
Insert_size: 0, single end reads Kmer_size: 23, default asm_flag: both contigs and scaffold
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Assembler comparison- 454
Tool N50 No. of contigs
Avg. contig length
No. of large contigs
Largest contig Read usage %
CLC Genomics wb.
93,536 363 13,107 NA NA 99.32
Newbler 194,540 142 33,550 94 777,156 98.9
Tool N50 No. of contigs
Avg. contig length
No. of large contigs
Largest contig
Read usage %
CLC Genomics wb.
84,313 313 13,828 NA NA 98.53
Newbler 111,462 347 12,606 168 218,091 97.88
nav_454_2541-90
vul_454_06-2432
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Assembler comparison- Illumina
Tool N50 No. of contigs Avg. contig length
Read usage % Largest contig
Median coverage depth
SOAP DeNovo 1,077 28,760 184 NA NA NA
Velvet 17,408 1,402 3,072 99.26 58,246 92.09
CLC Genomics wb 56,628 291 14,766 99.36 193,565 NA
Tool N50 No. of contigs Avg. contig length
Read usage % Largest contig
Median coverage depth
SOAP DeNovo 1,094 26,773 207 NA NA NA
Velvet 15,699 1,253 3,759 99.57 51,343 86.93
CLC Genomics wb 87,298 260 18,087 99.40 233,510 NA
nav_ill_2541-90
vul_ill_06-2432
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
454 raw reads
PRE-PROCESSING
Illumina raw reads
Pre-processing
454 reads
Illumina reads
Statistical analysis
Read stats
Published Genomes from public databases
V. vulnificus
YJ016
V. vulnificus CMCP6
V. vulnificus MO6-24/O
Align Illumina against the reference
FastqcPrinseqNGS QC
Compare mapping statistics
Reference genome
samstats
bwa
REFERENCE SELECTION
Hybrid DeNovo • Ray
Illumina/ 454/ Hybrid DeNovo assembly
454 DeNovo• Newbler• CABOG• SUTTA
Illumina DeNovo• Allpaths LG• SOAP DeNovo• Velvet• SUTTA
contigs * 3
Align illumina reads against 454 contigs
Unmapped reads
Mac vectorCLC wb
contigs
Unmapped reads
Evaluation
GAGEHawk-eye
Illumina/454? reference based
assembly
AMOScmp
contigs
Unmapped reads
DENOVO ASSEMBLY
REFERENCE BASED ASSEMBLY
Draft/ Finished genome
Reference evaluation
Reference evaluation
DNA DiffDNA Diff
Parameter optimization
CONTIG MERGING
All possible combinations of the
best 3
MimimusMAIA
PAGITMauve
Finished genomeScaffolds
GAGE
GENOME FINISHING
Gap filling Nulceotide identity
MUMmer
GRASSBuilt-in
Process
454
Illumina
Info.
Chosen Ref.
Assemblers
Assemblers
Illumina454
LEGEND
hybrid
Pipeline: Revisited
Reference Genomes
V. vulnificus MO6-24/O V. vulnificus YJ016 V. vulnificus CMCP6
Reference vs. all contigs- 454
Tool/Reference CMCP6 YJ016 MO6-24/O
Aligned contigs%
Aligned bases%
Aligned contigs%
Aligned bases%
Aligned contigs%
Aligned bases%
CLC Genomics wb.(n = 313)
45 25 41 25 39 25
Newbler (n = 347) 59 25 58 25 43 24
nav_454_2541-90
vul_454_06-2432Tool/Reference CMCP6 YJ016 MO6-24/O
Aligned contigs%-
Aligned bases%
Aligned contigs%
Aligned bases%
Aligned contigs%
Aligned bases%
CLC Genomics wb. NA NA NA NA NA NA
Newbler (n = 142) 85 91 84 91 86 92
Reference vs. all contigs- Illumina
Tool/Reference CMCP6 YJ016 MO6-24/O
Aligned contigs%
Aligned bases%
Aligned contigs%
Aligned bases%
Aligned contigs%
Aligned bases%
SOAP DeNovo (n = 28,760)
3 13- 3 14 3 14
Velvet (n = 1402) 20 23 20 23 20 23
nav_ill_2541-90
vul_ill_06-2432Tool/Reference CMCP6 YJ016 MO6-24/O
Aligned contigs%
Aligned bases%
Aligned contigs%
Aligned bases%-
Aligned contigs%
Aligned bases%
SOAP DeNovo (n = 26,773)
18 76 18 76 18 76
Velvet(n = 1,253) 46 91 47 91 47 91
Visualization
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Road ahead…..
Get all the tools working
Optimize tool parameters
Use Illumina reads to finish 454 contigs
Performance considerations for the tool
Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Questions???
Recommended