Upload
brian-krueger
View
675
Download
0
Embed Size (px)
DESCRIPTION
Slide Deck from Josh's 2014 presentation at the Illumina user group meeting in RTP. Slides describe our experience with V3 and V4 chemistries on a very large cohort of exome sequenced samples.
Citation preview
V4 Sequencing Reagent Experience
Joshua BridgersDuke University
Center for Human Genome Variation
V3 vs V4 Chemistry
• V3– 100bp x 100bp– 12 days run time– Requires loading of
pair-end reagents– ~300gb per flowcell
• V4– Requires HiSeq 2500
or newer 2000– 125bp x 125bp– 6 day run time*– Pair-end reagents are
loaded at the start of the run
– ~600gb per flowcell
Throughput
≈
2400gb / 12 day 2400gb / 12 day
V4 HiSeq 2000/2500
V3 HiSeq 2000
2nd Generation Sequencing Advances
• V3 System Chemistry– 300GB per flowcell– 12 days to data– Genome: $4700, Exome: $790
• V4 System Chemistry– 600GB per flowcell– 6 days to data– Genome: $3000, Exome: $640
• X System Chemistry– 1GB per patterned flowcell– 3 days to data– Genome: $1500, Exome: $500
Data Quality – Percent Q30 V3
600 700 800 900 1000 1100 1200 13000.65
0.7
0.75
0.8
0.85
0.9
0.95
1
V3 %q30R1
V3 %q30R2
Cluster Density K/mm2
Perc
ent Q
30
Data Quality – Percent Q30 V4
600 700 800 900 1000 1100 1200 13000.65
0.7
0.75
0.8
0.85
0.9
0.95
1
V4 %q30R1
V4 %q30R2
Cluster Density K/mm2
Perc
ent Q
30
Data Quality – Percent Q30
600 700 800 900 1000 1100 1200 13000.65
0.7
0.75
0.8
0.85
0.9
0.95
1
V3 %q30R1
V4 %q30R1
V3 %q30R2
V4 %q30R2
Cluster Density K/mm2
Perc
ent Q
30
Data Quality – Average Quality Score V3
600 700 800 900 1000 1100 1200 130029
30
31
32
33
34
35
36
37
V3 Avg. Qscore R1V3 Avg. Qscore R2
Cluster Density K/mm2
Qua
lity
Scor
e
Data Quality – Average Quality Score V4
600 700 800 900 1000 1100 1200 130029
30
31
32
33
34
35
36
37
V4 Avg. Qscore R1V4 Avg. Qscore R2
Cluster Density K/mm2
Qua
lity
Scor
e
Data Quality – Average Quality Score
600 700 800 900 1000 1100 1200 130029
30
31
32
33
34
35
36
37
V3 Avg. Qscore R1V3 Avg. Qscore R2V4 Avg. Qscore R1V4 Avg. Qscore R2
Cluster Density K/mm2
Qua
lity
Scor
e
Data Volume and Processing
• Run folders– .bcl files are now compressed – V3 Run Folders: ~350GB/flowcell– V4 Run Folders: ~500GB/flowcell
• Fastq generation cluster usage per flowcell– V3: 121.5 minutes, 283gb max memory used– V4: 184.9 minutes, 673gb max memory used
Lane-level Alignment
Indel Re-Alignment
Base QualityRecalibration
Merging & Sorting Alignments
PCR Duplicate Removal
BWA - http://bio-bwa.sourceforge.net/
Bioinformatics Pipeline
Lane-level Alignment
Indel Re-Alignment
Base QualityRecalibration
Merging & Sorting Alignments
PCR Duplicate Removal
SAMtools - http://samtools.sourceforge.net/
Bioinformatics Pipeline
Lane-level Alignment
Indel Re-Alignment
Base QualityRecalibration
Merging & Sorting Alignments
PCR Duplicate Removal
Alignment
Picard MarkDuplicates - http://picard.sourceforge.net/
Bioinformatics Pipeline
Lane-level Alignment
Indel Re-Alignment
Base QualityRecalibration
Merging & Sorting Alignments
PCR Duplicate Removal
GATK - http://www.broadinstitute.org/gatk/
Bioinformatics Pipeline
Core-released Reads
Alignment
Indel Re-Alignment
Base QualityRecalibration
Sorting/Merging Alignments
PCR Duplicate Removal
Analysis-Ready Read Alignments
GATK Unified Genotyper
GATK VQSR
Coverage Depth
Ti/Tv Ratio
dbSNP Overlap
Genotyping & Preliminary QC
Duplicate Read Pct.
Aligned Read Pct.
Gender Check
Bioinformatics Pipeline
Test Sample Description
• Sequenced one trio on V3 and V4 Illumina chemistry
• 400bp size-selected exome capture– V3 sequenced samples have higher overall
coverage
Overall Metrics
• Percent Bases Covered 5x are similar despite coverage difference
• SNV hom/het ratio changed• Indel hom/het ratio changed• dbSNP Overlap, Ti/Tv similar
Variant Call Overlap
V3
36.5kV4
13.6k114.8kV3
34.5kV4
13.9k118.1k
V3
34.5kV4
13.2k114.8k
Sample 1 Sample 2
Sample 3
Variant Call Overlap (Pass/Intermediate,Both 10x Covered)
V3
8.2kV4
7.7k104.1kV3
8.6kV4
7.5k107.3k
V3
7.4kV4
7.7k103.2k
Sample 1 Sample 2
Sample 3
Variant Call Overlap (High Confidence SNV)
V3
141V4
11322553V3
118V490
V3
109V483
Sample 1 Sample 2
Sample 3
22689
22257
Variant Call Overlap (High Confidence SNV)
V3
0.62%V4
0.50%98.9%V3
0.52%V4
0.39%
V3
0.49%V4
0.37%
Sample 1 Sample 2
Sample 3
99.1%
99.1%
Homopolymer Runs
V3
V4
V3
42V42422404
Sample 1
Variant Call Overlap (Low Complexity Regions)
CCDS Coverage
• Analyzed 72 Caucasian unaffected adults for % coverage across a modified CCDS release 14
• Same cohort• 34 V3 samples• 38 V4 samples
• Gender unbiased• All unaffected parents• Overall coverage between 80-90x
CCDS Coverage
V3 Average V4 Average
3x Coverage 97.61% 98.46%
10x Coverage 95.92% 96.71%
20x Coverage 92.93% 92.83%
• Overall greater coverage at 3x and 10x• Similar coverage at 20x
Extended Coverage
V3
V4
Conclusion
• Sequencing throughput increased ~400%– 71% temporary storage space usage– 75% CPU hours for fastq conversion– 120% maximum Vmem usage
• Higher average qscore at higher cluster densities• Higher percent Q30 at higher cluster densities
Conclusion
• High confidence variant calls largely unaffected
• Low complexity regions and indel calls can still be problematic
• Overall increased coverage of CCDS
Questions?
Acknowledgements
• CHGV– Brian Krueger– Slave Petrovski – Linda Hong– Erin Campbell
• Illumina– Adam Jerald– Kenny Patridge
Kaizen
改善
kai
zen
“Good”
“Change”
Cheaper sequencing, extended coverage, lower IT overhead
Data Quality – Percent Q30
V3
• Greater degradation in quality as cycles increase
• Looser distribution
V4 • Small drop in %q30 as cycles
increase• Tighter distribution
Data Quality – Cluster Passed Filter
600 700 800 900 1000 1100 1200 130065
70
75
80
85
90
95
100
V3 Cluster PFV4 Cluster PF
Cluster Density K/mm2
Perc
ent P
ass F
ilter
New SNVHomo and IndelHomo
#SNV Hom #SNV Het SNVHomo Ratio #Indel Hom #Indel Het Indel Homo Ratio
10x coverage
SQC0243F77 12012 118884 0.101039669 2102 18349 0.114556652
SQC0243F77_V4TEST 10181 101963 0.099849946 1833 14415 0.127159209
SQC0243F77 shared 9637 92697 0.103962372 1283 11216 0.114390157
SQC0243F77 missing 404 3345 0.12077728 404 3345 0.12077728
SQC0243F77_V4TEST missing 1036 11552 0.08968144 446 3617 0.123306608
Additional Filters (SNV)
• Percent Alt Read 0.3 – 1• GQ >50• SB < 60• HaplotypeScore < 13• MQ > 40• QD > 2• QUAL > 50• RPRS > -6• MQRS > -6• NON_SYNONYMOUS_CODING | SYNONYMOUS_CODING |
START_GAINED | START_LOST | STOP_LOST | STOP_GAINED |SPLICE_SITE_ACCEPTOR | SPLICE_SITE_DONOR | EXON
Discarded Variants
• Sample 1– 6.9k NON_SYNONYMOUS_CODING | SYNONYMOUS_CODING |
START_GAINED | START_LOST | STOP_LOST | STOP_GAINED |SPLICE_SITE_ACCEPTOR | SPLICE_SITE_DONOR | EXON
– 25.6k 10X coverage for both samples– 32.1k 10X coverage for one sample– 43.0k Percent Alt Read 0.3 – 1– 43.9k QUAL > 50– 46.3k GQ >50– 52.5k HaplotypeScore < 13– 56.8k Passed/Intermediate– 59.1k MQ > 40– 59.6k QD > 2– 67.6k SB < 60– 70.9k MQRS > -6– 71.3k RPRS > -6
Homopolymer Runs
V3
V4