Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
© 2010 Illumina, Inc. All rights reserved.Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.
informatics @ illumina
Dipesh RisalSenior Product Manager, Informatics
March 24, 2011
2
From 250G to 600G on HiSeq 2000How did we do it?
ClusterDensity
FractionPassing
Filter
ImageableArea
(Two FC)
Read length
Yield(Gb)
Throughput(Gb/day)
%>Q30
488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%The Starting Point: Current System Performance
3
From 250G to 600G on HiSeq 20001. Increased Cluster Density
825k/mm2
TruSeqv3 cBot kit
Higher Density
ClusterDensity
FractionPassing
Filter
ImageableArea
(Two FC)
Read length
Yield(Gb)
Throughput(Gb/day)
%>Q30
488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%
4
From 250G to 600G on HiSeq 20001. Increased Cluster Density
825k/mm2
Non-overlapping clusters enable very high feature densitiesHCS/RTA software properly detects irregularly shaped, dense clusters
ClusterDensity
FractionPassing
Filter
ImageableArea
(Two FC)
Read length
Yield(Gb)
Throughput(Gb/day)
%>Q30
488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%
TruSeqv3 cBot kit
Higher Density
5
From 250G to 600G on HiSeq 20002. Density-dependent bias is minimised
825k/mm2
ClusterDensity
488k/mm2
High GC
Low GC
Current Cluster Amplificationand Target Density
(469k/mm2)
GC-rich clusters well resolved and
detected
Good Coverage at high GC
GC dropout 2.2%
TruSeqv3 cBot kit
Reduced Bias
6
From 250G to 600G on HiSeq 20002. Density-dependent bias is minimised
825k/mm2
ClusterDensity
488k/mm2
Current Cluster AmplificationHigh Density(964k/mm2)
Some GC-rich clusters poorly
resolved / not detected
High GC
Low GC
Reduced Coverage at high GC
GC dropout 8.2%
TruSeqv3 cBot kit
Reduced Bias
7
From 250G to 600G on HiSeq 20002. Density-dependent bias is minimised
825k/mm2
ClusterDensity
488k/mm2
Larger, brighter GC-rich clusters
are well resolved and
detectedNew Cluster Amplification
High Density(966k/mm2)
High GC
Low GC
New chemistry equalises growth of AT and GC rich clusters
Excellent Coverage at high GC
GC dropout 0.8%
TruSeqv3 cBot kit
Reduced Bias
8
From 250G to 600G on HiSeq 20003. Increased yield of high quality data
Improved ImageAnalysis
Increases PF yield
825k/mm2 0.904
ClusterDensity
FractionPassing
Filter
ImageableArea
(Two FC)
Read length
Yield(Gb)
Throughput(Gb/day)
%>Q30
488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%
New HCS/RTA ensures that more reads pass filter at any density
9
From 250G to 600G on HiSeq 20004. Wider Flowcell Channels
50% increase inFC channel width6 swaths per channel
Same footprint
825k/mm2 0.904 4424 mm2
ClusterDensity
FractionPassing
Filter
ImageableArea
(Two FC)
Read length
Yield(Gb)
Throughput(Gb/day)
%>Q30
488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%
10
From 250G to 600G on HiSeq 2000
Yield in excess of 600G
825k/mm2 0.904 4424 mm2 2x100 652.6
ClusterDensity
FractionPassing
Filter
ImageableArea
(Two FC)
Read length
Yield(Gb)
Throughput(Gb/day)
%>Q30
488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%
11
From 250G to 600G on HiSeq 20005. Improved sequencing chemistry
TruSeqv2 SBS kit
Highly Accurate at High Density
825k/mm2 0.904 4424 mm2 2x100 652.6 60.4 84%*
ClusterDensity
FractionPassing
Filter
ImageableArea
(Two FC)
Read length
Yield(Gb)
Throughput(Gb/day)
%>Q30
488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%
New polymerase (EDP)improves incorporation efficiency at high density
*Preliminary
12
From 250G to 600G on HiSeq 20005. Improved sequencing chemistry
825k/mm2 0.904 4424 mm2 2x100 652.6 60.4 84%*
ClusterDensity
FractionPassing
Filter
ImageableArea
(Two FC)
Read length
Yield(Gb)
Throughput(Gb/day)
%>Q30
488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%
New Scan Reagent (SRE)reduces signal decay
TruSeqv2 SBS kit
Highly Accurate at High Density
*Preliminary
13
From 250G to 600G on HiSeq 2000
Unprecedented System Performance
ClusterDensity
FractionPassing
Filter
ImageableArea
(Two FC)
Read length
Yield(Gb)
Throughput(Gb/day)
%>Q30
488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%
825k/mm2 0.904 4424 mm2 2x100 652.6 60.4 84%
14
CASAVA 1.8: Accuracy in Alignment and Variant Calling
TAGGTTCAA
Alignments SNP’s Indels Counts
HCS/RTAPrimary Analysis
Available Q2, 2011
15
Repeat resolution– First pass: align using a single seed– Take reads that did not get matched or hit a repetitive sequence– Second – fourth passes: Align using overlapping seeds– Report seeds that hit a non-repetitive sequence– Increases sensitivity across repeat regions
Alignment Algorithm Enhancements: ELANDV2e
Finding seed hits100bp
100bp
Stage 0:(singleseed)
Stage 1:(multiple seeds)
16
Coverage Histograms for Repeat Elements– NA19240, Chr 21
Mapped Coverage
Long repeats (LINE: L1, L2)
%R
ef B
ases
%R
ef B
ases
Mapped Coverage
Short repeats(SINE: Alu, MIR)
NA19240, 2x100bp, GAIIx+v1 chemistry, NCBI b36
17
ELANDv2e improves alignment over repeat region compared to previous version (ELANDv2)
Human 2x100 bp data
Improved ELANDv2e Alignments on Human Data
18
Orphan aligner1
– Using mapped read as anchor, determine expected position of unmapped read (using insert size information)
– Perform local alignment in a 450 bp window (default) around expected position
– Score read pairs according to mismatches to the references and their insert size
– Increases % aligned reads: generally 5-7% on 30x human data– Improves indel detection
Alignment Algorithm Enhancements: ELANDV2e
…. … … …
…. … … …
Read 2 has multiple mappings (shown in red):
Do local realignment using read 1 (green) as an anchor
Estimate insert size
1. Orphans are reads that do not map, but have read partners which map uniquely.
Score Read Pairs
19
25 million simulated read pairs, error rate <2%, including SNPs and short indels
Reads correctly mapped if there is an overlap between the simulated position and the mapped read
bowtie bwa ELANDv2e
Correctly mapped
88.1% 94.1% 94.1%
Incorrectly mapped
3.0% 5.0% 0.09%
0% 20% 40% 60% 80% 100%
ELANDv2e
bw a
bow tie
ELANDv2
correctly mapped pairs
incorrectly mapped pairs
unmapped pairs
ELANDv2e Comparative Assessment
20
Positions of the 'singleton' reads and insert size information used as a distant metric
Indel Detection Using Grouper
1. Cluster Orphan Reads
TTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCGTAGCATGCATGCGGCTTTTCGCGTAGA
Aligned Reads
Reference
Clusters ofOrphanReads
21
This step is new in CASAVA 1.8Anomalously short inserts = possible insertionAnomalously long inserts = possible deletion
Indel Detection Using Grouper
TTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCGTAGCATGCATGCGGCTTTTCGCGTAGA
Aligned Reads
Reference
Cluster ofAnomalous Reads(insert too short)
2. Cluster Anomalous Reads
22
Indel Detection Using Grouper
Merged Cluster
3. Merge Clusters from Same Event
23
Indel Detection Using Grouper
4. Assemble Cluster into Contig
New ContigTAGCATGCATGCATGCACGATCGGTGTTTGTGGTGGGGGACTTT
Anomalous Reads in Contig
Orphan Reads in Contig
Potential New Insert
TAGCATGCATGCATGCACGGACTTT|||||||||||||||||||||||||
TTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCGTAGCATGCATGCGGCTTTTCGCGTAGA
GATCGGTGTTTGTGGTGGG
Reference
New Contig
Positions of associated 'singleton' used to narrow search to ~2,000 bp
5. Align Contig to Reference
24
Candidate indels identified using– Gapped alignment: ideal for small indels (1–20 bp)– IndelFinder (Grouper algorithm)– The above two are combined to identify candidate indels– Significantly improved indel calling sensitivity over CASAVA 1.7
All intersecting reads to each candidate indel are locally realigned
Alternate alignment probabilities are used to generate indel call qualities
Most likely alignment used for SNP calling
Local Realignment
25
Local Realignment Improves Variant Detection
CASAVA 1.7
reference:
xx xxxx
x
ELAND alignments with del:
ELAND alignments w/o del:xx
Potential false-positive calls in misaligned
region
Potential false-positive calls in misaligned
region
Lower sensitivity to true variants in the sample
Lower sensitivity to true variants in the sample
CASAVA 1.8
reference:
ELAND alignments with del:
ELAND alignments realigned by variant caller:
Unmapped ‘orphan’ reads recovered by GROUPER:
GROUPER provides alignments for indels which are
too large or complex for the read mapper
GROUPER provides alignments for indels which are
too large or complex for the read mapper
Local read realignment and incorporation of GROUPER results improves sensitivity of
both SNP and indel calls
Local read realignment and incorporation of GROUPER results improves sensitivity of
both SNP and indel calls
26
Joint SNP- and indel-calling process allows all variants to be based on a consistent set of read alignments.
– New probabilistic SNP caller provides conventional Q-scores Q(SNP): probability that each site contains a SNP Q(max_gt): probability of the most likely genotype at that site Genotype-based model significantly reduces Mendelian inheritance errors SNP calling also takes into account the reference sequence (genomic
prior) when calculating the Q-scores
Indel genotype quality model based on principles similar to those used for SNPs
– Q(indel), Q(max_gt)
SNP/Indel Genotyping
27
Yoruba trio datasets: NA18507, NA18506, NA18508– 95 Gb raw data each, 2x100 bp, Currently available Chemistry
Violations of Mendelian inheritance indicate miscalled variants
Results for chromosome 20
Improved SNP Calling Specificity(False Positive Test)
CASAVA 1.7Current chemistry
CASAVA 1.8Current chemistry
Total sites (‘N’ excluded) 59,505,520 59,505,520
Called in all 3 of trio 58,188,449 (97.8%) 58,988,964 (99.1%)
Mendelian conflicts 5,072 162
Conflict rate 3.78% of SNP sites 0.13% of SNP sites
0.0087% of called sites 0.00027% of called sites
28
Call set (NA19240) # Called SNPs # Called SNPs in NA19240&Yoruba1
# Called SNPs in NA19240&Yoruba
near repeats2
CASAVA 1.7 4,340,848 230,026 (6.3%)
152,890(8.1%)
CASAVA 1.8 4,549,001 237,113(3.4%)
159,629(4%)
1. NA19240&Yoruba: A subset of 245,469 SNPs were identified as being present in dbSNP and further validated by a capillary capillary sequencing study in NA19240 and at least one other Yoruban individual (of NA18506, NA18507 and NA18508) reported in Nature 2008 453(7191):56-64.2. NA19240&Yoruba near repeats: A subset of 166,343 in NA19240&Yoruba were located within 100bp of a known repeat element from UCSCgenome browser .3. Numbers in parentheses are false negative rates.
Improved SNP Calling Sensitivity(False Negative Test)
29
New Chemistry and Software Enables Higher Data Quality
Chemistry: TruSeq v3 cBOT Kit improves coverage in GC-rich regionsSoftware: HCS1.3/RTA.10 improves cluster detection at higher densities
Software: CASAVA 1.8 Repeat Resolution and Orphan Aligner improves coverage
Software: CASAVA 1.8 Probabilistic variant calling reduce false positives and false negatives
Chemistry: TruSeq v3 SBS Kit improves phasing and signal decay
Software: CASAVA 1.8 Local Realignment improves coverage
30
Software Ecosystem for Illumina Data
Application Software
DNA Alignment CASAVABWA
BowtieRNA Seq TopHat/Bowtie/Cufflinks (novel isoform discovery,
differential expression, counting)CASAVA (counting)
SNP Calling CASAVAGATK
SOAPsnpIndel Detection CASAVA
Breakdancerde novo assembly Velvet
Allpaths-LGSOAPdenovo
31
Data Volumes: HiSeq2000 + v3 Chemistry + CASAVA 1.8 (600Gb output per run; Q2, 2011)
Data Volume File Format Size CommentBase Call / Quality Score Data .bcl 660 GB Intermediate file
format
Read-level Data compressed fastq 660 GB FASTQ saving
optional
Alignment Output &Archiving .bam 660 GB BAM
Note: Export files are ~3.5bytes per base and temp space needed is twice the size of export files. For a 600Gbase run, ~5TB disk space is needed.
Note: Gb = Giba baseBG = Giga byte
32
TopHat User Guide
Annotated, Ready-to-use References on iCom/ ftp site
Coming Soon– Graphical User Interface for TopHat Suite (AVC)– Array/ RNA-Seq Comparison Tech Note– Array/ RNA-Seq Comparison Tool
Tools to Get You Started with TopHat Suite (RNA-Seq)
CASAVA Reference files
Bowtie Reference files
33
Reversing the TrendScaling output while reducing the data burden
0100200300400500600
2008 2009 2010 2011
Bases (Gb/Wk)
Bytes (Gbytes/day)
34
Why IlluminaCompute?
IlluminaCompute: Turnkey, Optimized Server Solution
For human-scale genomic data processing and analysis
Used by Illumina’s own sequencing service centers
Modeled around systems used at most genome centers
scalableProcessing
(CPUs)
scalableStorage
Configuration, maintenanceAnd Support
Turnkey: built, installed, and supported by Illumina
Performance scales with sequencing throughput
Cost-effective solution given performance and support
35
The Future: Addressing Three Key NeedsWorkflow management simplifies sequence data analysis
DATA GENERATIONAND MANAGEMENT IT INFRASTRUCTURE BIOINFORMATICS
Sample Prep & Instrument Control
Local HPC
Cloud
Analysis “AppStore”
Sequencing workflow engine
36
The Future: Simplified Workflows Illumina Sequencing Workflow Manager
Simplifies and automates sequencing data generation and analysis
Customization, execution, and monitoring of sequencing workflows
Integrates open source analysis tools and LIMS systems