36
© 2010 Illumina, Inc. All rights reserved. Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. informatics @ illumina Dipesh Risal Senior Product Manager, Informatics March 24, 2011

informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

© 2010 Illumina, Inc. All rights reserved.Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.

informatics @ illumina

Dipesh RisalSenior Product Manager, Informatics

March 24, 2011

Page 2: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

2

From 250G to 600G on HiSeq 2000How did we do it?

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%The Starting Point: Current System Performance

Page 3: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

3

From 250G to 600G on HiSeq 20001. Increased Cluster Density

825k/mm2

TruSeqv3 cBot kit

Higher Density

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

Page 4: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

4

From 250G to 600G on HiSeq 20001. Increased Cluster Density

825k/mm2

Non-overlapping clusters enable very high feature densitiesHCS/RTA software properly detects irregularly shaped, dense clusters

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

TruSeqv3 cBot kit

Higher Density

Page 5: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

5

From 250G to 600G on HiSeq 20002. Density-dependent bias is minimised

825k/mm2

ClusterDensity

488k/mm2

High GC

Low GC

Current Cluster Amplificationand Target Density

(469k/mm2)

GC-rich clusters well resolved and

detected

Good Coverage at high GC

GC dropout 2.2%

TruSeqv3 cBot kit

Reduced Bias

Page 6: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

6

From 250G to 600G on HiSeq 20002. Density-dependent bias is minimised

825k/mm2

ClusterDensity

488k/mm2

Current Cluster AmplificationHigh Density(964k/mm2)

Some GC-rich clusters poorly

resolved / not detected

High GC

Low GC

Reduced Coverage at high GC

GC dropout 8.2%

TruSeqv3 cBot kit

Reduced Bias

Page 7: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

7

From 250G to 600G on HiSeq 20002. Density-dependent bias is minimised

825k/mm2

ClusterDensity

488k/mm2

Larger, brighter GC-rich clusters

are well resolved and

detectedNew Cluster Amplification

High Density(966k/mm2)

High GC

Low GC

New chemistry equalises growth of AT and GC rich clusters

Excellent Coverage at high GC

GC dropout 0.8%

TruSeqv3 cBot kit

Reduced Bias

Page 8: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

8

From 250G to 600G on HiSeq 20003. Increased yield of high quality data

Improved ImageAnalysis

Increases PF yield

825k/mm2 0.904

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

New HCS/RTA ensures that more reads pass filter at any density

Page 9: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

9

From 250G to 600G on HiSeq 20004. Wider Flowcell Channels

50% increase inFC channel width6 swaths per channel

Same footprint

825k/mm2 0.904 4424 mm2

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

Page 10: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

10

From 250G to 600G on HiSeq 2000

Yield in excess of 600G

825k/mm2 0.904 4424 mm2 2x100 652.6

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

Page 11: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

11

From 250G to 600G on HiSeq 20005. Improved sequencing chemistry

TruSeqv2 SBS kit

Highly Accurate at High Density

825k/mm2 0.904 4424 mm2 2x100 652.6 60.4 84%*

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

New polymerase (EDP)improves incorporation efficiency at high density

*Preliminary

Page 12: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

12

From 250G to 600G on HiSeq 20005. Improved sequencing chemistry

825k/mm2 0.904 4424 mm2 2x100 652.6 60.4 84%*

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

New Scan Reagent (SRE)reduces signal decay

TruSeqv2 SBS kit

Highly Accurate at High Density

*Preliminary

Page 13: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

13

From 250G to 600G on HiSeq 2000

Unprecedented System Performance

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

825k/mm2 0.904 4424 mm2 2x100 652.6 60.4 84%

Page 14: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

14

CASAVA 1.8: Accuracy in Alignment and Variant Calling

TAGGTTCAA

Alignments SNP’s Indels Counts

HCS/RTAPrimary Analysis

Available Q2, 2011

Page 15: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

15

Repeat resolution– First pass: align using a single seed– Take reads that did not get matched or hit a repetitive sequence– Second – fourth passes: Align using overlapping seeds– Report seeds that hit a non-repetitive sequence– Increases sensitivity across repeat regions

Alignment Algorithm Enhancements: ELANDV2e

Finding seed hits100bp

100bp

Stage 0:(singleseed)

Stage 1:(multiple seeds)

Page 16: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

16

Coverage Histograms for Repeat Elements– NA19240, Chr 21

Mapped Coverage

Long repeats (LINE: L1, L2)

%R

ef B

ases

%R

ef B

ases

Mapped Coverage

Short repeats(SINE: Alu, MIR)

NA19240, 2x100bp, GAIIx+v1 chemistry, NCBI b36

Page 17: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

17

ELANDv2e improves alignment over repeat region compared to previous version (ELANDv2)

Human 2x100 bp data

Improved ELANDv2e Alignments on Human Data

Page 18: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

18

Orphan aligner1

– Using mapped read as anchor, determine expected position of unmapped read (using insert size information)

– Perform local alignment in a 450 bp window (default) around expected position

– Score read pairs according to mismatches to the references and their insert size

– Increases % aligned reads: generally 5-7% on 30x human data– Improves indel detection

Alignment Algorithm Enhancements: ELANDV2e

…. … … …

…. … … …

Read 2 has multiple mappings (shown in red):

Do local realignment using read 1 (green) as an anchor

Estimate insert size

1. Orphans are reads that do not map, but have read partners which map uniquely.

Score Read Pairs

Page 19: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

19

25 million simulated read pairs, error rate <2%, including SNPs and short indels

Reads correctly mapped if there is an overlap between the simulated position and the mapped read

bowtie bwa ELANDv2e

Correctly mapped

88.1% 94.1% 94.1%

Incorrectly mapped

3.0% 5.0% 0.09%

0% 20% 40% 60% 80% 100%

ELANDv2e

bw a

bow tie

ELANDv2

correctly mapped pairs

incorrectly mapped pairs

unmapped pairs

ELANDv2e Comparative Assessment

Page 20: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

20

Positions of the 'singleton' reads and insert size information used as a distant metric

Indel Detection Using Grouper

1. Cluster Orphan Reads

TTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCGTAGCATGCATGCGGCTTTTCGCGTAGA

Aligned Reads

Reference

Clusters ofOrphanReads

Page 21: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

21

This step is new in CASAVA 1.8Anomalously short inserts = possible insertionAnomalously long inserts = possible deletion

Indel Detection Using Grouper

TTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCGTAGCATGCATGCGGCTTTTCGCGTAGA

Aligned Reads

Reference

Cluster ofAnomalous Reads(insert too short)

2. Cluster Anomalous Reads

Page 22: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

22

Indel Detection Using Grouper

Merged Cluster

3. Merge Clusters from Same Event

Page 23: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

23

Indel Detection Using Grouper

4. Assemble Cluster into Contig

New ContigTAGCATGCATGCATGCACGATCGGTGTTTGTGGTGGGGGACTTT

Anomalous Reads in Contig

Orphan Reads in Contig

Potential New Insert

TAGCATGCATGCATGCACGGACTTT|||||||||||||||||||||||||

TTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCGTAGCATGCATGCGGCTTTTCGCGTAGA

GATCGGTGTTTGTGGTGGG

Reference

New Contig

Positions of associated 'singleton' used to narrow search to ~2,000 bp

5. Align Contig to Reference

Page 24: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

24

Candidate indels identified using– Gapped alignment: ideal for small indels (1–20 bp)– IndelFinder (Grouper algorithm)– The above two are combined to identify candidate indels– Significantly improved indel calling sensitivity over CASAVA 1.7

All intersecting reads to each candidate indel are locally realigned

Alternate alignment probabilities are used to generate indel call qualities

Most likely alignment used for SNP calling

Local Realignment

Page 25: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

25

Local Realignment Improves Variant Detection

CASAVA 1.7

reference:

xx xxxx

x

ELAND alignments with del:

ELAND alignments w/o del:xx

Potential false-positive calls in misaligned

region

Potential false-positive calls in misaligned

region

Lower sensitivity to true variants in the sample

Lower sensitivity to true variants in the sample

CASAVA 1.8

reference:

ELAND alignments with del:

ELAND alignments realigned by variant caller:

Unmapped ‘orphan’ reads recovered by GROUPER:

GROUPER provides alignments for indels which are

too large or complex for the read mapper

GROUPER provides alignments for indels which are

too large or complex for the read mapper

Local read realignment and incorporation of GROUPER results improves sensitivity of

both SNP and indel calls

Local read realignment and incorporation of GROUPER results improves sensitivity of

both SNP and indel calls

Page 26: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

26

Joint SNP- and indel-calling process allows all variants to be based on a consistent set of read alignments.

– New probabilistic SNP caller provides conventional Q-scores Q(SNP): probability that each site contains a SNP Q(max_gt): probability of the most likely genotype at that site Genotype-based model significantly reduces Mendelian inheritance errors SNP calling also takes into account the reference sequence (genomic

prior) when calculating the Q-scores

Indel genotype quality model based on principles similar to those used for SNPs

– Q(indel), Q(max_gt)

SNP/Indel Genotyping

Page 27: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

27

Yoruba trio datasets: NA18507, NA18506, NA18508– 95 Gb raw data each, 2x100 bp, Currently available Chemistry

Violations of Mendelian inheritance indicate miscalled variants

Results for chromosome 20

Improved SNP Calling Specificity(False Positive Test)

CASAVA 1.7Current chemistry

CASAVA 1.8Current chemistry

Total sites (‘N’ excluded) 59,505,520 59,505,520

Called in all 3 of trio 58,188,449 (97.8%) 58,988,964 (99.1%)

Mendelian conflicts 5,072 162

Conflict rate 3.78% of SNP sites 0.13% of SNP sites

0.0087% of called sites 0.00027% of called sites

Page 28: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

28

Call set (NA19240) # Called SNPs # Called SNPs in NA19240&Yoruba1

# Called SNPs in NA19240&Yoruba

near repeats2

CASAVA 1.7 4,340,848 230,026 (6.3%)

152,890(8.1%)

CASAVA 1.8 4,549,001 237,113(3.4%)

159,629(4%)

1. NA19240&Yoruba: A subset of 245,469 SNPs were identified as being present in dbSNP and further validated by a capillary capillary sequencing study in NA19240 and at least one other Yoruban individual (of NA18506, NA18507 and NA18508) reported in Nature 2008 453(7191):56-64.2. NA19240&Yoruba near repeats: A subset of 166,343 in NA19240&Yoruba were located within 100bp of a known repeat element from UCSCgenome browser .3. Numbers in parentheses are false negative rates.

Improved SNP Calling Sensitivity(False Negative Test)

Page 29: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

29

New Chemistry and Software Enables Higher Data Quality

Chemistry: TruSeq v3 cBOT Kit improves coverage in GC-rich regionsSoftware: HCS1.3/RTA.10 improves cluster detection at higher densities

Software: CASAVA 1.8 Repeat Resolution and Orphan Aligner improves coverage

Software: CASAVA 1.8 Probabilistic variant calling reduce false positives and false negatives

Chemistry: TruSeq v3 SBS Kit improves phasing and signal decay

Software: CASAVA 1.8 Local Realignment improves coverage

Page 30: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

30

Software Ecosystem for Illumina Data

Application Software

DNA Alignment CASAVABWA

BowtieRNA Seq TopHat/Bowtie/Cufflinks (novel isoform discovery,

differential expression, counting)CASAVA (counting)

SNP Calling CASAVAGATK

SOAPsnpIndel Detection CASAVA

Breakdancerde novo assembly Velvet

Allpaths-LGSOAPdenovo

Page 31: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

31

Data Volumes: HiSeq2000 + v3 Chemistry + CASAVA 1.8 (600Gb output per run; Q2, 2011)

Data Volume File Format Size CommentBase Call / Quality Score Data .bcl 660 GB Intermediate file

format

Read-level Data compressed fastq 660 GB FASTQ saving

optional

Alignment Output &Archiving .bam 660 GB BAM

Note: Export files are ~3.5bytes per base and temp space needed is twice the size of export files. For a 600Gbase run, ~5TB disk space is needed.

Note: Gb = Giba baseBG = Giga byte

Page 32: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

32

TopHat User Guide

Annotated, Ready-to-use References on iCom/ ftp site

Coming Soon– Graphical User Interface for TopHat Suite (AVC)– Array/ RNA-Seq Comparison Tech Note– Array/ RNA-Seq Comparison Tool

Tools to Get You Started with TopHat Suite (RNA-Seq)

CASAVA Reference files

Bowtie Reference files

Page 33: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

33

Reversing the TrendScaling output while reducing the data burden

0100200300400500600

2008 2009 2010 2011

Bases (Gb/Wk)

Bytes (Gbytes/day)

Page 34: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

34

Why IlluminaCompute?

IlluminaCompute: Turnkey, Optimized Server Solution

For human-scale genomic data processing and analysis

Used by Illumina’s own sequencing service centers

Modeled around systems used at most genome centers

scalableProcessing

(CPUs)

scalableStorage

Configuration, maintenanceAnd Support

Turnkey: built, installed, and supported by Illumina

Performance scales with sequencing throughput

Cost-effective solution given performance and support

Page 35: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

35

The Future: Addressing Three Key NeedsWorkflow management simplifies sequence data analysis

DATA GENERATIONAND MANAGEMENT IT INFRASTRUCTURE BIOINFORMATICS

Sample Prep & Instrument Control

Local HPC

Cloud

Analysis “AppStore”

Sequencing workflow engine

Page 36: informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to 600G on HiSeq 2000 How did we do it? Cluster Density Fraction Passing Filter Imageable

36

The Future: Simplified Workflows Illumina Sequencing Workflow Manager

Simplifies and automates sequencing data generation and analysis

Customization, execution, and monitoring of sequencing workflows

Integrates open source analysis tools and LIMS systems