informatics @ illuminaDipesh Risal Senior Product Manager, Informatics March 24, 2011 2 From 250G to...

Preview:

Citation preview

© 2010 Illumina, Inc. All rights reserved.Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.

informatics @ illumina

Dipesh RisalSenior Product Manager, Informatics

March 24, 2011

2

From 250G to 600G on HiSeq 2000How did we do it?

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%The Starting Point: Current System Performance

3

From 250G to 600G on HiSeq 20001. Increased Cluster Density

825k/mm2

TruSeqv3 cBot kit

Higher Density

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

4

From 250G to 600G on HiSeq 20001. Increased Cluster Density

825k/mm2

Non-overlapping clusters enable very high feature densitiesHCS/RTA software properly detects irregularly shaped, dense clusters

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

TruSeqv3 cBot kit

Higher Density

5

From 250G to 600G on HiSeq 20002. Density-dependent bias is minimised

825k/mm2

ClusterDensity

488k/mm2

High GC

Low GC

Current Cluster Amplificationand Target Density

(469k/mm2)

GC-rich clusters well resolved and

detected

Good Coverage at high GC

GC dropout 2.2%

TruSeqv3 cBot kit

Reduced Bias

6

From 250G to 600G on HiSeq 20002. Density-dependent bias is minimised

825k/mm2

ClusterDensity

488k/mm2

Current Cluster AmplificationHigh Density(964k/mm2)

Some GC-rich clusters poorly

resolved / not detected

High GC

Low GC

Reduced Coverage at high GC

GC dropout 8.2%

TruSeqv3 cBot kit

Reduced Bias

7

From 250G to 600G on HiSeq 20002. Density-dependent bias is minimised

825k/mm2

ClusterDensity

488k/mm2

Larger, brighter GC-rich clusters

are well resolved and

detectedNew Cluster Amplification

High Density(966k/mm2)

High GC

Low GC

New chemistry equalises growth of AT and GC rich clusters

Excellent Coverage at high GC

GC dropout 0.8%

TruSeqv3 cBot kit

Reduced Bias

8

From 250G to 600G on HiSeq 20003. Increased yield of high quality data

Improved ImageAnalysis

Increases PF yield

825k/mm2 0.904

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

New HCS/RTA ensures that more reads pass filter at any density

9

From 250G to 600G on HiSeq 20004. Wider Flowcell Channels

50% increase inFC channel width6 swaths per channel

Same footprint

825k/mm2 0.904 4424 mm2

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

10

From 250G to 600G on HiSeq 2000

Yield in excess of 600G

825k/mm2 0.904 4424 mm2 2x100 652.6

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

11

From 250G to 600G on HiSeq 20005. Improved sequencing chemistry

TruSeqv2 SBS kit

Highly Accurate at High Density

825k/mm2 0.904 4424 mm2 2x100 652.6 60.4 84%*

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

New polymerase (EDP)improves incorporation efficiency at high density

*Preliminary

12

From 250G to 600G on HiSeq 20005. Improved sequencing chemistry

825k/mm2 0.904 4424 mm2 2x100 652.6 60.4 84%*

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

New Scan Reagent (SRE)reduces signal decay

TruSeqv2 SBS kit

Highly Accurate at High Density

*Preliminary

13

From 250G to 600G on HiSeq 2000

Unprecedented System Performance

ClusterDensity

FractionPassing

Filter

ImageableArea

(Two FC)

Read length

Yield(Gb)

Throughput(Gb/day)

%>Q30

488k/mm2 0.878 2949 mm2 2x100 252.7 31.6 87%

825k/mm2 0.904 4424 mm2 2x100 652.6 60.4 84%

14

CASAVA 1.8: Accuracy in Alignment and Variant Calling

TAGGTTCAA

Alignments SNP’s Indels Counts

HCS/RTAPrimary Analysis

Available Q2, 2011

15

Repeat resolution– First pass: align using a single seed– Take reads that did not get matched or hit a repetitive sequence– Second – fourth passes: Align using overlapping seeds– Report seeds that hit a non-repetitive sequence– Increases sensitivity across repeat regions

Alignment Algorithm Enhancements: ELANDV2e

Finding seed hits100bp

100bp

Stage 0:(singleseed)

Stage 1:(multiple seeds)

16

Coverage Histograms for Repeat Elements– NA19240, Chr 21

Mapped Coverage

Long repeats (LINE: L1, L2)

%R

ef B

ases

%R

ef B

ases

Mapped Coverage

Short repeats(SINE: Alu, MIR)

NA19240, 2x100bp, GAIIx+v1 chemistry, NCBI b36

17

ELANDv2e improves alignment over repeat region compared to previous version (ELANDv2)

Human 2x100 bp data

Improved ELANDv2e Alignments on Human Data

18

Orphan aligner1

– Using mapped read as anchor, determine expected position of unmapped read (using insert size information)

– Perform local alignment in a 450 bp window (default) around expected position

– Score read pairs according to mismatches to the references and their insert size

– Increases % aligned reads: generally 5-7% on 30x human data– Improves indel detection

Alignment Algorithm Enhancements: ELANDV2e

…. … … …

…. … … …

Read 2 has multiple mappings (shown in red):

Do local realignment using read 1 (green) as an anchor

Estimate insert size

1. Orphans are reads that do not map, but have read partners which map uniquely.

Score Read Pairs

19

25 million simulated read pairs, error rate <2%, including SNPs and short indels

Reads correctly mapped if there is an overlap between the simulated position and the mapped read

bowtie bwa ELANDv2e

Correctly mapped

88.1% 94.1% 94.1%

Incorrectly mapped

3.0% 5.0% 0.09%

0% 20% 40% 60% 80% 100%

ELANDv2e

bw a

bow tie

ELANDv2

correctly mapped pairs

incorrectly mapped pairs

unmapped pairs

ELANDv2e Comparative Assessment

20

Positions of the 'singleton' reads and insert size information used as a distant metric

Indel Detection Using Grouper

1. Cluster Orphan Reads

TTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCGTAGCATGCATGCGGCTTTTCGCGTAGA

Aligned Reads

Reference

Clusters ofOrphanReads

21

This step is new in CASAVA 1.8Anomalously short inserts = possible insertionAnomalously long inserts = possible deletion

Indel Detection Using Grouper

TTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCGTAGCATGCATGCGGCTTTTCGCGTAGA

Aligned Reads

Reference

Cluster ofAnomalous Reads(insert too short)

2. Cluster Anomalous Reads

22

Indel Detection Using Grouper

Merged Cluster

3. Merge Clusters from Same Event

23

Indel Detection Using Grouper

4. Assemble Cluster into Contig

New ContigTAGCATGCATGCATGCACGATCGGTGTTTGTGGTGGGGGACTTT

Anomalous Reads in Contig

Orphan Reads in Contig

Potential New Insert

TAGCATGCATGCATGCACGGACTTT|||||||||||||||||||||||||

TTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGCGTAGCATGCATGCGGCTTTTCGCGTAGA

GATCGGTGTTTGTGGTGGG

Reference

New Contig

Positions of associated 'singleton' used to narrow search to ~2,000 bp

5. Align Contig to Reference

24

Candidate indels identified using– Gapped alignment: ideal for small indels (1–20 bp)– IndelFinder (Grouper algorithm)– The above two are combined to identify candidate indels– Significantly improved indel calling sensitivity over CASAVA 1.7

All intersecting reads to each candidate indel are locally realigned

Alternate alignment probabilities are used to generate indel call qualities

Most likely alignment used for SNP calling

Local Realignment

25

Local Realignment Improves Variant Detection

CASAVA 1.7

reference:

xx xxxx

x

ELAND alignments with del:

ELAND alignments w/o del:xx

Potential false-positive calls in misaligned

region

Potential false-positive calls in misaligned

region

Lower sensitivity to true variants in the sample

Lower sensitivity to true variants in the sample

CASAVA 1.8

reference:

ELAND alignments with del:

ELAND alignments realigned by variant caller:

Unmapped ‘orphan’ reads recovered by GROUPER:

GROUPER provides alignments for indels which are

too large or complex for the read mapper

GROUPER provides alignments for indels which are

too large or complex for the read mapper

Local read realignment and incorporation of GROUPER results improves sensitivity of

both SNP and indel calls

Local read realignment and incorporation of GROUPER results improves sensitivity of

both SNP and indel calls

26

Joint SNP- and indel-calling process allows all variants to be based on a consistent set of read alignments.

– New probabilistic SNP caller provides conventional Q-scores Q(SNP): probability that each site contains a SNP Q(max_gt): probability of the most likely genotype at that site Genotype-based model significantly reduces Mendelian inheritance errors SNP calling also takes into account the reference sequence (genomic

prior) when calculating the Q-scores

Indel genotype quality model based on principles similar to those used for SNPs

– Q(indel), Q(max_gt)

SNP/Indel Genotyping

27

Yoruba trio datasets: NA18507, NA18506, NA18508– 95 Gb raw data each, 2x100 bp, Currently available Chemistry

Violations of Mendelian inheritance indicate miscalled variants

Results for chromosome 20

Improved SNP Calling Specificity(False Positive Test)

CASAVA 1.7Current chemistry

CASAVA 1.8Current chemistry

Total sites (‘N’ excluded) 59,505,520 59,505,520

Called in all 3 of trio 58,188,449 (97.8%) 58,988,964 (99.1%)

Mendelian conflicts 5,072 162

Conflict rate 3.78% of SNP sites 0.13% of SNP sites

0.0087% of called sites 0.00027% of called sites

28

Call set (NA19240) # Called SNPs # Called SNPs in NA19240&Yoruba1

# Called SNPs in NA19240&Yoruba

near repeats2

CASAVA 1.7 4,340,848 230,026 (6.3%)

152,890(8.1%)

CASAVA 1.8 4,549,001 237,113(3.4%)

159,629(4%)

1. NA19240&Yoruba: A subset of 245,469 SNPs were identified as being present in dbSNP and further validated by a capillary capillary sequencing study in NA19240 and at least one other Yoruban individual (of NA18506, NA18507 and NA18508) reported in Nature 2008 453(7191):56-64.2. NA19240&Yoruba near repeats: A subset of 166,343 in NA19240&Yoruba were located within 100bp of a known repeat element from UCSCgenome browser .3. Numbers in parentheses are false negative rates.

Improved SNP Calling Sensitivity(False Negative Test)

29

New Chemistry and Software Enables Higher Data Quality

Chemistry: TruSeq v3 cBOT Kit improves coverage in GC-rich regionsSoftware: HCS1.3/RTA.10 improves cluster detection at higher densities

Software: CASAVA 1.8 Repeat Resolution and Orphan Aligner improves coverage

Software: CASAVA 1.8 Probabilistic variant calling reduce false positives and false negatives

Chemistry: TruSeq v3 SBS Kit improves phasing and signal decay

Software: CASAVA 1.8 Local Realignment improves coverage

30

Software Ecosystem for Illumina Data

Application Software

DNA Alignment CASAVABWA

BowtieRNA Seq TopHat/Bowtie/Cufflinks (novel isoform discovery,

differential expression, counting)CASAVA (counting)

SNP Calling CASAVAGATK

SOAPsnpIndel Detection CASAVA

Breakdancerde novo assembly Velvet

Allpaths-LGSOAPdenovo

31

Data Volumes: HiSeq2000 + v3 Chemistry + CASAVA 1.8 (600Gb output per run; Q2, 2011)

Data Volume File Format Size CommentBase Call / Quality Score Data .bcl 660 GB Intermediate file

format

Read-level Data compressed fastq 660 GB FASTQ saving

optional

Alignment Output &Archiving .bam 660 GB BAM

Note: Export files are ~3.5bytes per base and temp space needed is twice the size of export files. For a 600Gbase run, ~5TB disk space is needed.

Note: Gb = Giba baseBG = Giga byte

32

TopHat User Guide

Annotated, Ready-to-use References on iCom/ ftp site

Coming Soon– Graphical User Interface for TopHat Suite (AVC)– Array/ RNA-Seq Comparison Tech Note– Array/ RNA-Seq Comparison Tool

Tools to Get You Started with TopHat Suite (RNA-Seq)

CASAVA Reference files

Bowtie Reference files

33

Reversing the TrendScaling output while reducing the data burden

0100200300400500600

2008 2009 2010 2011

Bases (Gb/Wk)

Bytes (Gbytes/day)

34

Why IlluminaCompute?

IlluminaCompute: Turnkey, Optimized Server Solution

For human-scale genomic data processing and analysis

Used by Illumina’s own sequencing service centers

Modeled around systems used at most genome centers

scalableProcessing

(CPUs)

scalableStorage

Configuration, maintenanceAnd Support

Turnkey: built, installed, and supported by Illumina

Performance scales with sequencing throughput

Cost-effective solution given performance and support

35

The Future: Addressing Three Key NeedsWorkflow management simplifies sequence data analysis

DATA GENERATIONAND MANAGEMENT IT INFRASTRUCTURE BIOINFORMATICS

Sample Prep & Instrument Control

Local HPC

Cloud

Analysis “AppStore”

Sequencing workflow engine

36

The Future: Simplified Workflows Illumina Sequencing Workflow Manager

Simplifies and automates sequencing data generation and analysis

Customization, execution, and monitoring of sequencing workflows

Integrates open source analysis tools and LIMS systems

Recommended