© 2013 Illumina, Inc. All rights reserved. Illumina, IlluminaDx, BaseSpace, BeadArray, BeadXpress, cBot, CSPro, DASL, DesignStudio, Eco, GAIIx, Genetic

Embed Size (px)

Citation preview

  • Slide 1

2013 Illumina, Inc. All rights reserved. Illumina, IlluminaDx, BaseSpace, BeadArray, BeadXpress, cBot, CSPro, DASL, DesignStudio, Eco, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera, NuPCR, SeqMonitor, Solexa, TruSeq, TruSight, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. Advancing metagenomics with Illumina sequencing technology Anthony J. Cox Computational Biology Group Illumina Cambridge Ltd. 14 th April 2014 Slide 2 2 Challenge: achieving a seamless end-to- end workflow for metagenomics Case study: Eagle Creek Reservoir 16S workflow on MiSeq Shotgun metagenomics on NextSeq Challenge: efficient storage and access for metagenomic data Contents Slide 3 3 MiSeq HiSeq 2500 15Gb | 25M | 2x300 1000Gb | 4B | 2x125 NextSeq HiSeq X Ten 120Gb | 400M | 2x150 1800Gb | 6B | 2x150 Decreasing Price Per Gb Increasing System Output Expanded sequencing portfolio Slide 4 4 Sample Prep Sequencing Analysis Integration Streamlined end-to-end solution Industrys leading NGS instruments Storage, Processing, Analysis & Collaboration Suite of DNA, RNA & Targeted Solutions Slide 5 5 Assessing seasonal blooms of Cyanobacteria (blue-green algae) in drinking water that can impact water quality. Collaboration with Center for Earth and Environmental Science, IUPUI 49 reservoir samples collected in different months, at discrete depths. Study combines 16S analysis on MiSeq with shotgun metagenomics on NextSeq Case study: Eagle Creek reservoir, Indiana By courtesy of: Nicolas Clercin (IUPUI), Rob Schmeider, Brian Steffy, Clotilde Teiling, Kameran Wong (Illumina) Slide 6 6 2Q113Q114Q111Q122Q123Q124Q121Q132Q133Q134Q13 MiSeq continuous performance improvements Delivering on promise of 15Gb+, 2x300 bp reads 1 10 20 Output - Gb Faster chemistry Dual surface imaging Faster chemistry Dual surface imaging Since launch: 10x increase in output 7x decrease in price per data point Since launch: 10x increase in output 7x decrease in price per data point New v3 reagent kits 150 & 600-cycle *Prices reflect US List only Slide 7 7 Workflow overview 16S rRNA Sequencing was done on 27 of the samples Primer pair sequences for V3 and V4 region create a simple 460 bp long amplicon. Nextera XT indexing kit for 96 samples in parallel 100,000 reads per sample if using all 96 indexes. Comparative genomics Phylogenetic classification Genomic DNA extraction Sample Prep V3V4 region Amplification Library Prep MiSeq & Primary Analysis Secondary Analysis The Meta-G-Nome DNA Isolation Kit is used to isolate inhibitor-free, fosmid cloning-ready DNA from unculturable or difficult-to-culture microbial species present in environmental water, soil, or compost samples. Slide 8 8 16S metagenomics on BaseSpace Slide 9 9 Can run on-instrument using MiSeq Reporter or in cloud with BaseSpace Both analysis pipelines use the same classification algorithm and taxonomic database. The classification algorithm is a high performance implementation of the published RDP Nave Bayesian Classifier (http://dx.doi.org/10.1128%2FAEM.00062-07)http://dx.doi.org/10.1128%2FAEM.00062-07 The database is an Illumina-curated version of the GreenGenes Consortium 16S rRNA database. Redundant sequences and entries with missing or partial labels are removed. Provides fast, high-accuracy species-level taxonomic classifications Uses full length of Illumina paired-end reads Outputs: PDF reports, raw data (CSV), interactive visualizations Taxonomic classification Slide 10 10 Examples of 16S workflow output PCA plot of normalized relative abundance of samples Clustering dendrogram Slide 11 11 NextSeq innovations Consumables Load-and-go flowcell High or medium output Ships dry All-in-one reagent tray RFID-tagged, ships frozen All-in-one buffer tray Ships at room temperature Chemistry 2-dye sequencing chemistry comparable quality to 4-dye Isothermal amplification No chiller on instrument Optimized reagent consumption Optics Solid state optics Leverages advances in consumer products No alignment needed Fluidics Eliminated fluidic tubes less dead volume, waste, contamination Automatic post-run wash protocols Bleach step eliminates carry-over Simultaneous chemistry & imaging chemistry in one lane while imaging other pair Slide 12 12 Sample Extraction Library Prep NextSeq Sequencing Analysis Shotgun metagenomics on NextSeq: workflow overview 11 samples sequenced in 1 NextSeq run 400 million 2150bp read pairs generated in 29 hours 78.8% of bases exceeded Q30 Analysis done with MG-RAST Slide 13 13 Seasonal variation in composition at bottom of lake 25 th July Actinobacteria 33% 23 rd May Actinobacteria = 76% 23 rd October Actinobacteria=79% Ongoing challenge: what should be our data analysis pipeline for shotgun metagenomic data, e.g. on BaseSpace? Several standalone apps for taxonomic classification Seem to be fewer options for functional classification Slide 14 14 HiSeq 1 terabase run (R&D data) Per run you can do up to: 10 genomes 150 exomes 80 WT RNA samples *Assumes 100Gb, 30x genome; Nextera Rapid Capture Exome; 50M reads per RNA sample 2 x 125 Cycles Slide 15 15 Challenge: efficient storage and access for shotgun metagenomic data Resequencing data (Human genome build ~160 Gbp, ~400 Gbyte FASTQ) FASTQ (gzipped) 150 Gbyte BAM (40 Q-scores) 120 Gbyte BAM (8 Q-scores) 82 Gbyte BAM (consensus compressed) 60 Gbyte CRAM (consensus compressed) 27 Gbyte Relies heavily on known high-quality reference sequence Resequencing data (Human genome build 145Gbp, ~160 Gbp, ~400 Gbyte FASTQ) FASTQ (gzipped, 8 Q-scores) 89 Gbyte BWT compression (now) 37Gbyte BWT compression (likely achievable) 23 Gbyte 89 37Gbyte: BWT/PPM for reads, simple binning of Q-scores (lossless) Sort reads for better compression save 4Gbyte (Cox et al., 2012) Discard uninformative Q-scores (reference free) save 10Gbyte (Janin et al., 2012) Slide 16 16 Trading compression for searchability Resequencing data (Human genome build ~165 Gbp) FASTQ (gzipped) 152 Gbyte BWT (searchable) 105 Gbyte NB: 40 Q-scores, both FASTQ and BWT would be smaller for 8 Q-scores For a query sequence q, returns: Full FASTQ record (sequence, Q-scores, read names) for all reads containing q and full FASTQ record of their read pairs Pipe search output directly to your favourite tool, e.g. Velvet Applications: In silico pull-down Assembling breakpoints Genotyping complex variants by tracking k-mers Reads (BWT) :26 Gbyte Q-scores (razip):64 Gbyte Read names (razip):15 Gbyte Further info: beetl.github.io/BEETL/, Janin et al. (2014, submitted) Slide 17 17 Thank you! Slide 18 18 Extra slides Slide 19 19 Moleculo Technology Enables Synthetic Long Reads Up to 10Kb from Illumina short reads Synthetic long reads 8 10kb Enables fully phased genomes Accurate de novo assembly of large, complex genomes Synthetic long reads 8 10kb Enables fully phased genomes Accurate de novo assembly of large, complex genomes Available: Illumina services 2H13 Kit format early 2014 Available: Illumina services 2H13 Kit format early 2014 Slide 20 20 BaseSpace: Plug and Play Genomic Cloud Solution All you need is an internet connection Slide 21 21 How Is BaseSpace Being Used World Wide? Users & Growth Bioinformatics Cloud Computing Service Illumina Begins Streaming MiSeq Data to the Cloud October 2011 Illumina Begins Data Sharing in the Cloud December 2011 Illumina Begins Streaming HiSeq Data to the Cloud November 2012 BaseSpace Commercial (Supported) Release May 2013 Over 20,000 Instrument Runs Streamed to BaseSpace December 2012 Over 40,000 Instrument Runs Streamed to BaseSpace April 2013 General Availability of BaseSpace to all HiSeq instruments July 2013 Over 60,000 Instrument Runs Streamed to BaseSpace, and Over 10,000 Apps Run September 2013