Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
NEXT GENERATION SEQUENCING AT THE FGCZ 1
newsletterTechnologies, Applications, and Access to Support
Next Generation Sequencing at the Functional Genomics Center Zurich
Next Generation Sequencing (NGS) has become one of the major technologies at the FGCZ. The impressive throughput and read lengths of the current high-end systems enable research groups to de novo sequence genomes of small to complex organisms. They allow for the analysis of genome alterations, gene expression and DNA modifications, as well as a constantly growing number of other applications.
To account for the vast diversity of life science research at the ETH Zurich and the University of Zurich, the FGCZ has continuously extended its portfolio of NGS technologies and applications. The emphasis of this expansion has not only b e e n o n c a p a c i t i e s b u t a l s o o n capabilities, leading to the establishment of a complete range of NGS technologies: Short read technologies (Life Technologies SOLiD and Illumina HiSeq) for the generation of very large numbers of short to medium length reads, long read technologies (Roche/454 GS FLX+) for the generation of medium to large numbers of long reads, plus the latest technologies focusing on the very rapid generation of sequencing data (Ion Torrent PGM) and very long reads and novel applications at the single molecule level (Pacif ic Biosciences RS).
2008 $$$2009 $ $2010 $ $ $2011 $ $ $ $ $ $2012 $$
The on site availability of these systems allows us to flexibly combine the different technologies within or between projects and opens up combinations of possibilities that ensure suitable analytical strategies tailored to the research project’s needs. As a consequence of this flexibility
and the number of technologies available, NGS projects need careful planning and consideration at multiple levels, including sample availability and quality, robustness of methods and protocols, processing and analysis of data, as well as financial and time constraints.
This complexity emphasizes the need for a close collaboration of ETH and UZH research groups with analytical and bioinformatics experts of the FGCZ, which is why access to the FGCZ NGS platform is provided through the User Lab. In parallel to the technology expansion, significant efforts have been undertaken in increasing the User Lab staff working in the technologies and bioinformatics sections. As a result, optimized analytical protocols and data analysis workflows have been established that lead to significantly shortened turnover times from sample generation to data interpretation.
Technologies and support modes at the FGCZ
In the following, we briefly describe the available technologies (from most recent to most established) including supported applications. More information on the setup of NGS and access via the FGCZ User Lab and User Lab Services can be found on the FGCZ website at www.fgcz.ch. For further or more specific ques t i ons , p l ease con tac t us a t [email protected]
OVERVIEW
1NGS AT THE FGCZTechnologies and organization
2NGS APPLICATIONS AND TECHNOLOGIESApplications in sequencing
3PACIFIC BIOSCIENCES RSSingle molecule sequencing
4ION TORRENT PGMFast turnover sequencing
5ILLUMINA HISEQ2000High throughput short reads
6LIFE TECHNOLOGIES SOLID 5500XLFlexible short read sequencing
7ROCHE/454 GS FLX+Reliable long read sequencing
8BIOINFORMATICS OF NGSAnalysis workflows and support
FGCZ NEWSLETTER FALL 2011
NEXT GENERATION SEQUENCING AT THE FGCZ 2
NGS@FGCZ Applications and Technologies De novo sequencing of genomes and metagenomes
Emphasis on longer reads, medium to high throughput* * * Roche GS FLX+* * * Pacific Biosciences RS* * Illumina HiSeq (+++ in combination with PacBio or GS FLX+)* * Ion Torrent PGM (for smaller genomes)
De novo sequencing of transcriptomes Emphasis on medium to longer reads, high to very high throughput* * * Illumina HiSeq * * * Roche GS FLX+* * Ion Torrent PGM
Genome-wide SNP discovery and variant detection Emphasis on high to ultra-high throughput, short to medium read lengths* * * Illumina HiSeq (max. throughput)* * * LifeTech SOLiD (max. flexibility)* * Ion Torrent PGM (max. speed)
Transcriptome analysis Emphasis on high to ultra-high throughput, short to medium read lengths* * * Illumina HiSeq (max. throughput)* * * LifeTech SOLiD (max. flexibility)* * Ion Torrent PGM (max. speed)
Small RNAs Emphasis on high to ultra-high throughput, short read lengths* * * Illumina HiSeq (max. throughput)* * * LifeTech SOLiD (max. flexibility)* * Ion Torrent PGM (max. speed)
ChIP-seq Emphasis on high to ultra-high throughput, short to medium read lengths* * * Illumina HiSeq (max. throughput)* * * LifeTech SOLiD (max. flexibility)* * Ion Torrent PGM (max. speed)
Amplicon sequencing Emphasis on longer reads, medium to high throughput* * * Roche GS FLX+* * * Pacific Biosciences RS* * Illumina HiSeq (+++ in combination with PacBio or GS FLX+)* * LifeTech SOLiD (+++ in combination with PacBio or GS FLX+)* * Ion Torrent PGM
DNA Methylation analysis Emphasis on flexibility and protocols* * * Pacific Biosciences RS * * * LifeTech SOLiD * * * Illumina HiSeq
The list of applications and suitable technologies is not exhaustive nor does it exclude the use of a specific analytical technology for applications mentioned.The purpose of the list is to provide an initial overview that is always refined based on individual needs during the setup phase of an FGCZ project. Technologies of equal suitability are listed in alphabetical order.The list of applications and suitable technologies is not exhaustive nor does it exclude the use of a specific analytical technology for applications mentioned.The purpose of the list is to provide an initial overview that is always refined based on individual needs during the setup phase of an FGCZ project. Technologies of equal suitability are listed in alphabetical order.
FGCZ NEWSLETTER FALL 2011
* MP: mate pair; PE: paired-end§ The actual amount depends on the library type (and the number of needed SMRT cells in the case of PacBio). For accurate information please refer to each technology section.
NGS@FGCZ Technology SpecificationsPlatforms Library type * Library
preparation Chemistry Run time Read length
(bp) Throughput (Gb per run)
Amount of input materials §
PACIFIC BIOSCIENCES RS
Frag PCR free Single Molecule Real Time Sequencing
24 hrs (90 min per SMRT cell)
1500 1.2 (16 SMRT cells)
1-10 µg DNA
ION TORRENT PGM
Frag emPCR Ion semiconductor sequencing
2 hrs 200 1 0.1- 5 µg DNA 0.2- 1 µg poly(A)RNA or rRNA -
depleted total RNA
ILLUMINA HISEQ2000
Frag, MP, PE
Solid phase Reversible Terminator
10 days 2x100 600 1 µg DNA, 0.1-4 µg total RNA 1 – 10 µg for small RNA 10 µg Single ChIP enriched DNA
LIFE TECHNOLOGIES SOLID 5500XL
Frag, MP, PE
emPCR Sequencing by Ligation
10 days 35-75 (Frag) 75+35 (PE) 2x60 (MP)
200 0.1-5 µg DNA 0.2-1 µg poly(A)RNA or rRNA - depleted total RNA
ROCHE/454 GS FLX+
Frag, MP emPCR Pyrosequencing
23 hrs 700 0.7 1 µg DNA 1-2 µg total RNA
!
NEXT GENERATION SEQUENCING AT THE FGCZ 3
Single Molecule Sequencing
Pacific Biosciences RS
The Pacific Biosciences RS system is a third generation sequencer that provides single molecule and real time sequencing technology (SMRT) based on an uninterrupted template-directed DNA-polymerase synthesis. The RS gives the longest read length achievable to date: the average read length per run is currently around 1.5 kb, with instances of over 10,000 base pairs, which facilitates mapp ing and assembly. SMRT technology bears the potential to retrieve k inet ic sequencing informat ion of individual molecules, a method that is fo reseen to a l l ow fo r the d i rec t identification of chemical modifications, such as methylation, at a single base r e s o l u t i o n . T h e P a c B i o R S , i n combination with massive high throughput sequencing (i.e. HiSeq 2000 or SOLiD 5500xl) provides an ideal solution to cover a wide range of applications.
TechnologyT h e S M RT D N A s e q u e n c i n g
technology is built upon three key innovations. The SMRT Cell, provides single molecule, real-time observation of individual fluorophores against a dense background of labeled nucleotides while maintaining a high signal-to-noise ratio. In contrast to using base-linked nucleotides, SMRT Sequencing uses phospholinked-nucleotides (the fluorescent dye is attached to the poly-phosphate chain of the nucleotide). Through such attachment, incorporat ion of a phosphol inked nucleotide by the DNA polymerase results in separation of the dye molecule from the nucleotide when the enzyme forms the phospho-diester bond between two
nucleotides. A Real Time Detection platform provides single molecule, real-time detection as well as flexibility in run configurations and applications.
DNA sequencing is performed on SMRT Cells, each containing thousands of zero-mode waveguides (ZMWs). A ZMW is a hole, tens of nanometers in diameter, fabricated in a 100 nm metal film deposited on a silicon dioxide substrate. Each ZMW becomes a nanophotonic visual izat ion chamber providing a detection volume of just 2*10-20 liters. At this volume, the activity of a single molecule can be detected amongst a background of thousands of labeled nucleotides. Within each chamber, a single DNA polymerase molecule is attached to the bottom surface such that it permanently resides within the detection volume. Phospholinked nucleotides, each type labeled with a different colored fluorophore, are then introduced into the reaction solution. As the DNA polymerase incorporates complementary nucleotides, each base is held within the detection volume for tens of milliseconds. During this time, the engaged fluorophore emits fluorescent light whose color corresponds to the base identity. Then, as part of the n a t u r a l i n c o r p o r a t i o n c y c l e , t h e polymerase cleaves the bond holding the fluorophore in place and the dye diffuses out of the detection volume. Following
incorporation, the signal immediately returns to baseline and the process repeats. The DNA polymerase continues incorporating an average of 1-2 bases per second producing a long chain of DNA in minutes.
ApplicationsSMRT technology is ideal for
applications such as de novo genome and transcriptome sequencing (especially i n c o m b i n a t i o n w i t h s h o r t r e a d sequencers). PacBio’s long reads make the assembly of the genomic structure much easier, enabling a comprehensive view of the genome. SMRT technology is a PCR free approach: standard PacBio library preparation and sequencing methods do not include any amplification step. Native DNA is directly sequenced without PCR bias and potentially allows the identification of chemical base modifications, such as methylation.
PerformanceThe PacBio RS can process 16
SMRT cells per run in approximately 24 hours (90 minutes per SMRT cell). Currently every cell produces up to 50‘000 reads, with average read length of 1.5 kb with few reads hitting the 10 kb. With the next chemistry release (Q1, 2012) the average read length will be doubled to 3 kb.
Input Requirements* 500 bp library: 1 µg high quality gDNA* 2000 bp library: 3 µg high quality gDNA* 8000 bp library: 8 µg high quality gDNA
PacBio RS at FGCZ The PacBio RS system is accessible
through the FGCZ User Lab. For the time being, no User Lab Services are offered but projects including all analytical steps are carried out in close collaboration between FGCZ experts and User Group researchers. More information: PacBioRS
FGCZ NEWSLETTER - NGS TECHNOLOGIES FALL 2011
APPLICATIONSAND MAIN USE
Backbone sequencing
De novo sequencing
Methylation studies
TECHNOLOGYIN SHORT
Third generation sequencer using Single Molecule Real Time Sequencing Technology (SMRT).
Very long reads up to 10 kb, average read length 1.5 kb. Samp le p repa ra t i on and sequencing are PCR free: sequence native DNA.
Sma l l genomes de -novo sequencing and potent ia l detection of DNA methylation at base resolution.
NEXT GENERATION SEQUENCING AT THE FGCZ 4
Personal Genome Machine
PGM by Ion Torrent
The Personal Genome Machine (PGM) Sequencer is a benchtop PostLight semiconductor-based platform that p e r f o r m s a s e q u e n c i n g r u n i n approximately two hours.
TechnologyThe PGM advances Next-Generation
sequencing to PostLight sequencing: the t ranslat ion of chemical sequence information directly into digital form. The sequencing technology underlying the PGM exploits a well-characterized biochemical process: When a nucleotide is incorporated into a strand of DNA by a polymerase, a hydrogen ion (H+) is released as a byproduct. This hydrogen ion carries a charge which the PGM System’s ion sensor—essentially the world's smallest solid-state pH meter— can detect. As the sequencer floods the chip with one nucleotide after another, any nucleotide added to a DNA template will be detected as a voltage change, and the PGM System will call the base. If a nucleotide is not a match for a particular template, no voltage change will be detected and no base will be called for that template.
By eliminating the need for the optical system, the PGM provides sequencing that is simpler, faster, more cost effective, and more scalable than any other technology available.
A principal component of the PGM is the sequencing chip. This microprocessor chip incorporates an extremely dense array of >1 million micro-machined wells married to our proprietary ion sensor.
Each well contains a different DNA template, allowing massively parallel sequencing. Chips can scale in density for any application, from small, targeted experiments to large genomes.
ApplicationsThe PGM system enables many
possib le sequencing appl icat ions. Performing multiplex amplicon sequencing is possible in a very short time. The 200bp reads make sequencing long amplicons possible, resulting in more simple and affordable projects. It's perfect platform to obtain optimal results for microbial sequencing projects with extraordinary uniformity of coverage and long read lengths. The four to eight million reads produced by the 316 and 318 chips are ideal for applications such as small RNA sequencing, transcriptome sequencing, and ChIP-Seq.
PerformancePerformance of the system directly
depends on the sequencing chip used.
OutlookIon Torrent recently launched a long
read kit for the PGM sequencer with modal high-quality read lengths of 225 bases. Read lengths greater than 500 bases are feasible, as demonstrated by the generation of a perfect 525 bases read.
A novel paired-end sequencing (PES) method was developed for the PGM sequencer that produces paired reads, each 100 bases in length (2 x 100). We anticipate that applying recent improvements in read length to the Ion PES protocol should produce paired reads of 200 bases in length (2 x 200), with the possibility of paired reads up to 400 bases in length (2 x 400).
Input Requirements* DNA Seq - From 100 ng to 5 µg high-
quality RNA-free genomic DNA* RNA Seq - From 200 ng to 1 µ g
poly(A)RNA or rRNA-depleted total RNA
PGM at FGCZ The PGM system is accessible
through the FGCZ User Lab Services. More information: PGM
FGCZ NEWSLETTER - NGS TECHNOLOGIES FALL 2011
APPLICATIONSAND MAIN USE
Small genomes sequencing
Targeted re-sequencing (amplicon)
Low complexity transcriptomes and small RNA sequencing
TECHNOLOGYIN SHORT
Output depending on chip
> 10 Megabases > 100 Megabases
Read length up to 200 bases
Run Time < 2 Hours
Ion Semiconductor Sequencing Chip
Output Read Length Total Sequencing Time
314
> 10 Megabases Year 2011 >200 bp
Year 2012 > 400 bp
< 2 Hours
316 > 100 Megabases
318
> 1 Gigabase
Accuracy > 99.99% consensus accuracy and >99.5% raw accuracy !
NEXT GENERATION SEQUENCING AT THE FGCZ 5
High Throughput Short Read Sequencing
Illumina HiSeq 2000
The HiSeq2000 from Illumina is a high throughput sequencer that uses a massively paral le l sequencing-by-synthesis approach to generate billions of bases of high quality sequence data. To date, it delivers the industry’s highest throughput of quality filtered data at up to 55 Gb per day or 600 Gb per run with read lengths of 2X100 bp. The high throughput and read length of the system provides the coverage necessary for various next generation sequencing applications. .
TechnologyIllumina is different from the SOLiD
and 454 sequencers because it does not use beads for clonal amplification of f r a g m e n t s , i n s t e a d a f t e r l i b r a r y preparation, cluster or bridge PCR is performed directly on the flow cell.
For sequencing, Il lumina uses the s e q u e n c i n g b y s y n t h e s i s ( S B S ) technology. It is a proprietary reversible terminator-based method that enables detection of single bases as they are incorporated into growing DNA strands. A fluorescently-labeled terminator is imaged as each dNTP is added and then cleaved to allow incorporation of the next base. Since all four reversible terminator-bound dNTPs a re p resen t du r i ng each sequencing cycle, natural competition minimizes incorporation bias.
ApplicationsSBS technology supports both single
read and paired-end libraries. A wide array of available sample preparation methods serve to enable diverse applications, including: - DNA Sequencing- Transcriptome Analysis (RNA-Seq)- SNP Discovery and Structural Variation Analysis- Cytogenetic Analysis (copy number variation analysis, CNV)- DNA-Protein Interaction Analysis by - Chromatin Immunoprecipitation (ChIP-Seq)- Methylation Analysis- Small RNA Discovery and Analysis
PerformanceThe HiSeq 2000 can process 2 flow
cells in parallel . Each flow cell is made of 8 lanes that can be run independently. The HiSeq 2000 can generate a maximum of 600 Gigabases of qual i ty - f i l tered sequence data per run with read lengths of up to 2 x 100 base pairs, and provide up to three billion single-end reads and up to six billion paired-end reads. A run takes about ten days.
Input Requirements* DNA Seq - From 1 µg high-quality RNA-
free genomic DNA* RNA Seq - From 100 ng to 4 µg column
purified, genomic DNA free total RNA* Small RNA Seq - From 1 to 10 µg of
Trizol purified total RNA* Chip Seq -From 10 ng Single ChIP
enriched DNA or input DNA
HiSeq2000 at FGCZ The HiSeq2000 system is accessible
through the FGCZ User Lab Services. More information: HiSeq2000
FGCZ NEWSLETTER - NGS TECHNOLOGIES FALL 2011
APPLICATIONSAND MAIN USE
DNA Sequencing
Gene Regulation Analysis
Sequencing-Based Transcriptome Analysis
SNP Discovery and Structural Variation Analysis
Cytogenetic Analysis, DNA-Protein Interaction (ChIP-Seq)
Sequencing-Based Methylation Analysis
Small RNA Discovery and Analysis
TECHNOLOGYIN SHORT
Sequence up to 16 independent lanes on 2 flow cells in one run
24 unique barcodes provided by Illumina for multiplexing for RNA and DNA libraries
Up to 384 independent samples per run
~ 1.5 billion high-quality reads per flow cell (~180 million/lane)
up to 600 Gb output per run
single-read (50 or 100 bp) or paired-end sequencing (2 x 50 bp; 2 x 100bp)
No. of Lanes No. of reads (Millions) Throughput (Gigabases) 2 X 100 sequencing
1 lane 188 37.5 8 lanes (1 flow cell) 1504 300
16 lanes (2 flow cells) 3008 600 !
NEXT GENERATION SEQUENCING AT THE FGCZ 6
APPLICATIONSAND MAIN USE
DNA Sequencing (Fragment, Paired End, Mate-Paired)
Whole Transcriptome Analysis
Small RNA Expression
SAGE - genome-wide expression
TECHNOLOGYIN SHORT
Sequencing by ligation, two bases at a time
Sequence up to 12 independent lanes on 2 flow chips in one run
96 unique barcodes
Up to 200 Gb output per run
Flexible read lengths
Paired-end sequencing (75 x 35 bp) is available
High Accuracy and Reference-free analysis with Exact Call Chemistry Module
Flexible Short Read Sequencing
Life Technologies SOLiD 5500xl
The SOLiD 5500xl is a highly accurate, massively paral lel next-generation sequencing platform that supports a wide range of applications. Sequencing lanes in the flow chip can be run independently with pay-per-use reagents. Multiplexing capability (96 barcodes) allows sequencing of multiple library types in a single run.
TechnologyThe major difference between SOLiD
sequencing and other high-throughput sequencing platforms is its unique sequencing by l igat ion instead of sequencing by synthesis chemistry.
The ligation is performed using specific, flourescently labeled octamer probes made up of 2 known bases at the 3 prime end followed by three degenerate bases then three universal bases. These probes are present simultaneously and compete for incorporation. After each ligation, the fluorescence signal is measured and then cleaved before another round of ligation takes place. A
reset phase allows a reduction in noise - a capping step that prevents dephasing. Sequential ligations of fluorescently labeled probes detect every combination of two adjacent bases. Multiple cycles of ligation, detection and cleavage are performed, with the number of cycles determining the eventual read length. Following a series of ligation cycles, the extension product is removed and the t emp la te i s r ese t w i t h a p r ime r complementary to the n-1 position for a second round of ligation cycles. Five rounds of primer reset are completed for each sequence tag. Through the primer reset process, virtually every base is interrogated in two independent ligation reactions by two different primers. Up to 99.99 % accuracy is achieved with the Exact Call Chemistry Module by sequencing with an additional primer using a multi-base encoding scheme.
ApplicationsT h e s e q u e n c i n g b y l i g a t i o n
technology supports both single read and paired-end libraries. A wide array of sample preparation methods serve to enable diverse applications.- Fragment Sequencing (from 35 bp to 75 bp) also with Barcodes (currently 96) - Paired End (50-75)/35 bp (96 barcodes) - Mate-Paired 2 x 60 bp - Whole Transcriptome Analysis (96 barcodes) - Small RNA Expression (96 barcodes) - SAGE - genome-wide expression
PerformanceThe SOLiD 5500xl can process 2
flow chips in parallel. Each flow cell is made of six lanes that can be run independently. SOLiD 5500xl can to generate up to 198 gigabases of raw sequence data per run with read lengths of up to 75 X 35 base pairs. A run takes about ten days.
Input Requirements* DNA Fragment library - 500 ng of high
quality column purified genomic DNA * Mate Pair library - 5-20 µg of high
quality column purified genomic DNA* Targeted Enrichment (Sure Select) -
3 µg of high quality column purified genomic DNA
* Chip-Seq - 20 ng -1 µg DNA* RNA seq - 200 ng of poly(A) RNA or
200 ng of rRNA-depleted RNA* small RNA analysis - 5-10 µg of integral
high quality total RNA
SOLiD 5500xl at FGCZ The SOLiD 5500xl system is
accessible through the FGCZ User Lab Services.
More information: SOLiD5500xl
FGCZ NEWSLETTER - NGS TECHNOLOGIES FALL 2011
No. of Lanes No. of reads (Millions) Throughput (Gigabases) 75 X 35 sequencing
1 lane 150 16.5 6 lanes (1 flow chip) 900 99
12 lanes (2 flow chips) 1800 198 !
NEXT GENERATION SEQUENCING AT THE FGCZ 7
High Reliability Long Read Sequencing
Roche/454 GS FLX+
454 Sequencing uses a large- scale parallel pyrosequencing system, which is a bioluminescence method that relies on a single addition of a dNTP by a DNA polymerase.
TechnologyThe system depends on fixing
nebulised and adapter-ligated DNA fragments to small DNA-capture beads in a water-in-oil emulsion. The DNA fixed to these beads, ideally one DNA molecule per bead, is then singly clonally amplified by PCR. Each DNA-bound bead is placed into a ~29 µm well on a PicoTiterPlate (PTP), a titanium coated fiber optic slide. A mix of enzymes such as DNA polymerase, ATP sulfurylase, and luciferase are also packed into the well, the two latter facilitate light production.
Pyrosequencing
Emulsion PCR for signal amplification
The PTP is then mounted in a flow chamber, and individual dNTPs are flushed across the wells in a pre-determined sequential order. The lower surface of the PTP is directly attached to a high-resolution CCD camera, which allows
detection of the light generated from each PTP well undergoing the pyrosequencing reaction. Pyrosequencing, basically, measures the release of inorganic py rophospha te by p ropo r t i ona l l y converting it into light using a series of enzymatic reactions. The light signal generated by the enzymatic cascade is recorded as a series of peaks called a flowgram.
Applications454 sequencing enables:
- De novo whole genome sequencing- RNA analysis- Re-sequencing of whole genomes and
target DNA regions- Metagenomics
The GS FLX+ system allows to perform straightforward de novo assembly to decode previously uncharacterized genomes or transcriptomes, or re-sequence organisms with an available reference. Whole genome sequencing projects may use shotgun reads alone or in combination with mate paired reads to generate accurate draft assemblies of any organism. The clonal nature of the 454 Sequencing System renders this platform par t icu lar ly su i tab le for ampl icon sequencing. This application allows unambiguous allele resolution of variation in complex regions of the genome along with quantitative detection of variants present in less than 1 % of a mixture.
PerformanceThe GS FLX+ System features the
unique combination of long reads, exceptional accuracy and high-throughput, making the system well suited for varied genomic projects.
Latest improvements in sequencing chemistry, instrumentation and software have lead to an increased read length of up to 1000 bp. Thus, this technology a l lows a comprehens ive genome coverage and the exploration of the full range of genetic variation using long, high-quality reads with a very high consensus accuracy (99.995% to 99.997 %).
The GS FLX+ System is designed to use with both, the new long-read Sequencing Ki t XL+ and exist ing Sequencing Kit XLR70, each with their various features.
Input RequirementsFor shotgun fragment libraries: * 1 µg for average read length of 700 bp* 500 ng for a shotgun fragment library of
400 bp read-length * For mate pair libraries, with insert sizes
of 3 kb, 8 kb or 20 kb, we need a input of gDNA of 5 µg, 15 µg or 30 µg.
* For RNA analysis 1 to 2 µg of total RNA.
GS FLX+ at FGCZ The GS FLX+ system is accessible
through the FGCZ User Lab Services.More information: FLX+
FGCZ NEWSLETTER - NGS TECHNOLOGIES FALL 2011
APPLICATIONSAND MAIN ADVANTAGES
De novo sequencing
Re-sequencing
Targeted re-sequencing (Amplicon)
RNA analysis
Metagenomics
TECHNOLOGYIN SHORT
Over 1 Mio reads per run
Consensus accuracy greater than 99.99 %
For shotgun sequencing: Average read length of 700 bp Up to 1000 bp reads
Sequencing Kit New GS FLX Titanium XL+ GS FLX Titanium XLR70
Read Length
Up to 1'000 bp Up to 600 bp
Mode Read Length 700 bp 450 bp
Typical Throughput 700 Megabases 450 Megabases Reads per Run ~1 Million Shotgun ~1 Million Shotgun
~700'000 Amplicon Consensus Accuracy
99.997% 99.995%
Run Time 23 Hours 10 Hours Multiplexing Multiplex Identifiers (MIDs): 132 Gaskets: 2, 4, 8, 16 Regions !
NEXT GENERATION SEQUENCING AT THE FGCZ 8
Turning Sequencing Reads into Knowledge
NGS Bioinformatics
NGS technology and the resulting data are uniquely suited for answering a wide range of biological questions by overcoming limitations that so far existed using classical sequencing or alternative microarray approaches. Due to the massive amount of data involved, the management and analysis of the data requires dedicated software and high-performance and capacity computing resources. The FGCZ has developed and productively implemented data processing pipelines for standard data analysis workflows, which can be readily applied to analyze the sequencing data generated at the center. In addition to standard tools and workflows, the bioinformatics team provides access to its expertise and the optional development of customized solutions via the FGCZ User Lab. As for the experimental analysis part, the analysis and interpretation of the resulting data relies on the close interaction of FGCZ staff with the users that is essential to generate scientifically relevant results.
Standard Data Analysis SupportAs an in tegra l par t o f every
sequencing project, the FGCZ provides experimental design consultation during the setup phase of a project in conjunction with the discussion for selecting the most suitable analytical platform. The actual data analysis support following data production in a given project then significantly depends on the type of the study:
For projects with a reference genome, the standard support includes:- files of raw reads- read alignment to the reference(s)- initial secondary analysis (e.g. SNP
calling, expression quantitation, peak finding, and so on)
- QC Report
For de novo sequencing projects, the standard support includes:- files of raw reads- assembly of the raw reads with vendor
assembler using default parameters
The currently available standard data analysis workflows cover popular NGS applications, such as:- Re-sequencing, SNP discovery and variant detection- Transcriptome analysis and expression quantitation- Regulome analysis (small RNA, transcription factor binding, histone modification, methylation)
Standard data analysis and support efforts in the form of a limited number of consulting hours are part of the User Lab Services and are free of charge. Analysis pipelines and consulting for additional applications can be added, based on demand and available resources.
Customized Data Analysis ServiceA significant number of research
projects using the large flexibility and many options of NGS show a distinct need for customized support or even the development of new data analysis procedures or tools. Depending on the availability of resources and subject to project-specific agreements, the FGCZ collaborates on non-standard data analysis, like for example, the discovery and in silico verification of new mirRNAs.
Training and SupportIn addition to project-specific data
analysis, the FGCZ provides training and education at the conceptual and concrete software usage levels. Courses and tutorials are announced on the FGCZ Genomics Bioninformatics website.
FGCZ NEWSLETTER - NGS BIOINFORMATICS FALL 2011
DATA ANALYSIS WORKFLOWSIN SHORT
De novo assembly
Mapping to the reference genome
SNP and INDEL detection
Digital expression
ChIP-seq region detection
NEXT GENERATION SEQUENCING AT THE FGCZ 9
Example Workflows
F r o m R a w D a t a t o Knowledge
D e n o v o a s s e m b l y a n d a n n o t a t i o n o f g e n o m e s a n d transcriptomes
De novo assembly and annotation of genomes have high demands on customization of the analysis workflows. While the assembly using default parameters will frequently yield a satisfactory result, an iterative approach with manual inspection and adaptation of assembly parameters considerably improves the assembly outcome. The a n n o t a t i o n p r o c e s s h a s t w o components: structural annotation (identification of genomic elements) and functional annotation (linking biological information to the genomic elements). Both steps highly depend on the species sequenced and require close collaboration of the researchers with the FGCZ Bioinformaticians. Within such collaborations we provide sequence simi lar i ty and sequence domain searches to provide in silico predictions and functional annotations of genes. Additionally, we also bring our expertise into further analyses, such as in-depth analysis of gene families (e.g. secreted proteins, membrane proteins).
RNA-seq analysis For RNA-seq samples, we map the
reads to both the transcriptome and genome. For example, for human samples this will be the latest version of ENSEMBL transcripts and the hg19 assembly. Subsequently we use R/Bioconductor to load the reads from the BAM alignment files and generate expression counts for genes. With the R language DESeq package, we compute the significantly differentially expressed genes.
As an example for a recent project, RNA from four samples of human tissues were sequenced as paired-reads on a single slide of the SOLiD 5500xl. Following sequencing, the reads were mapped to the reference human genome (hg19), creating four sorted and indexed BAM files. Using the R/Bioconductor library Rsamtools, and Ensembl definitions of gene regions, the mapped reads were transformed into counts for each gene. The counts table was analyzed with DESeq test to reveal significantly differentially expressed genes. The results were uploaded into a MetaCore workspace and functionally analyzed for the enr ichment of pathways, gene networks and Gene Ontology terms. As an additional element, we searched the intergenic spaces for regions of previously unknown transcription and did in-depth analysis of splicing variants for the list of genes pre-selected by the users.
ChIP-seq analysis A frequent goal of ChIP-seq is to
identify active transcription factor binding sites or chromatin modifications. Here the standard workflow starts with mapping the reads of the IP and control samples to the corresponding reference genome using the default mapper (e.g. for SOLiD reads included in the L i f e S c o p e s o f t w a r e p a c k a g e ) . Depending on whether the study searches for short IP-enriched regions (e.g. transcription factor binding sites) or extended enriched regions (e.g. promoter acetylation), we select different peak-finding strategies. For short peaks, we use the MACS software and for extended regions we use SICER. Additionally, we provide a third fully customizable analysis approach using the chipseq package in R/Bioconductor. In all cases, we provide coverage plots for visual verification of identified peaks as well as interactive b rows ing us ing the In teg ra t i ve Genomics Viewer (IGV).
FGCZ NEWSLETTER - NGS BIOINFORMATICS FALL 2011
10
A c c e s s t o C o m p u t i n g a n d Software
Resources for Users
Access to and organization of dataAll NGS data is automatically stored in
the FGCZ B-Fabric system and accessible via web interface. Sample meta-information and project information are available through the same interface to all members of the respective User Lab Services project.
Access to Computing ResourcesWhile the high performance computing
infrastructure for processing and basic analysis of NGS data is used exclusively within the established pipelines and by FGCZ bioinformatics staff, the center provides users access to a dedicated high-performance linux computer at the FGCZ for NGS data analysis. This way, users have access to a wide range of NGS data analysis software tools and databases. Computationally-intensive data analysis tasks can be scheduled by the FGCZ team on the FGCZ computing cluster.
Bioinformatics Tools and DatabasesMore than one hundred bioinformatics
software packages and many standard life science databases are implemented and maintained at the FGCZ. Additional applications or databases can be hosted on request.
Frequently used packages are:- Open source software: Abyss, amos, BEDTools, BLAST, BLAT, Bowtie, bwa, consed, FASTQC, GATK, InterProScan, JIGSAW, MACS, MAQ, mira, Mosaik, mothur, phrap, phred, picard, PyroNoise2, Qiime, R/Bioconductor, Samtools, SHRiMP, SICER, SOAPdenovo, Tophat, velvet- Commercial software: CASAVA, CLCBio Genomics Workbench, CLCBio NGS Cell, LifeScope, 454 Software Package, tmap
Frequently used databases:- Standard sequence databases: NCBI nr, nt, est, cdd, SwissProt, PFAM, KEGG- Specialized databases: miRBase, 16S rDNA databases (RDP, GreenGenes).
FURTHER INFORMATIONSEQUENCING CONTACTVia eMail to [email protected] or [email protected]. FGCZ staff will initiate a meeting to discuss options and workflows, issue quotes, and advice on all aspects from study design to analysis to data interpretation.
BIOINFORMATICS SUPPORTBioinformatics consulting and support is an integral part of User Lab projects and User Lab Services. As resources are limited, discussion about the level of support and training offered to users is part of the project setup process.
USER LAB AND SERVICESMore information on the general setup of the FGCZ and its access modes can be found at www.fgcz.ch
FGCZ NEWSLETTER - NGS BIOINFORMATICS FALL 2011
Sources: Pictures of instruments, consumables and workflows have been retrieved from vendor sources. Text is in part based on informa>on materials of the vendors.
Disclaimer: The technologies men>oned may be well used for applica>ons men>oned or non-‐men>oned also in the absence of an FGCZ recommenda>on: due to the large number of protocols and applica>ons possible, the FGCZ can only support a limited number of protocols and methods per plaGorm and therefore limits the number of standard recommenda>ons. Before excluding op>ons, please consult with us.
SOFTWARE TOOLS AND ACCESS TO RESOURCES
Software available:
NGS reads mapping and assembly
Tag counting, quantification, profile pattern discovery
Sequence analysis and annotation
Databases available:
Primary sequence database
Protein sequence database
Genome database
Specialized Hardware available:
Dedicated high memory machine (24 processor, 146 Gb RAM) with direct user access
SGE cluster of 120 cores with restricted access