Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Illumina sequencingand
virus identification from smallRNA data
Patricia Otten
30th May 2012
COST training schoolUppsala - Sweden
30 May 2012 COST training school - Uppsala 2
Fasteris
Illumina sequencing technology - library preparation (smallRNAs) - cluster generation - sequencing - trimming (adapter search) and QC control
Bioinformatics analyses (smallRNAs)
- introduction to smallRNAs
- expression analyses - virus identification
Overview
30 May 2012 COST training school - Uppsala 3
Founded in 2003 in a “chalet” in Plan-les-Ouates by
Laurent FARINELLI and Magne OSTERAS
Capillary sequencing2004: Fasteris moves into rented labs in Plan-les-Ouates
Fasteris SA
30 May 2012 COST training school - Uppsala 4
2007: Solexa 1G; 2008: 2 GAIIx;2010: 2 HiSeqs. New offices;
2011: +1 MiSeq
Solexa1GGAIIx Illumina HiSeq
Fasteris SA: Illumina sequencing
30 May 2012 COST training school - Uppsala 5
Sequencing and bioinformatics service provider for private and academic labs.
Fasteris SA
30 May 2012 COST training school - Uppsala 6
Illumina sequencing technology is based on the concept of DNA colonies, invented in 1996 at the GlaxoWellcome's Geneva Biomedical Research Institute
Mayer P., Farinelli L. and Kawashima, E., 1997, Patent application WO 98/44151
Fasteris SA
30 May 2012 COST training school - Uppsala 7
Fasteris
Illumina sequencing technology - library preparation (smallRNAs) - cluster generation - sequencing - trimming (adapter search) and QC control
Bioinformatics analyses (smallRNAs)
- introduction to smallRNAs
- expression analyses - virus identification
Overview
30 May 2012 COST training school - Uppsala 8
Genomic: SNPs, Indels, de novo assembly
Mate-Pairs: scaffolding
Target enrichment: specific regions of genomic DNA
(e.g. exome)
ChIP-SEQ: DNA bound to proteins (eg. transcription
factors)
Library preparation: some DNA protocols
Illumina sequencing
30 May 2012 COST training school - Uppsala 9
mRNA-SEQ: expression of mRNAs (non oriented)
dir-mRNA-SEQ: expression of mRNAs (strand specific)
GEX: expression of mRNAs with a tag aproach
smallRNA: analysis of non coding short RNA
Library preparation: some RNA protocols
Illumina sequencing
30 May 2012 COST training school - Uppsala 10
Library preparation: smallRNA protocol3 ug total RNA
selection of small RNAs of e.g. 20-30 ntacrylamide gel purification
single-stranded ligation of the 3' adapter
single-stranded ligation of the 5' adapter
reverse transcription, PCR, gel purification
P7
P5
Illumina sequencing
index
30 May 2012 COST training school - Uppsala 11
Library preparation: Quality control
Quantification of the material (optimal concentration 10nM)
Titration run• 40 sequences in FASTA format• 100'000 sequences in FASTQ format
Wait for green light
Illumina sequencing
30 May 2012 COST training school - Uppsala 12
Template structure
“insert” ACGTCATG
indexP5 P7
6 or 8 bases for index
24 available indexes from Illumina
Over 150 Fasteris indexes available possibility to sell 10% of lanes (8-15 mio reads) →
Illumina sequencing
Fwd primer Index primer
(Rev primer)
30 May 2012 COST training school - Uppsala 13
Fasteris
Illumina sequencing technology - library preparation (smallRNAs) - cluster (DNA colonies) generation - sequencing - trimming (adapter search) and QC control
Bioinformatics analyses (smallRNAs)
- introduction to smallRNAs
- expression analyses - virus identification
Overview
30 May 2012 COST training school - Uppsala 14
P5
OHOH
P7DIOL
Lawn of P5 and P7 primers covalently attached to the flowcell surface.
Hi-Seq 2000 = 2 flowcells 1 flowcell = 8 lanes1 lane = 2 surfaces1 surface = 3 swaths1 swath = 16 tiles
3'end free
5' end covalently attachedcleavage link
Illumina sequencing
Flowcell (v3)
30 May 2012 COST training school - Uppsala 15
Cluster generation: Immobilization
1) The templates randomly hybridize to the complementary primers.2) BST polymerase and nucleotides are added. The complementary strand is synthesized (5'-->3').3) The two strands are dehybridized; the original strand is released and washed away.
Now, the templates are covalently bound the flowcell.
Illumina sequencing
1 2 35’3’
3’ 5’5’3’ 3’
3’
5’
3’
30 May 2012 COST training school - Uppsala 16
4 5 6
3’3’
3’3’
3’
3’ 3’
Cluster generation: In situ amplification
4) The templates bend and hybridize with a nearby complementary primer. 5) BST polymerase and nucleotides are added. The complementary strands are synthesized.6) The newly synthesized stands are dehybridized. A new amplification cycle can start.
Illumina sequencing
35x
30 May 2012 COST training school - Uppsala 17
7 8 93’3’ 3’
3’
Cluster generation: Linearization and primer hybridation
7) After 35 cycles of amplification, the colonies are formed8) The P5 primers are cleaved and the attached strands are released. All templates have now orientation P5-insert-P7-flowcell. Free 3' extremities are blocked with ddNTPs.
The flowcell can now be mounted on the sequencer. 9) A spot is a colony (diam ~ 1um). Each colony contains about 1000 copies of a single template molecule. Each lane of the flowcell contains about 120-180 mio colonies.
Illumina sequencing
P5
OHOH
P7
DIOL
30 May 2012 COST training school - Uppsala 18
Fasteris
Illumina sequencing technology - library preparation (smallRNAs) - cluster (DNA colonies) generation - sequencing - trimming (adapter search) and QC control
Bioinformatics analyses (smallRNAs)
- introduction to smallRNAs
- expression analyses - virus discovery
Overview
30 May 2012 COST training school - Uppsala 19
Parallel, base by base sequencing1) Reversible-terminator nucleotides labeled with fluorescent dyes are added; at each cycle, a single nucleotide is incorporated. Non incorporated nucleotides are washed away.2) The dyes are excited with two lasers; a camera scans the flowcell and captures images.3) the dyes along with the terminal 3' blocker are chemically removed, allowing for next cycle.
Illumina sequencing
30 May 2012 COST training school - Uppsala 20
Basecalling
1) The images are processed in order to extract the intensities of each colony.
2) The intensities are interpreted into bases. A quality score (qScore) is assigned to each base.
Illumina sequencing
30 May 2012 COST training school - Uppsala 21
Illumina sequencing
Sequence analysis viewer
30 May 2012 COST training school - Uppsala 22
1x100 cycle run:
- time: 1 week - RTA intensities: 1.5 TB - CASAVA basecalling: ~ 200 GB (the sequence files)
2 HiSeqs with 2 flowcells: ~ 100 GB data each day
Illumina sequencing
Some numbers...
30 May 2012 COST training school - Uppsala 23
Fasteris
Illumina sequencing technology - library preparation (smallRNAs) - cluster (DNA colonies) generation - sequencing - trimming (adapter search) and QC control
Bioinformatics analyses (smallRNAs)
- introduction to smallRNAs
- expression analysis - virus identification
Overview
30 May 2012 COST training school - Uppsala 24
Adapter trimming
Trimming and QC control (smallRNAs)
Adapter search :100% identify of the 5 bases of the adapterand at least 80% identity for the remaining part.
30 May 2012 COST training school - Uppsala 25
QC control: specifications
Trimming and QC control (smallRNAs)
Number of reads - 130 mio reads per lane, or 8-13 mio per 10% lane
Q30 - more than 85% of the reads with qScore > 30 (1x50 run)
Error rate - spiked PhiX must have error rate < 0.5% (1x50 run)
30 May 2012 COST training school - Uppsala 26
Trimming and QC control (smallRNAs)
QC control: insert length profile
30 May 2012 COST training school - Uppsala 27
Fasteris
Illumina sequencing technology - library preparation (smallRNAs) - cluster (DNA colonies) generation - sequencing - trimming (adapter search) and QC control
Bioinformatics analyses (smallRNAs)
- introduction to smallRNAs
- expression analysis - virus identification
Overview
30 May 2012 COST training school - Uppsala 28
ncRNA
miRNA
tRNA
rRNA
piRNA
snRNA
snoRNA
siRNA
~100 nts
>100 nts
23-31 nts
20 -25 nts
21-24 nts
73-93 nts
~150 nts
Introduction to smallRNAs
chemical modifications of other RNAs, mainly rRNAs,
tRNAs and snRNAs
RNA splicing, guides for telomere elongation
translationdsRNAdownregulation
transposons silencing, germ line,
poorly conserved
highly conserveddownregulation of genes
30 May 2012 COST training school - Uppsala 29
Fasteris
Illumina sequencing technology - library preparation (smallRNAs) - cluster (DNA colonies) generation - sequencing - trimming (adapter search) and QC control
Bioinformatics analyses (smallRNAs)
- introduction to smallRNAs
- expression analyses - virus identification
Overview
30 May 2012 COST training school - Uppsala 30
Expression analyses (smallRNAs)
a) mapping to a genome with annotations or to a database of sequences (BWA) - PMRD : 10'102 entries, 127 species - mirBase : 18'226 entries, 32 species, coordinates
b) count of the number of inserts mapped to annotated regions or to each sequence (BEDTOOLS, SEQMONK)
c) normalization as RPM and data mining (R) - comparison between libraries - selection of miRNAs with differential expression - heatmaps
Coverage of annotated miRNAs / siRNAs...
30 May 2012 COST training school - Uppsala 31
Expression analyses (smallRNAs)
Reference genome(chromosomes)
Aligned reads
annotations
Counts
a pre-miRNA
quantification
30 May 2012 COST training school - Uppsala 32
Fasteris
Illumina sequencing technology - library preparation (smallRNAs) - cluster (DNA colonies) generation - sequencing - trimming (adapter search) and QC control
Bioinformatics analyses (smallRNAs)
- introduction to smallRNAs
- expression analyses - virus identification
Overview
30 May 2012 COST training school - Uppsala 33
Introduction to smallRNAs
long dsRNA ( replicating RNA virus)
siRNAsduplex
siRNAs
target mRNA de gradation
Dicer
AAAAAA
unwind
RISC
target recognition(perfect match)
In plants, inserts and worms: siRNAs are used as antiviral immune response
Sequencing by siRNA: a novel generic tool for virus discoveryKreuze et al. (2009) Complete viral genome sequence and discovery of novel viruses by deep sequencing of small RNAs: a generic method for diagnosis, discovery and sequencing of viruses. Virology 388: 1-7
30 May 2012 COST training school - Uppsala 34
Virus identificationsamples and references - Libraries from infected samples or from tumoral sample (presence of a virus is suspected) - reference genome available or not - control library available or not
strategies to get virus specific contigs - assemble inserts - map inserts (sizes 20 to 25) on the reference genome and assemble unmapped inserts - assemble the control sample; map inserts on the obtained contigs; assemble unmapped inserts
Identification - blast the contigs on NCBI nr database - use a database of viral sequences and computes the coverage of the virus (MUMMER package)
30 May 2012 COST training school - Uppsala 35
Virus identification
Run - Instrument: HiSeq 2000 - run: 1x50 - lane: 20%
Libraries
Quality control
30 May 2012 COST training school - Uppsala 36
Virus identification
Adapter removal
Insert size profile
size [nts]
% of all reads
30 May 2012 COST training school - Uppsala 37
Virus identification
Analysis summary
30 May 2012 COST training school - Uppsala 38
Virus identification
Reference
Mapping (BWA)
2 mismatchs43.54% of Lib1 are mapped55.84 of Lib2 are mapped
30 May 2012 COST training school - Uppsala 39
Virus identificationSelection of unmapped inserts (SAMTOOLS)- Lib1: 5'132'190 unmapped inserts- Lib2: 3'404'696 unmapped inserts
De novo assemblyVELVET builds a hash table of all possible 'kmer (sequences of 'k' bases) in the dataset and through de Bruijn graph construction builds de novo contigs.
Several values of 'kmer' are tested;
To evaluate the quality of the assemblies, the inserts are mapped on the obtained contigs and the percentage of mapped inserts are reported.
30 May 2012 COST training school - Uppsala 40
Virus identification
IdentificationThe assemblies corresponding to the hash 15 are aligned with MUMMER on a database of virus sequences; The results are merged into a single file.
30 May 2012 COST training school - Uppsala 41
Virus identificationVisualisation (IGV)
The assemblies are mapped to the virus sequence using BWA.
30 May 2012 COST training school - Uppsala 42
Take home message
The presence of abundant siRNAs in infected cells allow virus identification without any prior knowledge about the virus
High potential for developing new diagnosis approach
Works for plants and insects. What about mammalian ?
30 May 2012 COST training school - Uppsala 43
Let's reproduce this analysis with a library from human infected cells during the hands-on!