43
Illumina sequencing and virus identification from smallRNA data Patricia Otten 30th May 2012 COST training school Uppsala - Sweden

Illumina sequencing and from smallRNA data

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Illumina sequencing and from smallRNA data

Illumina sequencingand

virus identification from smallRNA data

Patricia Otten

30th May 2012

COST training schoolUppsala - Sweden

Page 2: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 2

Fasteris

Illumina sequencing technology - library preparation (smallRNAs) - cluster generation - sequencing - trimming (adapter search) and QC control

Bioinformatics analyses (smallRNAs)

- introduction to smallRNAs

- expression analyses - virus identification

Overview

Page 3: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 3

Founded in 2003 in a “chalet” in Plan-les-Ouates by

Laurent FARINELLI and Magne OSTERAS

Capillary sequencing2004: Fasteris moves into rented labs in Plan-les-Ouates

Fasteris SA

Page 4: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 4

2007: Solexa 1G; 2008: 2 GAIIx;2010: 2 HiSeqs. New offices;

2011: +1 MiSeq

Solexa1GGAIIx Illumina HiSeq

Fasteris SA: Illumina sequencing

Page 5: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 5

Sequencing and bioinformatics service provider for private and academic labs.

Fasteris SA

Page 6: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 6

Illumina sequencing technology is based on the concept of DNA colonies, invented in 1996 at the GlaxoWellcome's Geneva Biomedical Research Institute

Mayer P., Farinelli L. and Kawashima, E., 1997, Patent application WO 98/44151

Fasteris SA

Page 7: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 7

Fasteris

Illumina sequencing technology - library preparation (smallRNAs) - cluster generation - sequencing - trimming (adapter search) and QC control

Bioinformatics analyses (smallRNAs)

- introduction to smallRNAs

- expression analyses - virus identification

Overview

Page 8: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 8

Genomic: SNPs, Indels, de novo assembly

Mate-Pairs: scaffolding

Target enrichment: specific regions of genomic DNA

(e.g. exome)

ChIP-SEQ: DNA bound to proteins (eg. transcription

factors)

Library preparation: some DNA protocols

Illumina sequencing

Page 9: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 9

mRNA-SEQ: expression of mRNAs (non oriented)

dir-mRNA-SEQ: expression of mRNAs (strand specific)

GEX: expression of mRNAs with a tag aproach

smallRNA: analysis of non coding short RNA

Library preparation: some RNA protocols

Illumina sequencing

Page 10: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 10

Library preparation: smallRNA protocol3 ug total RNA

selection of small RNAs of e.g. 20-30 ntacrylamide gel purification

single-stranded ligation of the 3' adapter

single-stranded ligation of the 5' adapter

reverse transcription, PCR, gel purification

P7

P5

Illumina sequencing

index

Page 11: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 11

Library preparation: Quality control

Quantification of the material (optimal concentration 10nM)

Titration run• 40 sequences in FASTA format• 100'000 sequences in FASTQ format

Wait for green light

Illumina sequencing

Page 12: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 12

Template structure

“insert” ACGTCATG

indexP5 P7

6 or 8 bases for index

24 available indexes from Illumina

Over 150 Fasteris indexes available possibility to sell 10% of lanes (8-15 mio reads) →

Illumina sequencing

Fwd primer Index primer

(Rev primer)

Page 13: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 13

Fasteris

Illumina sequencing technology - library preparation (smallRNAs) - cluster (DNA colonies) generation - sequencing - trimming (adapter search) and QC control

Bioinformatics analyses (smallRNAs)

- introduction to smallRNAs

- expression analyses - virus identification

Overview

Page 14: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 14

P5

OHOH

P7DIOL

Lawn of P5 and P7 primers covalently attached to the flowcell surface.

Hi-Seq 2000 = 2 flowcells 1 flowcell = 8 lanes1 lane = 2 surfaces1 surface = 3 swaths1 swath = 16 tiles

3'end free

5' end covalently attachedcleavage link

Illumina sequencing

Flowcell (v3)

Page 15: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 15

Cluster generation: Immobilization

1) The templates randomly hybridize to the complementary primers.2) BST polymerase and nucleotides are added. The complementary strand is synthesized (5'-->3').3) The two strands are dehybridized; the original strand is released and washed away.

Now, the templates are covalently bound the flowcell.

Illumina sequencing

1 2 35’3’

3’ 5’5’3’ 3’

3’

5’

3’

Page 16: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 16

4 5 6

3’3’

3’3’

3’

3’ 3’

Cluster generation: In situ amplification

4) The templates bend and hybridize with a nearby complementary primer. 5) BST polymerase and nucleotides are added. The complementary strands are synthesized.6) The newly synthesized stands are dehybridized. A new amplification cycle can start.

Illumina sequencing

35x

Page 17: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 17

7 8 93’3’ 3’

3’

Cluster generation: Linearization and primer hybridation

7) After 35 cycles of amplification, the colonies are formed8) The P5 primers are cleaved and the attached strands are released. All templates have now orientation P5-insert-P7-flowcell. Free 3' extremities are blocked with ddNTPs.

The flowcell can now be mounted on the sequencer. 9) A spot is a colony (diam ~ 1um). Each colony contains about 1000 copies of a single template molecule. Each lane of the flowcell contains about 120-180 mio colonies.

Illumina sequencing

P5

OHOH

P7

DIOL

Page 18: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 18

Fasteris

Illumina sequencing technology - library preparation (smallRNAs) - cluster (DNA colonies) generation - sequencing - trimming (adapter search) and QC control

Bioinformatics analyses (smallRNAs)

- introduction to smallRNAs

- expression analyses - virus discovery

Overview

Page 19: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 19

Parallel, base by base sequencing1) Reversible-terminator nucleotides labeled with fluorescent dyes are added; at each cycle, a single nucleotide is incorporated. Non incorporated nucleotides are washed away.2) The dyes are excited with two lasers; a camera scans the flowcell and captures images.3) the dyes along with the terminal 3' blocker are chemically removed, allowing for next cycle.

Illumina sequencing

Page 20: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 20

Basecalling

1) The images are processed in order to extract the intensities of each colony.

2) The intensities are interpreted into bases. A quality score (qScore) is assigned to each base.

Illumina sequencing

Page 21: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 21

Illumina sequencing

Sequence analysis viewer

Page 22: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 22

1x100 cycle run:

- time: 1 week - RTA intensities: 1.5 TB - CASAVA basecalling: ~ 200 GB (the sequence files)

2 HiSeqs with 2 flowcells: ~ 100 GB data each day

Illumina sequencing

Some numbers...

Page 23: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 23

Fasteris

Illumina sequencing technology - library preparation (smallRNAs) - cluster (DNA colonies) generation - sequencing - trimming (adapter search) and QC control

Bioinformatics analyses (smallRNAs)

- introduction to smallRNAs

- expression analysis - virus identification

Overview

Page 24: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 24

Adapter trimming

Trimming and QC control (smallRNAs)

Adapter search :100% identify of the 5 bases of the adapterand at least 80% identity for the remaining part.

Page 25: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 25

QC control: specifications

Trimming and QC control (smallRNAs)

Number of reads - 130 mio reads per lane, or 8-13 mio per 10% lane

Q30 - more than 85% of the reads with qScore > 30 (1x50 run)

Error rate - spiked PhiX must have error rate < 0.5% (1x50 run)

Page 26: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 26

Trimming and QC control (smallRNAs)

QC control: insert length profile

Page 27: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 27

Fasteris

Illumina sequencing technology - library preparation (smallRNAs) - cluster (DNA colonies) generation - sequencing - trimming (adapter search) and QC control

Bioinformatics analyses (smallRNAs)

- introduction to smallRNAs

- expression analysis - virus identification

Overview

Page 28: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 28

ncRNA

miRNA

tRNA

rRNA

piRNA

snRNA

snoRNA

siRNA

~100 nts

>100 nts

23-31 nts

20 -25 nts

21-24 nts

73-93 nts

~150 nts

Introduction to smallRNAs

chemical modifications of other RNAs, mainly rRNAs,

tRNAs and snRNAs

RNA splicing, guides for telomere elongation

translationdsRNAdownregulation

transposons silencing, germ line,

poorly conserved

highly conserveddownregulation of genes

Page 29: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 29

Fasteris

Illumina sequencing technology - library preparation (smallRNAs) - cluster (DNA colonies) generation - sequencing - trimming (adapter search) and QC control

Bioinformatics analyses (smallRNAs)

- introduction to smallRNAs

- expression analyses - virus identification

Overview

Page 30: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 30

Expression analyses (smallRNAs)

a) mapping to a genome with annotations or to a database of sequences (BWA) - PMRD : 10'102 entries, 127 species - mirBase : 18'226 entries, 32 species, coordinates

b) count of the number of inserts mapped to annotated regions or to each sequence (BEDTOOLS, SEQMONK)

c) normalization as RPM and data mining (R) - comparison between libraries - selection of miRNAs with differential expression - heatmaps

Coverage of annotated miRNAs / siRNAs...

Page 31: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 31

Expression analyses (smallRNAs)

Reference genome(chromosomes)

Aligned reads

annotations

Counts

a pre-miRNA

quantification

Page 32: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 32

Fasteris

Illumina sequencing technology - library preparation (smallRNAs) - cluster (DNA colonies) generation - sequencing - trimming (adapter search) and QC control

Bioinformatics analyses (smallRNAs)

- introduction to smallRNAs

- expression analyses - virus identification

Overview

Page 33: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 33

Introduction to smallRNAs

long dsRNA ( replicating RNA virus)

siRNAsduplex

siRNAs

target mRNA de gradation

Dicer

AAAAAA

unwind

RISC

target recognition(perfect match)

In plants, inserts and worms: siRNAs are used as antiviral immune response

Sequencing by siRNA: a novel generic tool for virus discoveryKreuze et al. (2009) Complete viral genome sequence and discovery of novel viruses by deep sequencing of small RNAs: a generic method for diagnosis, discovery and sequencing of viruses. Virology 388: 1-7

Page 34: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 34

Virus identificationsamples and references - Libraries from infected samples or from tumoral sample (presence of a virus is suspected) - reference genome available or not - control library available or not

strategies to get virus specific contigs - assemble inserts - map inserts (sizes 20 to 25) on the reference genome and assemble unmapped inserts - assemble the control sample; map inserts on the obtained contigs; assemble unmapped inserts

Identification - blast the contigs on NCBI nr database - use a database of viral sequences and computes the coverage of the virus (MUMMER package)

Page 35: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 35

Virus identification

Run - Instrument: HiSeq 2000 - run: 1x50 - lane: 20%

Libraries

Quality control

Page 36: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 36

Virus identification

Adapter removal

Insert size profile

size [nts]

% of all reads

Page 37: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 37

Virus identification

Analysis summary

Page 38: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 38

Virus identification

Reference

Mapping (BWA)

2 mismatchs43.54% of Lib1 are mapped55.84 of Lib2 are mapped

Page 39: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 39

Virus identificationSelection of unmapped inserts (SAMTOOLS)- Lib1: 5'132'190 unmapped inserts- Lib2: 3'404'696 unmapped inserts

De novo assemblyVELVET builds a hash table of all possible 'kmer (sequences of 'k' bases) in the dataset and through de Bruijn graph construction builds de novo contigs.

Several values of 'kmer' are tested;

To evaluate the quality of the assemblies, the inserts are mapped on the obtained contigs and the percentage of mapped inserts are reported.

Page 40: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 40

Virus identification

IdentificationThe assemblies corresponding to the hash 15 are aligned with MUMMER on a database of virus sequences; The results are merged into a single file.

Page 41: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 41

Virus identificationVisualisation (IGV)

The assemblies are mapped to the virus sequence using BWA.

Page 42: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 42

Take home message

The presence of abundant siRNAs in infected cells allow virus identification without any prior knowledge about the virus

High potential for developing new diagnosis approach

Works for plants and insects. What about mammalian ?

Page 43: Illumina sequencing and from smallRNA data

30 May 2012 COST training school - Uppsala 43

Let's reproduce this analysis with a library from human infected cells during the hands-on!