View
343
Download
0
Tags:
Embed Size (px)
Citation preview
Data analysis pipelines for NGS applications
Sergi Beltran Agulló
VHIR-CNAG Course, 11th February 2015
BIOREPOSITORY LABORATORY SEQUENCING
QC ANALYSIS TRANSFER
LIM
S
Full Traceability of CNAG’s Workflow ISO9001
P7
P5 Index SP read 1
SP read 2
DNA insert
WG _BS _SEq
WG_Seq
mRNA Seq
…and many more
smallRNA_Seq
Target capture
ChIP _Seq
Sequencing Platform (M. Gut)
cBot Automatic
reagent dispencer
Flow cell Glass slide with
a lawn of oligonucleotides
Sequencing Platform (M. Gut)
Flow cell Glass slide with
a lawn of oligonucleotides and sequencing library
HiSeq2000 – the sequencer
Sequencing Platform (M. Gut)
Sequencing-by-synthesis (SBS)
5’
G
T
C
A
G
T
C
A
G
T
C
A
G
T
3’
5’
C
A
G
T
C
A
T
C
A
C
C
T
A
G
C
G
T
A
First base incorporated
Cycle 1: Add sequencing reagents
Remove unincorporated bases
Detect signal
Cycle 2-n: Add sequencing reagents and repeat
All four labelled nucleotides in one reaction
High accuracy
Base-by-base sequencing
No problems with homopolymer repeats
5’
Sequencing Platform (M. Gut)
100 microns
Sequencing Output: FASTQ files
- Developed by the Wellcome Trust Sanger Institue
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
- Usually, each sequence (read) is split in 4 rows
- Sequence identifiers, description and quality encoding can be different
SEQUENCING
CN
AG
-LIM
S
MAPPING QUALITY CTRL
MAP/BAM
VARIANT CALLING
RNA-Seq
QUANTIFICATION
ASSEMBLY &
METAGENOMICS
BISULFITE
SHAPE-BASED 3D
MODELLING OF RNA
SEQUENCES
Hi-C TO 3D MODELS OF
GENOMIC DOMAINS
AND GENOMES
STRUCTURAL VARIANTS
FASTQ
Data Analysis Pipelines at CNAG
SEQUENCING
CN
AG
-LIM
S
MAPPING QUALITY CTRL
MAP/BAM
VARIANT CALLING
RNA-Seq
QUANTIFICATION
ASSEMBLY &
METAGENOMICS
BISULFITE
SHAPE-BASED 3D
MODELLING OF RNA
SEQUENCES
Hi-C TO 3D MODELS OF
GENOMIC DOMAINS
AND GENOMES
STRUCTURAL VARIANTS
FASTQ
Data Analysis Pipelines at CNAG
Aligning and merging fragments of DNA in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather small pieces between 20 and 1000 bases. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA or gene transcript (ESTs). (adapted from Wikipedia)
Adapted from Li-Jun Ma; Natalie D. Fedorova. Mycology, pages 9 - 24
Assembly: definition
CNAG de novo assembly pipeline
Assembly and Annotation Team (T. Alioto)
Removed
CNAG genome projects
Assembly and Annotation Team (T. Alioto)
Removed
Assembly: Metagenomics
http://wiki.biomine.skelleftea.se
Clinical Applications: Human Microbiome Project
SEQUENCING
CN
AG
-LIM
S
MAPPING QUALITY CTRL
MAP/BAM
VARIANT CALLING
RNA-Seq
QUANTIFICATION
ASSEMBLY &
METAGENOMICS
BISULFITE
SHAPE-BASED 3D
MODELLING OF RNA
SEQUENCES
Hi-C TO 3D MODELS OF
GENOMIC DOMAINS
AND GENOMES
STRUCTURAL VARIANTS
FASTQ
Data Analysis Pipelines at CNAG
Mapping to reference genome
Adapted from wikipedia
100bp read 100bp read
Adapted from wikipedia
100bp read 100bp read
Mapping to reference genome
Adapted from wikipedia
100bp read 100bp read 100bp read 100bp read
Mapping to reference genome
Mapping: Exome sequence example
Alignments are stored in a BAM file, which is the binary version of SAM
(Sequence/Alignment Map) format
Exome sequencing metrics
Removed
Variant Calling
Identification of genetic differences in comparison to a reference (strict definition)
- Designs: Pedigree, trio, group, somatic
Removed
CNAG’s Variant Calling Pipeline J. Camps, S. Derdak, S. Laurie, E.
Serra, R. Tonda, JR Trotta, S Beltran
Removed
CNAG’s Variant Calling Pipeline: Sensitive and Precise
- NA12878 50x Whole Genome FASTQs from Illumina Platinum Genomes
analyzed with the pipeline: http://www.illumina.com/platinumgenomes/
- Results compared independently for SNPs and INDELs agains NIST
reference set: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype
calls. Zook et al. Nat Biotechnol. 2014 Mar;32(3):246-51.
- Results (on callable region):
S. Derdak, A. Kanterakis, S. Laurie,
E. Serra, R. Tonda, S Beltran
Removed
Results: PDF Report
Removed
General (6) Chrom
Pos
Ref
Alt
RS
GMAF
Call Specific (5) Genotype
Genotype Quality
Depth
GT Probabilty Likelihood
Strand Bias
Functional Annotation
(12) Gene Name
Coding /Non-coding
Transcript Biotype
Variant Effect
Variant Effect Impact
Functional Class
Codon Change
Amino Acid affected
Trnscript ID
Transcript Length
Exon rank in transcript
Effect Prediction (12) Sift Prediction & Score
Polyphen2 HDIV Prediction and
score
Polyphen2 HVAR Prediction and
score
Mutation Taster Prediction and
score
Phylop Score
Gerp++ Score
SiPhy 29 Mammal Score
CADD Score
Control Populations
(6) ESP6500 European-
Americans
ESP6500 African-
Americans
1000GP-phase 1
Europeans
1000GP-phase 1 Africans
1000GP-phase 1 Asians
ExAC
Results: Relevant Fields in gVCF and Excel
Results: Secondary data analysis
Mutation number (All chr)
Inte
r-m
uta
tional dis
tance
Chromosomal position (Chr 6)
Norm
aliz
ed
Copy
Num
ber
Removed
Examples in Rare Diseases
30
RD-Connect : Integration and Sharing
WP1: Coordination
WP2: Patient registries
WP3: Biobanks
WP4: Bioinformatics
WP5: Unified platform
Hanns Lochmüller (Newcastle and TREAT-NMD)
Domenica Taruscio (ISS and EPIRARE)
Lucia Monaco (Fondaz. Telethon & EuroBioBank)
WP6 Ethical/legal/social
Ivo Gut (CNAG Barcelona)
Christophe Béroud (INSERM Marseille)
WP7: Impact/Innovation
Mats Hansson (Uppsala)
Kate Bushby (Newcastle and EUCERD/ EJARD)
Removed
SEQUENCING
CN
AG
-LIM
S
MAPPING QUALITY CTRL
MAP/BAM
VARIANT CALLING
RNA-Seq
QUANTIFICATION
ASSEMBLY &
METAGENOMICS
BISULFITE
SHAPE-BASED 3D
MODELLING OF RNA
SEQUENCES
Hi-C TO 3D MODELS OF
GENOMIC DOMAINS
AND GENOMES
STRUCTURAL VARIANTS
FASTQ
Data Analysis Pipelines at CNAG
Microarrays RNA-seq
Nature Methods, 8 469-477 (2011)
RNA-Seq Differential Expression
RNA-Seq Analysis Pipeline
A. Esteve, S. Heath
A. Esteve, S. Heath
Summary
- NGS has multiple applications, usually with higher precision compared to
microarrays.
- NGS has direct clinical applicability
- Sequencing can greatly speed up research and diagnostics.
- Analysis is far from being standardized but results are already very accurate.
- The CNAG offers full collaborations (from experiment design to user-friendly
analysed results)