Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... ·...

Bioinformatica e analisi dei genomi

Anno 2016/2017

Pierpaolo Maisano Delsermail: maisanop@tcd.ie

Background

Cusco, Marzo 2009

• Laurea Triennale: Scienze Biologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• Laurea Specialistica: Scienze Biomolecolarie Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;

• Post-doctoral fellow EPHE-MNHN, Paris, Dr.Stefano Mona.

• Research Fellow Trinity College, Dublin, Prof. Daniel Bradley

Muséum national d'Histoire naturelle - Paris

Trinity College Dublin - Ireland

Informazioni pratiche

• Teoria + pratica;

• Software and tools;

• Files;

• Slides on the website;

• Argomenti nuovi / argomenti gia’ trattati;

Informazioni pratiche

Cartella di lavoro (fastq file): /home/bioinfo_file/

File referenza , intevalli per il coverage, genoma per IGV): /home/bioinfo_file/reference/

Ricordatevi i percorsi dei file!!

pwd: mostra la vostra posizione

Programma

• next-generation sequencing (NGS)…come, quando, perche’?

• un esempio di gestione e analisi dati NGS:

• tipo di dato;• file e formati;• programmi;• interpretazione dei risultati;• stima dell’errore;• quando fermarsi?

• Applicazioni e/o progetti su diversi organismi.

capture: exome/custom/cancer

amplicon sequencing

whole genome

mapping to a reference genome

de-novoassembly

sequencing

unalignedreads QC

mapping refinement

mapping QCassembly QC

whole transcriptome

amplicon sequencing: fixed/custom

DNA-seq

RNA-seq

reads trimming

NGS: come, quando, perché?

Filtering

Validation

Domanda: quando?

Risposta: quando ha senso!

• Amplicone 400bp in 100 individui? → Sanger sequencing

• 50 ampliconi in 100 individui? → NGS + target capture

• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing

Domanda: perche’?

Risposta: la vostra idea per un progetto!

NGS: come, quando, perché?

un esempio di gestione e analisi dati NGS

Nanopore minIon/gridIon

Pacific Bioscience (PacBio)

Ion torrent PGM/Proton

Roche 454

Illumina MiSeq/HiSeq

capture: exome/custom/cancer

amplicon sequencing

whole genome

mapping to a reference genome

de-novoassembly

sequencing

unalignedreads QC

mapping refinement

mapping QCassembly QC

whole transcriptome

amplicon sequencing: fixed/custom

DNA-seq

RNA-seq

reads trimming

Filtering

Validation

un esempio di gestione e analisi dati NGS

• progetto

• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)

• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…)

• progetto:applicazione:scopo:coverage (SNPs, indels, repeatedelements, CNVs…)

Project:

• Carcharodon carcharias - the great white shark;

Project:

• Carcharodon carcharias;

• Diploid organism;

• 82 chromosomes (41 pairs);

• Genome size ~5.2 Gb – not fully sequenced yet;

• Target capture experiment;

• Paired-end reads (250bp).

Project:

• Target capture experiment;

Meyerson M et al., 2010, NatRevGenetics

fragment ========================================fragment + adaptors ~~~========================================~~~SE read --------->PE reads R1---------> <---------R2unknown gap ..................................................

Single-end (SE) or paired-end (PE) sequencing.

raw reads (.fastq) 2. alignment to a reference genomeclose reference?

time limited?

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

5. variant callingSNPs/indels

single/multi-sample

samtools

raw variants (.vcf)

ready-to-use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtoolsIGV/tablet

.fa/.fasta

.fastq

.sam (.sai)

.bam (.bai)

sequences

read data

mapped reads

mapped reads (binary)

variant information

@M00725:28:000000000-AJ72K:1:1101:11561:1002 1:N:0:1AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNATGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNNGCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNNGTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNNNAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG+8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFFFGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:*#212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((-

raw reads (.fastq)

Terminal: more cc_gn2_R1_trimmed.fastq head cc_gn2_R1_trimmed.fastq

raw reads (.fastq)

First mate in the pair (paired-end reads)

Run ID

flowcell ID

Quality values for each nucleotide

Instrument ID

raw reads (.fastq)

lane tile

coordinates of the cluster

Is the read filtered? No (N) or Yes (Y)

Control included? 0=No

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Lowest HighestASCII

33 1260.2......................26...31........41

Illumina 1.8+ Phred+33, raw reads typically (0, 41)

raw reads (.fastq)

Quality values for each nucleotide

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Lowest HighestASCII

33 1260.2......................26...31........41

Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Phred-scale value:

Q = -10*log_10P → P = 10-Q/10

Phred Quality Score(Q)

Probability of incorrect base call

(P)Base call accuracy

10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%

raw reads (.fastq)

• Open cc_gn2_R2_trimmed.fastq

Terminal: more cc_gn2_R2_trimmed.fastq OR head cc_gn2_R2_trimmed.fastq

• Are cc_gn2_R1_trimmed.fastq and cc_gn2_R2_trimmed.fastq coming from two different lanes?

• What’s the difference between the two fastq files?

raw reads (.fastq)

cc_gn2_R2_trimmed.fastq

cc_gn2_R1_trimmed.fastq

@M00725:28:000000000-AJ72K:1:1101:11561:1002 1:N:0:1AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNATGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNNGCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNNGTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNNNAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG

@M00725:28:000000000-AJ72K:1:1101:11561:1002 2:N:0:1CCATTTCTNNNNNNNAGGACCTNNNNNNNAGCCCTNNNNNNNNNNNNNAGNATATGANNNNNNNTCTTATTNANCCANNNTCTAGNNNNNNNCTTTCCTNNNNNNNTCTCTGANNNNNNNNNNNNNNCCCTTCCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTNTCTCNTNNNNNNNNNNNNAAAATCCNNNNNNNNNNNNNNCCACTAANNNNNNNNNNNNNNAAGAAATAACACACNNNNNNNACAAAAANNNNNNNACAACACNNNNNNNGCATAAANNNA

Same lane, different read mate in the pair!

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

1- Fastq quality control + trimming

Fastqc: quality control of the raw data coming out from the sequencer

• Evaluation of the quality of the generated data;

• Basic summary statistics of the raw data;

• Several modules to evaluate different features (i.e. adapters; base quality, etc…)

• Feedback (green, orange, red): do not fully rely on that, think what does it mean!!

Per base sequence quality: warning

95-99 bp 90-94 bp

What can we do to improve the quality at the end of the reads?

Read Trimming: removal of lower-quality 3' Ends with Low Quality Scores

Per sequence quality score: pass

Sequence length: pass

Adapters removal1- Fastq quality control + trimming

Failed

Warning

Adapters removal1- Fastq quality control + trimming

Overrepresented sequences

Removal of overrepresented sequences (PCR primers).

FASTQC references:

• Software website:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

• Manual:https://insidedna.me/tool_page_assets/pdf_manual/fastqc.pdf

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

Alignment : process of determining the most likelylocation within the genome for the observed DNA read

raw reads reference genome

2- Alignment to a reference genome

trade-off: speed vs sensitivity – the higher the accuracy the longer the alignment run

two classes of methods:

Burrows-Wheeler

• Fast• less robust at high divergence

with reference genome• e.g. bwa

Hashing

• slow (needs more memory)• robust at high divergence with

reference genome• e.g. stampy

the shorter the read the harder is to find its location in the genome

big amount of data: computationally challenging for memory and speed

BW: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transformHashing: https://en.wikipedia.org/wiki/Hash_table

raw reads reference genome

low MQ: the probability of mapping to different locations is high, but no perfect multiple matches

high MQ: a single match

MQ0: a perfect multiple match

What if there are several possible places to align your sequencing read?

This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps

MQ is a phred-score of the quality of the alignment

Reference sequence

Element 1 Element 2

Sample_1

Reference sequence

Sample_1

1 copia

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Perfect mul ple matches → MQ0Not a perfect match → Low MQ

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Perfect mul ple matches → MQ0Not a perfect match → Low MQ

Reference sequence

Sample_1

2 copia

1 copia

Reference sequence

Element 1 Element 2

Sample_1

False heterozygous callCluster of heterozygotes

Reference sequence

Sample_1

1 copia

2 copia

1 copia

AluSg7

2- Alignment to a reference genome: mapping with bwa-mem

Three different algorithm:

1. BWA-backtrack: for illumina reads up to 100bp;

2. BWA-SW: long read support, split alignment;

3. BWA-MEM: long read support, split alignment, faster, more accurate

Fastq files are already trimmed → adapters removed

Split read:

Karacok E et al., 2012

• paired-end alignment;

• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;

• Option to mark shorter split hits as secondary (not supplementary).

bwa mem [options] [RefSeq] [fastq1] [fastq2] > cc_gn2_R12.sam

Type: bwa mem to check the options

bwa mem -M cc_ref.fa cc_gn2_R1_trimmed.fastq cc_gn2_R2_trimmed.fastq > cc_gn2_R12.sam

Reference genomeMark shorter split hits as secondary

Fastq 1

Fastq 2

2- Alignment to a reference genome: from sam to bam

Convert sam-to-bam:

samtools view .. .. .. input_sam .. input_bam

• Option to define that the input is a sam file;

• Option to have output in bam format;

• Option to define the output;

samtools view -Sb cc_gn2_R12.sam -o cc_gn2_R12.bam

2- Alignment to a reference genome: from sam to bam

sam-to-bamOutput in bam format

Output file (bam)

Input file (sam)

Input is a sam file

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

SAM/BAM format

SAM – sequence alignment mapBAM – binary alignment map

Standard formats for alignmentBAM is the binary version of SAM – reduced size, easier to store and to access but the full information is not readable by human eye

more cc_gn2_R12.bam

BAM format

more cc_gn2_R12.sam

SAM format

They consist of two parts:

1. Header: contains information about the sample

2. Alignment: contains location and qualities for all the reads

You can find a detailed explanation in the sam/bam format specification (http://samtools.sourceforge.net/SAMv1.pdf).

SAM format

Header

Alignment

Header contains:@HD – header line@SQ – Reference sequence dictionary, one per chromosome,

SN (reference sequence name) and LN (reference sequence length) @RG – Read group@PG – Program, ID (identifier)@CO – comment

Reference Sequence Name Reference Sequence Length

SAM format

Header

Alignment

Alignment contains one line per read, and each line contains 12 columns:

M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60

300M = 1754780 309

CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAGTAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT

CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+

M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60

300M = 1754780 309

QNAME FLAG

bitwise FLAG

It is an integer, but it represents the sum of different values.

Open Firefox > google.co.uk > Type “bitwise flag broad”

There is a tool online which provides a quick “translation” (https://broadinstitute.github.io/picard/explain-flags.html)

M00725:28:000000000-AJ72K:1:1101:18215:1102 99QNAME FLAG

bitwise FLAG

It is an integer, but it represents the sum of different values.

There is a tool online which provides a quick “translation” (https://broadinstitute.github.io/picard/explain-flags.html)

M00725:28:000000000-AJ72K:1:1101:18215:1102 99QNAME FLAG

M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60

300M = 1754780 309

QNAME FLAG RNAME POS MAPQ

CIGAR string

It is a compact representation of sequence alignment. It includes:• M – match or mismatch• I – insertion• D – deletion

read: ACTCA–TGCAGTref: ACTCAGTG––GTcigar 5M1D2M2I2M

read: ACGTCATG––––CAGTref: ACG–CATGCGGCAGTcigar 3M1I4M3D4M

So, what is the cigar line of…?

M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60

300M = 1754780 309

M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60

300M = 1754780 309

CIGAR MPOS ISIZE

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

3- Bam refinement – before starting…

BAM missing a RG LINE…what is a RG LINE??

You can find a detailed explanation in the sam/bam format specification (http://samtools.sourceforge.net/SAMv1.pdf).

picard-tools AddOrReplaceReadGroupsINPUT=cc_gn2_R12.bam OUTPUT=cc_gn2_R12_rg.bam RGLB=cc_gn2 RGPL=Illumina RGPU=01 RGSM=shark

sort the bam (this adds the bam extension automatically!)It sorts alignments by coordinates

samtools sort cc_gn2_R12_rg.bam cc_gn2_R12_rg_sorted

samtools index cc_gn2_R12_rg_sorted.bam

Index the sorted bam file

BAM format

samtools view -H cc_gn2_R12_rg_sorted.bam

use samtools to check the header of the BAM

1. How many chromosomes are present in your header?2. Which version of the BAM is it?3. Is it sorted?

We can “read” the header of the BAM file…

@HD VN:1.4 SO:coordinate@SQ SN:cc_ref LN:1784076@RG ID:1 PU:01 LB:cc_gn2 SM:shark PL:Illumina

Yes, by coordinate

1 chromosome (“artificial”)

@HD VN:1.3 SO:coordinate@SQ SN:I LN:230218@SQ SN:II LN:813184@SQ SN:III LN:316620@SQ SN:IV LN:1531933@SQ SN:IX LN:439888@SQ SN:Mito LN:85779@SQ SN:V LN:576874@SQ SN:VI LN:270161@SQ SN:VII LN:1090940@SQ SN:VIII LN:562643@SQ SN:X LN:745751@SQ SN:XI LN:666816@SQ SN:XII LN:1078177@SQ SN:XIII LN:924431@SQ SN:XIV LN:784333@SQ SN:XV LN:1091291@SQ SN:XVI LN:948066@PG ID:bwa PN:bwa VN:0.7.10-r789 CL:bwa mem -M

3- Bam refinement

Input: BAM

Three main steps:

1. Local realignment

2. Base quality recalibration

3. Duplicate removal

Output: BAM

3- Bam refinement – Local realignment

Ref: ACTTTCGGATGCTGATCGGGATGCTTTAGCTGATGCTGATGGGCTTTCGATCGATTTAAAAGCTACTTTCGGATGCTGATCGGGATGCTTTAGCTGA

TCGGATGCTGATCGGGATGCTTTAGCTGATGCTCTGATCGGGATGCTTTAGCTGATGCTGATGG

Ref: ACTTTCGGATGCTGATCGGGATGCTTTAGCTGATGCTGATGGGCTTTCGATCGATTTAAAAGCTACTTTCGGATGCTGATC____ATGCTTTAGCTGA

TCGGATGCTGATC____T_GCTTTAGCTGATGCTCTGATC____ATGCTTTAGCTGATGCTGATGG

TC_____T_GCTTTAGCTGATGCTGATGGGCTT

Ref: ACTTTCGGATGCTGATCGGGATGCTTTAGCTGATGCTGATGGGCTTTCGATCGATTTAAAAGCTACTTTCGGATGCTGATC____ATGCTTTAGCTGA

TCGGATGCTGATC____ATGCTTTAGCTGATGCTCTGATC____ATGCTTTAGCTGATGCTGATGG

TC____ATGCTTTAGCTGATGCTGATGGGCTT

Problem: Short indels in the sample relative to the reference sequence can pose difficulties for alignment programs. Indels occuring towards the ends of the reads are often not aligned correctly, introducing an excess of SNPs

It uses the full alignment context to determine whether the indel exists.

Two-step process:

1. RealignerTargetCreator: it determines the small suspicious intervals which are likely in need of realignment

2. IndelRealigner: it runs the realignment on those intervals

notes:- having a list of known indels helps

3- Bam refinement – Local realignment

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

Each base call has an associated base call quality (phred-scale).Rule of thumb: anything less than Q20 is not useful data.

The quality of a call depends on multiple factors (e.g. position in the read, sequence context).

In addition, the alignment can provide useful information. Mismatches to the reference are considered errors (unless they are described polymoprhisms).

It requires a catalogue of variable sites!

3- Bam refinement – Base Quality Recalibration

How does it work?

List of know variantsBAM files with variants

123 C/A BQ cov1 cov2 cov3…

145 G/A BQ cov1 cov2 cov3…

1298 G/T BQ cov1 cov2 cov3…

1345 C/T BQ cov1 cov2 cov3…

1789 C/G BQ cov1 cov2 cov3…

123 C/A

145 G/A

1345 C/T

BQ: base qualityCovariates: position in the read, sequencing cycle, dinucleotide, …

Considered as real variants Recalibrated BQ using different covariates

BQ: base qualityCovariates: position in the read, end of read with worse calls!

BAM files with variants

1298 G/T BQ cov1 cov2 cov3…

1789 C/G BQ cov1 cov2 cov3…

Considered as real variants

Recalibrated BQ using different covariates

High quality >>>>>>>>>>>>> Low Quality

BQ: base qualityCovariates: position in the read, end of read with worse calls!

BAM files with variants

1298 G/T BQ_1 cov1 cov2 cov3…

1789 C/G BQ_1 cov1 cov2 cov3…

Considered as real variants

Recalibrated BQ using different covariates

High quality >>>>>>>>>>>>> Low Quality

Covariate: cycle number

First cycle: higher qualityLast cycles: lower quality

We will not run the Base Quality Recalibration because of time and list of variants available.

Few more details:

It supports several platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences (stated on the website) and IonTorrent (stated in the GATK forum).

You can find how to do it at:

https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

PCR is used during library preparation.

This can result in duplicate DNA fragments in the final library prep.

3- Bam refinement – Duplicate removal

What we want: information from independent fragmentsWhat we do not want: copies of the same information coming from one fragment

Reference Sequence

PCR is used during library preparation.

This can result in duplicate DNA fragments in the final library prep.

What we want: information from independent fragmentsWhat we do not want: copies of the same information coming from one fragment

Reference Sequence

Possible heterozygote, SNP call

Ref call

• It can result in false SNPs calls.

• Duplicates may fake a high coverage thus giving high support to some variants.

• PCR-free protocols exist but require a large amount of DNA.

Why is it important to do it?

Number of duplicates varies according to the complexity of the library:

• whole genome experiments (<5%)

• custom enrichment ones (<30%)

It must be done after alignment and at the library level.

How does it work?

It identifies read-pairs where the outer ends map to the same position on the genome and removes all but one copy.

picard-tools MarkDuplicatesINPUT=cc_gn2_R12_rg_sorted.bam OUTPUT=cc_gn2_R12_rg_sorted_rmdup.bamMETRICS_FILE=dupl_metrics.txt

Duplicate removal

Input file

Module used

Output file

Metrics file

Duplicate removal

What do we have to do after each step???

Sort and Index the newly generated BAM file

sort the bam

samtools sort cc_gn2_R12_rg_sorted_rmdup.bam cc_gn2_R12_rg_sorted_rmdup_sorted

samtools index cc_gn2_R12_rg_sorted_rmdup_sorted.bam

Index the sorted bam file

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

3- Bam refinement – BAM check

BAM check gives us answers to several questions:

How many duplicates do I have? Is that reasonable for my experiment?

How many of my reads mapped back to the reference? How many of these are paired in mapping? How many pairs are mapped to different chromosomes?

How much average coverage do I have? Is the coverage evenly distributed along my region?

Duplicate removal

Alignment Stats

Coverage

3- Bam refinement – BAM check: duplicate removal

Open the file dupl_metrics.txt

More/cat/gedit

1) Check the % of duplicates;

0.202646 → ~20.3%

3- Bam refinement – BAM check: duplicate removal

## net.sf.picard.metrics.StringHeader# net.sf.picard.sam.MarkDuplicatesINPUT=[/media/pier/pierWD/backup/freecom/work/teaching/Unife_bioinfo_122016/material/white_shark_fastq_gn2/white_shark/DD10/final_files/cc_gn2_R12_rg_sorted.bam] OUTPUT=/media/pier/pierWD/backup/freecom/work/teaching/Unife_bioinfo_122016/material/white_shark_fastq_gn2/white_shark/DD10/final_files/cc_gn2_R12_rg_sorted_rmdup.bam METRICS_FILE=/media/pier/pierWD/backup/freecom/work/teaching/Unife_bioinfo_122016/material/white_shark_fastq_gn2/white_shark/DD10/final_files/dupl_metrics.txt PROGRAM_RECORD_ID=MarkDuplicatesPROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false## net.sf.picard.metrics.StringHeader# Started on: Thu Nov 10 18:37:26 GMT 2016

## METRICS CLASS net.sf.picard.sam.DuplicationMetricsLIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS

UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATESPERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE

cc_gn2 12953 53727 379573 7026 8687 8687 0.202646

We have 20.3% of PCR duplicates in our experiment

Duplicate removal

Alignment Stats

Coverage

Run flagstat on the BAM file before and after BAM refinement, can you see any difference?

3- Bam refinement – BAM check: alignment stats

BEFORE:

AFTER:

samtools flagstat cc_gn2_R12_rg_sorted.bam

samtools flagstat cc_gn2_R12_rg_sorted_rmdup.bam

509594 + 0 in total (QC-passed reads + QC-failed reads)24400 + 0 duplicates130021 + 0 mapped (25.51%:-nan%)509594 + 0 paired in sequencing256086 + 0 read1253508 + 0 read2100420 + 0 properly paired (19.71%:-nan%)115180 + 0 with itself and mate mapped14841 + 0 singletons (2.91%:-nan%)0 + 0 with mate mapped to a different chr0 + 0 with mate mapped to a different chr (mapQ>=5

509594 + 0 in total (QC-passed reads + QC-failed reads)0 + 0 duplicates130021 + 0 mapped (25.51%:-nan%)509594 + 0 paired in sequencing256086 + 0 read1253508 + 0 read2100420 + 0 properly paired (19.71%:-nan%)115180 + 0 with itself and mate mapped14841 + 0 singletons (2.91%:-nan%)0 + 0 with mate mapped to a different chr0 + 0 with mate mapped to a different chr (mapQ>=5)

BEFORE

24400 + 0 duplicates

LIBRARY UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES PERCENT_DUPLICATIONcc_gn2 7026 8687 0.202646

(8687*2) + 7026 = 24400

Duplicate removal

Alignment Stats

Coverage

Depth of Coverage – The number of reads that spans a given DNA sequence of interest. This is commonly expressed in terms of “Yx” where “Y” is the number of reads and “x” is the unit reflecting the depth of coverage metric (i.e. 5x, 10x, 20x, 100x)

7x 9x11x

3- Bam refinement – BAM check: coverage estimation

java -jar /opt/GATK-3.5-0/GenomeAnalysisTK.jar-I cc_gn2_R12_rg_sorted_rmdup_sorted.bam-R cc_ref.fa-T DepthOfCoverage-o cc_gn2_R12_rg_sorted_rmdup_coverage-L coverage.intervals

Input file

Module used

List of intervals

GATK, coverage per base

Output

Reference sequence

GATK, coverage per base

-L coverage.intervals

List of intervals,

cc_ref:198200-213200

geditcc_gn2_R12_rg_sorted_rmdup_coverage.sample_summary

More/Cat/Head

sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_15shark 64770 5.70 11 4 1 10.2Total 64770 5.70 N/A N/A N/A

sample_id total mean granular_third_quartile granular_mediangranular_first_quartile %_bases_above_15

shark 64770 5.70 11 4 1 10.2Total 64770 5.70 N/A N/A N/A

Per base coverage

gedit cc_gn2_R12_rg_sorted_rmdup_coverage

More/Cat/Head

Locus Total_Depth Average_Depth_sample Depth_for_sharkcc_ref:198243 0 0.00 0cc_ref:198244 0 0.00 0cc_ref:198245 0 0.00 0cc_ref:198246 0 0.00 0cc_ref:198247 0 0.00 0cc_ref:198248 0 0.00 0cc_ref:198249 0 0.00 0cc_ref:198250 0 0.00 0cc_ref:198251 0 0.00 0cc_ref:198252 0 0.00 0cc_ref:198253 1 1.00 1

Open R:

• Open a terminal;• Type R;

data<-read.table("cc_gn2_R12_rg_sorted_rmdup_coverage", sep="\t", header=T)

names(data)

old_col<-data$Locus

new_col<-gsub("cc_ref:","",as.character(old_col))

data["pos"]<-new_col

plot(data$pos, data$Depth_for_shark, type="l")

Position (bp)

Coverage

Why is important to check the coverage?

• To check how your experiment performed (one of the ways to assess the quality of your experiment);

• To understand how confident can you be with your data;

• To decide on filtering after variant calling;

• To look for structural variation.

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

4- BAM check: visualisation

Different tools to visualise aligned NGS data:

• IGV: https://www.broadinstitute.org/igv/;

• Tablet: https://ics.hutton.ac.uk/tablet/;

• …

• Samtools tview

Samtools tview:

• Basic visualization tool;

• Terminal based;

• No need of RAM/Java/extra packages;

samtools tviewcc_gn2_R12_rg_sorted_rmdup_sorted.bam -p cc_ref:1215261

Position (Chr:bp)

Input: bam file

4- BAM check: visualisation Reference sequencePosition

Consensus

• . : base matching positive strand;

• , : base matching negative strand;

• underlined: secondary or orphan;

• Uppercase letters: base matching positive strand;

• Lowercase letters: base matching negative strand;

Press “ . ”

Positive strand

Reverse strand

?: menu

q: exit

m: mapping qualityn: nucleotide

b: base quality

.: on/off dots

Secondary or orphanColours for quality

0 ≤ MQ ≤ 9

Press “ ? ”

Press “ q ”, “ m ”

MQ >=30

Press “ b ”

Press “ ? ”

BQ ≥ 30

20 ≤ BQ ≤ 29

10 ≤ BQ ≤ 19

0 ≤ BQ ≤ 9

Press “ n ”

What does it happen??

The four nucleotides are highlighted by four different colours, no more mapping or base

quality

1. Is there any polymorphic site between position 1,174,361 and 1,174,391?

2. Is there any polymorphic sites between position 1,233,201 and 1,233,221?

If so, state the reference and alterative allele, the average quality of the base (BQ), the average mapping quality of the read (MQ) and how would you call that site (i.e. homozygous reference, heterozygous, homozygous alternative).

1. Is there any polymorphic site between position 1,174,361 and 1,174,391?

No polymorphic sites between position 1,174,361 and 1,174,391.

2. Is there any polymorphic sites between position 1,233,201 and 1,233,221?

• Yes, there is one polymorphic sites

• Reference allele: G

• Alternative allele: T

• Call: possible homozygous alternative(T/T)

• Average BQ: BQ ≥ 30 (white)

• Average MQ: MQ ≥ 30 (white)

1. Is there any polymorphic site between position 4761 and 4771?No polymorphic sites between position 4761 and 4771

2. Is there any polymorphic sites between position1781 and 1791?Yes, there is one polymorphic sitesReference allele: GAlternative allele: TCall: possible homozygous alternative (T/T)Average BQ: BQ ≥ 30Average MQ: MQ ≥ 30

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

time limited?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

single/multi-sample

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtoolsIGV/tablet

Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... ·...

Documents

INFORME BIOINFORMATICA

Genetica Taller Bioinformatica

Charla Bioinformatica

Tecnologias Emergentes - Bioinformatica

Bioinformatica I

Sequenziamento e analisi bioinformatica del genoma umano

Bioinformatica Proteinas

Bioinformatica Corso di Laurea Specialistica in Informatica Analisi della struttura dellRNA 27/04/2011

Bioinformatica GeBiX

Genetica bioinformatica

Presentacion Master Bioinformatica

Laboratorio Bioinformatica

Bioinformatica e analisi dei genomi - Unifedocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo...5. variant calling SNPs/indels single/multi-sample samtools raw variants (.vcf)

Informatica e Bioinformatica - Corso di Biologia Molecolare 2didattica.cribi.unipd.it/bioinfouno/2010_2011/lezioni/Dispensa_I.pdf · Informatica e Bioinformatica Insegnamento di Bioinformatica

Bioinformatica Entrez Utilities

curs de bioinformatica

Bioinformatica Classificazione

14 - Analisi Immagini Biomedicali...Riconoscimento e recupero dell’informazione per bioinformatica Analisi Immagini Biomedicali Manuele Bicego Corso di Laurea in Bioinformatica Dipartimento

Lab Bioinformatica

Practica5 Bioinformatica