WTAC NGS Course, Hinxton 12th April 2014 Lecture 2:
Identification of SNPs, Indels, and structural variants Thomas
Keane Sequence Variation Infrastructure Group WTSI Today's slides:
ftp://ftp-mouse.sanger.ac.uk/other/tk2/WTAC-2014/Lecture2.pdf
WTAC NGS Course, Hinxton 12th April 2014 Lecture 2:
Identification of SNPs, Indels, and structural variants VCF Format
SNP/indel Identification Structural Variation
WTAC NGS Course, Hinxton 10th April 2014 VCF: Variant Call
Format VCF is a standardised format for storing DNA polymorphism
data SNPs, insertions, deletions and structural variants With rich
annotations (e.g. context, predicted function, sequence data
support) Indexed for fast data retrieval of variants from a range
of positions Store variant information across many samples Record
meta-data about the site dbSNP accession, filter status, validation
status Very flexible format Arbitrary tags can be introduced to
describe new types of variants No two VCF files are necessarily the
same User extensible annotation fields supported Same event can be
expressed in multiple ways by including different numbers
Recommendation on VCF format website to ensure consistency
WTAC NGS Course, Hinxton 10th April 2014 VCF Format Header
section and a data section Header Arbitrary number of meta-data
information lines Starting with characters ## Column definition
line starts with single # Mandatory columns Chromosome (CHROM)
Position of the start of the variant (POS) Unique identifiers of
the variant (ID) Reference allele (REF) Comma separated list of
alternate non-reference alleles (ALT) Phred-scaled quality score
(QUAL) Site filtering information (FILTER) User extensible
annotation (INFO)
WTAC NGS Course, Hinxton 10th April 2014 Example VCF
(SNPs/indels)
WTAC NGS Course, Hinxton 10th April 2014 VCF Trivia 1 What
version of the human reference genome was used? What does the DB
INFO tag stand for? What does the ALT column contain? At position
17330, what is the total depth? What is the depth for sample
NA00002? At position 17330, what is the genotype of NA00002? Which
position is a tri-allelic SNP site? What sort of variant is at
position 1234567? What is the genotype of NA00002?
WTAC NGS Course, Hinxton 10th April 2014 Functional Annotation
VCF can store arbitrary INFO tags per site Genotype FORMAT tags Use
tags to describe Genomic context of the variant (e.g. coding,
intronic, non-coding, UTR, intergenic) Predicted functional
consequence of the variant (e.g. synonymous/non- synonymous,
protein structure change) Presence of the variant in other large
resequencing studies Several tools for annotating a VCF SnpEff:
http://snpeff.sourceforge.net/ Ensembl VEP:
http://www.ensembl.org/info/docs/tools/vep/script/index.html
FunSeq: http://funseq.gersteinlab.org/
WTAC NGS Course, Hinxton 10th April 2014 Ensembl - VEP "VEP
determines the effect of your variants (SNPs, insertions,
deletions, CNVs or structural variants) on genes, transcripts, and
protein sequence, as well as regulatory regions." Species must be
included in either Ensembl OR Ensembl genomes Sequence ontology
(SO) terms to describe genomic context Pubmed IDs for variants
cited Output only the most severe consequence per variation. Online
or off-line mode Off-line recommended for large numbers of variants
(download relevant cache) Human specific annotations Sift -
predicts whether an amino acid substitution affects protein
function Polyphen - predicts impact of an amino acid substitution
on the structure of human proteins 1000 genomes frequencies -
global or per population
WTAC NGS Course, Hinxton 10th April 2014 VEP VCF VEP INFO tag:
##INFO= Example
CSQ=T|ENSG00000238962|ENST00000458792|Transcript|upstream_gene_variant|
|||||rs72779452|||3789|-1||RNU7-176P|HGNC|||0.02|0.10|0.07|0.17,
T|ENSG00000143870|ENST00000404824|Transcript|synonymous_variant|474|102|
34|A|gcC/gcA|rs72779452||||-1||PDIA6|HGNC|||0.02|0.10|0.07|0.17,
T|ENSG00000143870|ENST00000381611|Transcript|5_prime_UTR_variant|264|||||r
s72779452||||-1||PDIA6|HGNC|||0.02|0.10|0.07|0.17
WTAC NGS Course, Hinxton 10th April 2014 More Information VCF
http://bioinformatics.oxfordjournals.org/content/27/15/2156.full
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-
variant-call-format-version-41 VCFTools
http://vcftools.sourceforge.net GATK
http://www.broadinstitute.org/gatk/
http://www.broadinstitute.org/gatk/guide/article?id=1268 VCF
Annotation Ensembl VEP:
http://www.ensembl.org/info/docs/tools/vep/index.html SNPeff:
http://snpeff.sourceforge.net/ Anntools:
http://anntools.sourceforge.net/
WTAC NGS Course, Hinxton 12th April 2014 Lecture 2:
Identification of SNPs, Indels, and structural variants VCF Format
SNP/indel Identification Structural Variation
WTAC NGS Course, Hinxton 12th April 2014 SNP Identification SNP
- single nucleotide polymorphisms Examine the bases aligned to
position and look for differences SNP discovery vs genotyping
Finding new variant sites Determining the genotype at a set of
already known sites Factors to consider when calling SNPs Base call
qualities of each supporting base Proximity to Small indel
Homopolymer run (>4-5bp for 454 and >10bp for illumina)
Mapping qualities of the reads supporting the SNP Low mapping
qualities indicates repetitive sequence Read length Possible to
align reads with high confidence to larger portion of the genome
with longer reads Paired reads Sequencing depth
WTAC NGS Course, Hinxton 12th April 2014 Mouse SNP
WTAC NGS Course, Hinxton 12th April 2014 Read Length vs.
Uniqueness
WTAC NGS Course, Hinxton 12th April 2014 Inaccessible
Genome
WTAC NGS Course, Hinxton 12th April 2014 Is this a real
SNP?
WTAC NGS Course, Hinxton 12th April 2014 Evaluating SNPs
Specificity vs sensitivity False positives vs. false negatives
Desirable to have high sensitivity and specificity Sensitivity
External sources of validation Specificity Test a random selection
of snps by another technology e.g. Sequenom, Sanger sequencing
Receiver operator curves to investigate effects of varying
parameters
WTAC NGS Course, Hinxton 12th April 2014 Known Systematic
Biases Many biases can be introduced in either sample preparation,
sequencing process, computational alignment steps etc. Can generate
false positive SNPs/indels Potential biases Strand bias End
distance bias Consistency across replicates/libraries Variant
distance bias VCF Tools Soft filter variants file for these biases
Variants kept in the file - just annotated with potential bias
affecting the variant
WTAC NGS Course, Hinxton 12th April 2014 Strand Bias
WTAC NGS Course, Hinxton 12th April 2014 End Distance Bias
WTAC NGS Course, Hinxton 12th April 2014 Variant Distance
Bias
WTAC NGS Course, Hinxton 12th April 2014 Reproducibility
WTAC NGS Course, Hinxton 12th April 2014 Future of Variant
Calling? Current approaches Rely heavily on the supplied alignment
Largely site based, don't examine local haplotype Local denovo
assembly based variant callers Calls SNP, INDEL, MNP and small SV
simultaneously Can removes mapping artifacts e.g. GATK haplotype
caller
WTAC NGS Course, Hinxton 12th April 2014 Haplotype Based
Calling - GATK
WTAC NGS Course, Hinxton 12th April 2014 Lecture 2:
Identification of SNPs, Indels, and structural variants VCF Format
SNP/indel Identification Structural Variation
WTAC NGS Course, Hinxton 12th April 2014 Genomic Structural
Variation Large DNA rearrangements (>100bp) Frequent causes of
disease Referred to as genomic disorders Mendelian diseases or
complex traits such as behaviors E.g. increase in gene dosage due
to increase in copy number Prevalent in cancer genomes Many types
of genomic structural variation (SV) Insertions, deletions, copy
number changes, inversions, translocations & complex events
Comparative genomic hybridization (CGH) traditionally used to for
copy number discovery CNVs of 1-50 kb in size have been
under-ascertained Next-gen sequencing revolutionised field of SV
discovery Parallel sequencing of ends of large numbers of DNA
fragments Examine alignment distance of reads to discover presence
of genomic rearrangements Resolution down to ~100bp
WTAC NGS Course, Hinxton 12th April 2014 Human Disease
Stankiewicz and Lupski (2010) Ann. Rev. Med.
WTAC NGS Course, Hinxton 12th April 2014 Structural Variation
Several types of structural variations (SVs) Large
Insertions/deletions Inversions Translocations Read pair
information used to detect these events Paired end sequencing of
either end of DNA fragment Observe deviations from the expected
fragment size Presence/absence of mate pairs
WTAC NGS Course, Hinxton 12th April 2014 Structural Variation
Types
WTAC NGS Course, Hinxton 10th April 2014 Fragment Size QC
WTAC NGS Course, Hinxton 10th April 2014 What is this?
WTAC NGS Course, Hinxton 12th April 2014 What is this?
WTAC NGS Course, Hinxton 12th April 2014 What is this?
WTAC NGS Course, Hinxton 12th April 2014 Mobile Element
Insertions Transposons are segments of DNA that can move within the
genome A minimal genome - ability to replicate and change location
Relics of ancient viral infections Dominate landscape of mammalian
genomes 38-45% of rodent and primate genomes Genome size
proportional to number of TEs Class 1 (RNA intermediate) and 2 (DNA
intermediate) Potent genetic mutagens Disrupt expression of genes
Genome reorganisation and evolution Transduction of flanking
sequence Species specific families Human: Alu, L1, SVA Mouse: SINE,
LINE, ERV Many other families in other species
WTAC NGS Course, Hinxton 12th April 2014 Human Mobile
Elements
WTAC NGS Course, Hinxton 12th April 2014 Mobile Element
Insertions
WTAC NGS Course, Hinxton 12th April 2014 Mouse Example -
LookSeq
WTAC NGS Course, Hinxton 12th April 2014 Human Alu - IGV
WTAC NGS Course, Hinxton 12th April 2014 Detecting Mobile
Element Insertions Most algorithms for locating non-reference
mobile elements operate in a similar manner Goal: Detect all read
pairs where one-end is flanking the insertion point and mate is in
the inserted sequence Pseudo algorithm Read through BAM file and
make list of all discordant read pairs Filter the reads where one
end is similar to your library of mobile elements Remove anchor
reads with low mapping quality Cluster the anchor reads and examine
breakpoint Filter out any clusters close to annotated elements of
the same type
WTAC NGS Course, Hinxton 12th April 2014 1000 Genomes CEU Trio
Typical human sample ~900-1000 non-reference mobile elements ~800
Alu elements, ~100 L1 Why are there 44 calls private to the
child?
WTAC NGS Course, Hinxton 12th April 2014 Mobile Element
Software RetroSeq: https://github.com/tk2/RetroSeq VariationHunter:
http://compbio.cs.sfu.ca/strvar. htm T-LEX:
http://petrov.stanford.edu/cgi-bin/Tlex. html Tea:
http://compbio.med.harvard.edu/Tea/