Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
2014 -‐ BMMB 852D: Applied Bioinforma9cs
Week 11, Lecture 21
István Albert
Biochemistry and Molecular Biology and Bioinforma9cs Consul9ng Center
Penn State
Sequence data to genotypes
l A common sequencing workflow
Sequencing reads à Alignments à Variant calls
FASTQ SAM/BAM VCF
a list of short sequences
a list of short sequences
and where they are in the genome
a list of locations in the genome and
what the base is at each
Pileups • Pileup à show all bases at a given index
What are variant calls?
l Naive variant calling
- Check all the reads that cover base chr1:291 - Add up the bases at chr1:291 - e.g. 10 A's, 2 G's
· Is this an A/G heterozygous site or two sequencing errors?
l Actual variant callers - Es9mate likelihood of a variant site vs a sequencing error
· Sequencing error rate · Quality scores
Note: it is not always obvious what the underlying assump9ons of a snp caller are. Especially when used for genomes other than human/mouse. These are by far the most studied and customized for.
VCF: Variant Call Format
l Represent a list of loca9ons and the variant call at each
- Simple, right? Yes and no.
- Simple founda9on · Loca9on and base
- Complex “bonus features” · Indels, structural variants, etc. · Mul9ple samples · Haplotype phasing
Short intro on the samtools website
Full VCF reference in the pdf
VCF Poster: great reference
VCF: The simple part
l loca9on, reference base, your base
- CHROM/POS, REF, ALT
- a lot like wgsim's muta9ons.txt
VCF: the more complicated parts
Small VCF files are easiest to interpret in Excel
VCF: Mul9ple samples
l VCF can have a variable number of columns!
l Column headings are the sample names http://vcftools.sourceforge.net/VCF-poster.pdf
VCF review
l VCF can represent SNV calls
l and much, much more - Indels (G GC) - Mul9ple variants per site (in ALT column) - Mul9ple samples (SAMPLE columns)
l Check poster for quick overview - hfp://vcgools.sourceforge.net/VCF-‐poster.pdf
l Check full specifica9on for details - hfp://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-‐variant-‐call-‐format-‐version-‐41
VCF and BCF
• The same rela9onship as SAM and BAM formats
• BCF – binary, compressed VCF – much smaller but need to be operated on with bc.ools
DIY snp caller
SNP calling comparison
Homework 21 -‐ Generate alignments from a mutated genome (or use the prior results).
-‐ Visualize the expected muta9ons in IGV
-‐ Call SNPS with a snp caller of your choice.
-‐ Overlay your snp calls with those that IGV shows and those that you expect to see.
-‐ Evaluate/discuss the snps that IGV makes visible, those visible in the samtools output versus the real muta9ons