16
2014 BMMB 852D: Applied Bioinforma9cs Week 11, Lecture 21 István Albert Biochemistry and Molecular Biology and Bioinforma9cs Consul9ng Center Penn State

%Week%11,%Lecture%21% - personal.psu.edu

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: %Week%11,%Lecture%21% - personal.psu.edu

2014  -­‐  BMMB  852D:  Applied  Bioinforma9cs  

   Week  11,  Lecture  21  

István  Albert    

Biochemistry  and  Molecular  Biology    and  Bioinforma9cs  Consul9ng  Center  

 Penn  State  

Page 2: %Week%11,%Lecture%21% - personal.psu.edu

Sequence  data  to  genotypes  

l  A  common  sequencing  workflow  

 

Sequencing  reads    à      Alignments    à      Variant  calls  

             FASTQ                                              SAM/BAM                                      VCF  

a list of short sequences

a list of short sequences

and where they are in the genome

a list of locations in the genome and

what the base is at each

Page 3: %Week%11,%Lecture%21% - personal.psu.edu

Pileups      •  Pileup  à  show  all  bases  at  a  given  index  

 

Page 4: %Week%11,%Lecture%21% - personal.psu.edu

What  are  variant  calls?  

l  Naive  variant  calling  

-  Check  all  the  reads  that  cover  base  chr1:291  -  Add  up  the  bases  at  chr1:291  -  e.g.  10  A's,  2  G's  

·  Is  this  an  A/G  heterozygous  site  or  two  sequencing  errors?    

l  Actual  variant  callers  -  Es9mate  likelihood  of  a  variant  site  vs  a  sequencing  error  

·  Sequencing  error  rate  ·  Quality  scores  

Note:  it  is  not  always  obvious  what  the  underlying  assump9ons  of  a  snp  caller  are.  Especially  when  used  for  genomes  other  than  human/mouse.  These  are  by  far  the  most  studied  and  customized  for.  

 

Page 5: %Week%11,%Lecture%21% - personal.psu.edu

VCF:  Variant  Call  Format  

l  Represent  a  list  of  loca9ons  and  the  variant  call  at  each  

-  Simple,  right?  Yes  and  no.    

-  Simple  founda9on  ·  Loca9on  and  base    

-  Complex  “bonus  features”  ·  Indels,  structural  variants,  etc.  · Mul9ple  samples  ·  Haplotype  phasing  

Page 6: %Week%11,%Lecture%21% - personal.psu.edu

Short  intro  on  the  samtools  website  

Page 7: %Week%11,%Lecture%21% - personal.psu.edu

Full  VCF  reference  in  the  pdf  

Page 8: %Week%11,%Lecture%21% - personal.psu.edu

VCF  Poster:  great  reference  

Page 9: %Week%11,%Lecture%21% - personal.psu.edu

VCF:  The  simple  part  

l  loca9on,  reference  base,  your  base  

-  CHROM/POS,  REF,  ALT              

-  a  lot  like  wgsim's  muta9ons.txt  

Page 10: %Week%11,%Lecture%21% - personal.psu.edu

VCF:  the  more  complicated  parts  

Small  VCF  files  are  easiest  to  interpret  in  Excel  

Page 11: %Week%11,%Lecture%21% - personal.psu.edu

VCF:  Mul9ple  samples  

l  VCF  can  have  a  variable  number  of  columns!  

 

 

 

 

 

l  Column  headings  are  the  sample  names  http://vcftools.sourceforge.net/VCF-poster.pdf

Page 12: %Week%11,%Lecture%21% - personal.psu.edu

VCF  review  

l  VCF  can  represent  SNV  calls  

l  and  much,  much  more  -  Indels  (G          GC)  - Mul9ple  variants  per  site  (in  ALT  column)  - Mul9ple  samples  (SAMPLE  columns)  

l  Check  poster  for  quick  overview  -  hfp://vcgools.sourceforge.net/VCF-­‐poster.pdf  

l  Check  full  specifica9on  for  details  -  hfp://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-­‐variant-­‐call-­‐format-­‐version-­‐41  

Page 13: %Week%11,%Lecture%21% - personal.psu.edu

VCF  and  BCF  

•  The  same  rela9onship  as  SAM  and  BAM  formats  

•  BCF  –  binary,  compressed  VCF  –  much  smaller  but  need  to  be  operated  on  with  bc.ools  

 

Page 14: %Week%11,%Lecture%21% - personal.psu.edu

DIY  snp  caller  

Page 15: %Week%11,%Lecture%21% - personal.psu.edu

SNP  calling  comparison  

Page 16: %Week%11,%Lecture%21% - personal.psu.edu

Homework  21  -­‐  Generate  alignments  from  a  mutated  genome  (or  use  the  prior  results).  

 

-­‐  Visualize  the  expected  muta9ons  in  IGV  

 

-­‐  Call  SNPS  with  a  snp  caller  of  your  choice.    

 

-­‐  Overlay  your  snp  calls  with  those  that  IGV  shows  and  those  that  you  expect  to  see.  

 

-­‐  Evaluate/discuss  the  snps  that  IGV  makes  visible,  those  visible  in  the  samtools  output  versus  the  real  muta9ons