Upload
torsten-seemann
View
72
Download
2
Tags:
Embed Size (px)
Citation preview
Snippy
Torsten Seemann
Balti & Bioinformatics - Birmingham, UK - Tue 5 May 2015
Rapid bacterial variant calling & core genome alignments
Background
(Far) south east England
Phyloflagomics
UK / Birmingham Australia / Victoria Canada / British Columbia
A new home
Centre for Applied Microbial Genomics
Microbiological Diagnostic Unit
∷ Oldest public health lab in Australia: established 1897 in Melbourne: large historical isolate collection back to 1950s
∷ National reference laboratory: Salmonella, Listeria, EHEC
∷ WHO regional reference lab: vaccine preventable invasive bacterial pathogens
New director
∷ Professor Ben Howden: clinician, microbiologist, pathologist: early adopter of genomics and bioinformatics: long term collaborator on MRSA/VRE w/ Tim Stinear
∷ Mandate: modernise service delivery: enhance research output and collaboration: nationally lead the conversion to WGS
Hardware∷ Sequencers
: NextSeq 500: 3 x MiSeq: PacBio RS II (arriving 22 May)
∷ Robots: Perkin Elmer (does not have a Twitter account): Colony picker
∷ Compute: 240 TB, 10 GigE, 3 x 72 core boxes
Variant calling
Variant calling
∷ Find DNA differences between genomes: variants to explain phenotype: validate your complemented mutant
∷ Two approaches: reference based (read alignment): reference-free (de novo assembly / k-mer based)
Types of variants
∷ Substitutions: single nucleotide polymorphism (snp) A➝C: multiple nucleotide polymorphism (mnp) AG➝TC
∷ Indels: insertion (ins) A➝AC : deletion (del) ACCG➝AG
∷ Complex: compound events AC➝T
My solution
Snippy
∷ Fast → snappy
∷ Finds variants → SNPs
∷ Australian → Skippy the bush kangaroo
Input
∷ FASTQ files: paired end, interleaved, or single-end
∷ Reference: FASTA or Genbank
∷ Output folder: self contained bundle of results
Inside the black box
∷ bwa mem - no clipping needed
∷ samtools - sorted, filtered BAM
∷ freebayes - split / GNU parallel / merge
∷ vcflib/vcftools - VCF filtering
∷ perl - glue
Outputs
∷ Read alignments: .bam / .bai
∷ Variants: .vcf / .vcf.gz / .vcf.gz.tbi / .gff .bed .tab .csv .html
∷ Consensus: reference with all variants applied to it
∷ Genome alignment: reference with “-” (missing) and “N” low depth
TAB outputCHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS LOCUS_TAG GENE PRODUCT
chr 5958 snp A G G:44 A:0 CDS + 41/600 13/200 ECO_0001 dnaA replication protein
DnaA
chr 35524 snp G T T:73 G:1 C:1 tRNA -
chr 45722 ins ATT ATTT ATTT:43 ATT:1 CDS - ECO_0045 gyrA DNA gyrase
chr 100541 del CAAA CAA CAA:38 CAAA:1 CDS + ECO_0179 hypothetical protein
plas 619 complex GATC AATA GATC:28 AATA:0
plas 3221 mnp GA CT CT:39 CT:0 CDS + ECO_p012 rep hypothetical protein
Phylogenomics
Phylogenetics 101∷ Choose some genes∷ Sequence each gene from each isolate∷ Align the protein sequences of each gene∷ Back-align to nucleotide space∷ Concatenate all the alignments∷ Construct a distance matrix (many ways)∷ Draw a tree (many ways)∷ Make wild inferences from little data
Phylogenomics 101
∷ Assemble each genome
∷ Perform whole genome alignment : in nucleotide space, as don’t know what is coding: very computationally expensive: can’t parallelize as with individual genes
∷ Continue as for phylogenetics
bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTC
∷ Ideally, feed this directly to a tree builder∷ Properly model gaps, codons and ambiguity ∷ Hard!
Whole genome alignment
Core genome SNPs
bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore | | ||||||||| ||||||
Core sites are present in all genomes.
Core genome
bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore | | ||||||||| ||||||SNPs | | | | |
Core SNPS = polymorphic sites in core genome
Core SNPs
bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore | | ||||||||| ||||||SNPs | | | | |SNPs’ | | | |
Unambiguous core SNPs
bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCSNPs’ | | | | ata ttc ata atg 1 2 3 4
Allele sites
>bug1ATAA>bug2TTTT>bug3ACAG
Alignment ⇢Tree
+------ bug3 | ---+--- bug1 | +--------- bug2
--- 1 SNP
The N±1 problem
Aligning to reference
∷ Why is whole genome alignment not used?: involves genome (mis)assembly: computationally difficult: expensive to add or remove isolates
∷ Short-cut: choose a single reference: align each isolates reads to the reference: core, by definition, must include the reference
Read mapping considerations
∷ Choice of reference
∷ Too divergent?: reads may not align well: will get too many core genome SNPs
∷ One solution: Assemble one isolate and use as the reference
SNPs | | | | |core | | ||||||||| ||||||bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore1 ||| ||||||||||| ||||||||||SNPs1 | | || |
Remove taxon, different core (1)
SNPs | | | | |core | | ||||||||| ||||||bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore2 | | ||||||||| ||||||SNPs2 | | | | |
Remove taxon, different core (2)
SNPs | | | | |core | | ||||||||| ||||||bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore3 | ||||||||||||| ||||||SNPs3 | |
Remove taxon, different core (3)
Core genome alignments
∷ Core SNP alignments: can shift dramatically with taxa content: we are only using globally conserved sites: remember variation still exists outside “core”
∷ Snippy will keep the full alignments: quickly derive subsets on the fly: adding isolates can be done quickly too
Conclusion
Snippy summary∷ The good
: Fast, scales to 100 cores: Simple, clean interface and output
∷ The bad: Doesn’t do full consequences yet using snpEff
∷ The ugly?: Written in Perl
Contact
∷ tseemann.github.io
∷ github.com/tseemann/snippy
∷ @torstenseemann