Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Snippy

Torsten Seemann

Balti & Bioinformatics - Birmingham, UK - Tue 5 May 2015

Rapid bacterial variant calling & core genome alignments

Background

(Far) south east England

Phyloflagomics

UK / Birmingham Australia / Victoria Canada / British Columbia

A new home

Centre for Applied Microbial Genomics

Microbiological Diagnostic Unit

∷ Oldest public health lab in Australia: established 1897 in Melbourne: large historical isolate collection back to 1950s

∷ National reference laboratory: Salmonella, Listeria, EHEC

∷ WHO regional reference lab: vaccine preventable invasive bacterial pathogens

New director

∷ Professor Ben Howden: clinician, microbiologist, pathologist: early adopter of genomics and bioinformatics: long term collaborator on MRSA/VRE w/ Tim Stinear

∷ Mandate: modernise service delivery: enhance research output and collaboration: nationally lead the conversion to WGS

Hardware∷ Sequencers

: NextSeq 500: 3 x MiSeq: PacBio RS II (arriving 22 May)

∷ Robots: Perkin Elmer (does not have a Twitter account): Colony picker

∷ Compute: 240 TB, 10 GigE, 3 x 72 core boxes

Variant calling

Variant calling

∷ Find DNA differences between genomes: variants to explain phenotype: validate your complemented mutant

∷ Two approaches: reference based (read alignment): reference-free (de novo assembly / k-mer based)

Types of variants

∷ Substitutions: single nucleotide polymorphism (snp) A➝C: multiple nucleotide polymorphism (mnp) AG➝TC

∷ Indels: insertion (ins) A➝AC : deletion (del) ACCG➝AG

∷ Complex: compound events AC➝T

My solution

Snippy

∷ Fast → snappy

∷ Finds variants → SNPs

∷ Australian → Skippy the bush kangaroo

Input

∷ FASTQ files: paired end, interleaved, or single-end

∷ Reference: FASTA or Genbank

∷ Output folder: self contained bundle of results

Inside the black box

∷ bwa mem - no clipping needed

∷ samtools - sorted, filtered BAM

∷ freebayes - split / GNU parallel / merge

∷ vcflib/vcftools - VCF filtering

∷ perl - glue

Outputs

∷ Read alignments: .bam / .bai

∷ Variants: .vcf / .vcf.gz / .vcf.gz.tbi / .gff .bed .tab .csv .html

∷ Consensus: reference with all variants applied to it

∷ Genome alignment: reference with “-” (missing) and “N” low depth

TAB outputCHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS LOCUS_TAG GENE PRODUCT

chr 5958 snp A G G:44 A:0 CDS + 41/600 13/200 ECO_0001 dnaA replication protein

DnaA

chr 35524 snp G T T:73 G:1 C:1 tRNA -

chr 45722 ins ATT ATTT ATTT:43 ATT:1 CDS - ECO_0045 gyrA DNA gyrase

chr 100541 del CAAA CAA CAA:38 CAAA:1 CDS + ECO_0179 hypothetical protein

plas 619 complex GATC AATA GATC:28 AATA:0

plas 3221 mnp GA CT CT:39 CT:0 CDS + ECO_p012 rep hypothetical protein

Phylogenomics

Phylogenetics 101∷ Choose some genes∷ Sequence each gene from each isolate∷ Align the protein sequences of each gene∷ Back-align to nucleotide space∷ Concatenate all the alignments∷ Construct a distance matrix (many ways)∷ Draw a tree (many ways)∷ Make wild inferences from little data

Phylogenomics 101

∷ Assemble each genome

∷ Perform whole genome alignment : in nucleotide space, as don’t know what is coding: very computationally expensive: can’t parallelize as with individual genes

∷ Continue as for phylogenetics

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTC

∷ Ideally, feed this directly to a tree builder∷ Properly model gaps, codons and ambiguity ∷ Hard!

Whole genome alignment

Core genome SNPs

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore | | ||||||||| ||||||

Core sites are present in all genomes.

Core genome

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore | | ||||||||| ||||||SNPs | | | | |

Core SNPS = polymorphic sites in core genome

Core SNPs

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore | | ||||||||| ||||||SNPs | | | | |SNPs’ | | | |

Unambiguous core SNPs

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCSNPs’ | | | | ata ttc ata atg 1 2 3 4

Allele sites

>bug1ATAA>bug2TTTT>bug3ACAG

Alignment ⇢Tree

+------ bug3 | ---+--- bug1 | +--------- bug2

--- 1 SNP

The N±1 problem

Aligning to reference

∷ Why is whole genome alignment not used?: involves genome (mis)assembly: computationally difficult: expensive to add or remove isolates

∷ Short-cut: choose a single reference: align each isolates reads to the reference: core, by definition, must include the reference

Read mapping considerations

∷ Choice of reference

∷ Too divergent?: reads may not align well: will get too many core genome SNPs

∷ One solution: Assemble one isolate and use as the reference

SNPs | | | | |core | | ||||||||| ||||||bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore1 ||| ||||||||||| ||||||||||SNPs1 | | || |

Remove taxon, different core (1)

SNPs | | | | |core | | ||||||||| ||||||bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore2 | | ||||||||| ||||||SNPs2 | | | | |


SNPs | | | | |core | | ||||||||| ||||||bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore3 | ||||||||||||| ||||||SNPs3 | |


Core genome alignments

∷ Core SNP alignments: can shift dramatically with taxa content: we are only using globally conserved sites: remember variation still exists outside “core”

∷ Snippy will keep the full alignments: quickly derive subsets on the fly: adding isolates can be done quickly too

Conclusion

Snippy summary∷ The good

: Fast, scales to 100 cores: Simple, clean interface and output

∷ The bad: Doesn’t do full consequences yet using snpEff

∷ The ugly?: Written in Perl

Contact

∷ tseemann.github.io

∷ github.com/tseemann/snippy

∷ @torstenseemann

Science

Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015