Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Lyve-SET: an hqSNP pipeline for outbreak investigations
Lee Katz, Ph.D.
Bioinformatics scientist
Enteric Diseases Laboratory Branch (EDLB)
Centers for Disease Control and Prevention (CDC)
Sequencing, Finishing, and Analysis of the Future 2015
May 28, 2015
National Center for Emerging and Zoonotic Infectious Diseases
Division of Foodborne, Waterborne, and Environmental Diseases
Enteric Diseases Laboratory Branch (EDLB) at CDC • We study bacterial foodborne pathogens:
Listeria monocytogenes, E. coli, Salmonella, C. botulinum,…
• Perform routine surveillance
PFGE photo taken from http://www.cdc.gov/pulsenet/about/faq.html
Phylogeny taken from Jackson et al 2015, “Notes from the Field: Listeriosis Associated with Stone Fruit — United States, 2014” MMWR
Traditionally, tracked by PFGE (and other methods, e.g., 7-gene MLST)
But now, transitioning to whole-genome methods for routine surveillance
We need high-quality SNPs!
• High-quality SNPs (hqSNPs) can give a fine-resolution view of a cluster of genomes
• Useful for outbreak investigations
• Therefore, we created Lyve-SET for hqSNP-based phylogenies
Phylogenetically
related…?
Lyve-SET Lyve – Listeria, Yersinia, Vibrio, and Enterobacteriaceae reference lab
SET – Snp Extraction Tool
• Some details on Lyve-SET:
– For LINUX
– Extensive documentation
– Help options are embedded in each script
– --fast option, takes ¼ the normal time
– Easy to use
– Modular
• Used in labs at CDC: Foodborne Diseases Laboratory Branch, Clinical and Environmental Microbiology Branch, Respiratory Diseases Branch
• Listeria monocytogenes outbreak investigations since summer 2013
https://github.com/lskatz/lyve-SET
HIGH-QUALITY SNPS: ASSUMPTIONS IN OUTBREAK INVESTIGATIONS
1. Evolution approximates epidemiology
2. SNPs correlate well with overall evolutionary change
A survey of SNP vs recombination rates in bacteria: Vos M, Didelot X. 2008. A comparison of homologous recombination rates in bacteria and archaea. ISME J
3:199-208.
An animation of Lyve-SET (hqSNPs) 0. Pre-processing
a) phage discovery/masking
b) Manual identification of troublesome regions
c) Read cleaning (Poster – Wagner et al)
1. Mapping - SMALT a) 95% read identity
b) Unambiguous mapping
2. SNP calling - VarScan a) 75% consensus
b) 10x depth
3. Phylogeny inferring – RAxML v8 a) Removal of clustered SNPs
b) Ascertainment bias model
c) Maximum likelihood https://github.com/lskatz/lyve-SET/blob/master/docs/FAQ.md https://www.sanger.ac.uk/resources/software/smalt Koboldt, D., Zhang, Q., Larson, D., Shen, D., McLellan, M., Lin, L., Miller, C., Mardis, E., Ding, L., & Wilson, R. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing Genome
Research DOI: 10.1101/gr.129684.111 Stamatakis, A. (2014) RAxML version8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. doi:10.1093/bioinformatics/btu033
Reference genome
phage
2014C-
2014C-3
2014C-3
2014C-
2014C-
2014C-3
2014C-3
2014C-3
2014C-3
100
100
100
100
0.001
Manual identification Genome 1 SNP profile
Genome 2 SNP profile
Genome 4 SNP profile
Genome 3 SNP profile
Phylogeny
Comparing against other well regarded tools
• wgMLST o Applied Maths: http://www.applied-maths.com/applications/wgmlst
o International Listeria wgMLST Schema Development Consortium
• SNVPhyl o Petkau A, Keddy A, Slusky L, Mabon P, Bristow F, Matthews T, Adam J, Carriço JA, Katz LS, Reimer A, Knox N, Courtot M,
Graham M, Hsiao W, Brinkman F, Beiko RG, Van Domselaar G. Outbreak investigation with IRIDA’s SNVPhyl pipeline and GenGIS. Poster presented at: The 7th Meeting of the Global Microbial Identifier; September 11-12, 2014; York, UK
o https://github.com/apetkau/core-phylogenomics
• Snp-Pipeline v3.3 o Pettengill JB, Luo Y, Davis S, Chen Y, Gonzalez-Escalona N, Ottesen A, Rand H, Allard MW, Strain E An evaluation of alternative
methods for constructing phylogenies from whole genome sequence data: A case study with Salmonella.
o http://snp-pipeline.readthedocs.org/en/latest
• kSNP2 o Gardner, S.N. and Hall, B.G. 2013. When whole-genome alignments just won't work: kSNP v2 software for alignment-free SNP
discovery and phylogenetics of hundreds of microbial genomes. PLoS ONE, 8(12):e81760.doi:10.1371/journal.pone.0081760
• Wombac v2.1 o https://github.com/tseemann/wombac
Quick comparison with well-regarded tools Lyve-SET --fast wgMLST SNVPhyl Snp-Pipeline kSNP Wombac
Phage masking x x
Manual masking x x
Read cleaning x x x x
HPC support SGE SGE SGE SGE,
Torque, etc
SGE, Torque
Customizable
thresholds x x x x x x
Considers
clustered SNPs x x x x
Finished product ML tree ML tree UPGMA,
alleles
ML tree SNP matrix ML, NJ
tree
ML tree
Approach Ref-
mapping
Ref-mapping MLST-
mapping
Ref-
mapping
Ref-mapping Asm-
free
Ref-
mapping
O/S 🐧 🐧 +🐧 🍎🐧 🐧 🐧 🐧
Availability Open Open © Open Open Open Open
Stone Fruit outbreak/Listeria • Summer 2014
• Contaminated stone fruit – peaches, nectarines, etc
• Two confirmed clinical cases, two related but sporadic cases, many environmental isolates
• Very good epidemiology; well characterized
http://www.cdc.gov/mmwr/preview/mmwrhtml/mm6410a6.htm
http://mvd.dor.ga.gov/motor/plates/PlateSelection.aspx (Standard/Prestige link, downloaded May 20, 2015)
https://github.com/lskatz/lyve-SET/tree/master/testdata/listeria_monocytogenes_clade1 (listing of the dataset)
MOD1 LS994
MOD1 LS992
MOD1 LS985
MOD1 LS989
FDA00008143
MOD1 LS997
MOD1 LS996
MOD1 LS993
MOD1 LS1003
MOD1 LS1004
MOD1 LS1009
MOD1 LS998
MOD1 LS1010
MOD1 LS1011
MOD1 LS991
MOD1 LS1006
CFSAN023464
CFSAN023465
CFSAN023470
CFSAN023471
CFSAN023463
CFSAN023458
CFSAN023462
MOD1 LS982
CFSAN023469
PNUSAL000870
MOD1 LS980
FDA00008144
MOD1 LS988
CFSAN023468
MOD1 LS1008
MOD1 LS984
PNUSAL000730
PNUSAL000957
PNUSAL001024
MOD1 LS995
MOD1 LS999
MOD1 LS990
MOD1 LS978
MOD1 LS1002
CFSAN023467
MOD1 LS979
MOD1 LS1000
CFSAN023466
MOD1 LS1005
50
100
100
20
100
0.01
Comparison of the Listeria monocytogenes stone fruit outbreak trees.
• Both outbreak genomes cluster with the correct clades with 100% in all trees
• Most trees have almost the exact same topology with high confidence values for outbreak clades
MOD1 LS1004
MOD1 LS994
MOD1 LS993
MOD1 LS1010
MOD1 LS996
MOD1 LS998
MOD1 LS991
FDA00008143
MOD1 LS1006
MOD1 LS989
MOD1 LS992
MOD1 LS1003
MOD1 LS1009
MOD1 LS985
CFSAN023464
CFSAN023465
CFSAN023471
CFSAN023470
MOD1 LS1011
MOD1 LS997
MOD1 LS982
CFSAN023463.fasta.wgsim
CFSAN023463
CFSAN023462
CFSAN023458
MOD1 LS980
PNUSAL000870
MOD1 LS988
MOD1 LS1008
FDA00008144
CFSAN023468
MOD1 LS984
CFSAN023469
PNUSAL001024
MOD1 LS1005
MOD1 LS978
MOD1 LS995
MOD1 LS990
MOD1 LS999
CFSAN023467
MOD1 LS1002
CFSAN023466
MOD1 LS1000
MOD1 LS979
PNUSAL000957
PNUSAL000730
75
100
100
98
85
100
100
86
69
0.001
Lyve-SET kSNP
MOD1 LS992.fasta
MOD1 LS993.fasta
MOD1 LS1003.fasta
MOD1 LS991.fasta
MOD1 LS998.fasta
MOD1 LS1011.fasta
MOD1 LS997.fasta
MOD1 LS1010.fasta
MOD1 LS994.fasta
MOD1 LS1004.fasta
MOD1 LS1006.fasta
FDA00008143.fasta
MOD1 LS985.fasta
CFSAN023464.fasta
CFSAN023465.fasta
CFSAN023470.fasta
CFSAN023471.fasta
MOD1 LS1009.fasta
MOD1 LS989.fasta
MOD1 LS996.fasta
CFSAN023458.fasta
CFSAN023468.fasta
CFSAN023463.fasta
MOD1 LS988.fasta
MOD1 LS984.fasta
MOD1 LS982.fasta
FDA00008144.fasta
CFSAN023469.fasta
PNUSAL000870.fasta
MOD1 LS1008.fasta
CFSAN023462.fasta
MOD1 LS980.fasta
CFSAN023466.fasta
CFSAN023467.fasta
MOD1 LS1005.fasta
MOD1 LS978.fasta
MOD1 LS990.fasta
MOD1 LS979.fasta
MOD1 LS1000.fasta
MOD1 LS995.fasta
MOD1 LS999.fasta
MOD1 LS1002.fasta
PNUSAL001024.fasta
PNUSAL000957.fasta
PNUSAL000730.fasta
0.02
wgMLST
MOD1 LS1010
MOD1 LS993
MOD1 LS989
MOD1 LS1003
MOD1 LS996
MOD1 LS1009
MOD1 LS994
MOD1 LS1006
MOD1 LS997
MOD1 LS1011
MOD1 LS998
MOD1 LS992
MOD1 LS1004
MOD1 LS991
CFSAN023470
CFSAN023471
MOD1 LS985
CFSAN023464
CFSAN023465
FDA00008143
PNUSAL000870
FDA00008144
CFSAN023462
MOD1 LS988
MOD1 LS1008
CFSAN023458
MOD1 LS984
MOD1 LS980
CFSAN023469
CFSAN023468
MOD1 LS982
CFSAN023463
MOD1 LS978
CFSAN023467
CFSAN023466
MOD1 LS1002
MOD1 LS999
MOD1 LS1000
MOD1 LS990
MOD1 LS1005
PNUSAL001024
MOD1 LS995
MOD1 LS979
PNUSAL000957
PNUSAL000730
91
6899
83
100
100
100
0.001
Snp-Pipeline
MOD1 LS996
MOD1 LS1006
MOD1 LS1009
MOD1 LS1003
MOD1 LS1011
MOD1 LS1004
MOD1 LS1010
MOD1 LS997
MOD1 LS998
MOD1 LS993
MOD1 LS992
MOD1 LS985
CFSAN023465
CFSAN023464
CFSAN023470
CFSAN023471
MOD1 LS994
MOD1 LS989
MOD1 LS991
FDA00008143
MOD1 LS1008
MOD1 LS982
MOD1 LS988
MOD1 LS980
MOD1 LS984
FDA00008144
CFSAN023469
CFSAN023463
PNUSAL000870
CFSAN023462
CFSAN023468
CFSAN023458
MOD1 LS979
CFSAN023466
CFSAN023467
MOD1 LS978
MOD1 LS990
PNUSAL001024
MOD1 LS999
MOD1 LS1002
MOD1 LS995
MOD1 LS1000
MOD1 LS1005
PNUSAL000730
PNUSAL000957
93
69
15
77
66
86
93
63
59
75
75
95
88
92
27
75
76
88
80
53
96
90
74
76
90
86
56
76
83
87
90
87
87
73
75
98
97
100
0
91
0.05
MOD1 LS1003
MOD1 LS1004
MOD1 LS1006
MOD1 LS1009
MOD1 LS1010
MOD1 LS1011
MOD1 LS989
MOD1 LS992
MOD1 LS993
MOD1 LS994
MOD1 LS996
FDA00008143
MOD1 LS985
CFSAN023471
CFSAN023470
CFSAN023464
CFSAN023465
MOD1 LS998
MOD1 LS997
MOD1 LS991
CFSAN023469
Reference
PNUSAL000870
MOD1 LS988
MOD1 LS984
MOD1 LS1008
FDA00008144
CFSAN023468
CFSAN023463
CFSAN023458
CFSAN023462
MOD1 LS982
MOD1 LS980
MOD1 LS1002
MOD1 LS978
CFSAN023467
PNUSAL001024
MOD1 LS999
MOD1 LS995
MOD1 LS990
MOD1 LS979
MOD1 LS1005
CFSAN023466
MOD1 LS1000
PNUSAL000957
PNUSAL000730
0.02
Wombac
SNVPhyl --fast
Sprouts/E. coli • 2014
• 19 cases
• Raw clover sprouts
• Very good epidemiology; well characterized
http://www.cdc.gov/ecoli/2014/O121-05-14/index.html
https://github.com/lskatz/lyve-SET/tree/master/testdata/escherichia_coli (listing of the dataset)
Comparison of the E. coli sprouts outbreak trees.
• Both outbreak genomes cluster with the correct clade with 100% in all trees
• Most trees have almost the exact same topology with high confidence values for outbreak clades
Lyve-SET kSNP
Snp-Pipeline Wombac
SNVPhyl
--fast
2014C-3600.fastq.gz-reference
2014C-3599.fastq.gz-reference
2014C-3598.fastq.gz-reference
2014C-3857.fastq.gz-reference
2014C-3850.fastq.gz-reference
2014C-3907.fastq.gz-reference
2014C-3840.fastq.gz-reference
2014C-3655.fastq.gz-reference
2014C-3656.fastq.gz-reference70
70
69
100
0.02
2014C-3599.fasta
2014C-3600.fasta
2014C-3598.fasta
2014C-3850.fasta
2014C-3857.fasta
2014C-3907.fasta
2014C-3655.fasta
2014C-3656.fasta
2014C-3840.fasta
100
68
17
100
99
13
0.05
2014C-3598
2014C-3600
2014C-3599
2014C-3850
2014C-3857
2014C-3907
2014C-3655
2014C-3656
2014C-3840
100
100
100
100
0.001
2014C-3598
2014C-3600
2014C-3599
2014C-3840
2011C-3609
2014C-3907
2014C-3850
2014C-3857
2014C-3655
2014C-3656
99
100
99
98
99
0.1
2014C-3598
2014C-3600
2014C-3599
Reference
2014C-3850
2014C-3857
2014C-3907
2014C-3655
2014C-3656
2014C-3840
100
69
0
100
95
99
49
0.05
2014C-3599
2014C-3600
2014C-3598
2011C-3609.fasta.wgsim
2014C-3850
2014C-3857
2014C-3840
2014C-3656
2014C-3655
2014C-3907
100
100
100
100
78
0.002
Other advantages of Lyve-SET
• Closely developed alongside outbreak investigations
• Modular – UNIX philosophy: each script does one thing and does it well.
Can switch in and out new scripts as desired
• Integration with CG-Pipeline (downsampling, read-cleaning, read-metrics, etc)
• Easy-to-understand documentation
• Easy to install
• Users mailing list
• Actively maintained
Thank you to the WGS standards and analysis working group for their contributions on the standardized datasets
https://github.com/lskatz/CG-Pipeline
Future work
• Avoiding SNP noise • Cliff detection
• Soft-clipping of reads
• Annotation of SNPs
• Validation of SNPs with WGS standards/analysis working group
• Members from FDA, USDA,
NCBI, and CDC
• Manually validate less-confident
SNP calls
• Create gold standard datasets
All proposed improvements: https://github.com/lskatz/lyve-SET/issues
http://www.taverna.org.uk
https://github.com/ssadedin/bpipe
depth
“cliff”
T! A!
Conclusions
• Aids in epidemiological investigations
• Gives concordant results
• Epidemiologically focused
https://github.com/lskatz/lyve-SET
https://groups.google.com/forum/#!forum/lyve-set
Thank you!
• WGS standards working
group
• Many others in
EDLB/CDC
• Bioinformatics core at
PHAC
• My wife
Authors: Lee S. Katz, Darlene D. Wagner, Aaron Petkau, Cameron Sieffert, Heather Carleton, Shaun
Tyler, Gary Van Domselaar