16
Lyve-SET: an hqSNP pipeline for outbreak investigations Lee Katz, Ph.D. Bioinformatics scientist Enteric Diseases Laboratory Branch (EDLB) Centers for Disease Control and Prevention (CDC) Sequencing, Finishing, and Analysis of the Future 2015 May 28, 2015 National Center for Emerging and Zoonotic Infectious Diseases Division of Foodborne, Waterborne, and Environmental Diseases

Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

Lyve-SET: an hqSNP pipeline for outbreak investigations

Lee Katz, Ph.D.

Bioinformatics scientist

Enteric Diseases Laboratory Branch (EDLB)

Centers for Disease Control and Prevention (CDC)

Sequencing, Finishing, and Analysis of the Future 2015

May 28, 2015

National Center for Emerging and Zoonotic Infectious Diseases

Division of Foodborne, Waterborne, and Environmental Diseases

Page 2: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

Enteric Diseases Laboratory Branch (EDLB) at CDC • We study bacterial foodborne pathogens:

Listeria monocytogenes, E. coli, Salmonella, C. botulinum,…

• Perform routine surveillance

PFGE photo taken from http://www.cdc.gov/pulsenet/about/faq.html

Phylogeny taken from Jackson et al 2015, “Notes from the Field: Listeriosis Associated with Stone Fruit — United States, 2014” MMWR

Traditionally, tracked by PFGE (and other methods, e.g., 7-gene MLST)

But now, transitioning to whole-genome methods for routine surveillance

Page 4: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

Lyve-SET Lyve – Listeria, Yersinia, Vibrio, and Enterobacteriaceae reference lab

SET – Snp Extraction Tool

• Some details on Lyve-SET:

– For LINUX

– Extensive documentation

– Help options are embedded in each script

– --fast option, takes ¼ the normal time

– Easy to use

– Modular

• Used in labs at CDC: Foodborne Diseases Laboratory Branch, Clinical and Environmental Microbiology Branch, Respiratory Diseases Branch

• Listeria monocytogenes outbreak investigations since summer 2013

https://github.com/lskatz/lyve-SET

Page 5: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

HIGH-QUALITY SNPS: ASSUMPTIONS IN OUTBREAK INVESTIGATIONS

1. Evolution approximates epidemiology

2. SNPs correlate well with overall evolutionary change

A survey of SNP vs recombination rates in bacteria: Vos M, Didelot X. 2008. A comparison of homologous recombination rates in bacteria and archaea. ISME J

3:199-208.

Page 6: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

An animation of Lyve-SET (hqSNPs) 0. Pre-processing

a) phage discovery/masking

b) Manual identification of troublesome regions

c) Read cleaning (Poster – Wagner et al)

1. Mapping - SMALT a) 95% read identity

b) Unambiguous mapping

2. SNP calling - VarScan a) 75% consensus

b) 10x depth

3. Phylogeny inferring – RAxML v8 a) Removal of clustered SNPs

b) Ascertainment bias model

c) Maximum likelihood https://github.com/lskatz/lyve-SET/blob/master/docs/FAQ.md https://www.sanger.ac.uk/resources/software/smalt Koboldt, D., Zhang, Q., Larson, D., Shen, D., McLellan, M., Lin, L., Miller, C., Mardis, E., Ding, L., & Wilson, R. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing Genome

Research DOI: 10.1101/gr.129684.111 Stamatakis, A. (2014) RAxML version8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. doi:10.1093/bioinformatics/btu033

Reference genome

phage

2014C-

2014C-3

2014C-3

2014C-

2014C-

2014C-3

2014C-3

2014C-3

2014C-3

100

100

100

100

0.001

Manual identification Genome 1 SNP profile

Genome 2 SNP profile

Genome 4 SNP profile

Genome 3 SNP profile

Phylogeny

Page 7: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

Comparing against other well regarded tools

• wgMLST o Applied Maths: http://www.applied-maths.com/applications/wgmlst

o International Listeria wgMLST Schema Development Consortium

• SNVPhyl o Petkau A, Keddy A, Slusky L, Mabon P, Bristow F, Matthews T, Adam J, Carriço JA, Katz LS, Reimer A, Knox N, Courtot M,

Graham M, Hsiao W, Brinkman F, Beiko RG, Van Domselaar G. Outbreak investigation with IRIDA’s SNVPhyl pipeline and GenGIS. Poster presented at: The 7th Meeting of the Global Microbial Identifier; September 11-12, 2014; York, UK

o https://github.com/apetkau/core-phylogenomics

• Snp-Pipeline v3.3 o Pettengill JB, Luo Y, Davis S, Chen Y, Gonzalez-Escalona N, Ottesen A, Rand H, Allard MW, Strain E An evaluation of alternative

methods for constructing phylogenies from whole genome sequence data: A case study with Salmonella.

o http://snp-pipeline.readthedocs.org/en/latest

• kSNP2 o Gardner, S.N. and Hall, B.G. 2013. When whole-genome alignments just won't work: kSNP v2 software for alignment-free SNP

discovery and phylogenetics of hundreds of microbial genomes. PLoS ONE, 8(12):e81760.doi:10.1371/journal.pone.0081760

• Wombac v2.1 o https://github.com/tseemann/wombac

Page 8: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

Quick comparison with well-regarded tools Lyve-SET --fast wgMLST SNVPhyl Snp-Pipeline kSNP Wombac

Phage masking x x

Manual masking x x

Read cleaning x x x x

HPC support SGE SGE SGE SGE,

Torque, etc

SGE, Torque

Customizable

thresholds x x x x x x

Considers

clustered SNPs x x x x

Finished product ML tree ML tree UPGMA,

alleles

ML tree SNP matrix ML, NJ

tree

ML tree

Approach Ref-

mapping

Ref-mapping MLST-

mapping

Ref-

mapping

Ref-mapping Asm-

free

Ref-

mapping

O/S 🐧 🐧 +🐧 🍎🐧 🐧 🐧 🐧

Availability Open Open © Open Open Open Open

Page 9: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

Stone Fruit outbreak/Listeria • Summer 2014

• Contaminated stone fruit – peaches, nectarines, etc

• Two confirmed clinical cases, two related but sporadic cases, many environmental isolates

• Very good epidemiology; well characterized

http://www.cdc.gov/mmwr/preview/mmwrhtml/mm6410a6.htm

http://mvd.dor.ga.gov/motor/plates/PlateSelection.aspx (Standard/Prestige link, downloaded May 20, 2015)

https://github.com/lskatz/lyve-SET/tree/master/testdata/listeria_monocytogenes_clade1 (listing of the dataset)

Page 10: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

MOD1 LS994

MOD1 LS992

MOD1 LS985

MOD1 LS989

FDA00008143

MOD1 LS997

MOD1 LS996

MOD1 LS993

MOD1 LS1003

MOD1 LS1004

MOD1 LS1009

MOD1 LS998

MOD1 LS1010

MOD1 LS1011

MOD1 LS991

MOD1 LS1006

CFSAN023464

CFSAN023465

CFSAN023470

CFSAN023471

CFSAN023463

CFSAN023458

CFSAN023462

MOD1 LS982

CFSAN023469

PNUSAL000870

MOD1 LS980

FDA00008144

MOD1 LS988

CFSAN023468

MOD1 LS1008

MOD1 LS984

PNUSAL000730

PNUSAL000957

PNUSAL001024

MOD1 LS995

MOD1 LS999

MOD1 LS990

MOD1 LS978

MOD1 LS1002

CFSAN023467

MOD1 LS979

MOD1 LS1000

CFSAN023466

MOD1 LS1005

50

100

100

20

100

0.01

Comparison of the Listeria monocytogenes stone fruit outbreak trees.

• Both outbreak genomes cluster with the correct clades with 100% in all trees

• Most trees have almost the exact same topology with high confidence values for outbreak clades

MOD1 LS1004

MOD1 LS994

MOD1 LS993

MOD1 LS1010

MOD1 LS996

MOD1 LS998

MOD1 LS991

FDA00008143

MOD1 LS1006

MOD1 LS989

MOD1 LS992

MOD1 LS1003

MOD1 LS1009

MOD1 LS985

CFSAN023464

CFSAN023465

CFSAN023471

CFSAN023470

MOD1 LS1011

MOD1 LS997

MOD1 LS982

CFSAN023463.fasta.wgsim

CFSAN023463

CFSAN023462

CFSAN023458

MOD1 LS980

PNUSAL000870

MOD1 LS988

MOD1 LS1008

FDA00008144

CFSAN023468

MOD1 LS984

CFSAN023469

PNUSAL001024

MOD1 LS1005

MOD1 LS978

MOD1 LS995

MOD1 LS990

MOD1 LS999

CFSAN023467

MOD1 LS1002

CFSAN023466

MOD1 LS1000

MOD1 LS979

PNUSAL000957

PNUSAL000730

75

100

100

98

85

100

100

86

69

0.001

Lyve-SET kSNP

MOD1 LS992.fasta

MOD1 LS993.fasta

MOD1 LS1003.fasta

MOD1 LS991.fasta

MOD1 LS998.fasta

MOD1 LS1011.fasta

MOD1 LS997.fasta

MOD1 LS1010.fasta

MOD1 LS994.fasta

MOD1 LS1004.fasta

MOD1 LS1006.fasta

FDA00008143.fasta

MOD1 LS985.fasta

CFSAN023464.fasta

CFSAN023465.fasta

CFSAN023470.fasta

CFSAN023471.fasta

MOD1 LS1009.fasta

MOD1 LS989.fasta

MOD1 LS996.fasta

CFSAN023458.fasta

CFSAN023468.fasta

CFSAN023463.fasta

MOD1 LS988.fasta

MOD1 LS984.fasta

MOD1 LS982.fasta

FDA00008144.fasta

CFSAN023469.fasta

PNUSAL000870.fasta

MOD1 LS1008.fasta

CFSAN023462.fasta

MOD1 LS980.fasta

CFSAN023466.fasta

CFSAN023467.fasta

MOD1 LS1005.fasta

MOD1 LS978.fasta

MOD1 LS990.fasta

MOD1 LS979.fasta

MOD1 LS1000.fasta

MOD1 LS995.fasta

MOD1 LS999.fasta

MOD1 LS1002.fasta

PNUSAL001024.fasta

PNUSAL000957.fasta

PNUSAL000730.fasta

0.02

wgMLST

MOD1 LS1010

MOD1 LS993

MOD1 LS989

MOD1 LS1003

MOD1 LS996

MOD1 LS1009

MOD1 LS994

MOD1 LS1006

MOD1 LS997

MOD1 LS1011

MOD1 LS998

MOD1 LS992

MOD1 LS1004

MOD1 LS991

CFSAN023470

CFSAN023471

MOD1 LS985

CFSAN023464

CFSAN023465

FDA00008143

PNUSAL000870

FDA00008144

CFSAN023462

MOD1 LS988

MOD1 LS1008

CFSAN023458

MOD1 LS984

MOD1 LS980

CFSAN023469

CFSAN023468

MOD1 LS982

CFSAN023463

MOD1 LS978

CFSAN023467

CFSAN023466

MOD1 LS1002

MOD1 LS999

MOD1 LS1000

MOD1 LS990

MOD1 LS1005

PNUSAL001024

MOD1 LS995

MOD1 LS979

PNUSAL000957

PNUSAL000730

91

6899

83

100

100

100

0.001

Snp-Pipeline

MOD1 LS996

MOD1 LS1006

MOD1 LS1009

MOD1 LS1003

MOD1 LS1011

MOD1 LS1004

MOD1 LS1010

MOD1 LS997

MOD1 LS998

MOD1 LS993

MOD1 LS992

MOD1 LS985

CFSAN023465

CFSAN023464

CFSAN023470

CFSAN023471

MOD1 LS994

MOD1 LS989

MOD1 LS991

FDA00008143

MOD1 LS1008

MOD1 LS982

MOD1 LS988

MOD1 LS980

MOD1 LS984

FDA00008144

CFSAN023469

CFSAN023463

PNUSAL000870

CFSAN023462

CFSAN023468

CFSAN023458

MOD1 LS979

CFSAN023466

CFSAN023467

MOD1 LS978

MOD1 LS990

PNUSAL001024

MOD1 LS999

MOD1 LS1002

MOD1 LS995

MOD1 LS1000

MOD1 LS1005

PNUSAL000730

PNUSAL000957

93

69

15

77

66

86

93

63

59

75

75

95

88

92

27

75

76

88

80

53

96

90

74

76

90

86

56

76

83

87

90

87

87

73

75

98

97

100

0

91

0.05

MOD1 LS1003

MOD1 LS1004

MOD1 LS1006

MOD1 LS1009

MOD1 LS1010

MOD1 LS1011

MOD1 LS989

MOD1 LS992

MOD1 LS993

MOD1 LS994

MOD1 LS996

FDA00008143

MOD1 LS985

CFSAN023471

CFSAN023470

CFSAN023464

CFSAN023465

MOD1 LS998

MOD1 LS997

MOD1 LS991

CFSAN023469

Reference

PNUSAL000870

MOD1 LS988

MOD1 LS984

MOD1 LS1008

FDA00008144

CFSAN023468

CFSAN023463

CFSAN023458

CFSAN023462

MOD1 LS982

MOD1 LS980

MOD1 LS1002

MOD1 LS978

CFSAN023467

PNUSAL001024

MOD1 LS999

MOD1 LS995

MOD1 LS990

MOD1 LS979

MOD1 LS1005

CFSAN023466

MOD1 LS1000

PNUSAL000957

PNUSAL000730

0.02

Wombac

SNVPhyl --fast

Page 11: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

Sprouts/E. coli • 2014

• 19 cases

• Raw clover sprouts

• Very good epidemiology; well characterized

http://www.cdc.gov/ecoli/2014/O121-05-14/index.html

https://github.com/lskatz/lyve-SET/tree/master/testdata/escherichia_coli (listing of the dataset)

Page 12: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

Comparison of the E. coli sprouts outbreak trees.

• Both outbreak genomes cluster with the correct clade with 100% in all trees

• Most trees have almost the exact same topology with high confidence values for outbreak clades

Lyve-SET kSNP

Snp-Pipeline Wombac

SNVPhyl

--fast

2014C-3600.fastq.gz-reference

2014C-3599.fastq.gz-reference

2014C-3598.fastq.gz-reference

2014C-3857.fastq.gz-reference

2014C-3850.fastq.gz-reference

2014C-3907.fastq.gz-reference

2014C-3840.fastq.gz-reference

2014C-3655.fastq.gz-reference

2014C-3656.fastq.gz-reference70

70

69

100

0.02

2014C-3599.fasta

2014C-3600.fasta

2014C-3598.fasta

2014C-3850.fasta

2014C-3857.fasta

2014C-3907.fasta

2014C-3655.fasta

2014C-3656.fasta

2014C-3840.fasta

100

68

17

100

99

13

0.05

2014C-3598

2014C-3600

2014C-3599

2014C-3850

2014C-3857

2014C-3907

2014C-3655

2014C-3656

2014C-3840

100

100

100

100

0.001

2014C-3598

2014C-3600

2014C-3599

2014C-3840

2011C-3609

2014C-3907

2014C-3850

2014C-3857

2014C-3655

2014C-3656

99

100

99

98

99

0.1

2014C-3598

2014C-3600

2014C-3599

Reference

2014C-3850

2014C-3857

2014C-3907

2014C-3655

2014C-3656

2014C-3840

100

69

0

100

95

99

49

0.05

2014C-3599

2014C-3600

2014C-3598

2011C-3609.fasta.wgsim

2014C-3850

2014C-3857

2014C-3840

2014C-3656

2014C-3655

2014C-3907

100

100

100

100

78

0.002

Page 13: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

Other advantages of Lyve-SET

• Closely developed alongside outbreak investigations

• Modular – UNIX philosophy: each script does one thing and does it well.

Can switch in and out new scripts as desired

• Integration with CG-Pipeline (downsampling, read-cleaning, read-metrics, etc)

• Easy-to-understand documentation

• Easy to install

• Users mailing list

• Actively maintained

Thank you to the WGS standards and analysis working group for their contributions on the standardized datasets

https://github.com/lskatz/CG-Pipeline

Page 14: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

Future work

• Avoiding SNP noise • Cliff detection

• Soft-clipping of reads

• Annotation of SNPs

• Validation of SNPs with WGS standards/analysis working group

• Members from FDA, USDA,

NCBI, and CDC

• Manually validate less-confident

SNP calls

• Create gold standard datasets

All proposed improvements: https://github.com/lskatz/lyve-SET/issues

http://www.taverna.org.uk

https://github.com/ssadedin/bpipe

depth

“cliff”

T! A!

Page 15: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

Conclusions

• Aids in epidemiological investigations

• Gives concordant results

• Epidemiologically focused

https://github.com/lskatz/lyve-SET

https://groups.google.com/forum/#!forum/lyve-set

Page 16: Lyve-SET: an hqSNP pipeline for outbreak …...An animation of Lyve-SET (hqSNPs) 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read

Thank you!

• WGS standards working

group

• Many others in

EDLB/CDC

• Bioinformatics core at

PHAC

• My wife

[email protected]

Authors: Lee S. Katz, Darlene D. Wagner, Aaron Petkau, Cameron Sieffert, Heather Carleton, Shaun

Tyler, Gary Van Domselaar