43
Introduction to the CGE servers

Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where

Embed Size (px)

Citation preview

Introduction to the CGE servers

Center for Genomic Epidemiology

Aim: • To provide the scientific foundation for future internet-based solutions, where a central database will enable simplification of total genome sequence information and comparison to all other sequenced isolates including spatial-temporal analysis.

• To develop algorithms for rapid analyses of whole genome DNA-sequences, tools for analyses and extraction of information from the sequence data and internet/web-interfaces for using the tools in the global scientific and medical community.

Tools for species identification

Name of Service Description

URL (cge.cbs.dtu.dk/services/) Status Publication

SpeciesFinder Species identification using 16S rRNA

SpeciesFinder Online Published Feb 2014 PMID: 24574292

KmerFinder Species identification using overlapping 16mers

KmerFinder Online Published Jan 2014 PMID: 24172157

TaxonomyFinder Taxonomy identification using functional protein domains

TaxonomyFinder Published in PMID: 24574292 + Oksana's PhD thesis

Reads2Type Species identification on client computer

Reads2Type Online Published Feb 2014 PMID: 24574292

Benchmarking of Methods for Bacterial Species Identification

PMID: 24574292

Training data 1,647 completed / almost completed genomes downloaded

from NCBI in 2011 (1,009 different species)

Evaluation data NCBI draft genomes

• 695 isolates from species that overlap with training set (151 species)

SRA draft genomes• 10,407 sets of short reads from Illumina (168 species)

• 10,407 draft genomes from Illumina data (168 species)

16S rRNA

• 16S rRNA sequencing has dominated molecular taxonomy of prokaryotes for more than 30 years (Fox et al, Int. J. Syst. Bacteriol., 1977)

• Tremendous amounts of 16S rRNA sequence data are available in databases

Concerns: • Low resolution • Some genomes contain several copies of the 16S rRNA gene with inter-gene variation• The 16S rRNA gene represents only about 0.1% of the coding part of a microbial genome

Reference database • 16S rRNA genes are isolated from genomes in training data using RNAmmer (Lagesen, NAR, 2007).

Method•Input genomes are BLASTed against 16S rRNA genes in reference database.

•Best hit is selected based on a combination of coverage, % identity, bitscore, number of mistmatches and number of gaps in the alignments.

CGE implementation of 16S species identification

SpeciesFinder

KmerFinder• Genomes in training data is chopped into 16mers:

A T G A C G T A T G A T T G A T G A C G T A G T A G T C C

• Immune system inspired downsampling• Only 16mers with specific prefix are kept

MHC-I

9mer

ATGAATGTGTGAGTGA

ATGACTGTGCCCCTGA

ATGAAAAAAAAAAAA

Unique 16 mers:

Species Match No. of Kmer hits

Acinetobacter baumannii CP001921 2

Acinetobacter baumannii CP000521 1

Acinetobacter baumannii CP002521 1

Buchnera aphidicola CP002301 1

ATGAATGTGTGAGTGACP001921 (Acinetobacter baumanii)CP000521 (Acinetobacter baumanii)CP002522 (Acinetobacter baumanii)

ATGACTGTGCCCCTGA CP001921 (Acinetobacter baumanii)CP002301 (Buchnera aphidicola)

16mer database

Unknown isolate

KmerFinder is very robust – it only needs one 16mer! Desulfovibrio piger GOR1 SRR097356

>NODE 4 length 92 cov 23.119566TAGGACGTGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCACTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATGC>NODE 15 length 82 cov 2.792683AGCGAAAAATGTCATAACAACGATCACGACCGATAACCATCTTTGGTCCAAACTTACTCACGCAGCAGGCGTATAACTCGCGCATACCAGCTTTGGGCAT

N50 = 110Total no. of bp: 210

Species Match No. of Kmer hits

Flavobacterium psycrophilum

AM398681 1

Prediction

TaxonomyFinder

Reads2Type

• Read2Type pushes analysis to user, server provides 50-mers database

• SuffixTree: efficient data structure for string matching

• Narrow Down Approach: – Reads2Type compares 50-mers

of combined marker genes against raw reads

– Shared Probes vs Unique Probe

• Definition: Quick & dirty taxonomy identification of single isolates

• 50-mer of marker gene DB–16S rRNA: Training data

genomes RNAmmer (other)

– ITS: Training data (Mycobacterium)

–GyrB: Training data (Enterobacteriaceae)

–Resulting database ~5 MB

rMLST

CGE implementation

•For each genome in the training data the 53 ribosomal genes were extracted.

•Genomes in evaluation sets were aligned using blat to each gene collection (only hits with at least 95% identity and 95% coverage were considered as a potential match).

•The closets match of the training genomes was selected based on a combination of coverage, %identity, bitscore, number of mistmatches and number of gaps in the alignments across all genes.

Jolley KA, Bliss CM, Bennett JS, Bratcher HB, Brehony C, Colles FM, Wimalarathna H, Harrison OB, Sheppard SK, Cody AJ, Maiden MC. Ribosomal multilocus sequence typing: universal characterization of bacteria from domain to strain. Microbiology. 2012 Apr;158(Pt 4):1005-15.

Results

(16s rRNA)

Overlap in predictions

Isolates in the NCBIdrafts set for which all four methods predict the species to be different from the annotated one. * NZAEPO00000000 has been re-annotated as S. oralis since we downloaded the data.

Speed

Method Estimated speed (mm:ss)

16S 00:13*

KmerFinder 00:09*

TaxonomyFinder 11:33*

rMLST 00:45*

Reads2Type 00:55**

*Estimation based on draft genomes**Estimation based on short reads

Summary of taxonomy benchmark study

• KmerFinder had the highest accuracy and was the fastest method.

• SpeciesFinder (16S rRNA-based) had the lowest accuracy.

• Methods that only sample genomic loci (16S, Reads2Type, rMLST) had difficulties distin-guishing species that only recently diverged, especially when main difference is a plasmid.

Tools for further typing

Name of Service Description

URL (https://cge.cbs.dtu.dk/services/ ) Publication

MLSTMultilocus sequence typing MLST

Published Apr 2012, PMID: 22238442

Plasmid-Finder

Identification of plasmids in Enterobacteriaceae

PlasmidFinder Published Apr 2014, PMID 24777092

pMLST pMLST of plasmids in Enterobacteriaceae

pMLST Published Apr 2014, PMID 24777092

Multilocus Sequence Typing (MLST)

First developed in 1998 for Neisseria meningitis (Maiden et al. PNAS 1998. 95:3140-3145)

The nucleotide sequence of internal regions of app. 7 housekeeping genes are determined by PCR followed by Sanger sequencing

Different alleles are each assigned a random number

The unique combination of alleles is the sequence type (ST)

Using WGS data for MLST

www.cbs.dtu.dk/services/MLST

Assembled genome454 – single end reads454 – paired end readsIllumina – single end readsIllumina – paired end readsIon TorrentSOLiD – single end readsSOLiD – mate pair reads

Acinetobacter baumannii #1Acinetobacter baumannii #2 Arcobacter Borrelia burgdorferi Bacillus cereus Brachyspira hyodysenteriae Bifidobacterium Brachyspiria intermedia Bordetella Burkholderia pseudomallei Brachyspira Burkholeria cepacia complex Campylobacter jejuni Clostridium botulinum Clostridium difficile #1 Clostridium difficile #2 Campylobacter helveticus Campylobacter insulaenigrae Clostridium septicum C. diphtheriae Campylobacter fetus Chlamydiales

Campylobacter lari Cronobacter C. upsaliensis Escherichia coli #1 Escherichia coli #2 Enterococcus faecalis Enterococcus faecium F. psychrophilum Haemophilus influenzae Haemophilus parasuis Helicobacter pylori Klebsiella pneumoniae Lactobacillus casei Lactococcus lactis Leptospira Listeria Listeria monocytogenes Moraxella catarrhalis Mannheimia haemolytica Neisseria P. gingivalis P. acne

Pseudomonas aeruginosa Pasteurella multocida Pasteurella multocida Staphylococcus aureus Streptococcus agalactiae Salmonella enterica Staphylococcus epidermidis S. maltophilia Streptococcus pneumoniae Streptococcus oralis S. zooepidemicus Streptococcus pyogenes Streptococcus suis Streptococcus thermophilus Streptomyces Streptococcus uberis Vibrio parahaemolyticus Vibrio vulnificus Wolbachia Xylella fastidiosa Y. pseudotuberculosis

Extended Output

Extended Output

aro: WARNING, Identity: 100%, HSP/Length: 349/498, Gaps: 0, aro_122 is the best match for aro

What is the MLST web-service used for?

PlasmidFinder and pMLST

The PlasmidFinder database contains replicons, not entire plasmids.

Tools for phenotyping

Name of Service Description

URL (https://cge.cbs.dtu.dk/services/ ) Publication

ResFinder

Identification of acquired antibiotic resistance genes ResFinder

Published Nov 2012, PMID: 22782487

Virulence-Finder

Identification of virulence genes in E. coli (and S. aureus and Enterococcus)

VirulenceFinder E. coli published Feb 2014, PMID: 24574290.

MyDbFinder Identification of genes from the users own database

MyDbFinder Will be published in book chapter

Pathogen-Finder

Prediction of pathogenic potential

PathogenFinder Published Oct 2013, PMID: 24204795

ResFinder

ResFinder(BLAST)

NGSIllumina

Ion torrent454..

Sanger

Fasta

Resistance gene profile

Assembly pipeline

List of genesAccession numbers

Theoretical resistance phenotype

Sanger

Fasta

200 isolates from 4 different species (Salmonella Typhimurium, Escherichia coli, Enterococcus faecalis and Enterococcus faecium)

ResFinder, 98 %ID, 60% length coverage

Phenotypic tests, 3,051 in total• 482 Resistant• 2569 Susceptible

=> 99,74% of the results were in agreement between ResFinder and the phenotypic tests

23 discrepancies -> 16, typically in relation to spectinomycin in E. coli

Alternatives to ResFinder

Unpublished or uncategorizedName of Service Description

URL (https://cge.cbs.dtu.dk/

services/ ) Status Publication

PanFunPro Groups homologous proteins based on functional domain content

PanFunProOnline

Published in F1000Research 2013, 2:265

Serotype-Finder

Identification of serotypes SerotypeFinder-1.0

Online

Not yet published

Restriction-ModificationFinder

Identification of RM system genes

Restriction-ModificationFinder

Online

Will only be published in book chapter

HostPhinder Prediction of the host of a bacteriophage

HostPhinderOnline, but under development

Not yet published

MetaVir-Finder

Identification of virus in metegenomic data

MetaVirFinderOnline, but under development

Not yet published

MGmapper

Identifies the content of metagenomic samples MGmapper

Online, but under development

Not yet published

Tools for phylogeny

Name of Service Description URL (cge.cbs.dtu.dk/services) Status Publication

SnpTree

Creation of phylogenetic trees based on SNPs snpTree Online

Published Dec 2012, PMID: 23281601

CSIPhylo-geny

Creation of phylogenetic trees based on SNPs

CSIPhylogenyOnline

Planned

NDtree Creation of phylogenetic trees

NDtree Online Published in Feb 2014, PMID: 24505344

Web-service usage

Type of data uploaded to MLST web-service

454, single reads454, paired-endIon torrentIllumina, single readsIllumina, paired-end readsAssembled draft genomes