44
So I have sequenced my organism … what do I do now? Nick Loman

ECCMID 2015 - So I have sequenced my genome ... what now?

Embed Size (px)

Citation preview

Page 1: ECCMID 2015 - So I have sequenced my genome ... what now?

So I have sequenced my organism … what do I do now?

Nick Loman

Page 2: ECCMID 2015 - So I have sequenced my genome ... what now?
Page 3: ECCMID 2015 - So I have sequenced my genome ... what now?

Oh dear

Page 4: ECCMID 2015 - So I have sequenced my genome ... what now?

Sequence some more

Page 5: ECCMID 2015 - So I have sequenced my genome ... what now?

Sensible

Page 6: ECCMID 2015 - So I have sequenced my genome ... what now?

Useful things

Page 7: ECCMID 2015 - So I have sequenced my genome ... what now?

Whole-genome sequencing:utility in clinical microbiology

• Diagnostics– Species, subspecies, strain identification– In silico antibiogram– In silico virulence profile

• Surveillance• Typing (including backwards compatibility with MLST and

serotype)• What strains and resistance elements are lurking in my

hospital/community?

• Forensic epidemiology – Is there an outbreak?

• Who gave what to who?

Page 8: ECCMID 2015 - So I have sequenced my genome ... what now?

Common types of sequencing

• Paired-end Illumina (typically 150 – 300 bases)

• Single-end Ion Torrent (typically 300-400 bases)

– Can be treated more or less the same

• Pacific Biosciences or Oxford Nanopore

– Requires special handling, not covered today

Page 9: ECCMID 2015 - So I have sequenced my genome ... what now?

Quality Control: Questions to Ask

• Did my sequencing work?

• What are the fragment lengths?

• Is my sample what I think it is?

• Is my sample contaminated?

Read QC

Adaptor/quality trimming

Species ID

Sample QC

FastQC, Qualimap, Kraken, BLAST

Trimmomatic

BLAST, Metaphlan, MOCAT

Blobology

Page 10: ECCMID 2015 - So I have sequenced my genome ... what now?

Did my sequencing work?

• FastQC:

Page 11: ECCMID 2015 - So I have sequenced my genome ... what now?

What coverage do I have?

• SNP calling: >10x (>15x better)

• De novo assembly: >30x (50x probably better)

• Absolutely no benefits over about 100x for standard applications and slows everything down and takes more disk space

• (BTW, FASTQ files are probably a waste of space)

Page 12: ECCMID 2015 - So I have sequenced my genome ... what now?

What are the fragment lengths?

• Qualimap (or just BWA)

BadFragment length < read

length

OKFragment length > read

length

GoodFragment length > 2x read

length

You are in dangerous territory dealing with repetitive regions longer than the fragment length, regardless of read depth coverage

Page 13: ECCMID 2015 - So I have sequenced my genome ... what now?

Repetitive regions

This is important because repeat-containing are often the most interesting parts of the genome! Think:

• Insertion elements

• Transposons

• Plasmids

• Ribosomal RNA

REPEAT: You are in dangerous territory dealing with repetitive regions longer than the fragment length, regardless of read depth coverage

Page 14: ECCMID 2015 - So I have sequenced my genome ... what now?

Do not trust the computer

Bioinformatics software will do its best to look like it is dealing with repeats in a rational way, but it is in fact plotting aggressively to ruin your analysis without telling you.

Computers are just like that!

If repeats are important to your analysis, you need an alternative sequencing strategy: long mate-pairs, long reads (Pacific Biosciences or Oxford Nanopore). Don’t drive yourself mad making short reads do what they can’t.

Page 15: ECCMID 2015 - So I have sequenced my genome ... what now?

Adaptor trim reads

• With Nextera libraries, failing to adaptor trim will KILL your assemblies.

• Particularly important when mean fragment length < read length.

• Many trimmers available: I like to use Trimmomatic

• Quality trimming not important with modern tools (BWA and Spades)

For more explanation: http://nickloman.github.io/high-throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-experiences-with-nextera-libraries/

Page 16: ECCMID 2015 - So I have sequenced my genome ... what now?

Is my sample what I think it is?

• BLASTing a few random reads usually very efficient quality control check, as well as helping identify a reference genome

• Kraken or Metaphlan can give rapid organism report

Page 17: ECCMID 2015 - So I have sequenced my genome ... what now?

Species identification

• Methods:

– 16S rDNA extraction (typically following de novo assembly and annotation) and BLAST

– Taxon-defining genes (e.g. Metaphlan)

– Phylogenetic approach (e.g. MOCAT, Phylosift)

For more explanation: http://nickloman.github.io/high-throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-experiences-with-nextera-libraries/

Page 18: ECCMID 2015 - So I have sequenced my genome ... what now?

Isolate genome

Sequence reads

Other samples on sequencing run

Contamination

Unsequencedregions

Page 19: ECCMID 2015 - So I have sequenced my genome ... what now?
Page 20: ECCMID 2015 - So I have sequenced my genome ... what now?

Sources of contamination

• Accidental multiple colony picks or mixed liquid culture– Same or different organism

– E.g. Achromobacter & Pseudomonas aeruginosa in CF

• Reagent contamination (DNA extractions)

• Sequencer “carry-over” (0.2%?)

• PhiX control sequence <- don’t be this guy

• Barcode “cross-over” (bad pipetting technique or contaminated reagents)

Page 21: ECCMID 2015 - So I have sequenced my genome ... what now?
Page 22: ECCMID 2015 - So I have sequenced my genome ... what now?

Blobology

Contamination

Page 23: ECCMID 2015 - So I have sequenced my genome ... what now?

Adaptor trim reads

• With Nextera libraries, failing to adaptor trim will KILL your assemblies.

• Particularly important when mean fragment length < read length.

• Many trimmers available: I like to use Trimmomatic

For more explanation: http://nickloman.github.io/high-throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-experiences-with-nextera-libraries/

Page 24: ECCMID 2015 - So I have sequenced my genome ... what now?

Reference-based or de novo?

Page 25: ECCMID 2015 - So I have sequenced my genome ... what now?

Reference-based or de novo?

• Reference-based

– Implies ALIGNMENT to reference

– Implies you HAVE a reference

– Allows exquisitely sensitive and specific SNP calling (forensic SNP calling to single mutation precision)

– Important for looking at CHAINS OF TRANSMISSION

– Can only call in parts of the genome COMMON between your SAMPLES and REFERENCE: the CORE

Page 26: ECCMID 2015 - So I have sequenced my genome ... what now?

Reference-based or de novo?

• De-novo– Implies de novo assembly

– Does NOT require a reference

– Gives access to the entire PAN-genome

– E.g.• Unexpected antibiotic resistance genes

• Virulence factors

– Can give misleading results in REPEAT sequences

– Not suitable for very fine-resolution SNP analysis

Page 27: ECCMID 2015 - So I have sequenced my genome ... what now?

In practice

• Most people will want to do both.

• And if you have no reference, you can use a draft de novo assembly AS your reference

– But exercise caution

Page 28: ECCMID 2015 - So I have sequenced my genome ... what now?

Reference-based approach

Alignment

Variant calling

SNP extraction & filter

Recombination filtering

Tree building

MLST/Antibiogram

Read QC

Adaptor/quality trimming

Species ID

Sample QC

FastQC, Qualimap, Kraken, BLAST

Trimmomatic

BLAST, Metaphlan, MOCAT

Blobology

BWA

Samtools/VarScanGATK

Custom script, snippy, snpEff, BRESEQ

Gubbins, ClonalFrameML

FastTree, RaXML

SRST2

Page 29: ECCMID 2015 - So I have sequenced my genome ... what now?

Analysis choice highly species dependent: not one size fits all!

• What is the mode and tempo of evolution?

• Monomorphic organisms:– Characterised by vertical pattern of inheritance

– Isolates differ by few mutations

• Highly recombinogenic organisms– Mutations dominated by recombination

– May have vast differences in gene content, gene order

– “Clonal frame” may be obscured or absent

Page 30: ECCMID 2015 - So I have sequenced my genome ... what now?

Different species require different analysis strategies

Variation

M. tuberculosis

S. aureus

B. anthracis

E. coli

P. aeruginosa

N. meningitidis

S. pneumoniae

Clonal population structureBranching phylogenies

Open pan-genomeHorizontal gene transfer

Salmonella

High rates of recombinationPhylogenetic networks

Page 31: ECCMID 2015 - So I have sequenced my genome ... what now?

Tips for picking a reference

• The higher quality the better (aim for pre-NGS Sanger genomes, e.g. <2001)

• Ideally single contig, no gaps

• Canonical strains have most portable and referenced gene references, e.g. TB H37Rv, PAO1, E. coli K-12 etc.

• For SNP calling specificity: more closely related is better

Page 32: ECCMID 2015 - So I have sequenced my genome ... what now?

The core genome

• The core genome used to call SNPs will reduce as more genomes are added

• Particularly noticeable in species with highly plastic genomes: E. coli

• Has significance for forensic applications

Page 33: ECCMID 2015 - So I have sequenced my genome ... what now?

Is my reference good enough?

• Assess core genome size

– Harvest will do this for you

• Or look at samtools flagstat (?)

• Between-sample SNP calling efficiency goes down with reference divergence

• Luxury option: get a Pacific Biosciences complete reference done for each “clone” in your dataset (for some definition of clone)

Page 34: ECCMID 2015 - So I have sequenced my genome ... what now?

Effect of closer reference on P. aeruginosa genotyping

SNPs Indels Mapped

PAO1Reference

23 4 77%

PacBioReference

40 5 97%

Quick, Loman et al. BMJ Open 2014

Page 35: ECCMID 2015 - So I have sequenced my genome ... what now?

SNP filtering

• Specific SNP dataset is vital for effective phylogenetic reconstructions and outbreak tracing

• Most SNP calling errors come from– A) misalignment (sequence present in sample but not

in reference, align)

– B) copy number variation (2 copies in sample, 1 copy in reference)

• NOT from sequencing error (at least with Illumina: systematic errors with other platforms)

Page 36: ECCMID 2015 - So I have sequenced my genome ... what now?

SNP filtering (2)

• Allele frequency filter is most effective SNP filter– AF > 0.9 (90%) works very well empirically

• Strand filter also very useful to prevent SNPs around structural variations

• Filtering for low coverage not that helpful:– 1/1000 error (Q30) * minimum of 3 coverage =

.000000001 chance of an error per position = < 1 error per genome

• Avoid SNPs at ends of contigs as these may be mismapping

Page 37: ECCMID 2015 - So I have sequenced my genome ... what now?

Detecting recombination

• Simple algorithms rely on SNP density, more complex ones asssess impact on “clonal frame”

Normal SNP density Recombining region

Page 38: ECCMID 2015 - So I have sequenced my genome ... what now?

Impact of recombination filtering

Page 39: ECCMID 2015 - So I have sequenced my genome ... what now?

De novo approach

• Interrogate the accessory genome

– Novel genes

• Some important applications take contigsrather than reads as primary input

• SNP calling with de novo assembly is fundamentally less reliable due to lack of allele frequency information; but fine for broad-scale clustering

Page 40: ECCMID 2015 - So I have sequenced my genome ... what now?

Reference-based approach

Alignment

Variant calling

SNP extraction & filter

Recombination filtering

Tree building

MLST/Antibiogram

Read QC

Adaptor/quality trimming

Species ID

Sample QC

FastQC, Qualimap

Trimmomatic

BLAST, Metaphlan, MOCAT

Blobology, Kraken, BLAST

BWA

Samtools/VarScanGATK

Custom script, snippy

Gubbins, ClonalFrameML

FastTree, RaXML

SRST2

De novo approach

Assembly

MLST/Antibiogram

Annotation

Tree building

Population genomics

Pan-genome

VelvetSPADES

Prokka

Harvest

BigsDBPhyloviz

LS-BSR

mlst, Abricate

Page 41: ECCMID 2015 - So I have sequenced my genome ... what now?

Concluding thoughts

1. Don’t trust your sequencing data (or others’) – sense-check and validate each step

2. Make extensive use of visualisation tools to do this

3. There’s more than one way to do any one task

Page 42: ECCMID 2015 - So I have sequenced my genome ... what now?

CLoud Infrastructure for Microbial Bioinformatics (CLIMB)

• MRC funded project to develop Cloud Infrastructure for microbial bioinformatics

• £4M of hardware, capable of supporting >1000 individual virtual servers

• Amazon/Google cloud for Academics

Page 43: ECCMID 2015 - So I have sequenced my genome ... what now?

Meet-The-Expert

• Meet-The-Expert: Joao Carrico and I

• Tomorrow (Monday)

• 07:45 (really)

• Hall M

• Session ME11 What bioinformatics tools do I use for whole-genome sequence (WGS)-based bacterial diagnostics and typing?

Page 44: ECCMID 2015 - So I have sequenced my genome ... what now?

Acknowledgements

• Twitter comments:

– Tom Connor, Alan McNally, Torsten Seemann, C. Titus Brown, Heng Li, Christoffer Flensburg, Matt MacManes, Rachel Glover, Willem van Schaik, Bill Hanage, Jennifer Gardy, Mick Watson, Alan McNally, Esther Robinson, Nicola Fawcett, Aziz Aboobaker, Ruth Massey