Pathogenomics, from patient to bioinformatician torsten seemann - uni melb - tue 1 may 2012

Pathogenomics

Dieter Bulach Torsten Seemann

From patient to bioinformatician

M.Sc(Bioinf) - University of Melbourne - Tue 1 May 2012

The "rules"

● Conversation, not lecture

● Ask questions at any time

● Activities and quizzes are interspersed.These have yellow background like this slide.

● Please turn your phones to silent.

● Let's start!

Overview

● Medical issue○ sample collection from patient○ sample preparation

● Genome sequencing○ experimental design

● Bioinformatics○ identify SNPs - read mapping, to reference

■ phylogenomic tree○ identify novel DNA - de novo assembly, no reference

■ annotation● Biological interpretation

What is a pathogen?

● infectious agent or "germ"

● microbe that causes disease in its host○ organism

■ virus, bacterium, fungus, protozoa, parasite■ most are harmless or even beneficial

○ host■ human, animal, plant

○ transmission■ any "hole" in you - inhalation, ingestion, wound

○ virulence■ how bad, how quick, mortality

What type of pathogen are these?

HIV Golden Staph

Glandular Fever

Malaria

Powdery mildew Black death (Plague)

How do they work?

● Adhesion○ bind to host cell surface - interferes normal process

● Colonization○ take over parts of the body - upsets processes

● Invasion○ produce proteins to disrupt host cells, allow entry

● Immunosuppression○ for example, produce proteins to bind to antibodies

● Toxins○ proteins/metabolites that are poison to the host

Patient scenario

● Hospital patient with indwelling catheter○ risk of pathogens entering the bloodstream○ this is not normal, and is called septicemia○ sepsis is the whole body inflammatory response to it

● Need to defeat the pathogen○ most likely bacterial in this case○ need to identify the bacterium and characteristics of

the bacterium ■ antibiotic resistance profile eg. MRSA, VRE■ might even want to know where it came from

Sample collection

● Take patient blood○ send to pathology

● Centrifuge○ slow spin to remove human cells○ fast spin to pellet bacterial cells

● Streak onto agar media○ first emulsify the pellet to make it spreadable○ grow for 24 hours, likely to be monoculture

Traditional Microbiology● Phenotype based:

○ look at cells under microscope○ Gram staining - cell walls○ biochemical tests

==> identification of the bacterium■ genus and species

● PCR based testing:○ 16s ribosomal RNA

■ common to all bacteria, differs slightly per strain■ identify genus >90%, species >65%, unknown 10%

○ Multi-locus sequence typing (MLST)■ sequence ~8 conserved genes■ each strain/genus has its own MLST pattern

==> faster but limited - need prior knowledge

WGS for diagnostics

● Whole Genome Sequencing○ fast and no prerequisite knowledge about the pathogen○ Microbiologist won't be superseded!!

■ just different tools■ sequence data set: will still do all the 'tests' to

identify and profile

Purify DNA● DNA extraction kit

○ lyse cells and digest (proteinaseK)○ centrifuge to remove cell debris○ pass lysate through column

■ DNA sticks to a DNA binding matrix○ wash bound DNA○ lower salt concentration - release bound DNA○ precipitate: dubiously familiar stringy white pellet

■ salt and ethanol

● Extract DNA from strawberries at home!○ detergent - breaks cells (octoploid genome)○ strainer/pantyhose - remove particulate matter○ salt - aids DNA precipitation○ alcohol - precipitates DNA, keeps rest in solution

Library preparation

● Enough DNA?○ each technology requires different amounts

● Library type○ shotgun, short paired, or long paired reads?○ different construction methods eg. circularization

● Size selection○ nebulize, sonication, enzymatic methods○ run on gel + scalpel, or fancier methods

● Amplification○ lots DNA : use PCR methods eg. emulsion PCR○ little DNA : multiple displacement amplification

■ random hexamers, high fidelity polymerase■ whole genome amplification for single-cell seq

High throughput DNA-Seq

● Lots of technologies at market○ 454, Illumina, SOLiD, Ion Torrent, PacBio

● Each has its ups and downs○ speed, yield, length, price, quality, labour, reliability

● Technology trend○ Illumina is currently the best choice○ Most mature technology○ Produces direct "base space" ie. A,G,T,C○ Easiest data to work with

Current technology

Method Length (now)

Length (soon)

Paired end?

Mate pairs? Quality Yield "Space"

Illumina 150 250 Yes(→800bp)

Yes(→3kb) +++++ ++++ base

454 500 900 No Yes (→8kb) +++ ++ flow

SOLiD 75 75 No Yes(~4kb) +++ +++++ colour

Ion Torrent 100 200 Testing Testing

(~4kb) ++ +++ flow

PacBio 2000 6000+ No No + + base?

Read types

● Single end, "shotgun"○ ===>---------○ sequence from one end of a fragment

● Paired end○ ==>--------<==○ sequence from both ends of the same fragment○ space between mates is the "insert size" (< 800bp)○

● Mate pair○ ==>--~~~~--<==○ sequence both ends of a pseudo-fragment○ this allows us to use longer insert sizes (> 800bp)

Read "spaces"

● Example read○ ACTGGGTCC

● Base space○ get native bases: A,C,T,G,G,G,T,C,C

● Flow space○ get base flows: A*1, C*1, T*1, G*3, T*1, C*2○ mis-count when n > 3 (homopolymers)

● Colour space○ get di-base encoding: T:X,X,X,X,X,X,X○ theoretically useful, but messy overall

Read filtering

● Sequencing is a multi-step process○ ost steps are biological - so there will be errors!

● Bacterial genome sequencing○ usually excess sequence, can afford to discard

● Why filter?○ reduce size of data set to deal with

■ need less RAM and CPU○ improve average reads quality

■ better results

What to filter on

● Phred base qualities○ Q<20 still means >1% error!

● ambiguous bases ie. "N"○ these should have low Q scores anyway

● reads that are too short○ too ambiguous to map, too short to assemble

● widowed reads○ reads, that after filtering, no longer have a mate

Sequenced it, now what?

● How is it different?○ compare to known closely related "reference" strain

● Types of differences○ deleted DNA - in reference, not in ours○ duplicated DNA - extra copies in ours ○ novel DNA - in ours, not in reference ○ SNPs - single nucleotide polymorphisms (1bp subst)○ indels - short insertions or deletions (usually 1bp)

■ sometimes indels fall under "SNPs" banner○ structural variation - rearrangements, inversions

■ small scale, large scale

Read mapping - large scale

Deleted x3 x2Conserved Conserved

x1 coverage

Read mapping - medium scale

Read mapping - small scale

Reference sequence

Depth

Errors

Are we seeing everything?

● Hmm, some of our reads didn't map○ sequencing artifacts (some)○ contamination (maybe - RA sneezed into sample?)○ DNA in our sample but not in reference (yes)

■ need to de novo assemble

● Other comparisons more difficult○ structural change, rearrangements

■ read length & insert size are limiting factors○ read mapping is not the answer to every question

■ particularly with non-model organisms

● De novo○ Latin - "from the beginning", "afresh", "anew"○ Without reference to any other genomes

● "Genome assembly is impossible."○ A/Prof. Mihai Pop - leading assembly researcher!

De novo genome assembly

Assembling bacteria

● Genomes○ DNA, single organism, ~1 sequence

● Transcriptomes○ RNA (cDNA), single organism, ~4000 sequences

● Meta-genomes○ DNA, mix of organisms, >10 sequences○ eg. human gut microbiome, oral cavity

● Meta-transcriptomes○ RNA (cDNA), mix of organisms, >40000 sequences!

Types of assemblers

● Greedy ○ find two best matching reads, join them, iterate

● Overlap-Layout-Consensus○ collate all overlaps into a graph and finds a path

● de Bruijn graph (pronounced "brown")○ break reads into k-mers, overlap is 100%id k-1

● String graph○ represents all that is inferable from the reads○ encompasses OLC and DBGs

Assembly algorithm

● Find all overlaps between all reads○ naively this is O(N2) for N reads, but good heuristics○ parameters are: min. overlap, min %identity

● Build a graph from these overlaps○ nodes/arcs <=> reads/overlaps <=> vertices/edges

● Simplify the graph○ because real-world reads have errors

● Trace a single path through the graph○ Read off the consensus of bases as you go

The tyranny of repeats

● Assembler would output 7 contig sequences○ path is broken at ambiguous decision points○ read/pair length limits ability to resolve repeats

How many contigs will be produced?

More complex graph

Contigs Connections

Reality bites

Shared vertices are repeats.

Scaffolding

● Use paired reads to join contigs○ reads with their mates in different contigs

in a consistent manner suggests adjacency

● A difficult constraint problem○ distance between mates ("insert size") variable○ repeats cause ambiguous mate placement○ some assemblers do it, separate scaffolders exist

Contig ordering

● Optical maps○ wet lab method, real experimental evidence○ chromosome sized restriction site map

● Align to reference genome○ fit contigs best as possible against known reference○ some contigs will fit if split (DNA rearrangement)○ expect orphan contigs (novel DNA)

Genome closure

● Finished genome○ one contig per replicon in original sample○ bacterial chromosomes/plasmids usually circular

● Labour intensive○ design primers around gaps, PCR, Sanger○ Fosmid/BAC libraries for larger inconsistencies

● Why bother?○ no close reference exists○ ensures you didn't miss anything○ understand whole genome architecture○ simplifies all downstream analysis

Annotation

● Annotation is the process of identifying important features in a genome○ gene - protein product, promoter, signal sequences

■ ~1000 per Mbp in bacteria, coding dense○ tRNA - transfer RNA

■ ~30 per bacteria cover all codons (wobble base)○ rRNA - ribosomal RNA locus

■ 1 to 9 per bacteria, fast vs slow growers○ And many more...

■ small RNAs, ncRNA, binding sites, tx factors

Annotating proteins

● Homology vs. Similarity○ homology means same biological function○ we use sequence similarity as a proxy for homology○ works well for most situations

● Sequence alignment methods○ "Exact" - Needleman-Wunsch, Smith-Waterman○ "Approx" - BLAST, FASTA, and many others!○ Database is sequences: nr, RefSeq, UniProt

● Sequence profile methods○ Build a HMM (model) of aligned sequence families○ HMMer - scan profiles against query protein seq.○ Database is profiles: Pfam, TIGRfams, FigFam

Curation

● Automatic annotation○ more quality databases and models now○ but still flawed

● Manual curation○ Essential for a quality annotation○ Find pseudo, missing, bogus, and broken genes○ Discover mis-assemblies○ Correct mis-annotated protein families○ Fix incorrect start codons

■ Bacteria have 3-5 start codons, not just AUG (M)

Practical Exercise

Go to URL: dna.med.monash.edu.au/~torsten/tmp/msc

Install: Artemis and MEGA

Keep this URL open in a tab.

Wait for further instructions.

http://dna.med.monash.edu.au/~torsten/tmp/msc/

Annotation

● Start Artemis○ open Hendra1994.fa

● What is it?○ Hendra virus - 18 kb viral genome○ single-stranded negative-sense RNA (not DNA!!)○ has 6 protein coding regions ("genes")

● Task○ find these genes using Artemis○ use similarity searching to assign a name to the

gene

The official annotation

● In Artemis○ download and open Hendra1994.gbk

● Task○ compare to your annotations

● What did you find?○ methionine (M) start codon (ATG)

DNA vs Protein Similarity

● Examine relationships between Paramyxoviridae○ includes Hendravirus already in Artemis

● Open BLAST: http://blast.ncbi.nlm.nih.gov/

● For the Hendravirus:○ use blastn to search nr database for sequences

related to the L gene (DNA)○ use blastp to search nr database for sequences

related to the L protein (amino acids)○ Any observations??

Phylogeny

● Start MEGA○ download L_para.fas○ multifasta with "L" proteins from 37 similar viruses

● Task:○ Load L_para.fas○ Align sequences (using MUSCLE)○ Infer tree (minumum evolution method)○ Examine relationships

Viral strain comparison

● Hendravirus○ 11 complete genome sequences○ different hosts (bats horses) and times (1994

onwards)

● Task○ Load hendra11.meg into MEGA○ multiple alignment already done - examine SNPs○ What is the impact of the nucleotide differences?

■ look at one SNP in detail■ use Artemis to see if the SNP is in a gene■ does the SNP change the encoded protein?

9918

Health & Medicine

Pathogenomics, from patient to bioinformatician torsten seemann - uni melb - tue 1 may 2012