View
322
Download
0
Embed Size (px)
Citation preview
Pathogenomics
Dieter Bulach Torsten Seemann
From patient to bioinformatician
M.Sc(Bioinf) - University of Melbourne - Tue 1 May 2012
The "rules"
● Conversation, not lecture
● Ask questions at any time
● Activities and quizzes are interspersed.These have yellow background like this slide.
● Please turn your phones to silent.
● Let's start!
Overview
● Medical issue○ sample collection from patient○ sample preparation
● Genome sequencing○ experimental design
● Bioinformatics○ identify SNPs - read mapping, to reference
■ phylogenomic tree○ identify novel DNA - de novo assembly, no reference
■ annotation● Biological interpretation
What is a pathogen?
● infectious agent or "germ"
● microbe that causes disease in its host○ organism
■ virus, bacterium, fungus, protozoa, parasite■ most are harmless or even beneficial
○ host■ human, animal, plant
○ transmission■ any "hole" in you - inhalation, ingestion, wound
○ virulence■ how bad, how quick, mortality
What type of pathogen are these?
HIV Golden Staph
Glandular Fever
Malaria
Powdery mildew Black death (Plague)
How do they work?
● Adhesion○ bind to host cell surface - interferes normal process
● Colonization○ take over parts of the body - upsets processes
● Invasion○ produce proteins to disrupt host cells, allow entry
● Immunosuppression○ for example, produce proteins to bind to antibodies
● Toxins○ proteins/metabolites that are poison to the host
Patient scenario
● Hospital patient with indwelling catheter○ risk of pathogens entering the bloodstream○ this is not normal, and is called septicemia○ sepsis is the whole body inflammatory response to it
● Need to defeat the pathogen○ most likely bacterial in this case○ need to identify the bacterium and characteristics of
the bacterium ■ antibiotic resistance profile eg. MRSA, VRE■ might even want to know where it came from
Sample collection
● Take patient blood○ send to pathology
● Centrifuge○ slow spin to remove human cells○ fast spin to pellet bacterial cells
● Streak onto agar media○ first emulsify the pellet to make it spreadable○ grow for 24 hours, likely to be monoculture
Traditional Microbiology● Phenotype based:
○ look at cells under microscope○ Gram staining - cell walls○ biochemical tests
==> identification of the bacterium■ genus and species
● PCR based testing:○ 16s ribosomal RNA
■ common to all bacteria, differs slightly per strain■ identify genus >90%, species >65%, unknown 10%
○ Multi-locus sequence typing (MLST)■ sequence ~8 conserved genes■ each strain/genus has its own MLST pattern
==> faster but limited - need prior knowledge
WGS for diagnostics
● Whole Genome Sequencing○ fast and no prerequisite knowledge about the pathogen○ Microbiologist won't be superseded!!
■ just different tools■ sequence data set: will still do all the 'tests' to
identify and profile
Purify DNA● DNA extraction kit
○ lyse cells and digest (proteinaseK)○ centrifuge to remove cell debris○ pass lysate through column
■ DNA sticks to a DNA binding matrix○ wash bound DNA○ lower salt concentration - release bound DNA○ precipitate: dubiously familiar stringy white pellet
■ salt and ethanol
● Extract DNA from strawberries at home!○ detergent - breaks cells (octoploid genome)○ strainer/pantyhose - remove particulate matter○ salt - aids DNA precipitation○ alcohol - precipitates DNA, keeps rest in solution
Library preparation
● Enough DNA?○ each technology requires different amounts
● Library type○ shotgun, short paired, or long paired reads?○ different construction methods eg. circularization
● Size selection○ nebulize, sonication, enzymatic methods○ run on gel + scalpel, or fancier methods
● Amplification○ lots DNA : use PCR methods eg. emulsion PCR○ little DNA : multiple displacement amplification
■ random hexamers, high fidelity polymerase■ whole genome amplification for single-cell seq
High throughput DNA-Seq
● Lots of technologies at market○ 454, Illumina, SOLiD, Ion Torrent, PacBio
● Each has its ups and downs○ speed, yield, length, price, quality, labour, reliability
● Technology trend○ Illumina is currently the best choice○ Most mature technology○ Produces direct "base space" ie. A,G,T,C○ Easiest data to work with
Current technology
Method Length (now)
Length (soon)
Paired end?
Mate pairs? Quality Yield "Space"
Illumina 150 250 Yes(→800bp)
Yes(→3kb) +++++ ++++ base
454 500 900 No Yes (→8kb) +++ ++ flow
SOLiD 75 75 No Yes(~4kb) +++ +++++ colour
Ion Torrent 100 200 Testing Testing
(~4kb) ++ +++ flow
PacBio 2000 6000+ No No + + base?
Read types
● Single end, "shotgun"○ ===>---------○ sequence from one end of a fragment
● Paired end○ ==>--------<==○ sequence from both ends of the same fragment○ space between mates is the "insert size" (< 800bp)○
● Mate pair○ ==>--~~~~--<==○ sequence both ends of a pseudo-fragment○ this allows us to use longer insert sizes (> 800bp)
Read "spaces"
● Example read○ ACTGGGTCC
● Base space○ get native bases: A,C,T,G,G,G,T,C,C
● Flow space○ get base flows: A*1, C*1, T*1, G*3, T*1, C*2○ mis-count when n > 3 (homopolymers)
● Colour space○ get di-base encoding: T:X,X,X,X,X,X,X○ theoretically useful, but messy overall
Read filtering
● Sequencing is a multi-step process○ ost steps are biological - so there will be errors!
● Bacterial genome sequencing○ usually excess sequence, can afford to discard
● Why filter?○ reduce size of data set to deal with
■ need less RAM and CPU○ improve average reads quality
■ better results
What to filter on
● Phred base qualities○ Q<20 still means >1% error!
● ambiguous bases ie. "N"○ these should have low Q scores anyway
● reads that are too short○ too ambiguous to map, too short to assemble
● widowed reads○ reads, that after filtering, no longer have a mate
Sequenced it, now what?
● How is it different?○ compare to known closely related "reference" strain
● Types of differences○ deleted DNA - in reference, not in ours○ duplicated DNA - extra copies in ours ○ novel DNA - in ours, not in reference ○ SNPs - single nucleotide polymorphisms (1bp subst)○ indels - short insertions or deletions (usually 1bp)
■ sometimes indels fall under "SNPs" banner○ structural variation - rearrangements, inversions
■ small scale, large scale
Read mapping - large scale
Deleted x3 x2Conserved Conserved
x1 coverage
Read mapping - medium scale
Read mapping - small scale
Reference sequence
Depth
Errors
Are we seeing everything?
● Hmm, some of our reads didn't map○ sequencing artifacts (some)○ contamination (maybe - RA sneezed into sample?)○ DNA in our sample but not in reference (yes)
■ need to de novo assemble
● Other comparisons more difficult○ structural change, rearrangements
■ read length & insert size are limiting factors○ read mapping is not the answer to every question
■ particularly with non-model organisms
● De novo○ Latin - "from the beginning", "afresh", "anew"○ Without reference to any other genomes
● "Genome assembly is impossible."○ A/Prof. Mihai Pop - leading assembly researcher!
De novo genome assembly
Assembling bacteria
● Genomes○ DNA, single organism, ~1 sequence
● Transcriptomes○ RNA (cDNA), single organism, ~4000 sequences
● Meta-genomes○ DNA, mix of organisms, >10 sequences○ eg. human gut microbiome, oral cavity
● Meta-transcriptomes○ RNA (cDNA), mix of organisms, >40000 sequences!
Types of assemblers
● Greedy ○ find two best matching reads, join them, iterate
● Overlap-Layout-Consensus○ collate all overlaps into a graph and finds a path
● de Bruijn graph (pronounced "brown")○ break reads into k-mers, overlap is 100%id k-1
● String graph○ represents all that is inferable from the reads○ encompasses OLC and DBGs
Assembly algorithm
● Find all overlaps between all reads○ naively this is O(N2) for N reads, but good heuristics○ parameters are: min. overlap, min %identity
● Build a graph from these overlaps○ nodes/arcs <=> reads/overlaps <=> vertices/edges
● Simplify the graph○ because real-world reads have errors
● Trace a single path through the graph○ Read off the consensus of bases as you go
The tyranny of repeats
● Assembler would output 7 contig sequences○ path is broken at ambiguous decision points○ read/pair length limits ability to resolve repeats
How many contigs will be produced?
More complex graph
Contigs Connections
Reality bites
Shared vertices are repeats.
Scaffolding
● Use paired reads to join contigs○ reads with their mates in different contigs
in a consistent manner suggests adjacency
● A difficult constraint problem○ distance between mates ("insert size") variable○ repeats cause ambiguous mate placement○ some assemblers do it, separate scaffolders exist
Contig ordering
● Optical maps○ wet lab method, real experimental evidence○ chromosome sized restriction site map
● Align to reference genome○ fit contigs best as possible against known reference○ some contigs will fit if split (DNA rearrangement)○ expect orphan contigs (novel DNA)
Genome closure
● Finished genome○ one contig per replicon in original sample○ bacterial chromosomes/plasmids usually circular
● Labour intensive○ design primers around gaps, PCR, Sanger○ Fosmid/BAC libraries for larger inconsistencies
● Why bother?○ no close reference exists○ ensures you didn't miss anything○ understand whole genome architecture○ simplifies all downstream analysis
Annotation
● Annotation is the process of identifying important features in a genome○ gene - protein product, promoter, signal sequences
■ ~1000 per Mbp in bacteria, coding dense○ tRNA - transfer RNA
■ ~30 per bacteria cover all codons (wobble base)○ rRNA - ribosomal RNA locus
■ 1 to 9 per bacteria, fast vs slow growers○ And many more...
■ small RNAs, ncRNA, binding sites, tx factors
Annotating proteins
● Homology vs. Similarity○ homology means same biological function○ we use sequence similarity as a proxy for homology○ works well for most situations
● Sequence alignment methods○ "Exact" - Needleman-Wunsch, Smith-Waterman○ "Approx" - BLAST, FASTA, and many others!○ Database is sequences: nr, RefSeq, UniProt
● Sequence profile methods○ Build a HMM (model) of aligned sequence families○ HMMer - scan profiles against query protein seq.○ Database is profiles: Pfam, TIGRfams, FigFam
Curation
● Automatic annotation○ more quality databases and models now○ but still flawed
● Manual curation○ Essential for a quality annotation○ Find pseudo, missing, bogus, and broken genes○ Discover mis-assemblies○ Correct mis-annotated protein families○ Fix incorrect start codons
■ Bacteria have 3-5 start codons, not just AUG (M)
Practical Exercise
Go to URL: dna.med.monash.edu.au/~torsten/tmp/msc
Install: Artemis and MEGA
Keep this URL open in a tab.
Wait for further instructions.
Annotation
● Start Artemis○ open Hendra1994.fa
● What is it?○ Hendra virus - 18 kb viral genome○ single-stranded negative-sense RNA (not DNA!!)○ has 6 protein coding regions ("genes")
● Task○ find these genes using Artemis○ use similarity searching to assign a name to the
gene
The official annotation
● In Artemis○ download and open Hendra1994.gbk
● Task○ compare to your annotations
● What did you find?○ methionine (M) start codon (ATG)
DNA vs Protein Similarity
● Examine relationships between Paramyxoviridae○ includes Hendravirus already in Artemis
● Open BLAST: http://blast.ncbi.nlm.nih.gov/
● For the Hendravirus:○ use blastn to search nr database for sequences
related to the L gene (DNA)○ use blastp to search nr database for sequences
related to the L protein (amino acids)○ Any observations??
Phylogeny
● Start MEGA○ download L_para.fas○ multifasta with "L" proteins from 37 similar viruses
● Task:○ Load L_para.fas○ Align sequences (using MUSCLE)○ Infer tree (minumum evolution method)○ Examine relationships
Viral strain comparison
● Hendravirus○ 11 complete genome sequences○ different hosts (bats horses) and times (1994
onwards)
● Task○ Load hendra11.meg into MEGA○ multiple alignment already done - examine SNPs○ What is the impact of the nucleotide differences?
■ look at one SNP in detail■ use Artemis to see if the SNP is in a gene■ does the SNP change the encoded protein?
9918