Upload
hilary-mckinney
View
219
Download
0
Embed Size (px)
Citation preview
Leptothorax gredosi
Leptothorax racovitzaeCamponotus herculeanus
0.990.58
0.990.96
0.76
0.76
0.911.00
0.581.00
0.990.91
Thomas Bayes 1702-1761
Bayesian inference
Computational phylogeneticsCSC 10.-12.12.2006
Mikko Kolkkala
How to read a tree?Temcur312
Temdul313
Temmin608
Temamb604
Temlon314
Proame311
Myrber327
Myrrav202
Temgre328
Temalb
Temuni196
Chakut323
Chamue330
Temint352FIN
Podade302
LepmUS319
Leppoc610
Fornit
Forhir307
LepaceFIN
Lepkut
Lepgre334
LepmusFINPOL
Lepwil380
Harcan371
Harsub
Carele322
Carsp303
Strtes377
TetcaeFINGER
Mesbar519
Messtr520
Aphsen505
Terall335
Terxal353
Manrub189
Myrrub175
Camher84
Lasali62
Prolon257
Dolqua620
Liomic533
Rhymet529
Tetpun531
Myrhar523
Myrpic525
Ambsp527
Pacsp528
Odosp526
Temcur312
Temdul313
Temmin608
Temamb604
Temlon314
Proame311
Myrber327
Myrrav202
Temgre328
Temalb
Temuni196
Chakut323
Chamue330
Temint352FIN
Podade302
LepmUS319
Leppoc610
Fornit
Forhir307
LepaceFIN
Lepkut
Lepgre334
LepmusFINPOL
Lepwil380
Harcan371
Harsub
Carele322
Carsp303
Strtes377
TetcaeFINGER
Mesbar519
Messtr520
Aphsen505
Terall335
Terxal353
Manrub189
Myrrub175
Camher84
Lasali62
Prolon257
Dolqua620
Liomic533
Rhymet529
Tetpun531
Myrhar523
Myrpic525
Ambsp527
Pacsp528
Odosp526
Temcur312
Temdul313
Temmin608
Temamb604
Temlon314
Proame311
Myrber327
Myrrav202
Temgre328
Temalb
Temuni196
Chakut323
Chamue330
Temint352FIN
Podade302
LepmUS319
Leppoc610
Fornit
Forhir307
LepaceFIN
Lepkut
Lepgre334
LepmusFINPOL
Lepwil380
Harcan371
Harsub
Carele322
Carsp303
Strtes377
TetcaeFINGER
Mesbar519
Messtr520
Aphsen505
Terall335
Terxal353
Manrub189
Myrrub175
Camher84
Lasali62
Prolon257
Dolqua620
Liomic533
Rhymet529
Tetpun531
Myrhar523
Myrpic525
Ambsp527
Pacsp528
Odosp526
Temcur312
Temdul313
Temmin608
Temamb604
Temlon314
Proame311
Myrber327
Myrrav202
Temgre328
Temalb
Temuni196
Chakut323
Chamue330
Temint352FIN
Podade302
LepmUS319
Leppoc610
Fornit
Forhir307
LepaceFIN
Lepkut
Lepgre334
LepmusFINPOL
Lepwil380
Harcan371
Harsub
Carele322
Carsp303
Strtes377
TetcaeFINGER
Mesbar519
Messtr520
Aphsen505
Terall335
Terxal353
Manrub189
Myrrub175
Camher84
Lasali62
Prolon257
Dolqua620
Liomic533
Rhymet529
Tetpun531
Myrhar523
Myrpic525
Ambsp527
Pacsp528
Odosp526100
Temcur312
Temdul313
Temmin608
Temamb604
Temlon314
Proame311
Myrber327
Myrrav202
Temgre328
Temalb
Temuni196
Chakut323
Chamue330
Temint352FIN
Podade302
LepmUS319
Leppoc610
Fornit
Forhir307
LepaceFIN
Lepkut
Lepgre334
LepmusFINPOL
Lepwil380
Harcan371
Harsub
Carele322
Carsp303
Strtes377
TetcaeFINGER
Mesbar519
Messtr520
Aphsen505
Terall335
Terxal353
Manrub189
Myrrub175
Camher84
Lasali62
Prolon257
Dolqua620
Liomic533
Rhymet529
Tetpun531
Myrhar523
Myrpic525
Ambsp527
Pacsp528
Odosp526
100
Temcur312Temdul313
Temmin608
Temamb604Temlon314
Proame311
Myrber327Myrrav202
Temgre328Temalb
Temuni196
Chakut323
Chamue330
Temint352FIN
Podade302LepmUS319Leppoc610
Fornit
Forhir307
LepaceFIN
LepkutLepgre334LepmusFINPOLLepwil380
Harcan371
Harsub
Carele322
Carsp303Strtes377
TetcaeFINGER
Mesbar519
Messtr520
Aphsen505Terall335
Terxal353Manrub189Myrrub175
Camher84
Lasali62
Prolon257
Dolqua620
Liomic533
Rhymet529
Tetpun531
Myrhar523
Myrpic525
Ambsp527
Pacsp528
Odosp526
Bayesian inference
Only very recently phylogenetical applications (”Why”? We’ll return to that…)
Controversial philosophySubjective probability concept; degrees of belief measured as probabilities
A learning processPrior and posterior probabilities
Spam filters
Subjective!Quack!
)(
)()|()|(
Dp
pDpDp
p = probabilityD = DataΘ = model/hypothesis/parameters| = read: ”provided that"
Conditional probability: ”|”
p( a six | loaded die )
1/2
An exampleSuppose we have ten identical looking dice, nine ordinary, one die loaded so that a six appears with probability 1/2. Let’s pick one die randomly. The probability of it being loaded is (of course)
1/10 (= prior)Next, we roll the die once - and get a six:
What is the probability that we have picked the loaded die now?
• p( loaded die )
• 1/10
p(a six)
1/2 • 1/10 + 1/6 • 9/10= = 1/4 (= posterior)
p( loaded die | a six ) =
An exerciseA reliable test?
Test for a rare disease (prevalence 0.1 %): Disease - positive result with probability 0.99No disease - positive result with probability 0.05.
What is the probability that the test is positive but the individual tested has not the disease?
Answer: 0.98(http://en.wikipedia.org/wiki/Bayesian_inference)
p(data | model) • p(model)
p(data)p(model | data) =
“loaded die" model“a six" data
)(
)()|()|(
Dp
pDpDp
From dice to biology:
Data: DNA-alignmentModels: nucleotide substitution modelstree shape and branch lengths
p(data | model) • p(model)
p(data)p(model | data) =
dXlp
XlpXf
)|()(
)|()()|(
Posteriordistribution
Prior distribution Likelihood function
)(
)()|()|(
Dp
pDpDp
If this Bayesian thing is so excellent why hasn’tIt been used in phylogenetic analyses?
No-one can solve the equations!
Numerical solutions possible - but only with powerful computers
MCMC = Markov Chain Monte Carlo
Parameters• Tree topology• Branch lenghts• Probabilities for nucleotide substitutions
“”Exploring the tree space”
Parameter space
Pro
bab
ility
© Fredrik Ronqvist
Metropolis-Coupled Markov Chain Monte CarloMCMCMC = (MC)3
“Heated chains"
“Flattened" parameter landscape
© John Huelsenbeck
(MC)3
© John Huelsenbeck
(MC)3
© John Huelsenbeck
(MC)3
Swap of states
p-values directlyNo need for bootstrapping
F81 JC
HKY85 K80
K81 TrN
TVM
TIM
SYM GTR
Standard models
Substitutiontypes:1-6
Nucleotidefrequences:equal/estimated fromthe data
Invariable sites:no/estimate
Evolutionary rate:equal/Γ-distributed
"+I"
"+G"
ETC.
. a a a
a . a a
a a . a
a a a .
AA
C G
G
T
T
C
πA=πc=πg=πT=1/4 )3
41ln(
4
3pD
JCJukes-Cantor
GTRGeneral time-reversible model
0.75
. . .
Characters independet? No way.
Time reversible: GC = CG ?
RNA-genes
SSR-models(site-specific rates)
Different evolutionary rate for 1./2./3. positions of codons
Problematic(see: Buckley ym. 2001 Syst.Biol. 50:67-86)
Coding regions
But – how to chooce the model?
Well, nobody said it would be easy.
30
How many parametersDoes it take to fit an elephant?
“What do you consider the largest map that would be really useful?"
"About six inches to the mile."
"Only six inches! […]
We actually made a map of the country, on the scale of a mile to the mile!"
(Lewis Carroll 1893)
Choosing a model
AIC (Akaike information criterion)AICc (Consistent Akaike information criterion)BIC (Bayesian information criterion)
Programs:
Modeltest (bad)
FindModel (plop!)
MrAic
?
Redelings, B. D. & Suchard, M.A 2005: Joint Bayesian estimation of alignment and phylogeny. Syst. Biol. 54: 401-418
Lunter, G. et al. 2005: Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 6:83
Most commonly used program: MrBayes
Future? Alignment and phylogeny co-estimationBAli-Phy (Redeling & Suchard 2005)
Beast (Lunter et al. 2005)
Sweden24 987 citiesWorld record
Cities (N)
10 69
Routes (N!)
10! = 3 628 80069! = 1.7 x 1098
Travelling salesmanFind the shortest route through cities (another NP-complete problem)
84.8 CPU years
How about studying them all?With rate million routes / sec.it would take 5x1084 years
24 987 24 987! = ?
Acknowledgements
Fredrik RonqvistJohn HuelsenbeckWife and Mom
-Command-line interface-UNIX, Macintosh and PC platforms
MrBayesRonqvist, F. & Huelsenbeck, J. 2001: Bioinformatics 17: 754-755(2005: v. 3.1.)
HomepageManualWiki, FAQMailing list (archives)
MrBayes
MrBayes Running the analysis
All you have to do:
Type
execute filename.nex *
at the MrBayes > prompt and press enter
* Replace filename.nex with your nexus-file containing MrBayes commands (type full path if the file is not in the same folder as MrBayes program).
#nexusbegin data; dimensions ntax=6 nchar=20; format datatype=dna; matrix Otus1 aaaaaaaaaaaaaaaaaaaa Otus2 aaaaaaaaaaaaaaaaaaaa Otus3 aaaaaaaaaaaaaaaaaaaa Otus4 cccccccccccccccccccc Otus5 gggggggggggggggggggg Otus6 tttttttttttttttttttt ; end;begin mrbayes; mcmcp ngen= 100000 samplefreq=100; mcmc;end;
MrBayes – an example nexus file
A real thing:
MrBayes After the run
Summarize the parameter values, type: sump burnin=Summarize the trees, type: sumt burnin=
With a proper burnin value
burn-in
(C) Fredrik Ronqvist
MrBayes After the run
Burnin discards initial values before the analysis reached convergence (burnin=2500 if you have run a million generations, sampled every 100th of them,and want to discard the first 25%)
Note: you have to run “enough” generations-Check the plot generated by sump; there should be no obvious trends -The standard deviation of split frequencies should be less than 0.01.
Restriction: Can handle only 24 substitution models
Command for example:lset nst=6 rates=invgamma
MrBayes Models
Confused? Try typing: help lset
Priors, command: prset Defaults (try help prset) should work fine for most analysis
Cladistic parsimony
Prefer the tree with the fewest number of evolutionaryPrefer the tree with the fewest number of evolutionarysteps – only parsimony informative sites countsteps – only parsimony informative sites count
Otus1 aaaaaaaaaaaaaaaaaaaaOtus2 aaaaaaaaaaaaaaaaaaaaOtus3 aaaaaaaaaaaaaaaaaaaaOtus4 cccccccccccccccccccc Otus5 ggggggggggggggggggggOtus6 tttttttttttttttttttt
Otus1Otus2
Otus3
Otus4
Otus5Otus6
Fain ja Houde 2004:Evolution 58: 2558-2573
Exercises:1. Study program defaults with help command (e.g. lset and prset)
2. Run program with a few arbitrary sequences (e.g. palikka.nex)-Try sump and sumt commands with different burnin values-Study the files made by the program – where is the tree?
3. Run program with some real data (e.g. your own or birds.txt)-Align sequences-Put them into a nexus file
-Try to find out how to select JC, K2P and GTR model with gamma-distributed rate variation and withoutwith correction for invariable sites and without
-Try the model suggested by FindModel (AIC-criterion)
-
MrBayes