Finding good models of molecular evolution in phylogenetics
Rob Lanfear
Australian National University,
National Evolutionary Synthesis Centre, USA
Acknowledgements
Simon Ho
Stephane Guindon
Brett Calcott
Alexis Statmatakis
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
A G
C T
Rate Matrix
πA + πC + πG + πT = 1
Base Frequencies
+ I + G
Site Rates
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
A G
C T
Rate Matrix
πA + πC + πG + πT = 1
Base Frequencies
+ I + G
Site Rates
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
a f
b
cd
e
JCa=b=c=d=e=fπA=πC=πG=πT
No I or G 0 free parameters
GTR+I+Ga, b, c, d, e, fπA, πC, πG, πT
I, G10 free parameters
GTRa, b, c, d, e, fπA, πC, πG, πT
No I or G 8 free parameters
HKYa=c=d=f, b=eπA, πC, πG, πT
No I or G 4 free parameters
Model Selection
Compare all models.
2. Penalise models with more parameters
e.g. Bayesian Information Criterion (BIC)
1. Calculate the Likelihood of each model
3. Use the model with the smallest BIC score
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
The Problem
Almost always select GTR+I+G(the most complex model)
“like an overweight man shopping in the women's petites department”Gatesy J, Trends Ecol Evol 2007, 22:509-10
Partitioning
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
GTR+I+Ga, b, c, d, e, fπA, πC, πG, πT
I, G10 free parameters
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
GTR+I+Ga, b, c, d, e, fπA, πC, πG, πT
I, G10 free parameters
GTR+I+Ga, b, c, d, e, fπA, πC, πG, πT
I, G10 free parameters
Spp1 actgactgactgactgactgactgactgactgactgactgactga
Spp2 actgactgactgactgactgactgactgactgactgactgactga
Spp3 actgactgactgactgactgactgactgactgactgactgactga
Spp4 actgactgactgactgactgactgactgactgactgactgactga
Spp5 actgactgactgactgactgactgactgactgactgactgactga
Gene 1 Gene 2 Gene 3 Subsets
9
6
2
A Solution
Compare all possible partitioning schemes.
2. Penalise schemes with more parameters
e.g. Bayesian Information Criterion (BIC)
1. Calculate the Likelihood of each scheme
3. Use the scheme with the smallest BIC score
Many models (HKY, GTR) for each subset
Many ways to partition a dataset
The Problem
PartitionFinderwww.robertlanfear.com/partitionfinder
15,404 sites from whole mitochondrial genomes
87 data blocks
8,000 unit improvement in BIC score
Future directions
1. Genome scale analyses (finished)2. Cloud computing (started)3. GUI (planned)4. Better algorithms (planned)