Upload
duongnhi
View
214
Download
1
Embed Size (px)
Citation preview
Intraspecific phylogenetics
and networks
David Posada
University of Vigo, Spain
EPSCoR Workshop on Phylogenetics
University of Hawaii
1-3 August 2005, O’ahu, Hawaii
David Posada – University of Vigo, Spain
2
Phylogenies and genealogies
David Posada – University of Vigo, Spain
3
Phylogeny
Phylogeny: hierarchical relationships
among organisms
! Caused by speciation/extinction
(organismal level)
! Caused by gene duplication (genetic
level)
! Descent with modification is a hierarchy
producing process.
Relationships among species are
hierarchical.
-> Phylogenetics
David Posada – University of Vigo, Spain
4
Tokogeny
Tokogeny: Non-hierarchical relationships
among organisms
! Caused by sexual reproduction
(organismal level)
! Caused by recombination (genetic level)
Relationships among individuals within
sexual species are non-hierarchical
Relationships among recombining genes are
non-hierarchical
-> Classic population genetics
David Posada – University of Vigo, Spain
5
Genealogies (I)
David Posada – University of Vigo, Spain
6
Genealogies (II)
David Posada – University of Vigo, Spain
7
Genealogies (III)
David Posada – University of Vigo, Spain
8
Genealogies (IV)
David Posada – University of Vigo, Spain
9
Genealogies (V)
David Posada – University of Vigo, Spain
10
Genealogies (VI)
David Posada – University of Vigo, Spain
11
Genealogies (VII)
David Posada – University of Vigo, Spain
12
Genealogy
A genealogy is a representation of the
history of a sample of genes, independent
of the process of mutation.
In reality we can only estimate those
branches marked by mutationsl that is
haplotype trees.
David Posada – University of Vigo, Spain
13
Haplotype trees
We cannot infer A1 that is closer to A4 than
to A2, for example.
Gene tree
Haplotype tree
David Posada – University of Vigo, Spain
14
Population trees: migration
Haplotypes and trees can have different
histories due to migration or lineage
sorting.
David Posada – University of Vigo, Spain
15
Population trees: lineage
sorting
Haplotypes and trees can also have
different histories due to lineage sorting.
David Posada – University of Vigo, Spain
16
Intraspecific data
Intraspecific data show some
particularities1:
1. Low divergence.
2. Ancestral haplotypes are very likely to
persist in the population.
3. Real polytomies o multifurcations.
4. Recombination generates homoplasy.
David Posada – University of Vigo, Spain
17
Reticulation
David Posada – University of Vigo, Spain
18
Traditional phylogenetics
Network methods can provide a useful
alternative to standard phylogenetic
methods (ML,MP,NJ) for estimating
intraspecific phylogenies.
David Posada – University of Vigo, Spain
19
Phylogenetic networks
Phylogenetic network methods are able to
display multifurcations and extant internal
nodes, and explicitly represent phylogenetic
conflict through reticulation.
David Posada – University of Vigo, Spain
20
Network methods (I)
Pyramids2
! Hierarchical clustering framework.
! Clades can overlap.
! Reticulations are allowed among tips.
! Implemented in PYRAMIDS.
Reticulograms3
! Adds reticulations to an existing binary
tree.
! Fit of network to the data (least
squares).
! Implemented in T-REX.
Statistical geometry3,4
! Average quartet geometry.
! Implemented in STATGEOM and GEOMETRY.
Split decomposition5
! Weighted consensus of non-worst
quartet splits.
! Implemented in SPLITSTREE, JSPLITS and
SPECTRONET.
David Posada – University of Vigo, Spain
21
Network methods (II)
NeighborNet6
! Hybrid between split decomposition and
NJ.
! Agglomerate pairs of pairs that share a
node in common, into a circular split
system.
! Implemented in SPLITSTREE and JSPLITS.
Median networks7
! Constructed by adding median vectors
(consensus of triplets).
! Can be reduced or pruned afterwards
using frequency information (RM).
! Not implemented.
Median joining networks8,9
! Refinement of RM for multistate
characters and large data sets.
! Adds median vectors for optimal
triplets are added to a minimum
spanning network(MSN).
! Implemented in NETWORK.
David Posada – University of Vigo, Spain
22
Network methods (III)
Statistical parsimony10
! Local parsimony connections are made
until a network or a set of networks is
constructed.
! Implemented in TCS.
Molecular variance parsimony11
The optimal network is that MST or MSN
that minimizes a set of population
statistics.
! These statistics are based on haplotype
frequency, distance, and geographic
distribution.
! Implemented in ARLEQUIN.
Netting12
! Closest haplotypes are successively
joined.
! In case of homoplasy a new dimension
is added to the network.
! Not implemented.
David Posada – University of Vigo, Spain
23
Network methods (IV)
Likelihood13
! For directed graphs.
! Needs of a good heuristic search.
! Likelihood implemented in PAL, but
search not implemented.
David Posada – University of Vigo, Spain
24
Network methods comparative
Methods Category Software Speed Input data Model of evolution
Statistical assessment
Pyramids
Distance Pyramids Fast Distances Yes No
Statistical geometry
Distance Invariants
Geometry, Statgeom
Fast Multistate Yes Yes
Split decomposition
Distance Parsimony
SplitsTree Fast Multistate Yes Yes
Median networks
Distance No Slow Binary No No
Median-joining networks
Distance Network Very fast
Multistate No No
Statistical parsimony
Distance
TCS Fast Multistate No Yes
NeighborNet
Distance SplitsTree Very fast
Multistate Yes Yes
Molecular variance parsimony
Distance Arlequin Fast Multistate Yes Yes
Netting
Distance No Slow Multistate No No
Likelihood network
Likelihood PAL Slow Multistate Yes Yes
Reticulate phylogeny
Least squares
No Slow Distances* Yes Yes
Reticulogram
Least squares
T-rex Fast Distances No Yes
* distances estimated from gene frequency data
David Posada – University of Vigo, Spain
25
Network methods can give
different results
David Posada – University of Vigo, Spain
26
Network performance
A few empirical studies compare the
networks inferred by different methods.
The absolute or relative performance of
network methods has never been evaluated
through computer simulations.
David Posada – University of Vigo, Spain
27
Statistical parsimony (I)
Calculate pairwise haplotype distance
matrix.
Estimate parsimony connection limit,
defined as the maximum number of
differences among two haplotypes that
ensures, with a probability >= 0.95, that no
over imposed mutations have occurred.
Do not consider connections above this
limit
ˆ P j = (1! ˆ q i )i=1
j
"
ˆ q i = q1L( j, m)dq
10
1
# / L( j,m)dq1
0
1
#L( j,m) = (2qi)
j!1 (1! q1)2m +1 1 ! q
1/(br)[ ] $ 2 ! q
1(br +1) /(br)[ ]
j!1
1 ! 2q11 ! q
1/(br)[ ]{ }
David Posada – University of Vigo, Spain
28
Statistical parsimony (II)
Find, for each haplotype, its minimum
connection/s to other haplotypes.
Make the minimum connections, and add
missing haplotypes (represented as zeroes
or small circles). Never make a connection
that implies that a network distance is
smaller than the observed distance.
David Posada – University of Vigo, Spain
29
Statistical parsimony (III)
If there are several options to make that
connection, do it in a way that agrees as
much as possible with the distances in the
observed distance matrix. So we minimize
W = Nij ! Dij( )PDijj= i+1
h
"i=0
h!1
"
h = number of haplotypes
Nij = min. network distance between haplotypes i and j
Dij = observed distance between haplotypes i and j
PDij = probability of no over imposed mutations among
two haplotypes differing by Dij steps
We will select the connecting alternative
with smallest W.
David Posada – University of Vigo, Spain
30
Coalescent theory results
Some results from neutral coalescent
theory related to frequency and
distribution of haplotypes are relevant to
contruct and interpret intraspecific
phylogenies.
There is a direct relationship between the
frequency and the age of a haplotype14,15.
Haplotypes that have persisted for a long
time in the population will tend to show
higher frequencies than more recent
haplotypes, and new haplotypes will arise
from old ones16. Also, young haplotypes will
tend to stay in the population where they
first appeared17.
We can establish several explicit
predictions:
David Posada – University of Vigo, Spain
31
Neutral coalescent theory
predictions
1. Older haplotypes tend to have higher
frequency
2. Older haplotypes tend to have wider
geographic distribution
3. Older haplotypes tend to be interior in the
network
4. Older haplotypes tend to be more
connected in the network
5. Singletons tend to connect to non-
singletons
6. Singletons tend to connect to haplotypes
in the same population
David Posada – University of Vigo, Spain
32
Rooting
Rooting networks is difficult.
Assuming neutrality we can assign
outgroup weights to haplotypes in the
sample18 using their absolute frequency (fi)
and the sum of k neighbor frequencies (vj):
David Posada – University of Vigo, Spain
33
Solving loops (I)
Neutral coalescent predictions can be used
to solve ambiguities or loops
David Posada – University of Vigo, Spain
34
Solving loops (II)
We can define some objective functions in
the statistical parsimony framework:
! Minimize network disagreement
�
W = Nij !Dij( )PDijj= i+1
h
"i= 0
h!1
"
h = number of haplotypes
Nij = min. network distance between haplotypes i and j
Dij = observed distance between haplotypes i and j
PDij = probability of no over imposed mutations among
two haplotypes differing by Dij steps
! Maximize topological/frequency
criterion
�
T = !(1 " pi )i =0
h
# + (1 " !)pi t = 1 if haplotype i is tip
t = 0 if haplotype i is interior
$
%
&
Solving loops (III)
Solving loops (IV)
David Posada – University of Vigo, Spain
37
Network applications
Detect recombination19,20
Species delimitation21,22
Speciation modes23
Population history24 (NCA)
Genotype/phenotype association25
Higher-level phylogenetics26
David Posada – University of Vigo, Spain
38
Take home
! Network methods can be very
appropriate for the representation of
intraspecific phylogenies.
! Their current performance is unknown.
At least in absence of recombination
they should work well.
! Several programs exists that implement
these methods.
David Posada – University of Vigo, Spain
39
Phylogenetic network software
(I)
STATGEOM27 implements stastistical geometry.
Data: DNA, RNA, protein, binary data.
Distribution: C code. URL: http://www.zbit.uni-
tuebingen.de/pas/kay_en.htm
SPLITSTREE28 implements split decomposition.
Data: DNA, RNA. Executables. Windows, Unix,
Mac. URL: http://www.mathematik.uni-
bielefeld.de/~huson/phylogenetics/splitstree.ht
ml
JSPLIT is a new version of the Splitstree
program written in Java. It can run under any
platform provided a Java runtime environment.
URL: http://www-ab.informatik.uni-
tuebingen.de/software/jsplits/welcome_en.html
SPECTRONET implements median networks adn
other tools. Data: DNA. Executables. Windows.
URL: http://awcmee.massey.ac.nz/spectronet/
David Posada – University of Vigo, Spain
40
Phylogenetic network software
(II)
NETWORK29 implements median-joining
networks. Data: DNA, RFLPs. Executables:
Windows, DOS. URL: http://www.fluxus-
engineering.com/sharenet.htm
ARLEQUIN30 implements molecular variance
parsimony. Data: DNA, RNA, microsatellites.
Executables: Mac, Windows and Unix. URL:
http://lgb.unige.ch/arlequin/
TCS31 implements statistical parsimony. Data:
DNA, RNA, distances. Executables: Mac and
Windows. URL:
http://inbio.byu.edu/Faculty/kac/crandall_lab/t
cs.htm y http://darwin.uvigo.es/
PAL32 calculates the likelihood of a network.
Data: network. Executables: Mac and Windows.
URL: http://www.cebl.auckland.ac.nz/pal-
project/
David Posada – University of Vigo, Spain
41
References
1. Posada, D. & Crandall, K.A. Intraspecific gene
genealogies: trees grafting into networks. Trends in Ecology and Evolution 16, 37-45 (2001).
2. Diday, E. & Bertrand, P. An extension of hierarchical
clustering: the pyramidal representation. in Pattern recognition in practice (eds. Gelsema, E.S. & Kanal, L.N.)
411-424 (North-Holland, Amsterdam, 1986).
3. Eigen, M., Winkler-Oswatitsch, R. & Dress, A. Statistical
geometry in sequence space: a method of quantitative
sequence analysis. Proceedings of the National Academy
of Sciences, U.S.A. 85, 5917 (1988).
4. Nieselt-Struwe, K. Graphs in sequence spaces: a review
of statistical geometry. Biophysical Chemistry 66, 111-
131 (1997).
5. Bandelt, H.-J. & Dress, A.W.M. Split decomposition: a
new and useful approach to phylogenetic analysis of
distance data. Molecular Phylogenetics and Evolution 1,
242-252 (1992).
6. Bryant, D. & Moulton, V. NeighborNet: an agglomerative
method for the construction of planar phylogenetic
networks. in Workshop in Algorithms for Bioinformatics
(2002).
7. Bandelt, H.-J., Macaulay, V. & Richards, M. Median
networks: Speedy construction and greedy reduction,
one simulation, and two case studies from human
mtDNA. Molecular Phylogenetics and Evolution 16, 8-28
(2000).
8. Bandelt, H.-J., Forster, P. & Röhl, A. Median-joining
networks for inferring intraspecific phylogenies.
Molecular Biology and Evolution 16, 37 (1999).
David Posada – University of Vigo, Spain
42
9. Foulds, L.R., Hendy, M.D. & Penny, D. A graph theoretic
approach to the development of minimal phylogenetic
trees. Journal of Molecular Evolution 13, 127-149
(1979).
10. Templeton, A.R., Crandall, K.A. & Sing, C.F. A cladistic
analysis of phenotypic associations with haplotypes
inferred from restriction endonuclease mapping and DNA
sequence data. III. Cladogram estimation. Genetics 132,
619-633 (1992).
11. Excoffier, L. & Smouse, P.E. Using allele frequencies and
geographic subdivision to reconstruct gene trees within
a species: molecular variance parsimony. Genetics 136,
343-359 (1994).
12. Fitch, W.M. Networks and viral evolution. Journal of Molecular Evolution 44, S65-S75 (1997).
13. Strimmer, K. & Moulton, V. Likelihood analysis of
phylogenetic networks using directed graphical models.
Molecular Biology and Evolution 17, 875-881 (2000).
14. Watterson, G.A. & Guess, H.A. Is the most frequent allele
the oldest? Theoretical Population Biology 11, 141-160
(1977).
15. Donnelly, P. & Tavaré, S. The ages of alleles and a
coalescent. Advances in Applied Probability 18, 1-19
(1986).
16. Excoffier, L. & Langaney, A. Origin and differentiation of
human mitochondrial DNA. American Journal of Human Genetics 44, 73-85 (1989).
17. Watterson, G.A. The genetic divergence of two
populations. Theoretical Population Biology 27, 298-317
(1985).
18. Castelloe, J. & Templeton, A.R. Root probabilities for
intraspecific gene trees under neutral coalescent theory.
Molecular Phylogenetics and Evolution 3, 102-113
(1994).
David Posada – University of Vigo, Spain
43
19. Holmes, E.C., Urwin, R. & Maiden, M.C.J. The influence of
recombination on the population structure and evolution
of the human pathogen Neisseria meningitidis. Mol. Biol. Evol. 16, 741-749 (1999).
20. Templeton, A.R. et al. Recombinational and mutational
hotspots within the human lipoprotein lipase gene.
American Journal of Human Genetics 66, 69-83 (2000).
21. Templeton, A.R. Using phylogeographic analyses of gene
trees to test species status and processes. Molecular Ecology 10, 779-91 (2001).
22. Shaw, K.L. A nested analysis of song groups and species
boundaries in the hawaiian cricket genus Laupala.
Molecular Phylogenetics and Evolution 11, 332-341
(1999).
23. Barraclough, T.G. & Vogler, A.P. Detecting the
geographical pattern of speciation from species-level
phylogenies. The American Naturalist 155, 419-434
(2000).
24. Gómez-Zurita, J., Petitpierre, E. & Juan, C. Nested
cladistic analysis, phylogeography and speciation in the
Timarcha goettingensis complex (Coleoptera,
Chrysomelidae). Molecular Ecology 9, 557-560 (2000).
25. Sing, C.F., Haviland, M.B., Zerba, K.E. & Templeton, A.R.
Application of cladistics to the analysis of genotype-
phenotype relationships. European Journal of Epidemiology 8, 3-9 (1992).
26. Crandall, K.A. Intraspecific cladogram estimation:
Accuracy at higher levels of divergence. Systematic Biology 43, 222-235 (1994).
27. Nieselt-Struwe, K. STATGEOM. 1.0 edn (Department of
Physics, University of Auckland, Auckland, New Zealand,
2000).
28. Huson, D.H. SplitsTree: analyzing and visualizing
evolutionary data. Bioinformatics 14, 68-73 (1998).
David Posada – University of Vigo, Spain
44
29. Röhl, A. Network. A program package for phylogenetic
networks. (Mathematisches Seminar, Universität
Hamburg, Hamburg, Germany, 1997).
30. Schneider, S., Roessli, D. & Excofier, L. Arlequin: A
software for population genetic data analysis. 2.000 edn
(Genetics and Biometry Lab, Dept. of Anthropology,
University of Geneva, Geneva, 2000).
31. Clement, M., Posada, D. & Crandall, K.A. TCS: a computer
program to estimate gene genealogies. Molecular Ecology 9, 1657-9 (2000).
32. Drummond, A. & Strimmer, K. PAL: an object-oriented
programming library for molecular evolution and
phylogenetics. Bioinformatics 17, 662-663 (2001).