Upload
ethan-dalton
View
220
Download
4
Tags:
Embed Size (px)
Citation preview
Data: how much is needed?
more sequence or more individuals, to combine or not?
14.4. Tue Introduction to models (Jarno) 16.4. Thu Distance-based methods (Jarno) 17.4. Fri ML analyses (Jarno)
20.4. Mon Assessing hypotheses (Jarno) 21.4. Tue Problems with molecular data
(Jarno) 23.4. Thu Problems with molecular data (Jarno) Phylogenomics 24.4. Fri Search algorithms, visualization, and other computational aspects (Jarno)
Schedule
J
The trivial truth◦ All extant species◦ The whole genome
Impractical? Well, then◦ As many species as possible◦ As much data as possible
How much data?
Finite constraints on resources (time, money)◦ Know your group – which taxa are the most
relevant for your study?◦ Know what gene sequences are available from
previous studies
Choosing taxa or data
The days of single gene datasets are over Mitochondrial and chloroplast DNA have
been popular because they are easy to amplify and sequence
It is worth increasing the number of nuclear genes
One should aim for at least 3 genes, preferably more (maybe 10?)
Number of genes
It is now possible to increase the number of genes being sequenced significantly
Whole genome analyses will allow us to understand:◦ Intron-exon boundary dynamics◦ Gene duplication-deletion dynamics◦ Gene transfer dynamics
Soon we will have a good understanding of the regions of the genome that are most suitable for systematics
Phylogenomics
Sometimes not all genes amplify from all samples◦ Should these samples be discarded?
Increased taxon sampling, despite missing data, increases resolution
All possible data should be used!
Missing data?
Can separate independent data sets be combined for analysis?
How can we assess the possibility of conflict between different data?
What does the potential conflict then mean?
To combine or not to combine?
For instance◦ Different genes may have different phylogenetic
signal (different history?)
What is the problem?
If both genes have equally strong signal
Possible effects on results
If one gene has a stronger signal than the other
Possible effects on results
If one gene has a stronger signal than the other
Possible effects on results
Never combineCombine sometimesAlways combine
Schools of thought
The different data sets may represent different evolutionary histories (e.g. different selection pressures)
Big data sets dominate small data sets When analyzed separately, the different
data sets can be tests of each others phylogenetic hypotheses
Never combine!
Consensus trees of separate analyses
+ =
Data set A Data set B Their consensus
A
B
C
D
E
F
G
H
My own experience:
Would be fantastic to get genealogical histories of individual genes
But!◦ Single genes generally short 1000-2000 bases◦ Lots of homoplasy◦ Unreliable phylogenies
Problems with the approach
If the data sets are congruent, combine them
If the data sets are incongruent, don’t combine them
One can use the ILD test to decide whether data sets are incongruent
Well, sometimes you can combine...
If there is no conflict between data sets:◦ The length of most parsimonious tree from the
combined data [L(x+y)] is equal to the sum of the lengths of the MP trees from the separately analyzed data [L(x) + L(y)]
Dxy = L(x+y) – (L(x) + L(y))Dxy = 0
(Farris et al 1994)
ILD (Incongruence Length Difference)
Combining the data sets leads to increased homoplasy
But is it statistically significant? Can be tested with the Mann-Whitney U
test, where the null hypothesis is that the data sets are combinable
If Dxy > 0
Data set x Data set y
Data sets x + y
Data set p Data set q
Original
Combine data
Sample randomly to get equally large data sets
Search for MP trees and calculate Dpq values Repeat many times (e.g. 1000), which gives
us a distribution for the value of D Compare whether Dxy differs from random
distribution at P < 0.05 However:
◦ ILD-test is sensitive to relative sizes of compared data sets and to the evolutionary history of the different data sets
For the randomly generated data sets
But what if the conflict is only partial?
Combining all available data leads to more resolved trees = the combined data has higher explanatory power
”Hidden support” can only be detected through combined analysis
Conflicts at different nodes can only be discovered in a combined analysis framework
The effects of combined analysis can be investigated using indices related to Bremer support
Always combine!
Partitioned Bremer Support (PBS)◦ Baker & DeSalle 1997: Syst Biol 46:654
Partition Congruence Index (PCI)◦ Brower 2006: Cladistics 22:378
Hidden Bremer Support (HBS)◦ Gatesy et al 1999: Cladistics 15:271
Indices related to Bremer Support
The different data partitions in a data set contribute to the Bremer support in an additive way
For each node:◦ A negative Partitioned Bremer support value
indicates conflict◦ A positive Partitioned Bremer support value
indicates congruence
PBS (Partitioned Bremer Support)
PBS in practice
PBS in practice
7
7
3,4
-6,13
Morpholgy, COI, EF1a, Wgl
Bremer Support
Tells us about the magnitude of conflict between data partitions in a combined analysis
PCI is always equal to or less than BS for a given branch
PCI = BS when there is no conflict PCI is negative when there is low BS
because of strong conflicts between data partitions
Partition Congruence Index
Brower 2006: Cladistics 22:378-386
Underlying phylogenetic signal can be confounded by homoplasy in separate analyses
Combining datasets can bring out this signal, as homoplasy is largely random noise
Can be measured using HBS and Partitioned HBS
Hidden support
Hidden support can be defined as increased support for the node of interest in the simultaneous analysis of all data partitions relative to the sum of support for that node in the separate analyses of each partition
Hidden support
For a particular combined data set and a particular node, HBS is the difference between BS for that node in the combined analysis and the sum of BS values for that node from each data partition
Measuring hidden support
With a small dataset, it is probably always best to combine everything
With large datasets (10 or 20 gene regions?) one can find sets of congruent genes and combine them
But!◦ Is there a biological reason for incongruence, or is
it just a property of the data?
So, what to do?
Problems inherent in molecular data
Niklas Wahlberg
Saturation Bias in nucleotide composition Orthology vs paralogy Lineage sorting Lateral Gene Transfer
What are the problems?
Saturation
Saturation is due to multiple changes at the same site subsequent to lineage splitting
Models of evolution attempt to infer the missing information through correcting for “multiple hits”
Most data will contain some fast evolving sites which are potentially saturated (e.g. in proteins often position 3)
In severe cases the data becomes essentially random and all information about relationships can be lost
Saturation in sequence data
C A
C G T A1 2 3
1
Seq 1
Seq 2
Number of changes
Multiple changes at a single site - hidden changes
Ancest GGCGCGSeq 1 AGCGAGSeq 2 GCGGAC
Saturation
Time since divergence
Pair
wis
e d
ista
nce
ca
lcula
ted
from
sequ
ence
s
Homoplasy is a problem with molecular data
Elevated rates of molecular evolution in unrelated lineages
Sparse taxon sampling leading to long branches
Saturation and long branch attraction
The classical long-branch attraction example
Based on one gene 18S
Nardi et al. 2003: Science 299: 1887-1889
Taxon sampling is important For divergent taxa with few extant species,
can be a problem More data from different sources
◦ Could be that molecular data are not able to resolve the position of some taxa
◦ Morphological data!
Is saturation a problem?
Biased base composition
Do sequences manifest biased base compositions (e.g thermophilic convergence) or biased codon usage patterns which may obscure phylogenetic signal?
Biased base compositions?
% Guanine + Cytosine in 16S rRNA genes
Thermophiles:Thermotoga maritimaThermus thermophilusAquifex pyrophilus
Mesophiles:Deinococcus radioduransBacillus subtilis
626465
5555
%GCall sites
727273
5250
737071
4838
variable sites
parsimonysites
A case study in phylogenetic analysis:Deinococcus and Thermus
Deinococcus are radiation resistant bacteria Thermus are thermophilic bacteria
BUT:◦ Both have the same very unusual cell wall based
upon ornithine◦ Both have the same menaquinones (Mk 9)◦ Both have the same unusual polar lipids
Congruence between these complex characters supports a phylogenetic relationship between Deinococcus and Thermus
An appropriate method can correct for GC bias
Aquifex
Thermotoga
Deinococcus
Bacillus
Thermus
Parsimony tree
Aquifex
Thermotoga
Deinococcus
Bacillus
Thermus
Aquifex
Thermotoga
Deinococcus
Thermus
Bacillus
Jukes & Cantor Tree Log Det Tree
Orthology and paralogy
Are the sequences being generated from different species the same (homologous)?
Gene duplication◦ duplicate gene degenerates◦ duplicate gene aquires new function
A problem particular accute currently as we search for new genes
Orthology or paralogy?
ORTHOLOGY
Orthology: gene trees and species trees
Gene phylogeny
a
b
c
Organism phylogeny
A
B
C
Darwin’s theory reinterpreted homology as common ancestry.
ATCGGCCACTTTCGCGATCA
ATAGGCCACTTTCGCGATCA ATCGGCCACTTTCGCGATCG
ATAGGCCACTTTCGCGATTA ATCGGCCACTTTCGTGATCG
ATAGGGCAGTTTCGCGATTA ATCGGCCACGTTCGTGATCG
ATAGGGCAGTTTTGCGATTA ATCGGCCACGTTCGCGATCG
ATAGGGCAGTTTCGCGATTA ATCGGCCACCTTCGCGATCG
ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG
ACCGGCCACCTTCGCGATCGATAGGGCAGTCTCGCGATTA
Ancestral sequence
Homologous sequences
Orthologs arise by speciation
ATCGGCCACTTTCGCGATCA
ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG
Sequence in ancestralOrganism
Orthologous sequences
Speciation event
Modern species A Modern species B
Orthologs are “evolutionary counterparts” – Koonin (2001)
Paralogs arise by duplications
ATCGGCCACTTTCGCGATCA
ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG
Sequence in ancestralOrganism
Paralogous sequences
Duplication event
Modern duplicate A Modern duplicate B
An evolutionary tale…
Duplication of A in worm
Duplication of A in human
Sonnhammer & Koonin (2002) TIGs 18 619-220
The yeast gene is orthologous to all worm and human genes, which are all co-orthologous to the yeast gene
Evolutionary Relationships
Sonnhammer & Koonin (2002) TIGs 18 619-220
all genes in the HA* set are co-orthologous to all genes in the WA* set
Evolutionary Relationships
Sonnhammer & Koonin (2002) TIGs 18 619-220
The genes HA* are hence ‘inparalogs’ to each other when comparing human to worm.
Evolutionary Relationships
Sonnhammer & Koonin (2002) TIGs 18 619-220
duplication speciation
By contrast, the genes HB and HA* are ‘outparalogs’ when comparing human with worm
Evolutionary Relationships
Sonnhammer & Koonin (2002) TIGs 18 619-220
HB and HA*, and WB and WA* are inparalogs when comparing with yeast, because the animal–yeast split pre-dates the HA*–HB duplication
duplication
speciationEvolutionary Relationships
Sonnhammer & Koonin (2002) TIGs 18 619-220
PARALOGY
a1*
b1
c1*
a2
b2*
c2
Gene phylogenies Organism phylogeny
A
B
C
gene duplication
Misleading tree
A
B
C
a1
b2
c1
Paralogy can produce misleading trees
Ancient gene duplications can be used to root the tree of life
Ancestral Elongation Factor Gene
Gene Duplication Prior To Split Into 3 Domains Of Life
EF-Tu/ 1-alpha
EF-2/ EF-G
Sequences from one paralogue can be used to root a tree formed using sequences from the other and vice versa
= paralogues of each other
+
EF-Tu/ 1-alpha
EF-2/ EF-G
Lineage sorting
Gene trees may not be the same as species trees
Extant populations may retain ancestral polymorphisms
Species level phylogenies should never sample single individuals of different species
Lineage sorting
Implicit assumption in many studies using mtDNA
The mode of speciation can now be studied using DNA sequences
Theoretical studies predict that DNA lineages pass through several phases in a species
Are species monophyletic?
Time
A B
Ancestral gene pool
The assumption: monophyly
Time
A BThe assumption: monophyly
Paraphyly can occur when one population in a set of locally panmictic populations speciates
Polyphyly occurs when a highly polymorphic population is subdivided
Can be highly informative of the history of divergence
The presence of poly- and paraphyletic lineages
Time
A B
Ancestral gene pool
Paraphyly
Time
A BParaphyly
Time
A BPolyphyly
Time
A BPolyphyly
Polyphyly
tharos orantain (35-6) CO4
tharos riocolorado (35-9) CO8
tharos tharos (47-3) MNtharos orantain (52-9) AB4
tharos orantain (47-2) CO7, (60-6, 60-7) AB6
batesii apsaalooke (35-8) WYcocyta selenis (47-12) CO1
pulchella pulchella (47-6, 49-14, 50-6) CA3pulchella pulchella (49-13) CA3
phaon phaon (25-17) FLphaon jalapeno (35-11) Mexico
mylitta mylitta (32-3) NVmylitta mylitta (32-6) MT
mylitta arizonensis (32-1) AZ1, (47-1) NM
orseis orseis (37-1) CA1
pallida pallida (34-6, 47-9, 47-10, 47-11) CO3
mylitta mylitta (11-10, 11-11, 58-1, 58-2) BC1
pallida barnesi (58-5, 58-6) BC1
picta canace (44-11, 44-12) AZ
vesta (41-1) TXvesta (41-2) TX
picta picta (34-7) CO
batesii lakota (35-4) NEpulchella camillus (48-8, 49-12) CO1
pulchella camillus (48-14) CO1
pulchella camillus (49-3) CO6
pulchella camillus (49-5) CO6pulchella camillus (50-3) CO1
pulchella camillus (50-4) CO1
pulchella tutchone (23-11) Alaska
pulchella montana (27-5) CA2
pulchella owimba (56-1, 56-5, 56-7, 60-2) BC2
pulchella owimba (52-14, 55-7) AB5pulchella owimba (54-1) AB5
cocyta selenis (11-5) BC1pulchella owimba (24-10) MT
cocyta selenis (47-13) CO1cocyta selenis (48-3) CO1
cocyta selenis (58-8) BC1
batesii maconensis (60-13, 60-15) NC
tharos tharos (25-18) FLtharos tharos (34-2) MN
tharos tharos (44-1) NY
tharos tharos (44-2) NYtharos tharos (44-3, 44-4) NY
tharos tharos (47-4) MNtharos tharos (47-8) MN
tharos tharos (53-8) MD
tharos tharos (54-9) MD
cocyta selenis (11-4) BC1, (55-8) AB7
cocyta selenis (48-10) CO1cocyta (49-8) MNdiminutor
cocyta selenis (11-6) BC1
batesii lakota (60-5) AB6
probably (52-2) AB1batesii lakotacocyta selenis (55-6) AB6
batesii anasazi (34-1) CO2cocyta selenis (47-14, 48-6) CO1
cocyta (49-9) MNdiminutor
batesii lakota (52-7, 52-8) AB3
cocyta selenis (55-2) AB7
cocyta selenis (60-12) BC2cocyta selenis (58-7) BC1
pulchella camillus (35-5, 48-2, 48-7, 48-9, 48-13) CO1, (50-2) NM
pulchella camillus (48-4) CO5pulchella camillus (49-1) NMpulchella camillus (49-2) CO6
pulchella camillus (49-4) CO6
orseis orseis (67-3) CA1
orseis orseis (67-4) CA1orseis orseis (67-6) CA1
vesta (67-9) Mexico
pallescens (64-2) Mexicopallescens (64-1) Mexico
mylitta arida (67-10) Mexico
cocyta cocyta (72-8) ONT
tharos distincta (73-4) Mexico
cocyta cocyta (72-9) ONTbatesii batesii (73-9) MNbatesii batesii (72-1) ONT
batesii maconensis (69-1, 69-2) NC
cocyta cocyta (72-10) ONT
pulchella montana (67-15) ORpulchella montana (67-16) OR
pulchella inornata (67-11) OR
pulchella inornata (67-13) ORpulchella inornata (67-14) OR
pulchella inornata (73-1) ORpulchella inornata (73-2) OR
95
100100
99100
100
100
10073
51
8086
7163
91
100
88
56
52
74
7862
100
95
99100
62
74
68
6152
91
80
7275
8968
10062
99
88
72
77
An empirical example:
Phyciodes butterflies
Wahlberg et al. 2003. Syst Ent 28:257-273
Paraphyly of a species can be due to incomplete lineage sorting and/or secondary gene flow
G = generations, starting with ten unrelated females at G = 0
Lateral gene transfer
Widely spread in single celled organisms◦ Even between distantly related lineages
In multi-celled organisms more a problem in closely related species◦ hybridization
Lateral Gene Transfer
Is the Tree of Life really a Web of Life?
Lateral Gene Transfer
These ”problems” are highly interesting phenomena in themselves!
When taking the different factors into account, can be informative about evolutionary history
”When in doubt, get more data”- Brooks and McLennan 2002
Problems inherent in molecular data?