More sequence or more individuals, to combine or not?

Data: how much is needed?

more sequence or more individuals, to combine or not?

14.4. Tue Introduction to models (Jarno) 16.4. Thu Distance-based methods (Jarno) 17.4. Fri ML analyses (Jarno)

20.4. Mon Assessing hypotheses (Jarno) 21.4. Tue Problems with molecular data

(Jarno) 23.4. Thu Problems with molecular data (Jarno) Phylogenomics 24.4. Fri Search algorithms, visualization, and other computational aspects (Jarno)

Schedule

J

The trivial truth◦ All extant species◦ The whole genome

Impractical? Well, then◦ As many species as possible◦ As much data as possible

How much data?

Finite constraints on resources (time, money)◦ Know your group – which taxa are the most

relevant for your study?◦ Know what gene sequences are available from

previous studies

Choosing taxa or data

The days of single gene datasets are over Mitochondrial and chloroplast DNA have

been popular because they are easy to amplify and sequence

It is worth increasing the number of nuclear genes

One should aim for at least 3 genes, preferably more (maybe 10?)

Number of genes

It is now possible to increase the number of genes being sequenced significantly

Whole genome analyses will allow us to understand:◦ Intron-exon boundary dynamics◦ Gene duplication-deletion dynamics◦ Gene transfer dynamics

Soon we will have a good understanding of the regions of the genome that are most suitable for systematics

Phylogenomics

Sometimes not all genes amplify from all samples◦ Should these samples be discarded?

Increased taxon sampling, despite missing data, increases resolution

All possible data should be used!

Missing data?

Can separate independent data sets be combined for analysis?

How can we assess the possibility of conflict between different data?

What does the potential conflict then mean?

To combine or not to combine?

For instance◦ Different genes may have different phylogenetic

signal (different history?)

What is the problem?

If both genes have equally strong signal

Possible effects on results

If one gene has a stronger signal than the other


If one gene has a stronger signal than the other


Never combineCombine sometimesAlways combine

Schools of thought

The different data sets may represent different evolutionary histories (e.g. different selection pressures)

Big data sets dominate small data sets When analyzed separately, the different

data sets can be tests of each others phylogenetic hypotheses

Never combine!

Consensus trees of separate analyses

+ =

Data set A Data set B Their consensus

A

B

C

D

E

F

G

H

My own experience:

Would be fantastic to get genealogical histories of individual genes

But!◦ Single genes generally short 1000-2000 bases◦ Lots of homoplasy◦ Unreliable phylogenies

Problems with the approach

If the data sets are congruent, combine them

If the data sets are incongruent, don’t combine them

One can use the ILD test to decide whether data sets are incongruent

Well, sometimes you can combine...

If there is no conflict between data sets:◦ The length of most parsimonious tree from the

combined data [L(x+y)] is equal to the sum of the lengths of the MP trees from the separately analyzed data [L(x) + L(y)]

Dxy = L(x+y) – (L(x) + L(y))Dxy = 0

(Farris et al 1994)

ILD (Incongruence Length Difference)

Combining the data sets leads to increased homoplasy

But is it statistically significant? Can be tested with the Mann-Whitney U

test, where the null hypothesis is that the data sets are combinable

If Dxy > 0

Data set x Data set y

Data sets x + y

Data set p Data set q

Original

Combine data

Sample randomly to get equally large data sets

Search for MP trees and calculate Dpq values Repeat many times (e.g. 1000), which gives

us a distribution for the value of D Compare whether Dxy differs from random

distribution at P < 0.05 However:

◦ ILD-test is sensitive to relative sizes of compared data sets and to the evolutionary history of the different data sets

For the randomly generated data sets

But what if the conflict is only partial?

Combining all available data leads to more resolved trees = the combined data has higher explanatory power

”Hidden support” can only be detected through combined analysis

Conflicts at different nodes can only be discovered in a combined analysis framework

The effects of combined analysis can be investigated using indices related to Bremer support

Always combine!

Partitioned Bremer Support (PBS)◦ Baker & DeSalle 1997: Syst Biol 46:654

Partition Congruence Index (PCI)◦ Brower 2006: Cladistics 22:378

Hidden Bremer Support (HBS)◦ Gatesy et al 1999: Cladistics 15:271

Indices related to Bremer Support

The different data partitions in a data set contribute to the Bremer support in an additive way

For each node:◦ A negative Partitioned Bremer support value

indicates conflict◦ A positive Partitioned Bremer support value

indicates congruence

PBS (Partitioned Bremer Support)

PBS in practice

PBS in practice

7

7

3,4

-6,13

Morpholgy, COI, EF1a, Wgl

Bremer Support

Tells us about the magnitude of conflict between data partitions in a combined analysis

PCI is always equal to or less than BS for a given branch

PCI = BS when there is no conflict PCI is negative when there is low BS

because of strong conflicts between data partitions

Partition Congruence Index

Brower 2006: Cladistics 22:378-386

Underlying phylogenetic signal can be confounded by homoplasy in separate analyses

Combining datasets can bring out this signal, as homoplasy is largely random noise

Can be measured using HBS and Partitioned HBS

Hidden support

Hidden support can be defined as increased support for the node of interest in the simultaneous analysis of all data partitions relative to the sum of support for that node in the separate analyses of each partition

Hidden support

For a particular combined data set and a particular node, HBS is the difference between BS for that node in the combined analysis and the sum of BS values for that node from each data partition

Measuring hidden support

With a small dataset, it is probably always best to combine everything

With large datasets (10 or 20 gene regions?) one can find sets of congruent genes and combine them

But!◦ Is there a biological reason for incongruence, or is

it just a property of the data?

So, what to do?

Problems inherent in molecular data

Niklas Wahlberg

Saturation Bias in nucleotide composition Orthology vs paralogy Lineage sorting Lateral Gene Transfer

What are the problems?

Saturation

Saturation is due to multiple changes at the same site subsequent to lineage splitting

Models of evolution attempt to infer the missing information through correcting for “multiple hits”

Most data will contain some fast evolving sites which are potentially saturated (e.g. in proteins often position 3)

In severe cases the data becomes essentially random and all information about relationships can be lost

Saturation in sequence data

C A

C G T A1 2 3

1

Seq 1

Seq 2

Number of changes

Multiple changes at a single site - hidden changes

Ancest GGCGCGSeq 1 AGCGAGSeq 2 GCGGAC

Saturation

Time since divergence

Pair

wis

e d

ista

nce

ca

lcula

ted

from

sequ

ence

s

Homoplasy is a problem with molecular data

Elevated rates of molecular evolution in unrelated lineages

Sparse taxon sampling leading to long branches

Saturation and long branch attraction

The classical long-branch attraction example

Based on one gene 18S

Nardi et al. 2003: Science 299: 1887-1889

Taxon sampling is important For divergent taxa with few extant species,

can be a problem More data from different sources

◦ Could be that molecular data are not able to resolve the position of some taxa

◦ Morphological data!

Is saturation a problem?

Biased base composition

Do sequences manifest biased base compositions (e.g thermophilic convergence) or biased codon usage patterns which may obscure phylogenetic signal?

Biased base compositions?

% Guanine + Cytosine in 16S rRNA genes

Thermophiles:Thermotoga maritimaThermus thermophilusAquifex pyrophilus

Mesophiles:Deinococcus radioduransBacillus subtilis

626465

5555

%GCall sites

727273

5250

737071

4838

variable sites

parsimonysites

A case study in phylogenetic analysis:Deinococcus and Thermus

Deinococcus are radiation resistant bacteria Thermus are thermophilic bacteria

BUT:◦ Both have the same very unusual cell wall based

upon ornithine◦ Both have the same menaquinones (Mk 9)◦ Both have the same unusual polar lipids

Congruence between these complex characters supports a phylogenetic relationship between Deinococcus and Thermus

An appropriate method can correct for GC bias

Aquifex

Thermotoga

Deinococcus

Bacillus

Thermus

Parsimony tree

Aquifex

Thermotoga

Deinococcus

Bacillus

Thermus

Aquifex

Thermotoga

Deinococcus

Thermus

Bacillus

Jukes & Cantor Tree Log Det Tree

Orthology and paralogy

Are the sequences being generated from different species the same (homologous)?

Gene duplication◦ duplicate gene degenerates◦ duplicate gene aquires new function

A problem particular accute currently as we search for new genes

Orthology or paralogy?

ORTHOLOGY

Orthology: gene trees and species trees

Gene phylogeny

a

b

c

Organism phylogeny

A

B

C

Darwin’s theory reinterpreted homology as common ancestry.

ATCGGCCACTTTCGCGATCA

ATAGGCCACTTTCGCGATCA ATCGGCCACTTTCGCGATCG

ATAGGCCACTTTCGCGATTA ATCGGCCACTTTCGTGATCG

ATAGGGCAGTTTCGCGATTA ATCGGCCACGTTCGTGATCG

ATAGGGCAGTTTTGCGATTA ATCGGCCACGTTCGCGATCG

ATAGGGCAGTTTCGCGATTA ATCGGCCACCTTCGCGATCG

ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG

ACCGGCCACCTTCGCGATCGATAGGGCAGTCTCGCGATTA

Ancestral sequence

Homologous sequences

Orthologs arise by speciation



Sequence in ancestralOrganism

Orthologous sequences

Speciation event

Modern species A Modern species B

Orthologs are “evolutionary counterparts” – Koonin (2001)

Paralogs arise by duplications



Sequence in ancestralOrganism

Paralogous sequences

Duplication event

Modern duplicate A Modern duplicate B

An evolutionary tale…

Duplication of A in worm

Duplication of A in human

Sonnhammer & Koonin (2002) TIGs 18 619-220

The yeast gene is orthologous to all worm and human genes, which are all co-orthologous to the yeast gene

Evolutionary Relationships


all genes in the HA* set are co-orthologous to all genes in the WA* set



The genes HA* are hence ‘inparalogs’ to each other when comparing human to worm.



duplication speciation

By contrast, the genes HB and HA* are ‘outparalogs’ when comparing human with worm



HB and HA*, and WB and WA* are inparalogs when comparing with yeast, because the animal–yeast split pre-dates the HA*–HB duplication

duplication

speciationEvolutionary Relationships


PARALOGY

a1*

b1

c1*

a2

b2*

c2

Gene phylogenies Organism phylogeny

A

B

C

gene duplication

Misleading tree

A

B

C

a1

b2

c1

Paralogy can produce misleading trees

Ancient gene duplications can be used to root the tree of life

Ancestral Elongation Factor Gene

Gene Duplication Prior To Split Into 3 Domains Of Life

EF-Tu/ 1-alpha

EF-2/ EF-G

Sequences from one paralogue can be used to root a tree formed using sequences from the other and vice versa

= paralogues of each other

+

EF-Tu/ 1-alpha

EF-2/ EF-G

Lineage sorting

Gene trees may not be the same as species trees

Extant populations may retain ancestral polymorphisms

Species level phylogenies should never sample single individuals of different species

Lineage sorting

Implicit assumption in many studies using mtDNA

The mode of speciation can now be studied using DNA sequences

Theoretical studies predict that DNA lineages pass through several phases in a species

Are species monophyletic?

Time

A B

Ancestral gene pool

The assumption: monophyly

Time

A BThe assumption: monophyly

Paraphyly can occur when one population in a set of locally panmictic populations speciates

Polyphyly occurs when a highly polymorphic population is subdivided

Can be highly informative of the history of divergence

The presence of poly- and paraphyletic lineages

Time

A B

Ancestral gene pool

Paraphyly

Time

A BParaphyly

Time

A BPolyphyly

Time

A BPolyphyly

Polyphyly

tharos orantain (35-6) CO4

tharos riocolorado (35-9) CO8

tharos tharos (47-3) MNtharos orantain (52-9) AB4

tharos orantain (47-2) CO7, (60-6, 60-7) AB6

batesii apsaalooke (35-8) WYcocyta selenis (47-12) CO1

pulchella pulchella (47-6, 49-14, 50-6) CA3pulchella pulchella (49-13) CA3

phaon phaon (25-17) FLphaon jalapeno (35-11) Mexico

mylitta mylitta (32-3) NVmylitta mylitta (32-6) MT

mylitta arizonensis (32-1) AZ1, (47-1) NM

orseis orseis (37-1) CA1

pallida pallida (34-6, 47-9, 47-10, 47-11) CO3

mylitta mylitta (11-10, 11-11, 58-1, 58-2) BC1

pallida barnesi (58-5, 58-6) BC1

picta canace (44-11, 44-12) AZ

vesta (41-1) TXvesta (41-2) TX

picta picta (34-7) CO

batesii lakota (35-4) NEpulchella camillus (48-8, 49-12) CO1

pulchella camillus (48-14) CO1


pulchella camillus (49-5) CO6pulchella camillus (50-3) CO1


pulchella tutchone (23-11) Alaska

pulchella montana (27-5) CA2

pulchella owimba (56-1, 56-5, 56-7, 60-2) BC2

pulchella owimba (52-14, 55-7) AB5pulchella owimba (54-1) AB5

cocyta selenis (11-5) BC1pulchella owimba (24-10) MT

cocyta selenis (47-13) CO1cocyta selenis (48-3) CO1

cocyta selenis (58-8) BC1

batesii maconensis (60-13, 60-15) NC

tharos tharos (25-18) FLtharos tharos (34-2) MN

tharos tharos (44-1) NY

tharos tharos (44-2) NYtharos tharos (44-3, 44-4) NY

tharos tharos (47-4) MNtharos tharos (47-8) MN

tharos tharos (53-8) MD

tharos tharos (54-9) MD

cocyta selenis (11-4) BC1, (55-8) AB7

cocyta selenis (48-10) CO1cocyta (49-8) MNdiminutor

cocyta selenis (11-6) BC1

batesii lakota (60-5) AB6

probably (52-2) AB1batesii lakotacocyta selenis (55-6) AB6

batesii anasazi (34-1) CO2cocyta selenis (47-14, 48-6) CO1

cocyta (49-9) MNdiminutor

batesii lakota (52-7, 52-8) AB3

cocyta selenis (55-2) AB7

cocyta selenis (60-12) BC2cocyta selenis (58-7) BC1

pulchella camillus (35-5, 48-2, 48-7, 48-9, 48-13) CO1, (50-2) NM

pulchella camillus (48-4) CO5pulchella camillus (49-1) NMpulchella camillus (49-2) CO6


orseis orseis (67-3) CA1

orseis orseis (67-4) CA1orseis orseis (67-6) CA1

vesta (67-9) Mexico

pallescens (64-2) Mexicopallescens (64-1) Mexico

mylitta arida (67-10) Mexico

cocyta cocyta (72-8) ONT

tharos distincta (73-4) Mexico

cocyta cocyta (72-9) ONTbatesii batesii (73-9) MNbatesii batesii (72-1) ONT

batesii maconensis (69-1, 69-2) NC

cocyta cocyta (72-10) ONT

pulchella montana (67-15) ORpulchella montana (67-16) OR

pulchella inornata (67-11) OR

pulchella inornata (67-13) ORpulchella inornata (67-14) OR

pulchella inornata (73-1) ORpulchella inornata (73-2) OR

95

100100

99100

100

100

10073

51

8086

7163

91

100

88

56

52

74

7862

100

95

99100

62

74

68

6152

91

80

7275

8968

10062

99

88

72

77

An empirical example:

Phyciodes butterflies

Wahlberg et al. 2003. Syst Ent 28:257-273

Paraphyly of a species can be due to incomplete lineage sorting and/or secondary gene flow

G = generations, starting with ten unrelated females at G = 0

Lateral gene transfer

Widely spread in single celled organisms◦ Even between distantly related lineages

In multi-celled organisms more a problem in closely related species◦ hybridization

Lateral Gene Transfer

Is the Tree of Life really a Web of Life?

Lateral Gene Transfer

These ”problems” are highly interesting phenomena in themselves!

When taking the different factors into account, can be informative about evolutionary history

”When in doubt, get more data”- Brooks and McLennan 2002

Problems inherent in molecular data?

Documents

More sequence or more individuals, to combine or not?