Phylogenetics. Phylogenetic Trees time NODE BRANCH ROOT Operational Taxonomic Unit (OTU)...

Preview:

Citation preview

Phylogenetics

Phylogenetic Trees

time

time

NODE BRANCH

ROOTOperationalTaxonomicUnit (OTU)

HypotheticalTaxonomic Unit

Information

• Branching order (topology)– Relative closeness of different taxa

• Branch length– Amount of divergence

Rooted and unrooted trees

A

B

C

D

E

A

B

E

C

D

ROOTED UNROOTED

Rooted and unrooted trees

A

B

C

D

E

A

B

E

C

D

ROOTED UNROOTED

Rooted and unrooted trees

A

B

C

D

E

A

B

E

C

D

ROOTED UNROOTED

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

D

A B

C DA

B

C

D

A

BC

D

A

BC

D

A

BC

D

A

B

CD

… 15 rooted trees of 4 OTUs

3 OTUs

4 OTUs

UNROOTED ROOTED

Monophyletic & Paraphyletic

Mammals

Turtles and tortoises

Snakes and lizards

Crocodiles

Birds

REPTILES

Monophyletic & Paraphyletic

• Monophyletic– Natural clade; all of the taxa are derived from

a common ancestor

• Paraphyletic– Taxonomic group whose most recent common

ancestor is shared by another taxon

Reconstruct phylogeny from molecular data

ACTGTTACCGA

ACTGTTACCGA

ACTGTTACCGA

ACTGTTACCGA

ACTGTTACCGA

?

Types of phylogenetic analysis methods

• Phenetic: trees are constructed based on observed characteristics, not on evolutionary history

• Cladistic: trees are constructed based on fitting observed characteristics to some model of evolutionary history

Distancemethods

ParsimonyandMaximumLikelihoodmethods

Methods of Tree reconstruction

• Distance• Maximum Parsimony• Maximum Likelihood• Bayesian

Phylogeny Estimation: Traditional and Bayesian Approaches

Nature Reviews Genetics (2003) 4:275

Genetic distance

• Distance from one sequence to another• Hamming Distance

– Count number of differences

• Multiple hits – number of events is greater than number of differences – Estimate number of events

• Infer tree from genetic distance using Neighbour-joining (NJ) method

UPGMA shown for illustrative purposes. Neighbour-joining is preferred method.

• The algorithm in the text means: find the closest distance between two sequences, cluster those; then find the next closest distance, cluster those; as sequences are added to existing clusters find the average distance between existing clusters

• Work through the notation!• UPGMA assumes a molecular clock

mechanism of evolution

• Neighbor-joining: corrects for UPGMA’s assumption of the same rate of evolution for each branch by modifying the distance matrix to reflect different rates of change.

• The net difference between sequence i and all other sequences is

• ri = Sdik

• The rate-corrected distance matrix is then • Mij = dij - (ri + rj)/(n - 2)

• Join the two sequences whose Mij is minimal; then calculate the distance from this new node to all other sequences using

• dkm = (dim + djm - dij)/2• Again correct for rates and join nodes.

Maximum Parsimony (MP)

• Find topology requiring smallest number of evolutionary changes

• Consider each position (site) in the sequence alignment independently

• Not all sites are informative

• Informative– Favours one topology over others

Informative sites

a. A A G A G T T C Ab. A G C C G T T C Tc. A G A T A T C C Ad. A G A G A T C C T

a

b

c

d

a b

c d

a

b

c

d

Maximum Likelihood (ML)

• Likelihood L of a tree is the probability of observing the data given the treeL = P(data|tree)

• Find the tree with the highest L value

• Results depends on model of nucleotide substitution

• Computationally time-consuming

• Actually, all the other methods discussed implicitly use a simple model of evolution similar to the typical model made explicit in maximum likelihood:

• All sites selectively neutral• All mutate independently, forward and

reverse rates equal, given by m

• Also assume discrete generations and sites change independently

• Given this model, can calculate probability that a site with initial nucleotide I will change to nucleotide j within time t:

• Ptij = dije-mt + (1 - e-mt)gj, where dij = 1 if i = j

and dij = 0 otherwise, and where gj is the equilibrium frequency of nucleotide j

• The likelihood that some site is in state i at the kth node of a tree is Li

(k)

• The likelihoods for all states for each site for each node are calculated separately; the product of the likelihoods for each site gives the overall likelihood for the observed data

• Different tree topologies are searched to find the highest overall likelihood

• Maximum likelihood is maybe the “gold standard” for phylogenetic analysis; but because of its computational intensity it can only be used for select data and only after much initial fine tuning of many parameters of sequence alignments

• Often used to distinguish between several already generated trees

Bayesian (B) Phylogeny Estimation

• Searches for best trees consistent with both model and data

• Incorporates prior knowledge (prior probability)

• B maximises probability of tree given data and model

• Searches for best set of trees

Comparison of methods

How much information are they using?• MP, ML, B use actual DNA whereas NJ

summarises information into distance matrix• BUT, not all sites are used by MP (“informative”

sites only)How can the nature of the data affect the

methods?• NJ better for recent divergences• MP works well for a high number of informative

sites

Comparison of methods

How do they cope with lots of sequences?• MP requires comparison of all possible trees

– Not possible for large number of taxa

• ML is computationally intensive and very slow for large number of taxa

• NJ efficient for large number of taxaAnything else?• ML requires explicit assumptions about rate and

pattern of substitution (model)– ML may perform poorly if model is incorrect

• ML or B may get stuck on local maxima

Outgroup rooting of unrooted trees

• Outgroup – related sequence that definitely diverged earlier (paleontological evidence)

humanmouse

rat

human

mouse

rat

chicken

Rate (r) of evolution

• K = number of substitutions per site

• T = time since divergence

• r = K/2T

• Rate is expressed as substitutions per site per year

Species A

Species BT

Estimating species divergence times

• fossil evidence shows that T1 = 310 mya

• What is T2 ?

• Only need to have sequences and information on one divergence time

Human (B)

Chicken (C)

Rat (A)T2

T1

True tree and inferred tree

• There is only one true tree of species relationships

• Inferred tree may not be correct

1. Some genes may not be representative

2. Tree inference method may have produced an incorrect tree– e.g. parsimony method:

may get several equally parsimonious results

How credible is the tree?

• The tree is a hypothesis of the true relationship

• Need some measure of the support for that hypothesis

• Note: Bayesian methods simultaneously estimate tree and measures of uncertainty for each branch

Standard Error of branches

Human

Chimp

Gorilla

Orangutan

• The bootstrap: randomly sample all positions (columns in an alignment) with replacement -- meaning some columns can be repeated -- but conserving the number of positions; build a large dataset of these randomized samples

Bootstrap

• Then use your method (distance, parsimony, likelihood) to generate another tree

• Do this a thousand or so times • Note that if the assumptions the method is based

on hold, you should always get the same tree from the bootstrapped alignments as you did originally

• The frequency of some feature of your phylogeny in the bootstrapped set gives some measure of the confidence you can have for this feature

Applications of phylogenetics

• Detection of orthology and paralogy

• Estimation of divergence times• Reconstruction of ancient

proteins• Identifying residues important

to selection• Detecting recombination points• Identifying mutations likely to

be associated with disease• Determining the identity of new

pathogens

The time will come, I believe, though I shall not live tosee it, when we shall have fairly true genealogical treesof each great kingdom of Nature.

Charles Darwin

The Tree of Life

• Traditional classification of life into five kingdoms– Bacteria (inc

cyanobacteria)– Protista (inc. cilliates,

flagellates, amoebae)– Fungi– Plantae– Animalia

Archaebacteria

• Carl Woese and colleagues• Study relationships by

comparing rRNAs • Methanogens were expected

to group with other bacteria• BUT, found to be equally

distant from bacteria and eukaryotes

• Made new taxon - Archaebacteria

• Includes many extremophiles– thermophiles– hyperthermophiles– halophiles (salt dependent)

The Tree of Life

Where is the root of the Tree of Life?

• No possible outgroup (by definition)• Iwabe et al. (1989)• Examined phylogenetic tree of pairs of genes that

exist in all organisms– derived from gene duplication that predates lineage

divergences

lineage 1

lineage 2

lineage 3

lineage 1

lineage 2

lineage 3

Gene A

Gene A1

Gene A2

• Homologous elongation factor genes EF-Tu and EF-G present in all prokaryotes and eukaryotes

• Both genes show the same topology

Archaea

Eucarya

Bacteria

Archaea

Eucarya

Bacteria

EF-Tu

EF-G

based on morphological characteristics (Chatton, 1925)

Changing view ofThe Tree of Life …(Gaucher et al, 2010)

based on DNA sequence analysis (Woese & Fox, 1977)

based on ancient gene duplication

based on phylogenies of hundreds of genes

based on membrane architecture & gene indels

Most modern view …

Phylogeny of humans and apes

• Darwin – Gorilla and Chimpanzee our closest relatives and human evolutionary origins in Africa

• Many people preferred anthropocentric idea that humans were special

Human

Chimp

Gorilla

Orangutan

Gibbon

Traditional view

So what is the evidence?

• Serological precipitation (Goodman 1962) – H, G, C constitute a natural clade, orangutans & gibbons earlier diverging

• However, H,G,C relative relationships remained unclear

• Most DNA sequence data support ((H,C),G)

• Some genes show different relationship

Human

Chimp

Gorilla

Orangutan

Gibbon

Conservation biology – the dusky seaside sparrow

• Last one died June 1987 (DisneyWorld)

• Discovered 1872• Ammodramus maritimus

nigrescens• Geographically confined to

small salt marsh in Florida• 2000 individuals in 1900• 6 individuals (all male) in 1980 • Conservation program

– artificial breeding

Conservation genetics

• Mating of remaining males with females from closest subspecies available

• Female hybrids of first generation then “back-crossed” to original males

• Continue as long as original males live

• Which species to choose to take the females from??

• 8 other A. maritimus subspecies

• Geographically dispersed along coast

• Artificial breeding with Scott’s seaside sparrow (A. m. peninsulae)

• Chosen based on Morphological and behavioural similarities

• Was this the best choice?

nigrescens

peninsulae

AtlanticCoast

GulfCoast

Woops!

• Two subspecies diverged about 250,000 – 500,000 years ago

• A. m. nigrescens almost indistinguishable molecularly from other Atlantic Coast subspecies

• Any Atlantic Coast subspecies would have been a better choice

• Created a new species instead of saving old• Dusky seaside sparrow officially declared extinct in 1990

Origin of angiosperms

• Flowering plants: carpel-enclosed ovules and seed

• Fossils – began to radiate mid-

Cretaceous (~115 mya)– Dominant land plants 90

mya

• 275,000 species described

Origin of angiosperms

• Probably arose from gymnosperm-like ancestor up to 370-380 mya

• Gymnosperm = “naked seed” (e.g. conifers)

• Long time span of possible origin

• Why no fossils?– Didn’t exist prior to

Cretaceous?– Lived in habitats not

conducive to fossilisation?

Monocot and Dicot divergence

• Monocotyledons• Dicotyledons• Two major classes of

angiosperm• Date of their divergence

gives minimum estimate for age of angiosperms

• Phylogenetic analysis of DNA sequences

Monocot – Dicot divergence

• Initial estimate of 300-320 mya (Martin et al. 1989)– Glyceraldehyde-3-phosphaste dehydrogenase from plants,

animals and fungi

• Implied origin close (within 100myr) to the time of origin of earliest land plants – seems too ancient– implies all vascular plants arose within 100myr

• Alternative study (Wolfe et al., 1989)• Calibrated molecular clock with maize-wheat divergence

(50-70 mya)• Monocot-dicot divergence estimated as 200 mya• Existed long before prominence in paleoflora

Cetaceans

• Link to ungulates (hoofed mammals) suggested by comparative anatomy

• Early protein and mtDNA phylogenetic studies indicated that Cetaceans are closely related to Artiodactyls

Cow

Deer

Hippo

Pig

Peccary

Art

ioda

ctyl

s

Camel

• Graur and Higgins (1994)• Protein and DNA

sequence from several cetaceans and from three suborders of artiodactyls

• Showed cetaceans are within artiodactyls

• Confirmed by analysis of distribution of SINE elements

Cetartiodactyls

Recommended