PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun

PHYLOGENETIC TREESIntroduction to Computational Biology

CIS 786

With Dr. Barry Cohen

Tuesday, May 7, 2001

Paul Wood

Yanchun Song

Chaowei Sun

Introduction

Paul Wood

Yanchun Song

Chaowei Sun

What is a Phylogenetic Tree?

• Phylogenetic trees are representations of the similarity or dissimilarity—among both existing & extinct living individuals &—across a set of characteristics or features.

• Similarity of molecular and physical systems provide compelling evidence that all life on earth arose from a common ancestry.

Carl R. Woese, Interpreting the universal phylogenetic tree, Proc. Natl. Acad. Sci. USA, Vol. 97, Issue 15, 8392-8396, July 18, 2000http://www.pnas.org/cgi/content/full/97/15/8392

• Shall I thee to a summers day?– W. Shakespeare, Sonnet 18

• There is a between Homer and Hesiod, between Æschylus and Euripides…

– P. Shelley, Prometheus Unbound

• Life all around me…All in the loom, and oh

What ! Woodlands, meadows,…– E. L. Masters, Spoon River Anthology

• If the foolish call them “flowers”/Need the wiser tell? // If the savants “ ” them/It is just as well.

– E. Dickenson, Part 1: Life, XCIV

SIMILARITY

PATTERNS

Why do we study Phylogenetic Trees?

COMPARE

CLASSIFY

…because humans need to….fill in blanks…

…and understand in our own language…

What are some applications of “phylogenetic” trees?

Computational Linguistics• Manning, Christopher D. and Heinrich Schutze, Foundations of Statistical

Natural Language Processing, MIT Press, Cambridge Massachusetts, 1999. http://www.aclweb.org/archive/fsnlp-ch1.pdf

Archaeological Statistics• Archaeological Statistics: Brief Bibliography

http://ad.trafficmp.com/tmpad/banner/itrack.asp?rv=3.0&id=16&nojs=1

Broad Historical and Technical Overview• Discriminant Analysis and Clustering, Panel on Discriminant Analysis,

Classification, and Clustering, Committee on Applied and Theoretical Statistics Board on Mathematical Sciences, Commission on Physical Sciences, Mathematics, and Resources National Research Council, NATIONAL ACADEMY PRESS, Washington, D.C. 1988 http://www.ulib.org/webRoot/Books/National_Academy_Press_Books/discrim_analysis/discr001.htm

Phylogenetic trees are used to study locations,

migrations, lives, health & cultures of populations.

Velda

Helena Tara

Katrina

Ursula

Xenia

Jasmine

http://www.oxfordancestors.com/daughters.html

Phylogenetic trees are used to study physical &

genetic variability, evolution of species.

http://www.oxfordancestors.com/daughters.html

Which areas of the genome provide mutant data to create phylogenetic trees?

Y-Chromosome

MitochondrialControl Region

Autosomes

How do we get data for computational biology?

Concentrationgradient

Homogenize

Detergent(Sodium Dodecyl Sulphate SDS)

+

+

Phenol

GeneticMaterial

InsolubleProtein

Phenol

Remove Upper Phase

Cesium Chloride

+

SPIN40 hrs @

40,000 RPM

RNARNA

RNARNA

CsCs

Cs

Cs

RNA

STEP 1: Eukaryotic Biochemical Protocol is……kind of like washing greasy dishes!

LowWeight

MediumWeight

HighWeight

How do we get sequence data?

RNARNA

RNARNA

CsCs

Cs

Cs

RNA

STEP 2: Cut up DNA using one of “two” methods… &

STEP 3: Label fragments using one of “two” methods…

2 b: Maxam-Gilbert

2 a: Sanger (Dideoxy)

EtOH+

+

RestrictionEnzymes

32Phosphate

GelElectro-phoresis

AutoRadiography

Fluorescent

Dye

FluorescenceSpectroscopy

~ 4 Reactions

~ 4 Reactions

GelElectro-phoresis

3a:

3b:

What is the rate of evolutionary change…or…how many mutants can we expect?

• Estimates vary depending upon assessment method and location within the genome

• “…134 independent mtDNA lineages spanning 327 generations found ~2.5 mutations per site per 1000 yrs.”

– A high observed substitution rate in the human mitochondrial DNA control region. Parsons TJ, Muniec DS, Sullivan K, Woodyatt N, Alliston-Greiner R, Wilson MR, Berry DL, Holland KA, Weedn VW, Gill P, Holland MM. Nat Genet 1997 Apr; 15(4):363-8. Armed Forces DNA Identification Laboratory, Armed Forces Institute of Pathology, Rockville, Maryland 20850, USA. http://www.mhrc.net/mitochondria.htm

– M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt. (1978) A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure, M. O. Dayhoff, (Ed.). National Biomedical Research Foundation, Vol. 5, Suppl. 3, chapter 22, 345-352)

What do sequence data and input files typically look like?

263 2821 AY053096 cacgggagct …variable region... 2822 AY053097 cacgggagct …variable region... 2823 AY053098 cacgggagct …variable region... 282.263

!Domain=Data property=Coding CodonStart=1;#W._Pygmy_(1)_{African} TTC TTT CAT GGG#W._Pygmy_(6)_{African} ... ... ... ...#Kung_(7)_{African} ... .C. ... ... .T.#Kung_(9)_{African} ... ... ... ... ...#Kung_(10)_{African} ... ... ... ... ...#Kung_(13)_{African} ... ... .G. ... ...

PHYLIP INPUT FILE (SEQUENCE)

MEGA INPUT FILE (SEQUENCE)

A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

DISTANCE MATRIX

What are some of the major classifications of algorithms & software applications?

Count of Software Applications by Type and Platform

Unix/Source Code DOS Windows Mac VMS

General-purpose 6 5 5 3 3Parsimony 12 12 13 5 3Distance matrix 27 21 20 15 4Compute distances 22 16 17 14 6Maximum likelihood 23 5 13 14 5Quartets methods 7 5 0 4 1Artificial Intelligence 1 0 0 0 0Invariants 2 2 2 2 2Tree rearrangement 4 2 3 5 1Recombination 9 2 1 2 0Bootstrapping and other measures 16 15 9 11 2Clocks, dating, and stratigraphy 10 2 6 10 0

PHYLIP, PAUP & MEGA are represented across most categories. PHYLIP is the most widely distributed and used. PAUP is most frequently cited in publications. MEGA has a nice GUI and is user friendly. http://evolution.genetics.washington.edu/phylip/software.html

Yanchun Song

Two Types of Data

• Distance-based: – The input is a matrix of distances between the

species (e.g., the alignment score between them or the fraction of residues they agree on).

• Character-based: – Examine each character (e.g., a base in a

specific position in the DNA) separately

Pairwise Distance

• Model of Jukes and Cantor– Each base in the DNA sequence has an equal

chance of mutating, and when it does, it is replaced by some other nucleotide uniformly.

• Distance dij:

– The fraction f of sites u where residues xu

i and x

uj differ (presupposing an alignment of the

two sequences).T. H. Jukes and C. Cantor, Mammalian Protein Metabolism, Chapter Evolution of protein molecules, pages 21-132, Academic Press, New York, 1969

How to Make a Tree?

• Clustering methods: – UPGMA – Neighbor-joining

• Parsimony:

Clustering Method: UPGMA

• UPGMA: Unweighted Pair Group Method with Arithmetic Mean

• Di,j between two clusters of species Ci and

C

j:

d(p, q) – distance function between species,

ni = |Ci| and nj = |Cj|.

i jCp Cqji

ji qpdnn

D ),(1

,

http://www.math.tau.ac.il/~rshamir/algmb/00/scribe00/html/lec08/node21.html

Algorithm• Initialization:

– Initialize n clusters with the given species, one species per cluster. – Size of each cluster: ni ← 1; assign a leaf for each species.

• Iteration: – Find minimal Dij,

– Create a new cluster (ij), which has n(ij) = ni + nj members.

– Connect i and j to the new node (ij), each given length Di,j /2. – Compute the distance from (ij) to all other clusters as a weighted average of the

distances from its components:

– Replace the columns and rows of clusters i and in D with cluster (ij), with D(ij),k computed as above.

• Termination: – until there is only one cluster left.

kjji

iki

ji

ikij D

nn

nD

nn

nD ,,),(

UPGMA Example

A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

http://www.icp.ucl.ac.be/~opperd/private/upgma.html

UPGMA Example (cont’d)

A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

A,B C D E

C 4

D 6 6

E 6 6 4

F 8 8 8 8

kjji

iki

ji

ikij D

nn

nD

nn

nD ,,),(

D(A,B),C = (DAC + DBC) / 2 = 4 D(A,B),D = (DAD + DBD) / 2 = 6 D(A,B),E = (DAE + DBE) / 2 = 6 D(A,B),F = (DAF + DBF) / 2 = 8


UPGMA Example (cont’d)

A,B C D,E

C 4

D,E 6 6

F 8 8 8

AB,C D,E

D,E 6

F 8 8

ABC,DE

F 8


Additivity

• Given a tree, its edge lengths are said to be additive if the distance between any pair of leaves is the sum of the lengths of the edges on the path connecting them.

Additivity

Dim = Dik + Dkm

Djm = Djk + Dkm

Dij = Dik + Djk i

j

k

m

2ijjmim

km

DDDD

The idea of Neighbor-joining

• Distance of i from the rest of the tree:

• To find neighboring nodes i and j:

min(Di,j – (ui + uj) )

i m

j n

0.1 0.1 0.1

0.40.4

k l

ik

kii n

Du

)2(,

)(2

1,)(, jijiiji uuDD

)(2

1,)(, ijjiijj uuDD

R. Durbin, et al, Additivity and neighbour-joining, Biological Sequence Analysis, p. 169-173, Cambridge Univ. Press, 1999.

Algorithm: Neighbor-Joining

• Initialization:– Define T to be the set of leaf nodes, one for each given sequence, and put n =

T.

• Iteration:– For each species, compute . – Choose a pair i, j in T for which Di,j – (ui + uj) is minimal.– Join i and j to a new cluster k=(ij). Calculate the branch lengths from i and j

to the new node k as: Di,k=1/2(Di,j+ ui – uj), Dj,k=1/2(Di,j+ uj – ui)

– Compute the distances between k and each other cluster: Dk,m=1/2(Di,m+ Dj,m – Di,j), mT

– Remove i and j from T and add k.

• Termination:– When T consists of only two nodes i and j, connect the remaining nodes by a

branch of length Dij.

ik

kii n

Du

)2(,

Chaowei Sun

MEGA 2MEGA 2

• Molecular Evolutionary Genetics Analysis

• Provides tools for exploring and analyzing DNA and protein sequences from evolutionary perspectives

History of MEGA

• MEGA 1

DOS-Based

• MEGA 2

User-friendly interface

Windows

Macintosh

Sun Workstation

Linux

Input

• Character Sequence - DNA/RNA - Protein• Distance Matrix• Import data from other formats, PHYLIP, XML,

etc.

Character Sequence

Distance Matrix

Methods and Algorithms

• methods for constructing phylogenetic trees from molecular data.

1. UPGMA Method

2. Neighbor-Joining (NJ) Method

3. Minimum Evolution (ME) Method

4. Maximum Parsimony (MP) Method

Unweighted Pair Group Method with Arithmetic Mean - UPGMA

• Assumes a constant rate of evolution

• sequential clustering method

• Produces a rooted tree

• edge lengths - time measured by a molecular clock

Neighbor-Joining - NJ

• No assumption

• finds neighbors sequentially that may minimize the total length of the tree

• produces an unrooted tree

• root - midpoint of the longest route connecting two taxa in the tree

Minimum Evolution - ME

• Finds a topology with the smallest sum of branch lengths

• time-consuming: sum of branches for all topologies have to be evaluated

Maximum Parsimony - MP

• Finds a topology that requires the smallest number of changes (substitution)

• For each topology – sums up total number of substitutions

Output - UPGMA

Branch lengthare equal

- constant rate

Unrooted Tree - NJ

Root

Output - NJ

Branch lengthis proportional

to distance

Output - ME

Comparison

Parsimony

Minimum EvolutionUPGMA

Neighbor-Joining

Optimality criterion Clustering algorithm

Computational Method

Distance

Characters

Comparison – Cont’d

• UPGMA, Neighbor-Joining

• Minimum Evolution, Maximum Parsimony

- Fast O(n2), Large dataset- depends upon the order in which we add sequences to the tree

- Time consuming, NP-Complete- use an explicit function relating the trees to the data

Thank you and enjoy the finals…

Documents

PHYLOGENETIC TREES Introduction to Computational Biology CIS 786 With Dr. Barry Cohen Tuesday, May 7, 2001 Paul Wood Yanchun Song Chaowei Sun