Download pptx - Inferring phylogenetic trees: Distance methods

Inferring phylogenetic trees:Distance methods

Prof. William Stafford NobleDepartment of Genome Sciences

Department of Computer Science and EngineeringUniversity of Washington

[email protected]

One-minute responses• Thank you for this lecture. It was very interesting.• I think I’m starting to program like a pro.• I wish to hear more on how we can understand better the evolutionary

relationships among species, preferably among distinct human populations.• I think I enjoyed today’s lecture. More especially the class problems!• 70% of the course has been understood by me.• Tell us more about interpretations.• Python part was easy to follow today.• Python part was very easy to follow. I did not have any problem for the first

time.• The lecture was well understood.• The Python part was not so easy for me, but OK.• I appreciate the revision every day, it is very helpful.• Can we learn how to have better output from Python (form / appearance)?• Can we work at this stage on real human genetic data?

Outline

• Parsimony• Distance methods

– Computing distances– Finding the tree

• Maximum likelihood

Revision• What is the input to a phylogenetic inference problem?

– A multiple alignment of DNA or protein sequences.• What is the output?

– A binary tree showing the inferred evolutionary relationships.• For what types of phylogenetic inference problems is maximum

parsimony the right approach?– Small numbers of input sequences.– Closely related sequences.

• What are the two computational problems that must be solved in a maximum parsimony approach?– Enumerating all possible tree topologies.– Evaluating the parsimony score for a given topology.

Revision

• Evaluate the parsimony score of the given tree with respect to the first column of the given alignment.

Scer RTGHSkud RTGVSbay RVGVSmik SVGHSpom STILSvin RLGH

SbaySkud

Scer Svin

SpomSmik

R

R R

R

SS

S

RScore = 1

R R

Revision

• Repeat, but use the second column of the alignment.

Scer RTGHSkud RTGVSbay RVGVSmik SVGHSpom STILSvin RLGH

SbaySkud

Scer Smik

SpomSvin

T

T V

V

TL

T

TScore = 2

VT X

X

Selecting a method

Chooseset of

relatedsequences

Obtainmultiple

sequencealignment

Is therestrong

sequencesimilarity?

Maximumparsimonymethods

Is there clearlyrecognizable

sequencesimilarity

Maximumlikelihoodmethods

Distancemethods

No

Yes

No

Yes

Distance methods

Multiple sequencealignment

Pairwisedistancematrix

Phylo-genetic

tree

Calculating distance

ACTGAACGTAACGC

Species 2: AATGAAAGAATCGCSpecies 1: ACTGTAGGAATCGC

X Y

Species 1: ACTGTAGGAATCGCSpecies 2: AATGAAAGAATCGC

The distance between species 1 and 2 is the

sum of X and Y.

True evolutionary historyACTGAACGTAACGC

ACTGA C TA C GGT AAA C TCGC

AC ATGAAC AGT AAA TCGC T C

Single substitution

Multiple substitutions

Coincidental substitutions

Parallel substitutions

Convergent substitution

Back substitution

Ancestral Species 1 Species 2

Jukes-Cantor model• Assume the same

probability of change at all positions and all times.

• dAB is the proportion of changed sites in the alignment.

• KAB is the expected number of changes per position.

ABAB dK

341ln

43

Derivation at http://en.wikipedia.org/wiki/Models_of_DNA_evolution

Jukes-Cantor modelACTGA C TA C GGT AAA C TCGC

AC ATGAAC AGT AAA TCGC T C

2.1

2.0ln75.0203

341ln

43

ABK

3 observed changes in 20

sites

Species 1 Species 2

Computing JK distancesSpecies 1: ACGTGATCGGTGASpecies 2: ACTTGATGCCTAGSpecies 3: A-TTACGTAATGGSpecies 4: A-TTGATGGCGTA

1 2 3 4

1234

Proportion of changed sites

1 2 3 4

123

4

Pairwise distances

ABAB dK

341ln

43


1 2 3 4

1 6/12 8/12 5/12

2 7/12 4/12

3 9/12

4

Proportion of changed sites

1 2 3 41 ?23

4

Pairwise distances

ABAB dK

341ln

43


1 2 3 4

1 6/12 8/12 5/12

2 7/12 4/12

3 9/12

4

Proportion of changes sites

1 2 3 41 0.82

23

4

Pairwise distances

From this matrix, we calculate the

tree.

ABAB dK

341ln

43

Other models• Jukes-Cantor

– The simplest possible model• Kimura

– 2 parameters– Differentiates between transitions and transversions.

• F84, HKY– 5 parameters– Allows arbitrary base frequencies.

• Tamura-Nei– 6 parameters– Combination of F84 and HKY.

• General time-reversible model– 12 parameters– Only assumes Pr(x→y) = Pr(y→x)

Distance methods

Multiple sequencealignment

Pairwisedistancematrix

Phylo-genetic

tree

• Fitch-Margoliash• Neighbor-joining• UPGMA

UPGMA• Unweighted pair group method with arithmetic

mean.• Also known as agglomerative hierarchical clustering.• Basic idea: iteratively connect the two most closely

related sequences.

UPGMA

Scer Spar Smik Sbay Skud Scas Sklu

Scer 0 31 40 32 30 323 253

Spar 31 0 26 37 30 300 229

Smik 40 26 0 25 35 290 219

Sbay 32 37 25 0 30 298 227

Skud 30 30 35 30 0 316 243

Scas 323 300 290 298 316 0 95

Sklu 253 229 219 227 243 95 0

UPGMA

• Find the smallest off-diagonal element in the matrix.


Scer 0 31 40 32 29 323 253

Spar 31 0 26 37 30 300 229

Smik 40 26 0 25 35 290 219

Sbay 32 37 25 0 30 298 227

Skud 29 30 35 30 0 316 243

Scas 323 300 290 298 316 0 95

Sklu 253 229 219 227 243 95 0

UPGMA

• Compute the average between the two rows and columns.


Scer 0 31 40 32 29 323 253

Spar 31 0 26 37 30 300 229

Smik 40 26 0 25 35 290 219

Sbay 32 37 25 0 30 298 227

Skud 29 30 35 30 0 316 243

Scas 323 300 290 298 316 0 95

Sklu 253 229 219 227 243 95 0

UPGMA


Scer 0 31 36 29 323 253

Spar 31 0 31.5 30 300 229

Smik 36 31.5 0 32.5 294 223

Sbay

Skud 29 30 32.5 0 316 243

Scas 323 300 294 316 0 95

Sklu 253 229 223 243 95 0

UPGMA

Scer Spar Smik-Sbay Skud Scas Sklu

Scer 0 31 36 29 323 253

Spar 31 0 31.5 30 300 229

Smik-Sbay 36 31.5 0 32.5 294 223

Skud 29 30 32.5 0 316 243

Scas 323 300 294 316 0 95

Sklu 253 229 223 243 95 0

SmikSbay• Each merger creates a subtree.

Perform the next merger


Scer 0 31 36 29 323 253

Spar 31 0 31.5 30 300 229

Smik-Sbay 36 31.5 0 32.5 294 223

Skud 29 30 32.5 0 315 243

Scas 323 300 294 316 0 95

Sklu 253 229 223 243 95 0

SmikSbay


Scer 0 31 36 29 323 253

Spar 31 0 31.5 30 300 229

Smik-Sbay 36 31.5 0 32.5 294 223

Skud 29 30 32.5 0 315 243

Scas 323 300 294 316 0 95

Sklu 253 229 223 243 95 0

SmikSbay

Spar Smik-Sbay Skud-Scer Scas Sklu

Spar 0 31.5 30.5 300 229

Smik-Sbay 31.5 0 34.25 294 223

Skud-Scer 30.5 34.25 0 319.5 248

Scas 300 294 319.5 0 95

Sklu 229 223 248 95 0

SmikSbay

SkudScer

Spar Smik-Sbay Skud-Scer Scas Sklu

Spar 0 31.5 30.5 300 229

Smik-Sbay 31.5 0 34.25 294 223

Skud-Scer 30.5 34.25 0 319.5 248

Scas 300 294 319.5 0 95

Sklu 229 223 248 95 0

SmikSbay

SkudScer

What is next?

Formatting with %• Insert % between a string and a tuple to get formatted

output.• Use %s for strings, %d for integers, and %f or %g for floats.• Use %f for a fixed number of decimal places, %e for

exponent, %g for either.– %g rounds to specified number of digits of precision– %g uses either fixed or exponential notation, depending on the

value• Use leading numbers to specify width.

– Replace with * to provide width as an input.

Full details at http://docs.python.org/2/library/string.html

Problem #1• Write a program that reads

sequences from a given file and prints, in aligned columns, the sequence ID, length and frequency of each letter. You may assume that each sequence is no more than 100,000 characters.

./compute-seq-stats.py sample-dna.txtRead 11 sequences from sample-dna.txt. ce1cg 77 A=0.17 C=0.12 G=0.31 T=0.40 ara 87 A=0.34 C=0.23 G=0.18 T=0.24 bglr1 61 A=0.41 C=0.13 G=0.07 T=0.39 crp 105 A=0.35 C=0.20 G=0.22 T=0.23 cya 72 A=0.24 C=0.19 G=0.21 T=0.36 deop2 102 A=0.29 C=0.11 G=0.25 T=0.34 gale 73 A=0.30 C=0.23 G=0.12 T=0.34 ilv 105 A=0.22 C=0.26 G=0.17 T=0.35 lac 86 A=0.22 C=0.22 G=0.22 T=0.34 male 54 A=0.31 C=0.24 G=0.28 T=0.17 malk 65 A=0.26 C=0.15 G=0.37 T=0.22

• Version 1: Use the alphabet ACGT and a fixed width for the sequence ID.

• Version 2: Adjust the field width of the sequence ID based on the longest sequence ID.

• Version 2: Use the alphabet of the given sequences. Print fields in alphabetical order.

• Version 3: Add a header line to your output file.

> ./compute-seq-stats-4.py ribosomal.txt Read 13 sequences from ribosomal.txt.Longest sequence ID = 32.20 letters in alphabet.Alphabet=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']. Sequence Len A C D E F G H I K L M N P Q R S T V W Ygi|457875803|ref|XP_004224433.1| 108 0.111 0.009 0.009 0.065 0.009 0.028 0.019 0.074 0.194 0.083 0.019 0.046 0.028 0.046 0.028 0.093 0.037 0.065 0.009 0.028 gi|351065825|emb|CCD61804.1 117 0.077 0.009 0.043 0.051 0.009 0.085 0.026 0.026 0.205 0.077 0.017 0.017 0.051 0.026 0.034 0.051 0.043 0.111 0.009 0.034 gi|459660330|gb|EMH75739.1 137 0.146 0.015 0.044 0.044 0.007 0.051 0.015 0.044 0.234 0.066 0.022 0.022 0.066 0.007 0.015 0.058 0.051 0.073 0.015 0.007 gi|449802221|pdb|3ZEY|U 113 0.097 0.018 0.035 0.035 0.018 0.071 0.009 0.044 0.186 0.080 0.044 0.027 0.044 0.035 0.062 0.062 0.053 0.053 0.009 0.018 gi|198419437|ref|XP_002130703.1 112 0.062 0.000 0.027 0.045 0.009 0.071 0.009 0.062 0.179 0.098 0.009 0.036 0.045 0.062 0.054 0.080 0.054 0.054 0.009 0.036 gi|17542024|ref|NP_500895.1 117 0.077 0.009 0.043 0.051 0.009 0.085 0.026 0.026 0.205 0.077 0.017 0.017 0.051 0.026 0.034 0.051 0.043 0.111 0.009 0.034 gi|187129228|ref|NP_001119663.1 116 0.034 0.009 0.043 0.052 0.009 0.078 0.017 0.034 0.216 0.095 0.009 0.017 0.043 0.069 0.043 0.078 0.043 0.078 0.009 0.026 gi|359807542|ref|NP_001241406.1 108 0.102 0.000 0.037 0.028 0.009 0.056 0.009 0.056 0.167 0.074 0.028 0.037 0.065 0.056 0.065 0.102 0.046 0.028 0.009 0.028 gi|351725913|ref|NP_001236341.1 108 0.093 0.000 0.037 0.028 0.009 0.065 0.009 0.056 0.167 0.074 0.037 0.037 0.065 0.046 0.065 0.102 0.046 0.028 0.009 0.028 gi|52346074|ref|NP_001005084.1 125 0.088 0.008 0.072 0.040 0.008 0.072 0.008 0.032 0.216 0.096 0.008 0.048 0.048 0.016 0.048 0.056 0.040 0.064 0.008 0.024 gi|41387126|ref|NP_957109.1 124 0.089 0.000 0.065 0.048 0.008 0.065 0.008 0.032 0.218 0.097 0.008 0.040 0.048 0.024 0.048 0.056 0.040 0.065 0.008 0.032 gi|6323365|ref|NP_013437.1 108 0.139 0.000 0.037 0.046 0.000 0.046 0.028 0.074 0.167 0.083 0.019 0.000 0.037 0.046 0.065 0.093 0.019 0.056 0.009 0.037 gi|6321464|ref|NP_011541.1 108 0.130 0.000 0.037 0.046 0.000 0.046 0.028 0.074 0.167 0.083 0.019 0.000 0.037 0.046 0.065 0.093 0.028 0.056 0.009 0.037