CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction 1 4 3 2 5 1 4 2 3 5

Preview:

Citation preview

CS262 Lecture 9, Win07, Batzoglou

Phylogeny Tree Reconstruction

1 4

3 2 5

1 4 2 3 5

CS262 Lecture 9, Win07, Batzoglou

Phylogenetic Trees

• Nodes: species• Edges: time of independent

evolution

• Edge length represents evolution time

AKA genetic distance

Not necessarily chronological time

CS262 Lecture 9, Win07, Batzoglou

Parsimony – direct method not using distances

• One of the most popular methods: GIVEN multiple alignment FIND tree & history of substitutions explaining alignment

Idea:

Find the tree that explains the observed sequences with a minimal number of substitutions

Two computational subproblems:

1. Find the parsimony cost of a given tree (easy)

2. Search through all tree topologies (hard)

CS262 Lecture 9, Win07, Batzoglou

Example: Parsimony cost of one column

A B A A

{A, B}CostC+=1

{A}Final cost C = 1

{A}

{A} {B} {A} {A}

ABAA

CS262 Lecture 9, Win07, Batzoglou

Parsimony Scoring

Given a tree, and an alignment column u

Label internal nodes to minimize the number of required substitutions

Initialization:

Set cost C = 0; node k = 2N – 1 (last leaf)

Iteration:

If k is a leaf, set Rk = { xk[u] } // Rk is simply the character of kth species

If k is not a leaf,

Let i, j be the daughter nodes;

Set Rk = Ri Rj if intersection is nonempty

Set Rk = Ri Rj, and C += 1, if intersection is empty

Termination:

Minimal cost of tree for column u, = C

CS262 Lecture 9, Win07, Batzoglou

Example

A A A B

{A} {A} {A} {B}

B A BA

{A} {B} {A} {B}

{A}

{A}

{A}

{A,B}

{A,B}

{B}

{B}

CS262 Lecture 9, Win07, Batzoglou

Traceback:

1. Choose an arbitrary nucleotide from R2N – 1 for the root

2. Having chosen nucleotide r for parent k,

If r Ri choose r for daughter i

Else, choose arbitrary nucleotide from Ri

Easy to see that this traceback produces some assignment of cost C

Traceback to find ancestral nucleotides

CS262 Lecture 9, Win07, Batzoglou

Example

A B A B

{A, B}

{A, B}

{A}

{A} {B} {A} {B}

A B A B

A

A

A

x

x

A B A B

A

B

A

x

x

A B A B

B

B

B

xx

Admissible with Traceback

Still optimal, but inadmissible with Traceback

CS262 Lecture 9, Win07, Batzoglou

Multiple Sequence Multiple Sequence AlignmentsAlignments

CS262 Lecture 9, Win07, Batzoglou

Evolution at the DNA level

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

InversionTranslocationDuplication

CS262 Lecture 9, Win07, Batzoglou

Protein Phylogenies

• Proteins evolve by both duplication and species divergence

CS262 Lecture 9, Win07, Batzoglou

Orthology and Paralogy

HB HumanHB Human

WB WormWB Worm

HA1 HumanHA1 Human

HA2 HumanHA2 Human

YeastYeast

WA WormWA Worm

Orthologs:Derived by speciation

Paralogs:Everything else

CS262 Lecture 9, Win07, Batzoglou

Orthology, Paralogy, Inparalogs, Outparalogs

CS262 Lecture 9, Win07, Batzoglou

CS262 Lecture 9, Win07, Batzoglou

Definition

• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that

• All sequences have the same length L• Score of the global map is maximum

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments reveal elements that are conserved among a class of organisms and therefore important in their common biology

• The patterns of conservation can help us tell function of the element

CS262 Lecture 9, Win07, Batzoglou

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

CS262 Lecture 9, Win07, Batzoglou

Sum Of Pairs (cont’d)

• Heuristic way to incorporate evolution tree:

Human

Mouse

Chicken

• Weighted SOP:

S(m) = k<l wkl s(mk, ml)

Duck

CS262 Lecture 9, Win07, Batzoglou

A Profile Representation

• Given a multiple alignment M = m1…mn Replace each column mi with profile entry pi

• Frequency of each letter in • # gaps• Optional: # gap openings, extensions, closings

Can think of this as a “likelihood” of each letter in each position

- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G

A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2- .2 .8 .4 .8 .4

CS262 Lecture 9, Win07, Batzoglou

Multiple Sequence Alignments

Algorithms

CS262 Lecture 9, Win07, Batzoglou

Multidimensional DP

Generalization of Needleman-Wunsh:

S(m) = i S(mi)

(sum of column scores)

F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN)

F(i1,i2,…,iN)= max(all neighbors of cube)(F(nbr)+S(nbr))

CS262 Lecture 9, Win07, Batzoglou

• Example: in 3D (three sequences):

• 7 neighbors/cell

F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(xi, xj, xk),

F(i – 1, j – 1, k ) + S(xi, xj, - ),F(i – 1, j , k – 1) + S(xi, -, xk),F(i – 1, j , k ) + S(xi, -, - ),F(i , j – 1, k – 1) + S( -, xj, xk),F(i , j – 1, k ) + S( -, xj, - ),F(i , j , k – 1) + S( -, -, xk) }

Multidimensional DP

CS262 Lecture 9, Win07, Batzoglou

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence

N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

Multidimensional DP

CS262 Lecture 9, Win07, Batzoglou

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence

N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

Multidimensional DP

• How do gap states generalize?

• VERY badly! Require 2N – 1 states, one per combination of

gapped/ungapped sequences Running time: O(2N 2N LN) = O(4N LN)

XY XYZ Z

Y YZ

X XZ

CS262 Lecture 9, Win07, Batzoglou

Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new

alignment with associated profile presult

Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x

w

y

z

pxy

pzw

pxyzw

CS262 Lecture 9, Win07, Batzoglou

Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new

alignment with associated profile presult

Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x

w

y

z

Example

Profile: (A, C, G, T, -)px = (0.8, 0.2, 0, 0, 0)py = (0.6, 0, 0, 0, 0.4)

s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)

Result: pxy = (0.7, 0.1, 0, 0, 0.2)

s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)

Result: px- = (0.4, 0.1, 0, 0, 0.5)

CS262 Lecture 9, Win07, Batzoglou

Progressive Alignment

• When evolutionary tree is unknown:

Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary

distance, based on pairwise alignment Construct a tree (UPGMA / Neighbor Joining / Other methods) Align on the tree

x

w

y

z?

CS262 Lecture 9, Win07, Batzoglou

Heuristics to improve alignments

• Iterative refinement schemes

• A*-based search

• Consistency

• Simulated Annealing

• …

CS262 Lecture 9, Win07, Batzoglou

Iterative Refinement

One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear correct y = GA-CTT

CS262 Lecture 9, Win07, Batzoglou

Iterative Refinement

Algorithm (Barton-Stenberg):

1. For j = 1 to N,Remove xj, and realign to x1…

xj-1xj+1…xN

2. Repeat 4 until convergence

x

y

z

x,z fixed projection

allow y to vary

CS262 Lecture 9, Win07, Batzoglou

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

CS262 Lecture 9, Win07, Batzoglou

Iterative Refinement

Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

CS262 Lecture 9, Win07, Batzoglou

Consistency

z

x

y

xi

yj yj’

zk

CS262 Lecture 9, Win07, Batzoglou

Consistency

Basic method for applying consistency

• Compute all pairs of alignments xy, xz, yz, …

• When aligning x, y during progressive alignment,

For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz) Align x and y with DP using the modified s(.,.) function

z

x

y

xi

yj yj’

zk

CS262 Lecture 9, Win07, Batzoglou

Real-world protein aligners

• MUSCLE High throughput One of the best in accuracy

• ProbCons High accuracy Reasonable speed

CS262 Lecture 9, Win07, Batzoglou

MUSCLE at a glance

1. Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time

2. Build tree TDRAFT based on those distances, with UPGMA

3. Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT

• Only perform alignment steps for the parts of the tree that have changed

4. Measure new Kimura-based distances D(x, y) based on MDRAFT

5. Build tree T based on D

6. Progressive alignment over T, to build M

7. Iterative refinement; for many rounds, do:• Tree Partitioning: Split M on one branch and realign the two resulting profiles• If new alignment M’ has better sum-of-pairs score than previous one, accept

CS262 Lecture 9, Win07, Batzoglou

PROBCONS at a glance

1. Computation of all posterior matrices Mxy : Mxy(i, j) = Prob(xi ~ yj), using a HMM

2. Re-estimation of posterior matrices M’xy with probabilistic consistency

• M’xy(i, j) = 1/N sequence z k Mxz(i, k) Myz (j, k); M’xy = Avgz(MxzMzy)

3. Compute for every pair x, y, the maximum expected accuracy alignment• Axy: alignment that maximizes aligned (i, j) in A M’xy(i, j)

• Define E(x, y) = aligned (i, j) in Axy M’xy(i, j)

4. Build tree T with hierarchical clustering using similarity measure E(x, y)

5. Progressive alignment on T to maximize E(.,.)

6. Iterative refinement; for many rounds, do:• Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each

sequence and realign the two resulting profiles

CS262 Lecture 9, Win07, Batzoglou

Some Resources

Genome Resources

Annotation and alignment genome browser at UCSChttp://genome.ucsc.edu/cgi-bin/hgGateway

Specialized VISTA alignment browser at LBNLhttp://pipeline.lbl.gov/cgi-bin/gateway2

ABC—Nice Stanford tool for browsing alignmentshttp://encode.stanford.edu/~asimenos/ABC/

Protein Multiple Aligners

http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable

http://probcons.stanford.edu/ PROBCONS – most accurate

Recommended