Generalized Tree Alignment: The Deferred Path Heuristic

1

Generalized Tree Alignment:The Deferred Path Heuristic

Stinus [email protected]

2

Overview:What is a phylogeny?

The Generalized Tree Alignment problem

Sequence Graphs and their algorithms

The Deferred Path Heuristic

3

Phylogeny:Describes evolutionary model- Common ancestor- Mutations happen all the time

- Insertions, deletions, substitutions, translocations, inversions, duplications …

Most mutations happen in DNA replication- Corrected by cell mechanismsMutations accumulate → new species divergeOnly mutations in sex cells are inherited (obviously)

4

Phylogeny:Phylogenetic inference:

Given n sequences build a phylogenetic tree

Most methods base T on a multiple alignment

Likewise: Multiple alignments often based on guide trees

Can we solve both problems at the same time?

5

Phylogeny:Describes the evolutionary relationship between species

Notice root

6

Phylogeny:... or among a single taxon (here, human entovirus 71)

7

The Problem:Given n sequences s1,…,sn …

Multiple Alignment:Make an ordering A of the sequences by inserting gaps such that homologous bases are put in the same column

Phylogenetic Inference: Build a (binary) tree T with s1,…,sn in the leaves and possible ancestors sn+1,…,sn+k in internal nodes describing their evolutionary connection

8

Generalized Tree Alignment:Combines the two. The problem we want to solve is:

Given: A set of n sequences s1,…,sn from n different species (could be DNA, RNA or protein – for simplicity we focus on DNA)

Problem: Generate an unrooted phylogenetic tree T with sequences s1,…,sn in the leaves and a multiple alignment A of these sequences

Placing the root is not trivial and is best left to biologists.

9

The given problem is proven to be MAXSNP-hard (Wang and Jiang, 1994)

→ Not possible to find an approximation algorithm.

Exact solutions to NP-hard problems are intractable → The best we can hope for is a heuristic

The given algorithm runs in time O(n2.ln)• n: The number of sequences• l: Their maximum length.

10

Sequence graphs (Hein, 1989):Recall pairwise alignment. Traceback ”spells” possible optimal alignments:

11

Sequence graphs:Make graph with alignment columns as edge labels→ represents all optimal alignmentsWe will get back to that shortly …

Right now, we want to represent sequencesLet us introduce sequence graphs.For instance, s = ACTGTA is represented by:

12

Sequence graphs:More formally:• Directed, acyclic graph.

• Edge labels l from alphabet Σ. Here, Σ={A,C,G,T,-}

• Source s: The unique node with no incoming edges

• Sink t: The unique node with no outgoing edges.

• Each path from s to t spells a sequence.

13

Sequence graphs:Represents a set of sequences given by all paths from s

to t:

14

Sequence graphs:

Any single sequence can be represented by a linear sequence graph

Any set of k sequences can be represented by making k paths from s to t

A given sequence s’ can be represented by more than one path

We can now represent sequences – but can we align them?

15

Aligning sequence graphs:Dynamic programming algorithm inspired by basicPairwise Alignment:• Given two sequences p and q• Move one letter in p and move through q finding the

optimal ”partial alignments”

Sequence Graphs:• Given two sequence graphs G1 and G2

• We can have many outgoing edges to choose from

16

Aligning sequence graphs:Fill in a |V1|*|V2| score matrix

For each pair of nodes i from G1 and j from G2:Should we:• Align the two characters we got by following e1 into i

and e2 into j?

• Stay in G1 and only move in G2?

• Stay in G2 and only move in G1?

• Or have we already found a better path into i and j?

17

Optimal Alignment Graphs:Now we need a way to remember the optimal alignments

Recall graphs from before: • Directed, acyclic graphs• Nodes s and t defined as before• Edge labels of the form [la,lb] where la,lb ∊Σ

Backtrack through the matrix and consider each possible combination of edges.

18

Optimal Alignment Graphs:

An example of an OAG:

This one represents the alignments:

We denote such a graph A*We have to convert the OAGs back to SGs

19

Optimal Alignment Graphs:This is done easily by considering the edge labels:

If la = lb: Make a single edge in the SG with label la

If la ≠ lb : Make two edges in the SG: One with label la and one with label lb

The graph from before turns into the SG:

20

Summing up Sequence Graphs:Final graph represents all sequences giving an optimal

alignment between G1 and G2

We can:• Represent a set of sequences by a sequence graph• Align two such graphs producing a new SG

We can now get on with the main algorithm

21

The basic idea:• Start by comparing all sequences

– Find a closest pair.• Represent all sequences giving the optimal solution

– Defer the choice of a single sequence• Repeat, but this time include the set of sequences• In the end: Choose a single sequence and backtrack

This shows a need for: - A compact representation of many sequences- An algorithm for aligning sets of sequences

22

The Deferred Path Heuristic:Similar to Kruskal’s algorithm for finding MSTs:

From sequences s1,…,sn,initialize n SGs G1,…,Gn.

Until only two SGs remain:• Align all pairs and choose a closest pair Gi and Gj

• Create A*(Gi,Gj) and convert A* into a SG Gk.

• Replace Gi and Gj with Gk

Note that we remember all candidate sequences

23

The Deferred Path Heuristic:

When only two SGs Gi and Gj remain:• Align them and connect them in T• Choose some optimal alignment

– This gives si and sj in the root of the two subtrees.• Backtrack through the subtrees

– At each step: Align sk to the underlying SGs.– Choose some optimal alignment

24

The Deferred Path Heuristic:We defer our choice of actual sequences until the last

moment, thereby enlarging our solution space:

Documents

Generalized Tree Alignment: The Deferred Path Heuristic