Upload
jude
View
42
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Generalized Tree Alignment: The Deferred Path Heuristic. Stinus Lindgreen [email protected]. Overview: What is a phylogeny? The Generalized Tree Alignment problem Sequence Graphs and their algorithms The Deferred Path Heuristic. Phylogeny: Describes evolutionary model Common ancestor - PowerPoint PPT Presentation
Citation preview
2
Overview:What is a phylogeny?
The Generalized Tree Alignment problem
Sequence Graphs and their algorithms
The Deferred Path Heuristic
3
Phylogeny:Describes evolutionary model- Common ancestor- Mutations happen all the time
- Insertions, deletions, substitutions, translocations, inversions, duplications …
Most mutations happen in DNA replication- Corrected by cell mechanismsMutations accumulate → new species divergeOnly mutations in sex cells are inherited (obviously)
4
Phylogeny:Phylogenetic inference:
Given n sequences build a phylogenetic tree
Most methods base T on a multiple alignment
Likewise: Multiple alignments often based on guide trees
Can we solve both problems at the same time?
5
Phylogeny:Describes the evolutionary relationship between species
Notice root
6
Phylogeny:... or among a single taxon (here, human entovirus 71)
7
The Problem:Given n sequences s1,…,sn …
Multiple Alignment:Make an ordering A of the sequences by inserting gaps such that homologous bases are put in the same column
Phylogenetic Inference: Build a (binary) tree T with s1,…,sn in the leaves and possible ancestors sn+1,…,sn+k in internal nodes describing their evolutionary connection
8
Generalized Tree Alignment:Combines the two. The problem we want to solve is:
Given: A set of n sequences s1,…,sn from n different species (could be DNA, RNA or protein – for simplicity we focus on DNA)
Problem: Generate an unrooted phylogenetic tree T with sequences s1,…,sn in the leaves and a multiple alignment A of these sequences
Placing the root is not trivial and is best left to biologists.
9
The given problem is proven to be MAXSNP-hard (Wang and Jiang, 1994)
→ Not possible to find an approximation algorithm.
Exact solutions to NP-hard problems are intractable → The best we can hope for is a heuristic
The given algorithm runs in time O(n2.ln)• n: The number of sequences• l: Their maximum length.
10
Sequence graphs (Hein, 1989):Recall pairwise alignment. Traceback ”spells” possible optimal alignments:
11
Sequence graphs:Make graph with alignment columns as edge labels→ represents all optimal alignmentsWe will get back to that shortly …
Right now, we want to represent sequencesLet us introduce sequence graphs.For instance, s = ACTGTA is represented by:
12
Sequence graphs:More formally:• Directed, acyclic graph.
• Edge labels l from alphabet Σ. Here, Σ={A,C,G,T,-}
• Source s: The unique node with no incoming edges
• Sink t: The unique node with no outgoing edges.
• Each path from s to t spells a sequence.
13
Sequence graphs:Represents a set of sequences given by all paths from s
to t:
14
Sequence graphs:
Any single sequence can be represented by a linear sequence graph
Any set of k sequences can be represented by making k paths from s to t
A given sequence s’ can be represented by more than one path
We can now represent sequences – but can we align them?
15
Aligning sequence graphs:Dynamic programming algorithm inspired by basicPairwise Alignment:• Given two sequences p and q• Move one letter in p and move through q finding the
optimal ”partial alignments”
Sequence Graphs:• Given two sequence graphs G1 and G2
• We can have many outgoing edges to choose from
16
Aligning sequence graphs:Fill in a |V1|*|V2| score matrix
For each pair of nodes i from G1 and j from G2:Should we:• Align the two characters we got by following e1 into i
and e2 into j?
• Stay in G1 and only move in G2?
• Stay in G2 and only move in G1?
• Or have we already found a better path into i and j?
17
Optimal Alignment Graphs:Now we need a way to remember the optimal alignments
Recall graphs from before: • Directed, acyclic graphs• Nodes s and t defined as before• Edge labels of the form [la,lb] where la,lb ∊Σ
Backtrack through the matrix and consider each possible combination of edges.
18
Optimal Alignment Graphs:
An example of an OAG:
This one represents the alignments:
We denote such a graph A*We have to convert the OAGs back to SGs
19
Optimal Alignment Graphs:This is done easily by considering the edge labels:
If la = lb: Make a single edge in the SG with label la
If la ≠ lb : Make two edges in the SG: One with label la and one with label lb
The graph from before turns into the SG:
20
Summing up Sequence Graphs:Final graph represents all sequences giving an optimal
alignment between G1 and G2
We can:• Represent a set of sequences by a sequence graph• Align two such graphs producing a new SG
We can now get on with the main algorithm
21
The basic idea:• Start by comparing all sequences
– Find a closest pair.• Represent all sequences giving the optimal solution
– Defer the choice of a single sequence• Repeat, but this time include the set of sequences• In the end: Choose a single sequence and backtrack
This shows a need for: - A compact representation of many sequences- An algorithm for aligning sets of sequences
22
The Deferred Path Heuristic:Similar to Kruskal’s algorithm for finding MSTs:
From sequences s1,…,sn,initialize n SGs G1,…,Gn.
Until only two SGs remain:• Align all pairs and choose a closest pair Gi and Gj
• Create A*(Gi,Gj) and convert A* into a SG Gk.
• Replace Gi and Gj with Gk
Note that we remember all candidate sequences
23
The Deferred Path Heuristic:
When only two SGs Gi and Gj remain:• Align them and connect them in T• Choose some optimal alignment
– This gives si and sj in the root of the two subtrees.• Backtrack through the subtrees
– At each step: Align sk to the underlying SGs.– Choose some optimal alignment
24
The Deferred Path Heuristic:We defer our choice of actual sequences until the last
moment, thereby enlarging our solution space: