Upload
cameron-reed
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
1
Chaining Algorithms Simplified
Mohamed Ibrahim Abouelhoda
University of Ulm and Cairo University
2007
2
..GCGGGGCGGTTCACGCGGCCGCAATCAACTGCGTGGGGGGGGGGGGG..
The Genome
GeneGene
The Genome
The total DNA content in a cell a string over an alphabet of 4 characters {A, C, G, T} Encodes the necessary information for the existence and reproduction of an organism
3
Computational Comparative Genomics
What about?
Genome of
Organism A
Genome of
Organism B
Basic science: Understanding how genomes function, organize, replicate, and evolve
Industry and Healthcare: Increasing organism productivity and finding drugs
Objectives:
In-silico identification of regions of similarity and difference among two or multiple
genomesSimilar (conserved) regions Similar (conserved) function necessary for the organism
Different regions Traits unique to one organism
4
TCACAA
CAAATCA
Sequence Alignment
Local Sequence Alignment
Sequence Alignment is not suitable for comparing genomic sequences
Dynamic programming algorithms take time (k=number of genomes, N=average
genome length)
)( kNO
TACAATCAA
TCACTCAC
S1
S2
Global Sequence Alignment
T_ACAATCAA
TCAC_ _TCAC
TCACAA
CAAATCA
S1
S2
Traditional solutions
5
1. Computation of fragments (similar regions) among genomes
2. Computation of an optimal global chain or chains of colinear non-overlapping fragments
3. Detailed alignment of the regions between the fragments of the chain
A fragment:
GACCGCGCA
CACCGCGCT
Genome 1
Genome 2
Exact Fragments
(e.g., maximal exact matches)
Composed of three phases:
The Anchor-based Strategy
Different characters
The fragments can be computed using an index data structure
Abouelhoda. Kurtz, Ohlebusch, 2004
6
Fragment Representation
Box-Line Representation
Geometric Representation: Each fragment is represented by a hyper-rectangle in kD space, each axis
corresponds to one sequence
T A C A A T C A A
T A C A A T C A A
T C
A C
T C
A C
T C A C T C A C
S1
S2
Box-line Representation Geometric Representation
S1
S2
7
First Genome G1
Second Genome G2
The Anchor-based Strategy
1. Computation of fragments (similar regions) among genomes
2. Computation of an optimal global chain or chains of colinear non-overlapping fragments
3. Detailed alignment of the regions between the fragments of the chain
Composed of three phases:
8
The Anchor-based Strategy
First Genome G1
Second Genome G2
anchors
.. TCATA_TCAA..
..TCACAATCAA..
1. Computation of fragments (similar regions) among genomes
2. Computation of an optimal global chain or chains of colinear non-overlapping fragments
3. Detailed alignment of the regions between the fragments of the chain
Composed of three phases:
11
The Global Chaining Problem
Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments
such that its total score is maximum over all other chains.
where g(fi+1, fi) is the gap cost of connecting fi+1 to fi
First Genome G1
Second Genome G2
Third Genome G3
The weight of a fragment is for example its length
score(C)= ∑i fi .weight - ∑i g(fi, fi-1)
12
The Global Chaining Problem
Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments
such that its total score is maximum over all other chains.
First Genome G1
Second Genome G2
Third Genome G3
fi+1fi
where g(fi+1, fi) is the gap cost of connecting fi+1 to fi
The weight of a fragment is for example its length
score(C)= ∑i fi .weight - ∑i g(fi, fi-1)
13
Notions
fi<< fi+1: end( fi).xr < start(fi+1).xr for all r, 0 < r <= k
• A fragment fi is represented as a hyper-rectangle in a k-dimensional space.
• A fragment fi is identified with its start and end points: start(fi) and end( fi).
• We add two imaginary fragments O and t with weight zero.
• Any two fragments fi and fi+1 in the chain must be colinear and non-overlapping
14
Types of Gap Costs
ACC YYY _ _ ACC ACC YYY ACCACC _ _ _ XX ACC ACC _ XX ACC
L1 L∞
A C
C Y
YY
A C
C
ACCXXACC
f The gap costs g can be described geometrically:
k
iii xfendxfstartfendfstartdffg
111 ).().())(),((),(
iii
xfendxfstartfendfstartdffg ).().(max))(),((),(
5),(1 ffg 3),( ffg f
ACC XX _ _ _ _ _ ACCACC _ _ YYY _ _ ACC
ACC _XX ACCACC YYY ACC
ACC _ _ _ _ _ ZZ ACC
7),(1 ffg
ACC _ ZZ ACC
3),( ffg
x
y
15
A Graph-based Solution
fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}
score(C)= ∑i [ fi .weight - g(fi, fi-1)]
An optimal chain is a chain of maximum score A highest-scoring path in the graph is an optimal
chain
The score of a chain C is
The maximum score can be computed by the recurrence
A graph based solution takes O(n2) time.
16
Sparse Dynamic Programming
fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}
where
fi<< fj : end( fi).xr < start(fj).xr for all r, 0 < r <= kg( fi , fj ) is the gap cost of connecting fi to fj
The maximum score can be computed by the recurrence
Chaining algorithms are sparse dynamic programming
1
i
j
A C G T C C G C A T
T
C G
C
C
C
C
G
T
T
D. Eppstein, R. Giancarlo, Z. Galil, and G.F. Italiano, 1992
17
Sparse Dynamic Programming
fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}
where
fi<< fj : end( fi).xr < start(fj).xr for all r, 0 < r <= kg( fi , fj ) is the gap cost of connecting fi to fj
The maximum score can be computed by the recurrence
Chaining algorithms are sparse dynamic programming
1
i
j
X X X X X X X X X X
Y
Y
Y
Y
Y
Y
Y
Y Y
Y
- The string characters are not given, only positions- In extreme cases, you can enumerate all matches and consider others as gaps sparse dynamic programming (chaining) is used to compute alignment directly
selecting gap cost function is critical
D. Eppstein, R. Giancarlo, Z. Galil, and G.F. Italiano, 1992
18
A Geometric-based Solution
fj.score=fj.weight+RMQ{O, start(fj)}
The max function in the recurrence
fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}
can be replaced by range maximum query (RMQ)
If the gap cost is zero, a RMQ returns the end point of the
fragment fi such that is
maximum.
ir
rri weightfscoref
0
..
If all the fragments have the same weight (length) and no gap cost we are solving the LCS problem
RMQ (Range Maximum Query)
Retrieves the fragment fi whose end point Iies in the hyper-rectangle bounded by start(fj) and O such that fi.score-g( fi , fj ) is maximum.
19
The Algorithm without gap cost
Line-sweep algorithm
1. Sort the start and end points of the fragments w.r.t. x1
2. If a start point of a fragment, say fj, is scanned
apply the RMQ(O, (start(fj).x1, start(fj).x2, …, start(fj).xk)) to
the set of active end points
and update the score of the end point of fragment fj.
3. Otherwise, add the end point to the set of active end points
(already scanned end points).
For comparing two sequences, the RMQ dimension is 1 we can use priority queues to find an optimal fragment in O(log log m)
But the complexity is dominated by the sorting, unless the fragments are computed in order.Priority queue is complicated to implement
Becaue of the sorting step, the dimension of the RMQ can be reduced to k-1
we can use RMQ(O, (start(fj).x2, …, start(fj).xk))
20
The Complexity of the Algorithm
The algorithm complexity depends on the data structure supporting RMQ
Semi-dynamic data structure Dynamic data structure
- Constructed for all point at once - Points are not inserted/deleted, rather activated/inactivated
- More space, all fragments remain in memory
- Easier to implement
- Works for off-line chaining
- Constructed point by point - Points are explicitly inserted, deleted
- Less space, because some covered fragments can be deleted
- Very difficult to implement
- Works for on-line chaining
21
The Complexity of the Algorithm
D is implemented as a range tree
O(n log k-2 n log log n) time and O(n log k-2 n) space
For n fragments and dimension d, the RMQ and activation takes:
Since d= k-1>1, the complexity of the algorithm is
O(n log d-1 n log log n) time and O(n log d-1 n) space
1. supported by fractional cascading.
2. enhanced with priority queues.
Willard, 1985
van Emde Boas, 1977
Johnson, 1982
O(n log n) time and O(n) space For k=2, the total complexity is
RMQ using semi-dynamic range tree
22
The Complexity of the Algorithm
For n fragments and dimension d>1, the RMQ and activation takes:
Since d= k-1>1, the complexity of the algorithm is
O(n log n) time and O(n) space For k=2, the total complexity is
RMQ using semi-dynamic kd-tree
time and O(n) space )(1
2ddnO
time and O(n) space ))1(( 1
12
knkO
Bently, 1990
Lee-Wong 1977
The running time can be speeded-up in practice using some programming tricks
24
kd-trees vs. Range Trees
d stands for dimension
C stands for construction
Q stands for query and activation time
For 4 strains E. coli, the range tree did not fit in memory; estimated space consumption is 7.1 Gb
25
Including Gap Costs
The gap cost should be included in the RMQ, otherwise the algorithm would be quadratic.
fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}
fj.score=fj.weight+RMQ{O, start(fj)}
A C
C X
XX
A C
C
ACCYYACC
f
f
Recall the recurrence
26
Including Gap Costs in L1
gc( f) = d1(t, end(f))
We define the geometric cost of a fragment f as follows:
where d1(t, end(f) is the distance in the L1 metric
between t and end(f).
f 1.score - g( f 1 , f ) > f 2.score - g( f 2
, f )
iff
f 1.score - gc( f 1) > f 2.score - gc( f 2)
f 1
f 2
gc( f) is a constant that can be precomputed and attached to the fragment’s weight We activate fragment with f .score - gc( f ) instead of f.score
f
The inclusion of gap cost can be done with no extra cost the same complexity as the algorithm with no gap cost
27
The Local Chaining Problem
Given n weighted fragments from k genomes, a chain C of colinear non-overlapping fragments
has score:
score(C)= ∑i fi .weight - ∑i g(fi, fi-1)
where g(fi, fi-1) is the gap cost of connecting fi to fi-1
First Genome G1
Second Genome G2
Third Genome G3
A local chain C is called optimal if its score is maximum over all other chains.
The weight of a fragment is for example its length or its statistical significance
28
First Genome G1
Second Genome G2
Third Genome G3
Given n weighted fragments from k genomes, a chain C of colinear non-overlapping fragments
has score:
score(C)= ∑i fi .weight - ∑i g(fi, fi-1)
where g(fi, fi-1) is the gap cost of connecting fi to fi-1
A local chain C is called optimal if its score is maximum over all other chains.
The Local Chaining Problem
The weight of a fragment is for example its length or its statistical significance
29
Geometric Solution
The recurrence
fj.score=fj.weight+max{0, fi.score-g( fi , fj ): fi<<fj}
can be written as
fj.score=fj.weight+RMQ{O, start(fj)}
fj
But we have to check
if fj.score=fj.weight+f’.score >= 0, f’=RMQ{O, start(fj)}
then
Connect f’ to fj
else
Start a new chain, starting with fj
30
Comparing two bacterial genomes
The two genomes:
1- C. trachomatis (1.2 Mbp)2- C. pneumoniae (1.2 Mbp)
Fragments of the type maximal exact matches of minimum length 12 Total number of fragments 288,899
C. t
rach
moa
tish
C. pneumonia
Red points: Forward fragmentsGreen points: Reverse fragments
31
Comparing two bacterial genomes
CoCoNUT is fast: it takes minutes to compute fragments and local chains; a task that took hours by previous methods
The two genomes:
1- C. trachomatis (1.2 Mbp)2- C. pneumoniae (1.2 Mbp)
C. pneumonia
C. t
rach
moa
tis
Chains
Fragments of the type maximal multiple exact matches of minimum length 12 Total number of fragments 288,899
Termini of Replication
32
Conclusions
Chaining Algorithms are efficient for comparative genomics
More variations needed for real applications in biology, i.e., limiting range search, considering overlaps
CoCoNUT is a system for comparative genomics containing various variations of the chaining algorithms
Global and local chaining are analogous to global and local sequence alignment
kd-tree is superior to range tree in practice
33
More on Chaining Algorithms
[1] E. Ohlebusch, M. I. Abouelhoda. Chaining Algorithms and Applications in Comparative Genomics. Handbook of Computational Molecular Biology (Chapter 20), 2005, in press.
[2] M. I. Abouelhoda. Algorithms and a Software System for Comparative Genome Analysis. PhD Thesis, Ulm University, 2005.