1 Chaining Algorithms Simplified Mohamed Ibrahim Abouelhoda University of Ulm and Cairo University 2007

1

Chaining Algorithms Simplified

Mohamed Ibrahim Abouelhoda

University of Ulm and Cairo University

2007

2

..GCGGGGCGGTTCACGCGGCCGCAATCAACTGCGTGGGGGGGGGGGGG..

The Genome

GeneGene

The Genome

The total DNA content in a cell a string over an alphabet of 4 characters {A, C, G, T} Encodes the necessary information for the existence and reproduction of an organism

3

Computational Comparative Genomics

What about?

Genome of

Organism A

Genome of

Organism B

Basic science: Understanding how genomes function, organize, replicate, and evolve

Industry and Healthcare: Increasing organism productivity and finding drugs

Objectives:

In-silico identification of regions of similarity and difference among two or multiple

genomesSimilar (conserved) regions Similar (conserved) function necessary for the organism

Different regions Traits unique to one organism

4

TCACAA

CAAATCA

Sequence Alignment

Local Sequence Alignment

Sequence Alignment is not suitable for comparing genomic sequences

Dynamic programming algorithms take time (k=number of genomes, N=average

genome length)

)( kNO

TACAATCAA

TCACTCAC

S1

S2

Global Sequence Alignment

T_ACAATCAA

TCAC_ _TCAC

TCACAA

CAAATCA

S1

S2

Traditional solutions

5

1. Computation of fragments (similar regions) among genomes

2. Computation of an optimal global chain or chains of colinear non-overlapping fragments

3. Detailed alignment of the regions between the fragments of the chain

A fragment:

GACCGCGCA

CACCGCGCT

Genome 1

Genome 2

Exact Fragments

(e.g., maximal exact matches)

Composed of three phases:

The Anchor-based Strategy

Different characters

The fragments can be computed using an index data structure

Abouelhoda. Kurtz, Ohlebusch, 2004

6

Fragment Representation

Box-Line Representation

Geometric Representation: Each fragment is represented by a hyper-rectangle in kD space, each axis

corresponds to one sequence

T A C A A T C A A

T A C A A T C A A

T C

A C

T C

A C

T C A C T C A C

S1

S2

Box-line Representation Geometric Representation

S1

S2

7

First Genome G1

Second Genome G2






8


First Genome G1

Second Genome G2

anchors

.. TCATA_TCAA..

..TCACAATCAA..





10

Chaining Algorithms

The Global Chaining Problem The Local Chaining Problem

11

The Global Chaining Problem

Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments

such that its total score is maximum over all other chains.

where g(fi+1, fi) is the gap cost of connecting fi+1 to fi

First Genome G1

Second Genome G2

Third Genome G3

The weight of a fragment is for example its length

score(C)= ∑i fi .weight - ∑i g(fi, fi-1)

12

The Global Chaining Problem

Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments

such that its total score is maximum over all other chains.

First Genome G1

Second Genome G2

Third Genome G3

fi+1fi

where g(fi+1, fi) is the gap cost of connecting fi+1 to fi

The weight of a fragment is for example its length


13

Notions

fi<< fi+1: end( fi).xr < start(fi+1).xr for all r, 0 < r <= k

• A fragment fi is represented as a hyper-rectangle in a k-dimensional space.

• A fragment fi is identified with its start and end points: start(fi) and end( fi).

• We add two imaginary fragments O and t with weight zero.

• Any two fragments fi and fi+1 in the chain must be colinear and non-overlapping

14

Types of Gap Costs

ACC YYY _ _ ACC ACC YYY ACCACC _ _ _ XX ACC ACC _ XX ACC

L1 L∞

A C

C Y

YY

A C

C

ACCXXACC

f The gap costs g can be described geometrically:

k

iii xfendxfstartfendfstartdffg

111 ).().())(),((),(

iii

xfendxfstartfendfstartdffg ).().(max))(),((),(

5),(1 ffg 3),( ffg f

ACC XX _ _ _ _ _ ACCACC _ _ YYY _ _ ACC

ACC _XX ACCACC YYY ACC

ACC _ _ _ _ _ ZZ ACC

7),(1 ffg

ACC _ ZZ ACC

3),( ffg

x

y

15

A Graph-based Solution

fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}

score(C)= ∑i [ fi .weight - g(fi, fi-1)]

An optimal chain is a chain of maximum score A highest-scoring path in the graph is an optimal

chain

The score of a chain C is

The maximum score can be computed by the recurrence

A graph based solution takes O(n2) time.

16

Sparse Dynamic Programming


where

fi<< fj : end( fi).xr < start(fj).xr for all r, 0 < r <= kg( fi , fj ) is the gap cost of connecting fi to fj


Chaining algorithms are sparse dynamic programming

1

i

j

A C G T C C G C A T

T

C G

C

C

C

C

G

T

T

D. Eppstein, R. Giancarlo, Z. Galil, and G.F. Italiano, 1992

17

Sparse Dynamic Programming


where

fi<< fj : end( fi).xr < start(fj).xr for all r, 0 < r <= kg( fi , fj ) is the gap cost of connecting fi to fj


Chaining algorithms are sparse dynamic programming

1

i

j

X X X X X X X X X X

Y

Y

Y

Y

Y

Y

Y

Y Y

Y

- The string characters are not given, only positions- In extreme cases, you can enumerate all matches and consider others as gaps sparse dynamic programming (chaining) is used to compute alignment directly

selecting gap cost function is critical

D. Eppstein, R. Giancarlo, Z. Galil, and G.F. Italiano, 1992

18

A Geometric-based Solution

fj.score=fj.weight+RMQ{O, start(fj)}

The max function in the recurrence


can be replaced by range maximum query (RMQ)

If the gap cost is zero, a RMQ returns the end point of the

fragment fi such that is

maximum.

ir

rri weightfscoref

0

..

If all the fragments have the same weight (length) and no gap cost we are solving the LCS problem

RMQ (Range Maximum Query)

Retrieves the fragment fi whose end point Iies in the hyper-rectangle bounded by start(fj) and O such that fi.score-g( fi , fj ) is maximum.

19

The Algorithm without gap cost

Line-sweep algorithm

1. Sort the start and end points of the fragments w.r.t. x1

2. If a start point of a fragment, say fj, is scanned

apply the RMQ(O, (start(fj).x1, start(fj).x2, …, start(fj).xk)) to

the set of active end points

and update the score of the end point of fragment fj.

3. Otherwise, add the end point to the set of active end points

(already scanned end points).

For comparing two sequences, the RMQ dimension is 1 we can use priority queues to find an optimal fragment in O(log log m)

But the complexity is dominated by the sorting, unless the fragments are computed in order.Priority queue is complicated to implement

Becaue of the sorting step, the dimension of the RMQ can be reduced to k-1

we can use RMQ(O, (start(fj).x2, …, start(fj).xk))

20

The Complexity of the Algorithm

The algorithm complexity depends on the data structure supporting RMQ

Semi-dynamic data structure Dynamic data structure

- Constructed for all point at once - Points are not inserted/deleted, rather activated/inactivated

- More space, all fragments remain in memory

- Easier to implement

- Works for off-line chaining

- Constructed point by point - Points are explicitly inserted, deleted

- Less space, because some covered fragments can be deleted

- Very difficult to implement

- Works for on-line chaining

21


D is implemented as a range tree

O(n log k-2 n log log n) time and O(n log k-2 n) space

For n fragments and dimension d, the RMQ and activation takes:

Since d= k-1>1, the complexity of the algorithm is

O(n log d-1 n log log n) time and O(n log d-1 n) space

1. supported by fractional cascading.

2. enhanced with priority queues.

Willard, 1985

van Emde Boas, 1977

Johnson, 1982

O(n log n) time and O(n) space For k=2, the total complexity is

RMQ using semi-dynamic range tree

22


For n fragments and dimension d>1, the RMQ and activation takes:

Since d= k-1>1, the complexity of the algorithm is

O(n log n) time and O(n) space For k=2, the total complexity is

RMQ using semi-dynamic kd-tree

time and O(n) space )(1

2ddnO

time and O(n) space ))1(( 1

12

knkO

Bently, 1990

Lee-Wong 1977

The running time can be speeded-up in practice using some programming tricks

23

kd-trees

24

kd-trees vs. Range Trees

d stands for dimension

C stands for construction

Q stands for query and activation time

For 4 strains E. coli, the range tree did not fit in memory; estimated space consumption is 7.1 Gb

25

Including Gap Costs

The gap cost should be included in the RMQ, otherwise the algorithm would be quadratic.



A C

C X

XX

A C

C

ACCYYACC

f

f

Recall the recurrence

26

Including Gap Costs in L1

gc( f) = d1(t, end(f))

We define the geometric cost of a fragment f as follows:

where d1(t, end(f) is the distance in the L1 metric

between t and end(f).

f 1.score - g( f 1 , f ) > f 2.score - g( f 2

, f )

iff

f 1.score - gc( f 1) > f 2.score - gc( f 2)

f 1

f 2

gc( f) is a constant that can be precomputed and attached to the fragment’s weight We activate fragment with f .score - gc( f ) instead of f.score

f

The inclusion of gap cost can be done with no extra cost the same complexity as the algorithm with no gap cost

27

The Local Chaining Problem

Given n weighted fragments from k genomes, a chain C of colinear non-overlapping fragments

has score:


where g(fi, fi-1) is the gap cost of connecting fi to fi-1

First Genome G1

Second Genome G2

Third Genome G3

A local chain C is called optimal if its score is maximum over all other chains.

The weight of a fragment is for example its length or its statistical significance

28

First Genome G1

Second Genome G2

Third Genome G3

Given n weighted fragments from k genomes, a chain C of colinear non-overlapping fragments

has score:


where g(fi, fi-1) is the gap cost of connecting fi to fi-1

A local chain C is called optimal if its score is maximum over all other chains.

The Local Chaining Problem

The weight of a fragment is for example its length or its statistical significance

29

Geometric Solution

The recurrence

fj.score=fj.weight+max{0, fi.score-g( fi , fj ): fi<<fj}

can be written as


fj

But we have to check

if fj.score=fj.weight+f’.score >= 0, f’=RMQ{O, start(fj)}

then

Connect f’ to fj

else

Start a new chain, starting with fj

30

Comparing two bacterial genomes

The two genomes:

1- C. trachomatis (1.2 Mbp)2- C. pneumoniae (1.2 Mbp)

Fragments of the type maximal exact matches of minimum length 12 Total number of fragments 288,899

C. t

rach

moa

tish

C. pneumonia

Red points: Forward fragmentsGreen points: Reverse fragments

31

Comparing two bacterial genomes

CoCoNUT is fast: it takes minutes to compute fragments and local chains; a task that took hours by previous methods

The two genomes:

1- C. trachomatis (1.2 Mbp)2- C. pneumoniae (1.2 Mbp)

C. pneumonia

C. t

rach

moa

tis

Chains

Fragments of the type maximal multiple exact matches of minimum length 12 Total number of fragments 288,899

Termini of Replication

32

Conclusions

Chaining Algorithms are efficient for comparative genomics

More variations needed for real applications in biology, i.e., limiting range search, considering overlaps

CoCoNUT is a system for comparative genomics containing various variations of the chaining algorithms

Global and local chaining are analogous to global and local sequence alignment

kd-tree is superior to range tree in practice

33

More on Chaining Algorithms

[1] E. Ohlebusch, M. I. Abouelhoda. Chaining Algorithms and Applications in Comparative Genomics. Handbook of Computational Molecular Biology (Chapter 20), 2005, in press.

[2] M. I. Abouelhoda. Algorithms and a Software System for Comparative Genome Analysis. PhD Thesis, Ulm University, 2005.

34

Thanks for attention

Documents

1 Chaining Algorithms Simplified Mohamed Ibrahim Abouelhoda University of Ulm and Cairo University 2007