43
A Sub-quadratic Sequence Alignment Algorithm

A Sub-quadratic Sequence Alignment Algorithm

  • Upload
    yachi

  • View
    20

  • Download
    1

Embed Size (px)

DESCRIPTION

A Sub-quadratic Sequence Alignment Algorithm. a. a. c. g. a. c. g. a. 6. 7. 3. 4. 1. 5. 2. 8. 0. c. 1. t. 2. a. 3. c. 4. g. 5. a. 6. g. 7. a. 8. Global alignment. Alignment graph for S = aacgacga , T = ctacgaga. V( i,j ) = max { - PowerPoint PPT Presentation

Citation preview

Page 1: A Sub-quadratic Sequence  Alignment Algorithm

A Sub-quadratic Sequence Alignment Algorithm

Page 2: A Sub-quadratic Sequence  Alignment Algorithm

Global alignment

ag

a

g

c

a

t

c

agcagcaa 31

1

2

3

5

4 65 7 80

7

6

8

2

4

Alignment graph for S = aacgacga, T = ctacgaga

Complexity: O(n2)

V(i,j) = max {V(i-1,j-1) + (S[i], T[j]),V(i-1,j) + (S[i], -),V(i,j-1) + (-, T[j])

}

Page 3: A Sub-quadratic Sequence  Alignment Algorithm

FOUR RUSSIAN ALGORITHM

Page 4: A Sub-quadratic Sequence  Alignment Algorithm
Page 5: A Sub-quadratic Sequence  Alignment Algorithm

UNRESTRICTED SCORING FUNCTION

Page 6: A Sub-quadratic Sequence  Alignment Algorithm

Main idea: Compress the sequences

• S = aacgacga • T = ctacgaga

0

21 3

4 5

c t a

g g

0

1 3

2

4

a g

c

g

LZ-78: Divide the sequence into distinct words

1 2 3 4

a ac g acg a1 2 3 4 5

c t a cg ag a

Trie Trie

The number of distinct words: )( lognnO

Page 7: A Sub-quadratic Sequence  Alignment Algorithm

a acg g ac act

3/4 3/2 acg

5/4 5/2aga

2 3 4

1

2

3

4

5

0 1

g

a

gca

agca

aca

ga

ca

Main idea

03

52

1

ag c

t

Trie for T

4g

g

01

23

4

ac

gTrie for S

• Compute the alignment score in each block• Propagate the scores between the adjacent blocks

Page 8: A Sub-quadratic Sequence  Alignment Algorithm

Main idea

• Compress the sequence into words• Pre-compute the score for each block• Do alignment between blocks

• Note:– Replace normal characters by words– Operate on blocks

Page 9: A Sub-quadratic Sequence  Alignment Algorithm

COMPRESS THE SEQUENCELZ-78

Page 10: A Sub-quadratic Sequence  Alignment Algorithm

LZ-78

• S = aacgacga • T = ctacgaga

0

21 3

4 5

c t a

g g

0

1 3

2

4

a g

c

g

LZ-78: Divide the sequence into distinct words

1 2 3 4

a ac g acg a1 2 3 4 5

c t a cg ag a

Trie Trie

The number of distinct words: )( lognnO

Page 11: A Sub-quadratic Sequence  Alignment Algorithm

LZ-78

• Theorem (Lempel and Ziv):– Constant alphabet sequence S– The maximal number of distinct phrases in S is

O(n/log n).

• Tighter upper bound: O(hn/log n) – h is the entropy factor – a real number, 0 < h 1– Entropy is small sequence is repetitive

Page 12: A Sub-quadratic Sequence  Alignment Algorithm

COMPUTE THE ALIGNMENT SCORE IN EACH BLOCK

Page 13: A Sub-quadratic Sequence  Alignment Algorithm

a acg g ac act

3/4 3/2 acg

5/4 5/2aga

2 3 4

1

2

3

4

5

0 1

g

a

gca

agca

aca

ga

ca

Compute the alignment score in each block•

Page 14: A Sub-quadratic Sequence  Alignment Algorithm

• Given– Input border: I– Block

• Compute– Output border: O

O

g

a

gca

G0

20

1

2 3 4

13

4

55

I

Page 15: A Sub-quadratic Sequence  Alignment Algorithm

Matrices

• I[i] : is the input border value• DIST[i,j] : weight of the optimal path– From entry i of the input border– To entry j of its output border

• OUT[i,j] : merges the information from input row I and DIST– OUT[i,j]=I[i] + DIST[i,j]

• O[j] = max{OUT[i,j] for i=1..n}

O

g

a

gca

G0

20

1

2 3 4

13

4

55

I

Page 16: A Sub-quadratic Sequence  Alignment Algorithm

DIST and OUT matrix example

O

g

a

gca

G0

20

1

2 3 4

13

4

55

I

DIST matrix OUT matrixI (input borders)

Block – sub-sequences “acg”, “ag”

0 1 2 3 4 5

I0 0 -1 -2 -3 △ △

I1 -1 -1 -2 -1 -3 △

I2 -2 0 0 1 -1 -3

I3 △ -2 -2 0 -2 -2

I4 △ △ -2 0 -1 -1

I5 △ △ △ -2 -1 0

0 1 2 3 4 5

1 0 -1 -2 - -

1 1 0 1 -1 -

1 3 3 4 2 0

-12 0 0 2 0 0

-13 -13 -1 1 0 0

-14 -14 -14 1 2 3

I0=1

I1=2

I2=3

I3=2

I4=1

I5=3

O0 O1 O2 O3 O4 O5

1 3 3 4 2 3

max col

Page 17: A Sub-quadratic Sequence  Alignment Algorithm

• For each block, given two sub-sequence S1, S2

• Compute (from scratch) DIST in (n*m) time• Given I and DIST, compute OUT in (n*m) time• Given OUT[i,j], Compute O in (m*n) time

Page 18: A Sub-quadratic Sequence  Alignment Algorithm

Revise• Compress the sequence• Pre-compute DIST[i,j] for

each block• Compute border values of

each blocks

• Remaining questions– How to compute DIST[i,j]

efficiently?– How to compute O[j] from

I[i] and DIST[i,j] efficiently?

a acg g ac acta

4/4cg

5/4 5/3aga

2 3 4

1

2

3

4

5

0 1

Page 19: A Sub-quadratic Sequence  Alignment Algorithm

COMPUTE O[J] EFFICIENTLY

Page 20: A Sub-quadratic Sequence  Alignment Algorithm

Compute O[j] efficiently

• For each block of two sub-sequences S1, S2• Given– I[i]– DIST[i,j]

• Compute– O[j]

Page 21: A Sub-quadratic Sequence  Alignment Algorithm

DIST and OUT matrix example

O

g

a

gca

G0

20

1

2 3 4

13

4

55

I

DIST matrix OUT matrixI (input borders)

Block – sub-sequences “acg”, “ag”

0 1 2 3 4 5

I0 0 -1 -2 -3 △ △

I1 -1 -1 -2 -1 -3 △

I2 -2 0 0 1 -1 -3

I3 △ -2 -2 0 -2 -2

I4 △ △ -2 0 -1 -1

I5 △ △ △ -2 -1 0

0 1 2 3 4 5

1 0 -1 -2 - -

1 1 0 1 -1 -

1 3 3 4 2 0

-12 0 0 2 0 0

-13 -13 -1 1 0 0

-14 -14 -14 1 2 3

I0=1

I1=2

I2=3

I3=2

I4=1

I5=3

O0 O1 O2 O3 O4 O5

1 3 3 4 2 3

max col

Page 22: A Sub-quadratic Sequence  Alignment Algorithm

Compute O without explicit OUT

O

g

a

gca

G0

20

1

2 3 4

13

4

55

I

DIST matrix I (input borders)

Block – sub-sequences “acg”, “ag”

0 1 2 3 4 5

I0 0 -1 -2 -3 △ △

I1 -1 -1 -2 -1 -3 △

I2 -2 0 0 1 -1 -3

I3 △ -2 -2 0 -2 -2

I4 △ △ -2 0 -1 -1

I5 △ △ △ -2 -1 0

I0=1

I1=2

I2=3

I3=2

I4=1

I5=3

O0 O1 O2 O3 O4 O5

1 3 3 4 2 3

SMAWK

Page 23: A Sub-quadratic Sequence  Alignment Algorithm

• Given DIST[i,j], I[i] we can compute O[j] in O(n+m)– Without creating OUT[i,j]

• How? Why?

Page 24: A Sub-quadratic Sequence  Alignment Algorithm

Why?

• Aggarwal, Park and Schmidt observed that DIST and OUT matrices are Monge arrays.

• Definition: a matrix M[0…m,0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n: 1. Convex condition:

M[a,c]M[b,c]M[a,d]M[b,d] for all a<b and c<d.2. Concave condition:

M[a,c]M[b,c]M[a,d]M[b,d] for all a<b and c<d.

Page 25: A Sub-quadratic Sequence  Alignment Algorithm

How?

• Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find

all row and column maxima of a totally monotone matrixby querying only O(n) elements of the matrix.

Page 26: A Sub-quadratic Sequence  Alignment Algorithm

• Why DIST[i,j] is totally monotone?

O

g

a

gca

G0

20

1

2 3 4

13

4

55

I

The concave condition

If b-c is better than a-c, then b-d is better than a-d.

a b

dc

Page 27: A Sub-quadratic Sequence  Alignment Algorithm

Other problem

• Rectangle problem of DIST

• Set upper right corner of OUT to -• Set lower left corner of OUT to -(n+i-1)*k• Preserve the totally monotone property of

OUT

0 1 2 3 4 5

I0 0 -1 -2 -3 △ △I1 -1 -1 -2 -1 -3 △I2 -2 0 0 1 -1 -3

I3 △ -2 -2 0 -2 -2

I4 △ △ -2 0 -1 -1

I5 △ △ △ -2 -1 0

Page 28: A Sub-quadratic Sequence  Alignment Algorithm

COMPUTE DIST[I,J] EFFICIENTLY

Page 29: A Sub-quadratic Sequence  Alignment Algorithm

a acg g ac act

3/4 3/2 acg

5/4 5/2aga

2 3 4

1

2

3

4

5

0 1

g

a

gca

agca

aca

ga

ca

Compute DIST[i,j] for block(5/4)

03

52

1

ag c

t

Trie for T

4g

g

01

23

4

ac

gTrie for S

Page 30: A Sub-quadratic Sequence  Alignment Algorithm

gca

g

a

gca

g

a

I0

I4 I5I2I3

I1

O3 DIST matrix

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

Page 31: A Sub-quadratic Sequence  Alignment Algorithm

gca

g

a

gca

g

a

I0

I4 I5I2I3

I1

O3 DIST matrix

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

Page 32: A Sub-quadratic Sequence  Alignment Algorithm

gca

g

a

gca

g

a

I0

I4 I5I2I3

I1

O3 DIST matrix

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

Page 33: A Sub-quadratic Sequence  Alignment Algorithm

gca

g

a

gca

g

a

I0

I4 I5I2I3

I1

O3 DIST matrix

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

Page 34: A Sub-quadratic Sequence  Alignment Algorithm

gca

g

a

gca

g

a

I0

I4 I5I2I3

I1

O3 DIST matrix

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

Page 35: A Sub-quadratic Sequence  Alignment Algorithm

• Only column m in DIST[i,j] is new

• DIST block can be updated in O(m+n)

Page 36: A Sub-quadratic Sequence  Alignment Algorithm

MANTAINING DIRECT ACCESS TO DIST TABLE

Page 37: A Sub-quadratic Sequence  Alignment Algorithm

-3

-1

1

0

0

-2

a a c g a c g actacgaga

Trie for T0

1 3

2

4

g

ga

c

Trie for S0

31

2

54

g

cta

g

2 3 4

12

3

4

5

01

Page 38: A Sub-quadratic Sequence  Alignment Algorithm

-3

-1

1

0

0

-2

-2

-2

0

-2

-2

-1

-1

0

-2

0

-1

-2-2

-1

-2

-1

-1

-3

-2

-1

0

a a c g a c g actacgaga

Trie for T0

1 3

2

4

g

ga

c

Trie for S0

31

2

54

g

cta

g

2 3 4

12

3

4

5

01

Page 39: A Sub-quadratic Sequence  Alignment Algorithm

DIST

-3

-1

1

0

0

-2

-2

-2

0

-2

-2

-1

-1

0

-2

0

-1

-2-2

-1

-2

-1

-1

-3

-2

-1

0

a a c g a c g actacgaga

Trie for T0

1 3

2

4

g

ga

c

Trie for S0

31

2

54

g

cta

g

2 3 4

12

3

4

5

01

Page 40: A Sub-quadratic Sequence  Alignment Algorithm
Page 41: A Sub-quadratic Sequence  Alignment Algorithm

Complexity

• Assume |S| = |T| = n• Number of words in S, T = O(hn/log n)• Number of blocks in alignment graph O(h2n2/(log n)2)• For each block

– Update new DIST block O(t = size of the border)– Create direct access table O(t)

• Propagating I/O across blocks – SMAWK O(t)

• Sum of the sizes of all borders is O(hn2/log n)• Total complexity: O(hn2/log n)

Page 42: A Sub-quadratic Sequence  Alignment Algorithm

Other extensions

• Trace• Reducing the space complexity for discrete

scoring• Local alignment

Page 43: A Sub-quadratic Sequence  Alignment Algorithm

References

• Crochemore, M.; Landau, G. M. & Ziv-Ukelson, M. A sub-quadratic sequence alignment algorithm for unrestricted cost matricesACM-SIAM, 2002, 679-688

• Some pictures from 葉恆青