Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

Effective Indexing and Filtering for Similarity Search in Large

Biosequence Databases

O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03), pp. 359-366.

Washington, DC. March 2003.

BMI 731 - Winter'04 2

Overview

• Applications of queries

• Background on queries

• Current problem

• Solutions and our solution

• Comparison experiments and results

• Future work


Queries in general

• We need a metric distance function– To measure the (dis)similarity btw objects

• Dynamic programming Algorithm

– O( |string1| * |string2| ) time and space• i.e. O(n2) where n is length of the strings

– Especially bad for genetic sequence queries where you have long sequences


2 kinds of queries

-range queries– Retrieve all objects similar to query more than a certain

degree


2 kinds of queriesk-nearest neighbor (k-NN) queries

– Retrieve k most similar objects

• No domain knowledge necessary

Ex: 4 NN


2 kinds of queries

-range queries• Requires domain knowledge

– Data distribution & Distance definition

too smallNone returned


2 kinds of queries-range queries

too largeAll returned


Measuring similarity

• We need a metric distance function– To measure the (dis)similarity btw objects

• Edit Distance (ED)– Three kinds of operations

• Insert, delete, replace

– ACTTAGC to AATGATAG

– A C T - - T A G C R I I D ED = 4 A A T G A T A G -

– Dynamic programming Algorithm– O(mn) time and space


DPA


String/Genome Data• Asks the most similar substrings in the

database to the given string.• BLAST has -range queries

– Naïve search (linear scan)– scalability problems

• How to Handle Size– Partial information rather than whole

database • Approximate the string data (compress)

may fit in memory may be used for indexing, clustering


How to Handle Size

• 3 approaches to make use of compressed data

1. Prune irrelevant data, I/O for non-pruned entries calculate exact values for non-pruned

(especially -range queries)

2. Get approximate answers, virtually no I/O (I/O only for answers)(especially k-NN queries)

3. Approximate pruning for -range queries


Overview


• Current problem

• Transformation and Indexing


• Future work


Big PictureGeneral Approach step by step

• Transform (large) string data into (hopefully smaller sized) multi-dimensional vectors

• Develop a distance function df in vector spaces to approximate the string similarity

• Build a multi-dimensional indexing technique on top of multi-dimensional vectors -Preprocessing-

• Implement one of the three approaches mentioned -Query-


String Database Overlapping Windows

Windowing

1

MultidimentionalVectors

Indexed with respect to some

distance function

Transformation Into vector

Space Indexing

3

2

Preprocessing


Index of vectors

Transformation

ApproximateQuery(k-NN or -range)

Query sequence

1

Index of vectors

Exact Query(k-NN or -range)

2a

2b

DoneThe vectors returned represent most of k-NN (or vectors in -range ) + some false positives

Candidate set

Using the index

Continued


Calculate ED for each of them. (Remove false positives.)

Refine

I/O for strings represented by those vectors.

3

Candidate set

Using the index


1ST Step: Partitioning into overlapping Windows

• AACCGGTTACGTACGT…



e.g W=6

e.g =2


2ND Step: Mapping Windows into Vector Space

• Choose a tuple size k

• Associate an int to each 4k k-tuples

• Frequencies of those k-tuples, is the vector

• If k=2 4k=16 k-tuples• AA, AC, AG, AT,

• CA, CC, CG, CT

• TA, TC, TG, TT

• GA, GC, GG, GT


Example Mapping

• The integers assigned• AA=0, AC=1, AG=2, AT=3,

• CA=4, CC=5, CG=6, CT=7

• TA=8, TC=9, TG=10, TT=11

• GA=12, GC=13, GG=14, GT=15

• Assume window AACCGG

• AA, AC, CC, CG, GG all occur once

• 1100011000100000 is the matching vector.


Different transformations & Distance Functions

• Tuple size transformation size– 1 4 (frequencies of A, C, G, T) FV1

– 2 16 (frequencies of 2-tuples)FV2


Different transformations & Distance Functions 2

• WVn transformation– String into halves x,y

– FVns for x,yFVx,FVy

– Concatenate addition and subtraction of them

[ FVx + FVy, FVx-FVy]

• Wavelet 1 on example– TCACTTAG

– 1st: divide into halves & find FV1 transformation

• x:TCAC 1 2 0 1

• y:TTAG 1 0 1 2

– 2nd: add and subtract• 2 2 1 3 0 2 –1 –1 WV1

• Same operations on 2-tuples WV2


Distance Functions on the Vector Spaces

• All of them are proved to be lower-bounds to edit-distance

• FD1 distance on FV1

• FD2 distance on FV2

• WD1 distance on WV1

• WD2 distance on WV2


Frequency Distance FDn

Algorithm Example (n=1)

FDn (n-gram frequencies u,v)

• posDist:=negDist:=0• for all dimensions ui,vi

– If ui>vi then posDist:=ui-vi

– else

negDist:=ui-vi

• Return max(posDist, negDist)/n

• u:ACTTAGC2,2,1,2 v:AATGATAG4,0,2,2• – 2-4<0 negDist+=|2-4|

– 2-0>0 posDist+=|2-0|– 1-2<0 negDist+=|1-2|– 2-2=0

• posDist:2 negDist:3• FD1 is 3


FDn Why lower bound? • On example

– need to incresase A by 2 G by 1 3– need to decrease c by 2

• We may “increase+decrease” if we can replace (back to slide #8)

• So in best case edit dist is only FD1 • But it may not be the case, you may need

more operations, because of mismatch of locations…

• Divide by n is because a change in one character, updates frequency of n n-grams.


Wavelet Distance WDn

Algorithm Example (n=1)WDn (n-gram frequency

wavelets u,v)• Find posDist and negDist

on u,v• m:=min(posDist, negDist)• d:= (posDist-negDist)/2• if m < d

– Return d / n

• else– Return (d + (m-d )/2 )/n

• u:ACTC TAGC 1201 1111

2 3 1 2 0 1 –1 0• v:AATG ATAG 2011

2011

4 0 2 2 0 0 0 0

• posDist: 3 + 1 = 4• negDist: 2 + 1 + 1 = 4• m:4 d:0• (0 + 4/2)/1• Return 2


WDn Why lower bound?

• Assume a string transformed into wavelet

[a1,…a, b1,…b]

• Largest change posDist+=3 negDist-=1 or vice versa– So use this change whenever posDist<>negDist


Overview


• Current problem

• Transformation and Indexing


• Future work


Experiment Design

• Implemented transformations & distance functions• Evaluated their pruning efficiency on -range

queries and approximation efficiency on k-NN queries experimentally on real genetic data

• Ran queries with different parameters– Varying string size W, shift amount – Some containing exact match, some not– For -range queries different values– For k-NN queries different k values


K-nearest efficiency

0

10

20

30

40

50

60

70

80

90

5 10 15 20 25

k (for k-nearest neighbor query )

Av

era

ge

of

ed

it-d

ista

nc

es

of

k-n

ea

res

t

EditDist

Freq

Freq2

MaxFreq

Wav

Wav2


Error Rates Compared

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

140.00%

160.00%

5 10 15 20 25

k

per

cen

tag

e er

ror (Freq-Edit)/Edit

(Freq2-Edit)/Edit

(MaxFreq-Edit)/Edit

(Wav-Edit)/Edit

(Wav2-Edit)/Edit


Sorted Graphs

• To depict why our distance functions perform so good in k-NN

• Imitate what our k-NN approximation does, and graph the result– It sorts the data values in increasing order, and

takes the k-nearest ones


Edit Distances and Matching FD1 Distances sorted by FD1

0

20

40

60

80

100

120

140

1 12 23 34 45 56 67 78 89 100

111

122

133

144

155

166

177

188

199

210

221

232

243

254

265

276

287

298

309

320

331

342

353

364

375

386

397

First 400 strings when sorted by FD1

Dis

tan

ce V

alu

e

ED

FD1

20 nearest50 nearest

BMI 731 - Winter'04 3420 nearest50 nearest

Edit Distances and Matching WD2 sorted by WD2

0

20

40

60

80

100

120

140

1 14 27 40 53 66 79 92 105

118

131

144

157

170

183

196

209

222

235

248

261

274

287

300

313

326

339

352

365

378

391

First 400 strings when sorted by WD2

Dis

tan

ce

Va

lue

EDWD2


Nature of the distance functions

• WD2 has very good performance in k-NN even though not so well pruning– Its variance of its ratio to edit distance is much

lower than others as you would like for a distance function


wav2

0

20

40

60

80

100

120

140

1

20

39

58

77

96

11

5

13

4

15

3

17

2

19

1

21

0

22

9

24

8

26

7

28

6

30

5

32

4

34

3

36

2

38

1

40

0

41

9

43

8

45

7

47

6

49

5

51

4

53

3

55

2

57

1

59

0

60

9

62

8

64

7

66

6

EditDist

WaveletDist2


Freq

0

20

40

60

80

100

120

1401

20

39

58

77

96

11

5

13

4

15

3

17

2

19

1

21

0

22

9

24

8

26

7

28

6

30

5

32

4

34

3

36

2

38

1

40

0

41

9

43

8

45

7

47

6

49

5

51

4

53

3

55

2

57

1

59

0

60

9

62

8

64

7

66

6

string sorted by edit dist to query

dis

tan

ce

(e

dit

an

d f

req

)

EditDist

FreqDist


Results

• Tested the parameters obtained by this random experiments, on real data.

• Then also did the parameter extraction using real data too.


Comparison of index structures


Future Work

• Check applicability of those methods to other kinds of sequence data.– Text– Image search

• Implement index structure in the standalone program, and make performance evaluation

Documents

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases