39
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03), pp. 359-366. Washington, DC. March 2003.

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

  • Upload
    radley

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases. O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03) , pp. 359-366. Washington, DC. March 2003. Overview. Applications of queries Background on queries - PowerPoint PPT Presentation

Citation preview

Page 1: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

Effective Indexing and Filtering for Similarity Search in Large

Biosequence Databases

O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03), pp. 359-366.

Washington, DC. March 2003.

Page 2: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 2

Overview

• Applications of queries

• Background on queries

• Current problem

• Solutions and our solution

• Comparison experiments and results

• Future work

Page 3: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 3

Queries in general

• We need a metric distance function– To measure the (dis)similarity btw objects

• Dynamic programming Algorithm

– O( |string1| * |string2| ) time and space• i.e. O(n2) where n is length of the strings

– Especially bad for genetic sequence queries where you have long sequences

Page 4: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 4

2 kinds of queries

-range queries– Retrieve all objects similar to query more than a certain

degree

Page 5: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 5

2 kinds of queriesk-nearest neighbor (k-NN) queries

– Retrieve k most similar objects

• No domain knowledge necessary

Ex: 4 NN

Page 6: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 6

2 kinds of queries

-range queries• Requires domain knowledge

– Data distribution & Distance definition

too smallNone returned

Page 7: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 7

2 kinds of queries-range queries

too largeAll returned

Page 8: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 8

Measuring similarity

• We need a metric distance function– To measure the (dis)similarity btw objects

• Edit Distance (ED)– Three kinds of operations

• Insert, delete, replace

– ACTTAGC to AATGATAG

– A C T - - T A G C R I I D ED = 4 A A T G A T A G -

– Dynamic programming Algorithm– O(mn) time and space

Page 9: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 9

DPA

Page 10: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 11

String/Genome Data• Asks the most similar substrings in the

database to the given string.• BLAST has -range queries

– Naïve search (linear scan)– scalability problems

• How to Handle Size– Partial information rather than whole

database • Approximate the string data (compress)

may fit in memory may be used for indexing, clustering

Page 11: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 12

How to Handle Size

• 3 approaches to make use of compressed data

1. Prune irrelevant data, I/O for non-pruned entries calculate exact values for non-pruned

(especially -range queries)

2. Get approximate answers, virtually no I/O (I/O only for answers)(especially k-NN queries)

3. Approximate pruning for -range queries

Page 12: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 13

Overview

• Background on queries

• Current problem

• Transformation and Indexing

• Comparison experiments and results

• Future work

Page 13: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 14

Big PictureGeneral Approach step by step

• Transform (large) string data into (hopefully smaller sized) multi-dimensional vectors

• Develop a distance function df in vector spaces to approximate the string similarity

• Build a multi-dimensional indexing technique on top of multi-dimensional vectors -Preprocessing-

• Implement one of the three approaches mentioned -Query-

Page 14: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 15

String Database Overlapping Windows

Windowing

1

MultidimentionalVectors

Indexed with respect to some

distance function

Transformation Into vector

Space Indexing

3

2

Preprocessing

Page 15: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 16

Index of vectors

Transformation

ApproximateQuery(k-NN or -range)

Query sequence

1

Index of vectors

Exact Query(k-NN or -range)

2a

2b

DoneThe vectors returned represent most of k-NN (or vectors in -range ) + some false positives

Candidate set

Using the index

Continued

Page 16: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 17

Calculate ED for each of them. (Remove false positives.)

Refine

I/O for strings represented by those vectors.

3

Candidate set

Using the index

Page 17: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 18

1ST Step: Partitioning into overlapping Windows

• AACCGGTTACGTACGT…

• AACCGGTTACGTACGT…

• AACCGGTTACGTACGT…

e.g W=6

e.g =2

Page 18: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 19

2ND Step: Mapping Windows into Vector Space

• Choose a tuple size k

• Associate an int to each 4k k-tuples

• Frequencies of those k-tuples, is the vector

• If k=2 4k=16 k-tuples• AA, AC, AG, AT,

• CA, CC, CG, CT

• TA, TC, TG, TT

• GA, GC, GG, GT

Page 19: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 20

Example Mapping

• The integers assigned• AA=0, AC=1, AG=2, AT=3,

• CA=4, CC=5, CG=6, CT=7

• TA=8, TC=9, TG=10, TT=11

• GA=12, GC=13, GG=14, GT=15

• Assume window AACCGG

• AA, AC, CC, CG, GG all occur once

• 1100011000100000 is the matching vector.

Page 20: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 21

Different transformations & Distance Functions

• Tuple size transformation size– 1 4 (frequencies of A, C, G, T) FV1

– 2 16 (frequencies of 2-tuples)FV2

Page 21: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 22

Different transformations & Distance Functions 2

• WVn transformation– String into halves x,y

– FVns for x,yFVx,FVy

– Concatenate addition and subtraction of them

[ FVx + FVy, FVx-FVy]

• Wavelet 1 on example– TCACTTAG

– 1st: divide into halves & find FV1 transformation

• x:TCAC 1 2 0 1

• y:TTAG 1 0 1 2

– 2nd: add and subtract• 2 2 1 3 0 2 –1 –1 WV1

• Same operations on 2-tuples WV2

Page 22: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 23

Distance Functions on the Vector Spaces

• All of them are proved to be lower-bounds to edit-distance

• FD1 distance on FV1

• FD2 distance on FV2

• WD1 distance on WV1

• WD2 distance on WV2

Page 23: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 24

Frequency Distance FDn

Algorithm Example (n=1)

FDn (n-gram frequencies u,v)

• posDist:=negDist:=0• for all dimensions ui,vi

– If ui>vi then posDist:=ui-vi

– else

negDist:=ui-vi

• Return max(posDist, negDist)/n

• u:ACTTAGC2,2,1,2 v:AATGATAG4,0,2,2• – 2-4<0 negDist+=|2-4|

– 2-0>0 posDist+=|2-0|– 1-2<0 negDist+=|1-2|– 2-2=0

• posDist:2 negDist:3• FD1 is 3

Page 24: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 25

FDn Why lower bound? • On example

– need to incresase A by 2 G by 1 3– need to decrease c by 2

• We may “increase+decrease” if we can replace (back to slide #8)

• So in best case edit dist is only FD1 • But it may not be the case, you may need

more operations, because of mismatch of locations…

• Divide by n is because a change in one character, updates frequency of n n-grams.

Page 25: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 26

Wavelet Distance WDn

Algorithm Example (n=1)WDn (n-gram frequency

wavelets u,v)• Find posDist and negDist

on u,v• m:=min(posDist, negDist)• d:= (posDist-negDist)/2• if m < d

– Return d / n

• else– Return (d + (m-d )/2 )/n

• u:ACTC TAGC 1201 1111

2 3 1 2 0 1 –1 0• v:AATG ATAG 2011

2011

4 0 2 2 0 0 0 0

• posDist: 3 + 1 = 4• negDist: 2 + 1 + 1 = 4• m:4 d:0• (0 + 4/2)/1• Return 2

Page 26: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 27

WDn Why lower bound?

• Assume a string transformed into wavelet

[a1,…a, b1,…b]

• Largest change posDist+=3 negDist-=1 or vice versa– So use this change whenever posDist<>negDist

Page 27: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 28

Overview

• Background on queries

• Current problem

• Transformation and Indexing

• Comparison experiments and results

• Future work

Page 28: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 29

Experiment Design

• Implemented transformations & distance functions• Evaluated their pruning efficiency on -range

queries and approximation efficiency on k-NN queries experimentally on real genetic data

• Ran queries with different parameters– Varying string size W, shift amount – Some containing exact match, some not– For -range queries different values– For k-NN queries different k values

Page 29: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 30

K-nearest efficiency

0

10

20

30

40

50

60

70

80

90

5 10 15 20 25

k (for k-nearest neighbor query )

Av

era

ge

of

ed

it-d

ista

nc

es

of

k-n

ea

res

t

EditDist

Freq

Freq2

MaxFreq

Wav

Wav2

Page 30: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 31

Error Rates Compared

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

140.00%

160.00%

5 10 15 20 25

k

per

cen

tag

e er

ror (Freq-Edit)/Edit

(Freq2-Edit)/Edit

(MaxFreq-Edit)/Edit

(Wav-Edit)/Edit

(Wav2-Edit)/Edit

Page 31: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 32

Sorted Graphs

• To depict why our distance functions perform so good in k-NN

• Imitate what our k-NN approximation does, and graph the result– It sorts the data values in increasing order, and

takes the k-nearest ones

Page 32: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 33

Edit Distances and Matching FD1 Distances sorted by FD1

0

20

40

60

80

100

120

140

1 12 23 34 45 56 67 78 89 100

111

122

133

144

155

166

177

188

199

210

221

232

243

254

265

276

287

298

309

320

331

342

353

364

375

386

397

First 400 strings when sorted by FD1

Dis

tan

ce V

alu

e

ED

FD1

20 nearest50 nearest

Page 33: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 3420 nearest50 nearest

Edit Distances and Matching WD2 sorted by WD2

0

20

40

60

80

100

120

140

1 14 27 40 53 66 79 92 105

118

131

144

157

170

183

196

209

222

235

248

261

274

287

300

313

326

339

352

365

378

391

First 400 strings when sorted by WD2

Dis

tan

ce

Va

lue

EDWD2

Page 34: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 35

Nature of the distance functions

• WD2 has very good performance in k-NN even though not so well pruning– Its variance of its ratio to edit distance is much

lower than others as you would like for a distance function

Page 35: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 36

wav2

0

20

40

60

80

100

120

140

1

20

39

58

77

96

11

5

13

4

15

3

17

2

19

1

21

0

22

9

24

8

26

7

28

6

30

5

32

4

34

3

36

2

38

1

40

0

41

9

43

8

45

7

47

6

49

5

51

4

53

3

55

2

57

1

59

0

60

9

62

8

64

7

66

6

EditDist

WaveletDist2

Page 36: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 37

Freq

0

20

40

60

80

100

120

1401

20

39

58

77

96

11

5

13

4

15

3

17

2

19

1

21

0

22

9

24

8

26

7

28

6

30

5

32

4

34

3

36

2

38

1

40

0

41

9

43

8

45

7

47

6

49

5

51

4

53

3

55

2

57

1

59

0

60

9

62

8

64

7

66

6

string sorted by edit dist to query

dis

tan

ce

(e

dit

an

d f

req

)

EditDist

FreqDist

Page 37: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 38

Results

• Tested the parameters obtained by this random experiments, on real data.

• Then also did the parameter extraction using real data too.

Page 38: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 39

Comparison of index structures

Page 39: Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BMI 731 - Winter'04 40

Future Work

• Check applicability of those methods to other kinds of sequence data.– Text– Image search

• Implement index structure in the standalone program, and make performance evaluation