Upload
jered
View
34
Download
2
Embed Size (px)
DESCRIPTION
Accelerating. Boyer Moore Searches on Binary Texts. Shmuel Tomi Klein Miri Kopel Ben-Nissan Bar Ilan University, ISRAEL. Background and motivation. Boyer Moore algorithm. New binary variant. Analysis. Experiments. Summary. Outline. Background and motivation. - PowerPoint PPT Presentation
Citation preview
Boyer Moore Searches Boyer Moore Searches
on Binary Textson Binary TextsShmuel Tomi Klein Shmuel Tomi Klein Miri Kopel Ben-NissanMiri Kopel Ben-Nissan
Bar Ilan University, ISRAELBar Ilan University, ISRAEL
AcceleratingAccelerating
Outline
Background and motivationBoyer Moore algorithm
Analysis
Experiments
New binary variant
Summary
Background and motivationBoyer Moore algorithm
New binary variant
Analysis
Experiments
Summary
Important application of Automata:
PATTERN MATCHING
KMP BDM BM
Boyer & Moore
this-is-a-sample-text---
pattern
Match Backwards ! !
Mismatch – case 1: Mismatch – case 1: deltadelta11
ub
ua
b does not occur in x
x
y
contains no bcontains no bx
shift
Boyer – Moore Algorithm
ub
uax
y
contains no bcontains no bbx
shift
b occurs in x
Mismatch – case 2: Mismatch – case 2: deltadelta11
Boyer – Moore Algorithm
ub
uax
y
ucx
shift
Mismatch – case 3: Mismatch – case 3: deltadelta22
u reoccurs in x preceded by c ≠ a
Boyer – Moore Algorithm
ub
uax
y
vx
v shift
Mismatch – case 4: Mismatch – case 4: deltadelta22
Only a suffix v of u reoccurs in x
Boyer – Moore Algorithm
Boyer – Moore Example
aaeellmmppxxresrestt
44001133225577
eexxaammppllee
12121111101099887711
example
deltadelta11
deltadelta22
here ihere iss a simple example a simple example
exampleexamplehere is a simhere is a simpple examplele example
exampleexamplehere is a shere is a siimplemple example example
exaexamplemplehere is a simple examhere is a simple exampplele
exampleexamplehere is a simple here is a simple exampleexample
exampleexample
Problems of Binary Boyer & Moore
deltadelta1 1 uselessuseless
most work bymost work by delta delta11
0100101101011101000100110101001
1101100
this-is-a-sample-text---
pattern
Bit-level processing
Need for Binary Boyer & Moore
Compressed Matching
Given E(T) and P look for E(P) in E(T)
rather than P in D(E(T))
Suggested Solution:
BBBMM Blocked Binary Boyer Moore
Matching
k
shsl
BBBMM
Text [ i ]
Pat [ sh , j ]
ffghabdgttiocbsbgghj
0110001001101010
BBBMM
More information in binary case
ASCII
BINARY
BBBMM
101
101
i i + 1i – 1
T
P
101
100
extended extended delta delta11
01
ksl 1slB 20
mBsldelta ],[1
BBBMM
Total size of delta1 tables:
2221
1 k
sl
ksl
If too large, use limit value kK
T
P
sl k
K
Size of delta1 tables reduced to
12 K
BBBMM
Original delta1 : increase of text pointer BBBMM delta1 : shift size
T
P
Mismatch not in last block
Correct[sh,j]
BBBMM
T
P
deltadelta22
][2 matchlenmdelta
jj11223344556677889910
11
12
13
14
15
16
Pat[Pat[jj]]11001100110011001111110011110011deltadelta22[[jj
]]1133
1133
1133
1133
1133
1133
1133
1133
1133
1133
1133
33771155
2211
AnalysisAssumption : random input
Reasonable for compressed text
Expected # comparisons till mismatch:
Bit-wise:
221
m
j
jj
Blocked:
kk
k
sl
km
t
sltk 112
11
1
/
1
)(
AnalysisExpected # bits shifted after mismatch:
Bit-wise: M
Blocked: M’
mmME jm
j
j log),2min(2)(1
MM '
Experiments
English Bible (2.5MB) World Factbook (1.5MB)
Text: Huffman encoded
Patterns: Random substrings
of lengths 10 to 500
k = 8
Experiments:Average # comparisons between shiftsAverage # comparisons between shifts
Bit-wiseBlocked
100 200 300 400 500
1.1
1.2
1.3
1.4
1.5
length of pattern
Experiments:Average size of shiftsAverage size of shifts
Bit-wise
100 200 300 400 500
20
40
60
80
100
length of pattern
Blocked
Experiments:Average # comparisons for 1000 bitsAverage # comparisons for 1000 bits
100 200 300 400 500
100
200
300
400
500
length of pattern
Bit-wise
Blocked
BDM
Experiments:Time to locate first occurrence (ms)Time to locate first occurrence (ms)
100 200 300 400 500
50
100
150
200
250
length of pattern
300
Bit-wise
Blocked
BDMTurbo-BDM
Summary
Blocked variant of BMBlocked variant of BM
Faster than alternatives, Overhead 1-10 KFaster than alternatives, Overhead 1-10 K
Extensions:Extensions:
ASCII, words instead of characters
Thank you Thank you !!