Boyer-Moore string search algorithm

Boyer-Moore string search algorithm

Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997)

Original: Robert S. Boyer, J Strother Moore (1977)

Presented by: Vladimir Zoubritsky

Agenda Problem Statement Bad character rule Boyer-Moore-Horspool algorithm Good Suffix Rule Preprocessing Analysis

Problem Statement

Given pattern P(1..n) and text T(1..m) defined over alphabet Σ, find one or all occurrences of P in T.

Boyer-Moore algorithm (1977) provides an efficient solution. The algorithm has a linear running time in worst case and sub-linear time in most practical cases.

Right to left matching idea Other known algorithms, e.g. Brute Force, match the

pattern from left to right. Algorithm: Align P with index k of T. Start matching

from k+n-1, and if all letters match, report occurrence. By itself matching from right to left is similar to Brute

Force in the running time. Based on the suffix we can decide to skip over ranges

of characters.

Algorithm Skeleton

1)Align P with the beginning of T and match from right to left.

2)If whole P was match report occurrence.

3)Otherwise shift P by the maximal amount between the ones given by the bad character shift and the good suffix shift.

Conditional correctness: If the two shifts never go beyond an occurrence of P in T, the algorithm will report all occurrences.

Bad Character rule

Definition For each character x, let R(x) be the position of the right-most occurrence of character x in P. R(x) is defined to be zero if x does not occur in P.

Bad character shift

Definition: Suppose a particular alignment of P against T, the rightmost n-i characters of P match their counterparts in T, but the character P(i) mismatches with its counterpart, say in position k of T. If the right-most position of the character T(k) in P is j, j < i, then shift so that character j of P is below character k of T, otherwise shift by 1.

The shift would be max[1, i-R(T(k))].

Bad character shift

Simple case: The character aligned with P(n), T(k) does not appear in P: P is shifted by n (to start after k).

Bad character shift

General case: Shift by i – R(x). Trivial to prove correctness.

Boyer-Moore-Horspool algorithm

Described by Horspool in 1980. Basic idea: use Boyer Moore algorithm, but only

use the bad character shift rule. Worst case running time in degenerate cases

may be O(nm). Best case is sub-linear: O(m/n).

Boyer-Moore-Horspool worst case

• A pair of pattern and text could be constructed to have a shift of 1 each time (same as Brute Force).

Boyer-Moore-Horspool best case

• In a case when the last character in the pattern does not appear in the text, each shift would be of steps.

Boyer-Moore-Horspool Time

Preprocessing: Scanning the pattern is done in O(n) time, and using space.

Worst case: . Best case: . Average time: An average number of

comparisons for the general case of Boyer-Moore-Horspool was established: [Baeza-Yates 1990].

Bad character rule is not strong enough for providing linear time (see worst case above).

Good Suffix Rule Definition: Suppose for a given alignment of and , a substring

of matches a suffix of , but a mismatch occurs to the next character to the left. Then find, if exists, the rightmost copy of in , such as is not a suffix of , and the character to the left of in differs from the one to the left of in . Shift to the right, so that substring in is below substring in .

Good suffix rule (cont'd) If does not exist, then shift the left end of past the left end of

in by the least amount, so that a prefix of matches a suffix of t in . If no such shift is possible then shift by n places to the right.

Correctness of the good-suffix shift

• Recall: Suppose for a given alignment of and , a substring of matches a suffix of , but a mismatch occurs to the next character to the left.

• If there is only one occurrence of in P, then any alignment with the left end of P aligned before the left end of will not yield a match.

• If we align with a previous copy of in P, and the character before is equal to the character before , this alignment will fail the same way.

Preprocessing of P Originally published preprocessing algorithm was

complex and erroneous. An updated version was complex still.

We will use a simpler version based on the Z algorithm.

We want the preprocessing to compute values for functions L’(i) and l’(i) – defined later.

Preprocessing of P (cont'd) An intermediate value we will require is . of is defined as the length

of the longest suffix of which is also a suffix of . Recall that is the length of the longest substring of that is also a

prefix of S. We can compute values for by running the Z-algorithm on the

reverse of P.

Preprocessing of P: calculating L’(i) gives the right-end position of the right-most copy of

which is preceded by a different character. is zero if no such position exists.

Using , we can define as the largest j so that . can be accumulated in linear time from the values of .

Preprocessing of P: calculating l’(i) l'(i) is the length of the largest suffix of , that is also a

prefix of P, if exists. We can also define l'(i) in terms of : is the largest j ≤ |t|

so that . In a similar way, can be accumulated in linear time

from values.

Using the preprocessing results

• First part of the good suffix rule says we should find a copy of which is preceded by a different character – i.e. using a non-zero value of .

• The second part looks at the least amount for a prefix of P to match a suffix of t – i.e. using a non-zero value of .

Boyer-Moore Time Using the linear time implementation of the Z algorithm, the preprocessing

takes O(n) time and O(n) space. The original Boyer-Moore algorithm had cases when P appears in T which

resulted in O(nm) time, before a few simple modifications [Galil 1979]. A tight bound of 3m comparisons was established for Boyer-Moore running

time [Cole 1991]. An average case analysis is proposed, but remains difficult to simplify into a

simple expression as in BMH [Tsai 2005]. For other, “Boyer-Moore-like” algorithms the following time bounds were

established:

14m Galil, 79

2m Apostolico et al. 86

3m/2 Colussi et al. 90

4m/3 Colussi et al. 90

Experimental Analysis

On average, for sufficiently large alphabets (8 characters) Boyer-Moore-Horspool has fast running time and sub-linear number of character comparisons.

On average, and in worst cases Boyer-Moore is faster than “Boyer-Moore-like” algorithms.

Data from Michailidis and Margaritis [2001]

Questions?

Documents

Boyer-Moore string search algorithm