Speeding Up String Pattern Matching by Text Compression ...ayumi/papers/IPSJ40.pdf · Vol.42 No.3 IPSJJournal Mar. 2001 IPSJ40thAnniversaryAwardPaper Speeding Up String Pattern Matching

Vol. 42 No. 3 IPSJ Journal Mar. 2001

IPSJ 40th Anniversary Award Paper

Speeding Up String Pattern Matching by Text Compression:

The Dawn of a New Era

Masayuki Takeda,† Yusuke Shibata,†,☆ Tetsuya Matsumoto,†

Takuya Kida,† Ayumi Shinohara,† Shuichi Fukamachi,††

Takeshi Shinohara†† and Setsuo Arikawa†

This paper describes our recent studies on string pattern matching in compressed textsmainly from practical viewpoints. The aim is to speed up the string pattern matching task, incomparison with an ordinary search over the original texts. We have successfully developed (1)an AC type algorithm for searching in Huffman encoded files, and (2) a KMP type algorithmand (3) a BM type algorithm for searching in files compressed by the so-called byte pairencoding (BPE). Each of the algorithms reduces the search time at nearly the same rate asthe compression ratio. Surprisingly, the BM type algorithm runs over BPE compressed filesabout 1.2–3.0 times faster than the exact match routines of the software package agrep, whichis known as the fastest pattern matching tool.

1. Introduction

String pattern matching is one of the mostfundamental operations in string processing.The problem is to find all occurrences of a givenpattern in a given text. It becomes more im-portant to find a pattern in text files efficiently,as files become large. Recently, the compressedpattern matching problem attracts special con-cern, in which the aim is to find pattern occur-rences in compressed text without decompres-sion. It has been extensively studied for vari-ous compression methods by many researchersin the last decade from both theoretical andpractical viewpoints. See Table 1.One important goal of studies on this prob-

lem is to perform a faster search in compressedfiles in comparison with a decompression fol-lowed by an ordinary search (Goal 1). Al-though the prices of storage devices are com-ing down year by year, the data to be storedincrease even more rapidly and we still tendto store them in a compressed form. Typicalexamples of such situations are mobile devicessuch as notebook computers and personal digi-tal assistants (PDAs), where a user is often ea-ger to insert any available information up to apossible limitation. Goal 1 is indeed attractivefor this reason.A more ambitious goal is to perform a faster

† Department of Informatics, Kyushu University☆ Presently with NTT Comware†† Department of Artificial Intelligence, Kyushu Insti-

tute of Technology

search in compressed files in comparison withan ordinary search in the original files (Goal 2).In this case, the aim of compression is not onlyto reduce disk storage requirement but also tospeed up string searching task. Let td, ts, andtc be the CPU times for a decompression, forsearching in uncompressed files, and for search-ing in compressed files, respectively. Goal 1aims for td + ts > tc while Goal 2 for ts > tc.Thus, Goal 2 is more difficult to achieve thanGoal 1.The searching time is the sum of the file I/O

time and the CPU time for pattern matching.In this paper, we focus on the reduction of theCPU time. Of course, text compression reducesthe file I/O time at the same rate as the com-pression ratio, but it may increase the CPUtime. When the data transfer is slow, e.g., net-work environments, the CPU time is negligiblecompared with the file I/O time. In this case,the elapsed time for searching in compressedfiles would be shorter than that for searching inthe original files☆☆. On the other hand, this isnot necessarily true when the data transfer isrelatively fast, in such situations as a worksta-tion with local disk storage or a notebook per-sonal computer. If we achieve Goal 2, however,we can perform a faster search in compressedfiles in the elapsed time than an ordinary searchin the original files even in such a situation.

☆☆ Even a naive method of decompression followed bya search can be faster than an ordinary search in theoriginal files.

370

Vol. 42 No. 3 Speeding Up String Pattern Matching by Text Compression 371

Table 1 Compressed pattern matching.

Compression method Compressed pattern matching algorithms

Run-length Eilam-Tzoreff and Vishkin10)

Run-length (two dim.) Amir, et al.6); Amir and Benson2),3); Amir, et al.5)

LZ77 Farach and Thorup11); Gasieniec, et al.14)

LZ78/LZW Amir, et al.4); Kida, et al.19); Navarro and Raffinot 28); Navarro andTarhio29); Karkkainen, et al.16)

Straight-line programs Karpinski, et al.17); Miyazaki, et al.25); Hirao, et al.15)

Huffman Fukamachi, et al.12); Miyazaki, et al.24)

Finite state encoding Takeda35)

Word based encoding Moura, et al.27)

Pattern substitution Manber22); Shibata, et al.32)

Collage systems Kida, et al.18); Shibata, et al.33); Matsumoto, et al.23)

Let n and N denote the compressed textlength and the original text length, respec-tively. Theoretically, the best compression hasn =

√N for the Lempel-Ziv-Welch (LZW) en-

coding37), and n = logN for LZ77 39). Thus anO(n) time algorithm for searching directly incompressed text is considered to be better thana simple O(N) time algorithm for searching inthe original text☆. However, in practice n is lin-early proportional to N for real text files. Forthis reason, an elaborate O(n) time algorithmfor searching in compressed text is often slowerthan a simple O(N) time algorithm runningon the original text. For example, as shownin Refs. 19), 28), searching in LZW compressedfiles is slower than searching in the original files,although it is faster than a decompression fol-lowed by an ordinary search.In order to achieve the above two goals, es-

pecially Goal 2, we have to choose an appro-priate compression method paying attentionto the constant factors hidden behind the O-notation. Thus, we shall re-estimate the per-formances of the existing compression methodsin the light of the new criterion: efficiency ofcompressed pattern matching. It is consideredthat a compression method which has receivedlittle attention against the traditional crite-ria (i.e., the compression ratio and the com-pression/decompression time) will be estimatedgood against the new criterion.In this paper we describe our recent stud-

ies on this research theme. Text compressionmethods fall into two classes: character-wisecompression and dictionary-based compression.For the former class, we focused on the Huffman☆ The O(n) time algorithm requires an extra O(r)time in order to report all pattern occurrences,where r is the number of them, and r could be linearin N . But, we here ignore the O(r) factor.

encoding and the higher-order Huffman encod-ing, and developed run-time efficient algorithmsfor searching files compressed by these meth-ods. The one dealing with Huffman encodedfiles runs faster than the Aho-Corasick (AC)1)algorithm over the original files at the same rateas the compression ratio.For the latter class, we introduced collage sys-

tem, a unifying system which abstracts vari-ous dictionary-based compression methods. Weshowed two general pattern matching algo-rithms for text strings in terms of collage sys-tem, which simulate the Knuth-Morris-Pratt(KMP)20) and the Boyer-Moore (BM)8) al-gorithms, the most important algorithms forsearching in uncompressed files. Within theframework of collage system, we re-estimatedthe existing dictionary-based methods from theviewpoint of compressed pattern matching, andconcluded that the byte-pair encoding (BPE)13)is most suitable for our purpose. We special-ized the algorithms for dealing with BPE com-pressed files. Experimental results show thateach of the obtained algorithms runs faster thanthe one being simulated. The searching time isreduced at nearly the same rate as the com-pression ratio. Surprisingly, the one simulatingthe move of the BM algorithm is about 1.2–3.0times faster than the exact match routines ofthe software package agrep38), which is knownas the fastest pattern matching tool. Text com-pression thus accelerates string matching.It should be mentioned that Manber22) pro-

posed a simple compression scheme that accel-erates the string matching. The approach israther straightforward: to encode a given pat-tern and to apply any search routine in orderto find the encoded pattern within compressedfiles. The reductions of space and search timeare not very good compared with BPE.

372 IPSJ Journal Mar. 2001

It should also be emphasized that Moura, etal.27) proposed a compression scheme that usesa word-based Huffman encoding with a byte-oriented code. They presented an algorithmwhich runs twice faster than agrep. However,the compression method is not applicable tosuch texts as DNA sequences, which cannot besegmented into words. For the same reason, itcannot be used for natural language texts writ-ten in Japanese in which we have no blank sym-bols between words.

2. Preliminaries

Let Σ be a finite set of character symbols,called an alphabet. Denote by Σ∗ the set ofstrings over Σ. Strings x, y, and z are saidto be a prefix, factor, and suffix of the stringu = xyz, respectively. The length of a string uis denoted by |u|. The empty string is denotedby ε, that is, |ε| = 0. The ith symbol of astring u is denoted by u[i] for 1 ≤ i ≤ |u|, andthe factor of a string u that begins at positioni and ends at position j is denoted by u[i : j]for 1 ≤ i ≤ j ≤ |u|. For a convenience, letu[i : j] = ε for i > j. Let u be a string in Σ∗,and let i be a non-negative integer. Denote by[i]u (resp. u[i]) the string obtained by removingthe length i prefix (resp. suffix) from u.

3. For Character-wise CompressionMethods

Text compression methods fall into two cate-gories: the character-wise compression and thedictionary-based compression. In this section,we discuss the problem of compressed patternmatching in which text files are compressed us-ing character-wise compression methods, suchas the Huffman encoding.3.1 Pattern Matching in Huffman En-

coded TextPattern matching in Huffman encoded text

seems not so difficult. A naive solution wouldbe to encode a given pattern and to applyany favorite string search routine to find theencoded pattern within the compressed text.Searching with this approach, however, is veryslow because of the following reasons: (1) Anextra work for ‘synchronization’ is needed, i.e.,the starting bit of each encoded symbol needs tobe determined; (2) Bit-wise processing is ratherslow. These problems can be solved as statedin the sequel.3.1.1 SynchronizationAs shown in Ref. 12), the first problem can

0

0

0

0

1

1

1

1

A B

C

D

E

(a) Huffman tree

0

0

0

0

1

1

1

1

(b) DFA

Fig. 1 Huffman tree and DFA accepting the set ofcodewords.

be overcome by incorporating a deterministicfinite-state automaton (DFA) that accepts theset of codewords into the pattern-matching ma-chines (PMMs, in short) of the AC algorithm.Assume that Σ = {A,B,C,D,E} and textstrings over Σ are encoded according to theHuffman tree of Fig. 1 (a). Figure 1 (b) is aDFA accepting the set of codewords, which isobtained directly from the Huffman tree. Thesmallest DFA accepting the same language canbe built in only linear time with respect to thesize of the Huffman tree, by using the minimiza-tion technique31).Suppose that we are given two patterns EC

and CD. PMM for finding the encoded pat-terns 1001 and 00101 within a text compressedby the Huffman code is shown in Fig. 2 (a).For a comparison, we also show PMM with-out DFA for synchronization in Fig. 2 (b), whichmay lead a false-detection of patterns.By using PMM with DFA for synchroniza-

tion, we can process a Huffman encoded textbit-by-bit without any extra work for determin-ing the starting bit of each encoded symbol.The size of PMM, namely, the number of statesin it, is approximately m · R + �, where m isthe total length of patterns, R is the averagecodeword length, and � is the size of DFA forsynchronization, respectively.3.1.2 Avoiding Bit-wise ProcessingThe second problem can be avoided by con-

verting PMM into a new one so that it runs inbyte-by-byte manner, as shown in Ref. 24). Al-though the machine size is larger than that ofthe usual PMM, the two-dimensional array im-plementation can be adopted. The new PMMruns on a Huffman encoded file faster than theusual PMM running on the original file, as willbe shown in Section 6. The searching time is


1 2 3

6 7 8

4 5

9 10

1 0 0 1

0

0 1 0 1

EC

CD

1 1

1

1

1

0 1

0 1

01

0 1

11

12

13

14

(a) PMM with DFA for synchronization.

1 2 3

6 7 8

4 5

9 10

1 0 0 1

0

0 1 0 1

EC

CD

(b) PMM without DFA for synchronization.

Fig. 2 PMMs for searching in Huffman encoded text. The circles denotethe states. The solid and the broken arrows represent the goto andthe failure functions, respectively. The strings adjacent to the statesmean the outputs from them. The thick line circles correspond to thestarting bits of encoded symbols.

reduced at nearly the same rate as the compres-sion ratio.3.2 Searching in Files Compressed by

Finite-state EncoderSince the compression ratio is not very good

in the Huffman encoding, we shall use the finite-state model in which the probability distribu-tion is conditioned by the current state, of-ten referred to as the context. The compres-sion based on this model can be formalized asa finite-state encoder (FSE, in short). Fig-ure 3 shows an example of FSE with two states,where Σ = {A,B,C,D} and the code alphabetis ∆ = {0, 1}. The code-trees for the two states1 and 2 are shown in Fig. 4.3.2.1 Compression Using FSEAn FSE is formally a 6-tuple 〈Q, q0,Σ,∆, δ, λ〉,

where Q is a finite set of states; q0 in Q is theinitial state; Σ is a source alphabet; ∆ is a codealphabet; δ : Q × Σ → Q is a state-transitionfunction; and λ : Q×Σ→ ∆∗ is a coding func-tion which satisfies the condition that, for anyq ∈ Q and any a, b ∈ Σ, if λ(q, a) is a prefix ofλ(q, b), then a = b. Define for a state q ∈ Qthe function ϕq : Σ → ∆∗ by ϕq(a) = λ(q, a)(a ∈ Σ). Then, the above condition impliesthat ϕq is a one-to-one mapping and the set of

A/111B/10C/110

A/00D/11

1

D/0B/01C/10

2

Fig. 3 Finite-state encoder with two states.

01

0

0

1

10

0 1

1

1A B C D 0

A

B

C

D

for state 1 for state 2

Fig. 4 Code-trees.

codewords Codewordq = {ϕq(a)|a ∈ Σ} has theprefix property for any state q.Extend δ into the function from Q×Σ∗ to Q

by

δ(q, ε) = q,δ(q, xa) = δ(δ(q, x), a),


01

0 1

0 1

2 1 1 2

0 1

0 1

10

1

1

1

2

1

1

1

0

0

1

1

BD AB 1

1

1

1

1

1

0

0

2

BD

AB

0

Fig. 5 PMM for searching in FSE compressed files.

and then extend λ into the function fromQ×Σ∗to ∆∗ by

λ(q, ε) = ε,λ(q, xa) = λ(q, x) · λ(δ(q, x), a),

where q ∈ Q, x ∈ Σ∗, and a ∈ Σ. The encodingof a text T ∈ Σ∗ by an FSE is then definedto be the string λ(q0, T ). For example, FSEof Fig. 3 takes as input the text DBBDCAB,makes state-transitions of 1 → 2 → 1 → 1 →2→ 1→ 2→ 1, and emits as output the stringλ(1, DBBDCAB) = 11 10 01 11 110 00 10.3.2.2 PMMs for Searching in FSE

Compressed TextSuppose that we are given two patterns AB

and BD. In state 1 we shall search for thestrings λ(1, AB) = 0010 and λ(1, BD) = 0111,and in state 2 the strings λ(2, AB) = 11101 andλ(2, BD) = 10 11. As shown in Ref. 35), PMMfor searching in FSE compressed text can beconstructed from the code-trees of Fig. 4 andfrom the tries representing the encodings of thepatterns for two states. See Fig. 5.The size of PMM is linearly proportional to

the number of states of the underlying FSE.Therefore, we cannot use a large FSE. Theproblem is how to build a reasonably small FSEwhich has a relatively high compression ratio.Miyazaki26) proposed an approach to reduce

the size of FSE, based on alphabet indexing. Heshowed a local search method which finds not

[00-7F]

[00-FF]

[80-FF]

Fig. 6 DFA accepting Japanese texts (for EUC).Although the first and second 8-bits of each16-bits codeword are in fact in [A1-FE], thisDFA is enough for a correct Japanese text.

the best but a ‘good’ alphabet indexing whichyields a relatively high compression ratio.3.3 Searching in Japanese Text FilesSince an 8-bit code such as the ASCII

code cannot represent many characters used inJapanese texts, a 16-bit code is used to repre-sent such characters. Thus, a text file writtenin Japanese is a mixture of 8-bit codewords and16-bit codewords. Automata-oriented approachto the string pattern matching in Japanese textseems to be unrealistic because the alphabetsize is very large. However, we can build aPMM which runs on a Japanese text in a byte-by-byte manner.In both of the Extended-Unix-Code (EUC)

and the Shifted-JIS (SJIS), the first 8-bits ofeach of the 16-bit codewords is not identical toany of the 8-bit codewords. Thus the set ofcodewords satisfies the prefix property. Assum-ing the code alphabet is ∆ = {00, 01, . . . , FF}(i.e., the set of byte codes), DFA accepting theset of codewords is as shown in Fig. 6. By in-corporating this DFA into PMMs, we can pro-cess Japanese text files in a byte-by-byte man-ner. The synchronization technique mentionedin Section 3.1.1 is thus applicable to any codesatisfying the prefix property.

4. For Dictionary-based Methods

In this section we discuss the problem of com-pressed pattern matching for the dictionary-based methods. In a dictionary-based compres-sion, a text string is described by a pair of a dic-tionary and a sequence of tokens, each of whichrepresents a phrase defined in the dictionary.We introduced in Ref. 18) a unifying framework,named collage system, which abstracts variousdictionary-based compression methods, such asthe Lempel-Ziv family, the SEQUITUR 30), theRe-Pair21), and the static dictionary methods.In Ref. 18) we presented a general compressed


pattern matching algorithm for texts describedin terms of collage system. Consequently, any ofthe compression methods covered by the frame-work has a compressed pattern matching algo-rithm as an instance. The algorithm essentiallysimulates the move of the KMP algorithm. Onthe other hand, we also presented in Ref. 33) aBM type algorithm for collage systems.4.1 Collage SystemA collage system is a pair 〈D,S〉 defined as

follows: D is a sequence of assignments X1 =expr1; X2=expr2; · · · ;Xn=exprn, where eachXk is a variable (or a token) and exprk is anyof the form:

a for a ∈ Σ ∪ {ε}, (primitive assignment)XiXj for i, j < k, (concatenation)[j]Xi for i < k and an integer j,

(prefix truncation)

X[j]i for i < k and an integer j,

(suffix truncation)(Xi)j for i < k and an integer j.

(j times repetition)Each variable represents a string obtained by

evaluating the expression as it implies. Weidentify a variable Xi with the string repre-sented by Xi in the sequel. The size of D is thenumber n of assignments and denoted by ||D||.The syntax tree of a variable X in D, denotedby T (X), is defined inductively as follows. Theroot node of T (X) is labeled by X and has:

no subtree, if X = a ∈ Σ ∪ {ε},two subtrees T (Y ) and T (Z),

if X = Y Z,one subtree T (Y ),

if X = Y i, [i]Y, or Y [i].

Define the height of a variable X to be theheight of the syntax tree T (X). The height ofD is defined by height(D) = max{height(X) |X in D}. It expresses the maximum depen-dency of the variables in D.On the other hand, S = Xi1 , Xi2 , . . . , Xik

is asequence of variables defined in D. We denoteby |S| the number k of variables in S. Thecollage system represents a string obtained byconcatenating strings Xi1 , Xi2 , . . . , Xik

. Essen-tially, we can convert any collage system 〈D,S〉into the one where S consists of a single vari-able, by adding a series of concatenation op-erations into D. The fact may suggest that Sis unnecessary. However, by separating a dic-tionary D which only defines phrases, from S

which intends for a sequence of phrases, we cancapture a variety of compression methods natu-rally. Both D and S can be encoded in variousways. The compression ratios therefore dependupon the encoding sizes of D and S rather thanupon ||D|| and |S|.A collage system is said to be truncation-free

if D contain no truncation operation. A collagesystem is said to be regular if D contain neitherrepetition nor truncation operation.4.2 KMP Type Algorithm for Collage

SystemsLet us denote by t.u the phrase represented

by a token t. The problem is:Given: a pattern π = π[1 : m] and a collagesystem 〈D,S〉 with S = S[1 : n].Find: all locations at which π occurs withinthe original text S[1].u · S[2].u · · · S[n].u.In Ref. 18) we presented a KMP type algo-

rithm solving this problem. Figure 7 gives anoverview of the algorithm, which processes Stoken-by-token. The algorithm simulates themove of the KMP automaton running on theoriginal text, by using two functions JumpKMP

and OutputKMP, both take as input a state anda token. The former is used to substitute justone state transition for the consecutive statetransitions of the KMP automaton caused byeach of the phrases, and the latter is used toreport all pattern occurrences found during thestate transitions. Thus the definitions of thetwo functions are as follows.

JumpKMP(q, t) = δ(q, t.u),OutputKMP(q, t)

={|v|

∣∣∣∣ v is a non-empty prefix of t.usuch that δ(q, v) is the final state

},

where δ is the state transition function of theKMP automaton.This idea is essentially based on the algorithm

for searching in LZW compressed text due toAmir, et al.4) which finds only the leftmost pat-tern occurrence. The extension to find all pat-tern occurrences was achieved by Kida, et al.19),together with an extension to the multiple pat-tern problem.Let ||D|| and height(D) respectively denote

the number of assignments in D and the max-imum dependency in D. Let r be the numberof all occurrences of π in the text. The timeand space complexities of the algorithm are asfollows.Lemma 1 (Kida, et al.18)) The function

JumpKMP can be realized inO(||D||·height(D)+


Input: Pattern π and collage system consisting of D and S = S[1 : n].Output: All occurrences of π in the original text.begin/* Preprocessing */

Compute the functions JumpKMP and OutputKMP from the pattern πand the dictionary D;

/* Main routine */state := 0; � := 0;for i := 1 to n do begin

for each d ∈ OutputKMP(state,S[i]) doReport a pattern occurrence that ends at position � + d;

state := JumpKMP(state,S[i]); � := � + |S[i].u|end

end.

Fig. 7 KMP Type algorithm for searching in a collage system.

m2) time using O(||D||+m2) space, so that itresponds in constant time. For a truncation-free collage system, the time complexity be-comes O(||D||+m2).Lemma 2 (Kida, et al.18)) The procedure

to enumerate the set OutputKMP(q, t) can berealized in O(||D|| · height(D) + m2) time us-ing O(||D||+m2) space, so that it responds inO(height(t) + �) time, where � is the size of theset. For a truncation-free collage system, it canbe realized in O(||D||+m2) time and space, sothat it runs in O(�) time.Theorem 1 (Kida, et al.18)) The prob-

lem of compressed pattern matching can besolved in O((||D||+n) ·height(D)+m2+r) timeusing O(||D||+m2) space. For a truncation-freecollage system, the time complexity becomesO(||D||+ n+m2 + r).The idea can also be applied to compressed

pattern matching for other compression meth-ods that are not contained in the collage sys-tem. For instance, we developed in Ref. 34) analgorithm, which is based on the similar idea,for searching in texts compressed using anti-dictionaries9).4.3 BM Type Algorithm for Collage

SystemsWe first briefly sketches the BM algorithm,

and then show a BM type algorithm for search-ing in collage systems.4.3.1 BMAlgorithm on Uncompressed

TextThe BM algorithm performs the character

comparisons in the right-to-left direction, andslides the pattern to the right using the so-called shift function when a mismatch occurs.

Input: Pattern π and text T = T [1 : N ].Output:All occurrences of π in T .begin/* Preprocessing */

Compute the functions g and σ from the pattern π;

/* Main routine */T [0] := $; /* $ never occurs in pattern */i := m;while i ≤ N do begin

state := 0; � := 0;while g(state, T [i − �]) is defined do begin

state := g(state, T [i − �]); � := � + 1end;if state = m then report a pattern occurrence;i := i + σ(state, T [i − �])

endend.

Fig. 8 BM algorithm on uncompressed text.

The algorithm for searching in text T [1 : N ]is shown in Fig. 8. Note that the function g isthe state transition function of the (partial) au-tomaton that accepts the reversed pattern, inwhich state j represents the length j suffix ofthe pattern (0 ≤ j ≤ m).Although there are many variations of the

shift function, they are basically designed toshift the pattern to the right so as to align a textsubstring with its rightmost occurrence withinthe pattern. Let

rightmost occ(w)

= min

{� > 0

∣∣∣∣ π[m − � − |w|+ 1 : m − �] = w,

or π[1 : m − �] is a suffix of w

}.

The following definition, given by Uratani andTakeda36) (for multiple pattern case), is the onewhich utilizes all information gathered in oneexecution of the inner-while-loop in the algo-


Input: Pattern π and collage system consisting of D and S = S[1 : n].Output: All occurrences of π in the original text.begin/* Preprocessing */

Compute the functions JumpBM, OutputBM and Occ from the pattern πand the dictionary D;

/* Main routine */focus := an appropriate value;while focus ≤ n do begin

Step 1: Report all pattern occurrences that are contained in the phrase S[focus].uby using Occ;

Step 2: Find all pattern occurrences that end within the phrase S[focus].uby using JumpBM and OutputBM;

Step 3: Compute a possible shift ∆ based on information gathered in Step 2;focus := focus +∆

endend.

Fig. 9 Overview of BM type compressed pattern matching algorithm.

Compressed text

Original text

Pattern occurrences

focus

Fig. 10 Pattern occurrences.

rithm of Fig. 8.σ(j, a) = rightmost occ(a ·π[m−j+1 : m]).

The two-dimensional array realization of thisfunction requires O(|Σ| ·m) memory, but it be-comes realistic due to recent progress in com-puter technology. Moreover, the array can beshared with the goto function g. This saves notonly memory requirement but also the numberof table references.4.3.2 Algorithm for Collage SystemNow, we show a BM type algorithm for

searching in collage systems. Figure 9 givesan overview of our algorithm. For each itera-tion of the while-loop, we report in Step 1 allthe pattern occurrences that are contained inthe phrase represented by the token we focuson, determine in Step 2 the pattern occurrencesthat end within the phrase, and then shift ourfocus to the right by ∆ obtained in Step 3. Letus call the token we focus on the focused token,and the phrase it represents the focused phrase.For Step 1, we shall compute during the pre-

processing, for every token t, the set Occ(t) ofall pattern occurrences contained in the phraset.u. The time and space complexities of thiscomputation are as follows.Lemma 3 (Kida, et al.18)) We can build

in O(height(D) · ||D||+m2) time using O(||D||+m2) space a data structure by which the enu-meration of the set Occ(t) is performed inO(height(t) + �) time, where � = |Occ(t)|. Fora truncation-free collage system, it can be builtin O(||D||+m2) time and space, and the enu-meration requires only O(�) time.In the following we discuss how to realize Step 2and Step 3.Figure 10 illustrates pattern occurrences

that end within the focused phrase. A candi-date for pattern occurrence is a non-empty pre-fix of the focused phrase that is also a propersuffix of the pattern. There may be morethan one candidate to be checked. One naivemethod is to check all of them independently,but here we take another approach. We shall


procedure Find pattern occurrences(focus : integer);begin

if JumpBM(0,S[focus]) is undefined then return;state := JumpBM(0,S[focus]); d := state; � := 1;repeat

while JumpBM(state,S[focus − �]) is defined do beginstate := JumpBM(state,S[focus − �]); � := � + 1

end;if OutputBM(state,S[focus − �]) = true then report a pattern occurrence;d := d − (state − f(state)); state := f(state)

until d ≤ 0end;

Fig. 11 Finding pattern occurrences in Step 2.

start with the longest one. For the case of un-compressed text, we can do it by using the par-tial automaton for the reversed pattern statedin Section 4.3.1. When a mismatch occurs, wechange the state by using the failure functionand try to proceed into the left direction. Theprocess is repeated until the pattern does nothave an overlap with the focused phrase. In or-der to perform such processing over compressedtext, we use the two functions JumpBM andOutputBM defined in the sequel.Let lpps(w) denote the longest prefix of a

string w that is also a proper suffix of the pat-tern π. Extend the function g into the domain{0, . . . ,m} × Σ∗ by g(j, aw) = g(g(j, w), a), ifg(j, w) is defined and otherwise, g(j, aw) is un-defined, where w ∈ Σ∗ and a ∈ Σ. Let f(j)be the largest integer k (k < j) such that thelength k suffix of the pattern is a prefix of thelength j suffix of the pattern. Note that f isthe same as the failure function of the KMPautomaton. Define the functions JumpBM andOutputBM by

JumpBM(j, t)

=

g(j, t.u), if j �= 0;|lpps(t.u)|, if j=0 and lpps(t.u) �= ε;undefined, otherwise.

OutputBM(j, t)

=

true, if g(j, w) = m and w is

a proper suffix of t.u;false, otherwise.

The procedure for Step 2 is shown in Fig. 11.We now discuss how to compute the possible

shift ∆ of the focus. Let

Shift(j, t)= rightmost occ(t.u · π[m− j + 1 : m]).

Assume that starting at the token S[focus], we

encounter a mismatch against a token t in statej. Find the minimum integer k > 0 such that

Shift(0,S[focus])

≤k∑

i=1

∣∣∣S[focus+ i].u∣∣∣, or (1)

Shift(j, t)

≤k∑

i=0

∣∣∣S[focus+ i].u∣∣∣−∣∣∣lpps(S[focus].u)∣∣∣.(2)

Note that the shift due to Eq. (1) is possi-ble independently of the result of the procedureof Fig. 11. When returning at the first if-thenstatement of the procedure in Fig. 11, we canshift the focus by the amount due to Eq. (1).Otherwise, we shift the focus by the amountdue to both Eq. (1) and Eq. (2) for j = stateand t = S[focus − �] just after the executionof the while-loop at the first iteration of therepeat-until loop.Lemma 4 (Shibata, et al.33)) The func-

tions JumpBM, OutputBM, and Shift can bebuilt in O(height(D) · ||D|| + m2) time andO(||D||+m2) space, so that they answer in O(1)time. The factor height(D) can be dropped ifthe collage system is truncation-free.Theorem 2 (Shibata, et al.33)) The al-

gorithm of Fig. 9 runs in O(height(D) · (||D||+n) + n ·m+m2 + r) time, using O(||D||+m2)space. For a truncation-free collage system, thetime complexity becomes O(||D||+n ·m+m2+r).4.4 Practical AspectsTheorem 1 suggests that a compression

method described as a collage system with notruncation might be suitable for the speed-up ofpattern matching. In fact the collage systems


for LZ77 have truncation and LZ77 is not suit-able as shown in Ref. 28). On the other hand,the collage systems for LZW are truncation-free. However, searching in LZW compressedfiles are rather slow in comparison with search-ing in the original files, as shown in Refs. 19),28). We have two reasons. One is that in LZWthe dictionary D is not encoded explicitly: itwill be incrementally re-built from S. The pre-processing of D is therefore merged into themain routine (see Fig. 7 again). The other rea-son is as follows. Although JumpKMP can berealized using only O(||D||+m) space so that itanswers in constant time, the constant factor isrelatively large. The two-dimensional array im-plementation of JumpKMP would improve this,but it requires O(||D|| ·m) space, which is un-realistic because ||D|| is linear with respect ton in the case of LZW.From the above observations the desirable

properties for compressed pattern matching canbe summarized as follows.• The dictionary D contains no truncation.• The dictionary D is encoded separatelyfrom the sequence S.

• The size of D is small enough.• The tokens of S are encoded using a fixedlength code.

The compression scheme called the byte pairencoding (BPE)13) is the one which satisfies allof the properties. It is regarded as a simplifiedversion of the compression method called Re-Pair21). The basic operation of the compressionis to substitute a single character which did notappear in the text for a pair of consecutive twocharacters which frequently appears in the text.This operation will be repeated until either allcharacters are used up or no pair of consecu-tive two characters appears frequently. Thus,||D|| ≤ 256. In the next section, we show com-pressed pattern matching algorithms for BPEcompressed files, which are very fast in prac-tice.

5. Searching in BPE Compressed Files

We specialized the KMP type and the BMtype algorithms, presented in the last section,for dealing with only BPE compressed files.5.1 KMP (AC) Type AlgorithmWe can take the two-dimensional array real-

ization of JumpKMP, which is realistic since thearray size is only 256 · (m+ 1).Lemma 5 (Shibata, et al.32)) For a reg-

ular collage system, the two-dimensional tables

storing JumpKMP and OutputKMP can be builtin O(||D|| ·m) time and space.The searching time is reduced at almost the

same rate as the compression ratio, as will beseen in Section 6. BPE compresses Japanesetexts nearly the same ratio as English texts, andtherefore the method can be applied to search-ing in BPE compressed Japanese texts. This isa big advantage of this method. Moreover, thealgorithm can be extended to the multipatternsearching problem so that it simulates the ACalgorithm.5.2 BM Type AlgorithmLemma 6 (Shibata, et al.33)) For a reg-

ular collage system, the tables storing JumpBM,OutputBM, and Shift can be built in O(||D|| ·m)time and space.By using the BM type algorithm presented

in the previous section, we cannot shift thepattern without knowing the total length ofphrases corresponding to skipped tokens. Thisslows down the algorithm in practice. To dowithout such information, we assume that theskipped phrases are all of length C, the maxi-mum phrase length in D, and divide the shiftvalue by C. The value of C is crucial in thisapproach.We estimated the change of compression ra-

tios depending on C. The text files we usedare:Medline. A clinically-oriented subset of Med-

line, consisting of 348,566 references. Thefile size is 60.3Mbyte and the entropy☆ is4.9647.

Genbank. The file consisting only of ac-cession numbers and nucleotide sequencestaken from a data set in Genbank. The filesize is 17.1Mbyte and the entropy is 2.6018.

Table 2 shows the compression ratios ofthese texts for BPE, together with those for theHuffman encoding, gzip, and compress, wherethe last two are well-known compression toolsbased on LZ77 and LZW, respectively. Remarkthat the change of compression ratios depend-ing on C is non-monotonic. The reason for thisis that the BPE compression routine we usedbuilds a dictionary D in a greedy manner onlyfrom the first block of a text file. It is observedthat we can restrict C with no great sacrifice ofcompression ratio. Thus we decided to use theBPE compressed file of Medline for C = 3, and

☆ By entropy is meant the order-0 entropy, namely theclassical Shannon entropy.


Table 2 Compression ratios (%).

HuffmanBPE

compress gzipC = 3 C = 4 C = 5 C = 6 C = 7 C = 8 unlimit.

Medline 62.41 59.44 58.46 58.44 58.53 58.47 58.58 59.07 42.34 33.35Genbank 33.37 36.93 32.84 32.63 32.63 32.34 32.28 32.50 26.80 23.15

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

5 10 15 20 25 30pattern length

(sec)

KMP

uncompress+KMP

gunzip+KMP

AC on LZW

Shift-Or on LZW

unBPE+KMP

AC on HuffmanAC on BPEagrepUTBM on BPE

0.0

0.2

0.4

0.6

0.8

1.0

5 10 15 20 25 30pattern length

(sec)

uncompress+KMP

AC on Huffman

AC on BPEBM on BPE

agrepUT

KMP

unBPE+KMP

Shift-Or on LZW

AC on LZW

gunzip+KMP

(a) Medline (b) Genbank

Fig. 12 Running times (in CPU time).

that of Genbank for C = 4 in our experimentin the next section.For the BPE compression, we observed that

putting a restriction on C makes no great sac-rifice of compression ratio even for C = 3, 4.

6. Experimental results

We estimated the performances of the follow-ing programs:(A) Decompression followed by ordinary search.

We tested this approach with the KMPalgorithm for the compression meth-ods: gzip, compress, and BPE. Wedid not combine the decompression pro-grams and the KMP search program us-

ing the Unix ‘pipe’ because it is slow.Instead, we embedded the KMP routinein the decompression programs, so thatthe KMP automaton processes the de-coded characters ‘on the fly’. The pro-grams are abbreviated as gunzip+KMP,uncompress+KMP, and unBPE+KMP,respectively.

(B) Ordinary search in original text.KMP, UT (the Uratani-Takeda vari-ant36) of BM), and agrep.

(C) Compressed pattern matching.AC on LZW 19), Shift-Or on LZW 19),AC on Huffman24), AC on BPE 32), andBM on BPE 33).


The automata in KMP, UT, AC on Huffman,AC on BPE, and BM on BPE were realized astwo-dimensional arrays of size � × 256, where� is the number of states. The texts usedare Medline and Genbank mentioned in Sec-tion 5, and the patterns searched for are textsubstrings randomly gathered from them. Ourexperiment was carried out on an AlphaSta-tion XP1000 with an Alpha21264 processor at667MHz running Tru64 UNIX operating sys-tem V4.0F. Figure 12 shows the running times(CPU time). We excluded the preprocessingtimes since they are negligible compared withthe running times. We observed the followingfacts.• The differences between the running timesof KMP and the three programs of (A) cor-respond to the decompression times. De-compression tasks for LZ77 and LZW arethus time-consuming compared with pat-tern matching task. Even when we usethe BM algorithm instead of KMP, the ap-proach (A) is still slow for LZ77 and LZW.

• BM on BPE is faster than all the others.Especially, it runs about 1.2 times fasterthan agrep for Medline, and about 3 timesfaster for Genbank.

7. Concluding Remarks

The aim of compression was traditionally tosave space requirement of data files. Insteadof saving space, we must bear an extra timefor expanding files when processing them. Forthis reason, we did not want to compress datafiles in the case that the available space wassufficient. However, we now have a new reasonto compress data files, in which the compressionis regarded as a means of speeding up the exactstring matching task, which we call Goal 2. Theresults presented in this paper thus have raisedthe curtain on a new era in the history of stringpattern matching and data compression.The work by Manber22) (1994) is recognized

as the first attempt to Goal 2. However, anattempt was made by our research group inRef. 12) early in 1992. In this work, we incorpo-rated the Huffman tree into the pattern match-ing machine for syncronization, as described inSection 3. This idea was obtained as a general-ization of the one used for bytewise-processingof Japanese text files, which are mixtures ofone-byte and two-byte codes. A naive methodof processing such a text file would be to con-vert the text into a sequence of 16-bit integers,

each of which represents one character, andthen process it by an ordinary search routine.Notice that the naive method is similar to thedecompression-then-search method, and on theother hand, the bytewise-processing seems likea compressed pattern matching. The require-ment to efficiently process Japanese text filesforced us to distinguish a text from its represen-tation. Most interestingly, we have proposed in1984 in Ref. 7) an efficient realization of the pat-tern matching machine which processes a textfile 4-bit by 4-bit, in order to reduce the size ofthe state-transition table into 1/8. Thus, beforethe 1990’s we had almost finished preparing forstudies on the compressed pattern matching.The string pattern matching is simple but the

basis of other string processings. To speed upmore complicated string processings, such asthe approximate string matching or the regularexpression matching, by using text compressionwill be our future work.

References

1) Aho, A.V. and Corasick, M.: Efficient stringmatching: An aid to bibliographic search,Comm.ACM, Vol.18, No.6, pp.333–340 (1975).

2) Amir, A. and Benson, G.: Efficient two-dimensional compressed matching, Proc. DataCompression Conference, p.279 (1992).

3) Amir, A. and Benson, G.: Two-dimensionalperiodicity and its application, Proc. 3rd Ann.ACM-SIAM Symp. on Discrete Algorithms,pp.440–452 (1992).

4) Amir, A., Benson, G. and Farach, M.: Letsleeping files lie: Pattern matching in Z-compressed files, J. Computer and System Sci-ences, Vol.52, pp.299–307 (1996).

5) Amir, A., Benson, G. and Farach, M.: Opti-mal two-dimensional compressed matching, J.Algorithms, Vol.24, No.2, pp.354–379 (1997).

6) Amir, A., Landau, G.M. and Vishkin, U.: Ef-ficient pattern matching with scaling, J. Algo-rithms, Vol.13, No.1, pp.2–32 (1992).

7) Arikawa, S. and Shinohara, T.: A run-timeefficient realization of Aho-Corasick patternmatching machines, New Generation Comput-ing, Vol.2, No.2, pp.171–186 (1984).

8) Boyer, R.S. and Moore, J.S.: A fast stringsearching algorithm, Comm. ACM, Vol.20,No.10, pp.62–72 (1977).

9) Crochemore, M., Mignosi, F., Restivo, A. andSalemi, S.: Text compression using antidic-tionaries, Proc. 26th Internationial Colloquimon Automata, Languages and Programming,pp.261–270, Springer-Verlag (1999).

10) Eilam-Tzoreff, T. and Vishkin, U.: Match-


ing patterns in strings subject to multi-lineartransformations, Theoretical Computer Sci-ence, Vol.60, No.3, pp.231–254 (1988).

11) Farach, M. and Thorup, M.: String-matchingin Lempel-Ziv compressed strings, Algorith-mica, Vol.20, No.4, pp.388–404 (1998). (Previ-ous version in: STOC’95 ).

12) Fukamachi, S., Shinohara, T. and Takeda,M.: String pattern matching for compresseddata using variable length codes (in Japanese),Proc. Symposium on Informatics 1992, pp.95–103 (1992).

13) Gage, P.: A new algorithm for data compres-sion. The C Users Journal, Vol.12, No.2 (1994).

14) Gasieniec, L., Karpinski, M., Plandowski,W. and Rytter, W.: Efficient algorithms forLempel-Ziv encoding, Proc. 4th ScandinavianWorkshop on Algorithm Theory, pp.392–403,Springer-Verlag (1996).

15) Hirao, M., Shinohara, A., Takeda, M. andArikawa, S.: Fully compressed pattern match-ing algorithm for balanced straight-line pro-grams, Proc. 7th International Symp. on StringProcessing and Information Retrieval, pp.132–138, IEEE Computer Society (2000).

16) Karkkainen, J., Navarro, G. and Ukkonen, E.:Approximate string matching over Ziv-Lempelcompressed text, Proc. 11th Ann. Symp. onCombinatorial Pattern Matching, pp.195–209,Springer-Verlag (2000).

17) Karpinski, M., Rytter, W. and Shinohara,A.: An efficient pattern-matching algorithm forstrings with short descriptions, Nordic Journalof Computing, Vol.4, pp.172–186 (1997).

18) Kida, T., Shibata, Y., Takeda, M., Shinohara,A. and Arikawa, S.: A unifying framework forcompressed pattern matching, Proc. 6th Inter-national Symp. on String Processing and Infor-mation Retrieval, pp.89–96, IEEE ComputerSociety (1999).

19) Kida, T., Takeda, M., Shinohara, A., Miyazaki,M. and Arikawa, S.: Multiple pattern match-ing in LZW compressed text, J. Discrete Al-gorithms (to appear). (Previous versions in:DCC’98 and CPM’99 ).

20) Knuth, D.E., Morris, J.H. and Pratt, V.R.:Fast pattern matching in strings, SIAM J.Comput, Vol.6, No.2, pp.323–350 (1977).

21) Larsson, N.J. and Moffat, A.: Offline dictionary-based compression, Proc. Data CompressionConference ’99, pp.296–305, IEEE ComputerSociety (1999).

22) Manber, U.: A text compression scheme thatallows fast searching directly in the compressedfile, ACM Trans. Information Systems, Vol.15,No.2, pp.124–136 (1997). (Previous version in:CPM’94 ).

23) Matsumoto, T., Kida, T., Takeda, M.,Shinohara, A. and Arikawa, S.: Bit-parallel ap-proach to approximate string matching in com-pressed texts, Proc. 7th International Symp. onString Processing and Information Retrieval,pp.221–228, IEEE Computer Society (2000).

24) Miyazaki, M., Fukamachi, S., Takeda, M.and Shinohara, T.: Speeding up the patternmatching machine for compressed texts (inJapanese), Trans. IPS Japan, Vol.39, No.9,pp.2638–2648 (1998).

25) Miyazaki, M., Shinohara, A. and Takeda, M.:An improved pattern matching algorithm forstrings in terms of straight-line programs. J.Discrete Algorithms (to appear). (Previous ver-sion in: CPM’97 ).

26) Miyazaki, T.: Studies on speed-up of stringpattern matching by text compression usingstatistical model (in Japanese), Master’s The-sis, Kyushu Institute of Technology (1996).

27) Moura, E., Navarro, G., Ziviani, N. andBaeza-Yates, R.: Fast and flexible word search-ing on compressed text, ACM Trans. Informa-tion Systems (2000). (Previous Versions in: SI-GIR’98 and SPIRE’98 ).

28) Navarro, G. and Raffinot, M.: A general prac-tical approach to pattern matching over Ziv-Lempel compressed text, Proc.10th Ann.Symp.on Combinatorial Pattern Matching, pp.14–36,Springer-Verlag (1999).

29) Navarro, G. and Tarhio, J.: Boyer-Moorestring matching over Ziv-Lempel compressedtext, Proc. 11th Ann. Symp. on CombinatorialPattern Matching, pp.166–180, Springer-Verlag(2000).

30) Nevill-Manning, C.G., Witten, I.H. andMaulsby, D.L.: Compression by induction ofhierarchical grammars, DCC94, pp.244–253,IEEE Press (1994).

31) Revuz, D.: Minimisation of acyclic determin-istic automata in linear time, Theoretical Com-puter Science, Vol.92, No.1, pp.181–189 (1992).

32) Shibata, Y., Kida, T., Fukamachi, S., Takeda,M., Shinohara, A., Shinohara, T. and Arikawa,S.: Speeding up pattern matching by text com-pression, Proc. 4th Italian Conference on Algo-rithms and Complexity, pp.306–315, Springer-Verlag (2000).

33) Shibata, Y., Matsumoto, T., Takeda, M.,Shinohara, A. and Arikawa, S.: A Boyer-Mooretype algorithm for compressed pattern match-ing, Proc. 11th Ann. Symp. on CombinatorialPattern Matching, pp.181–194, Springer-Verlag(2000).

34) Shibata, Y., Takeda, M., Shinohara, A. andArikawa, S.: Pattern matching in text com-pressed by using antidictionaries, J. Discrete


Algorithms (to appear). (Previous version in:CPM’99 ).

35) Takeda, M.: Pattern matching machine fortext compressed using finite state model, Tech-nical Report DOI-TR-CS-142, Department ofInformatics, Kyushu University (Oct. 1997).

36) Uratani, N. and Takeda, M.: A fast string-searching algorithm for multiple patterns. In-formation Processing & Management, Vol.29,No.6, pp.775–791 (1993).

37) Welch, T.A.: A technique for high per-formance data compression, IEEE Comput.,Vol.17, pp.8–19 (1984).

38) Wu, S. and Manber, U.: Agrep – a fast approx-imate pattern-matching tool, Usenix Winter1992 Technical Conference, pp.153–162 (1992).

39) Ziv, J. and Lempel, A.: A universal algo-rithm for sequential data compression, IEEETrans. Inform. Theory, Vol.23, No.3, pp.337–349 (1977).

(Received June 30, 2000)(Accepted September 27, 2000)

Masayuki Takeda was bornin 1964. He is an Associate Pro-fessor in the Department of In-formatics at Kyushu University.He received his B.S. in 1987 inMathematics, his M.S. in 1989in Information Systems, and his

Dr.Eng. degree in 1996 all from Kyushu Uni-versity. His present research interests includepattern matching algorithms, text compression,discovery science, and information retrieval.

Yusuke Shibata was bornin 1975. He receieved his B.E.and M.E. from Kyushu Instituteof Technology in 1998 and fromKyushu University in 2000, re-spectively. His research inter-ests include pattern matching al-

gorithms and text compression. Presently, heworks at NTT Comware.

Tetsuya Matsumoto wasborn in 1977. He is a student ofthe master course in the Depart-ment of Informatics at KyushuUniversity. He receieved his B.S.in Physics from Kyushu Uni-versity in (1999). His research

interests include pattern matching algorithmsand text compression.

Takuya Kida was born in1974. He is a student of thedoctor course in the Departmentof Informatics at Kyushu Uni-versity. He obtained his B.S. in1997 in Physics, his M.A. in 1999from Kyushu University. His re-

search interests include pattern matching algo-rithms and text compression.

Ayumi Shinohara was bornin 1965. He is an Associate Pro-fessor in the Department of In-formatics at Kyushu University.He obtained his B.S. in 1988 inMathematics, his M.S. in 1990in Information Systems, and his

Dr.Sci. degree in 1994 all from Kyushu Uni-versity. His current research interests includediscovery science, bioinformatics, and patternmatching algorithms.

Shuichi Fukamachi wasborn in 1967. He is a researcherin the Department of ArtificialIntelligence at Kyushu Instituteof Technology. He received hisB.E. and M.E. from Kyushu In-stitute of Technology in 1991

and 1993, respectively. His research interestsinclude pattern matching algorithms and infor-mation retrieval.

Takeshi Shinohara was bornin 1955. He is a Professor inthe Department of Artificial In-telligence at Kyushu Instituteof Technology. He obtainedhis B.S. in Mathematics fromKyoto University in 1980, and

his Dr.Sci. degree from Kyushu University in1986. His research interests are in computa-tional/algorithmic learning theory, informationretrieval, and approximate retrieval of multime-dia data.


Setsuo Arikawa was bornin 1941. He is a Professor inthe Department of Informaticsat Kyushu University and theDirector of University Library atKyushu University. He receivedhis B.S. in 1964, his M.S. in 1966

and his Dr.Sci. degree in 1969 all in Mathe-matics from Kyushu University. His researchinterests include discovery science, algorithmiclearning theory, logic and inference/reasoningin AI, pattern matching algorithms and libraryscience.

Documents

Speeding Up String Pattern Matching by Text Compression ...ayumi/papers/IPSJ40.pdf · Vol.42 No.3 IPSJJournal Mar. 2001 IPSJ40thAnniversaryAwardPaper Speeding Up String Pattern Matching