38
Exact String Matching Algorithms: A Survey Mehreen Ali, Hina Naz Khan, Shumaila Sayyab, Nadeem Iftikhar Department of Bio-Science Mohammad Ali Jinnah University, Islamabad-Pakistan

Exact String Matching Algorithms: A Survey Mehreen Ali, Hina Naz Khan, Shumaila Sayyab, Nadeem Iftikhar Department of Bio-Science Mohammad Ali Jinnah University,

Embed Size (px)

Citation preview

Exact String Matching Algorithms: A Survey

Mehreen Ali, Hina Naz Khan, Shumaila Sayyab, Nadeem IftikharDepartment of Bio-Science

Mohammad Ali Jinnah University, Islamabad-Pakistan

Introduction

• Exact string matching algorithms are believed to find all occurrences of a given string pattern in the given text of finite length.

• Exact string matching algorithms are excessively used in

1. most of the operating systems,2. text editors, 3. internet related searches, 4. high performance computing, 5. nucleotide or amino acid sequence searches

from genome or protein databases.

Abstract

Exact String matching problem has always remained an eye catching area of research throughout the

history of computer science. Exact string matching is fundamental to the database and text processing

applications. Till now several algorithms have been proposed to solve this problem. This paper provides

a survey on available exact string matching algorithms, along with their classification and

evaluation based on certain important benchmarks.

General Behavior

• The general behavior consists of alignment of string pattern against the text and then comparison between them based on the given algorithm.

• Each such alignment is referred to as text window and each process of

comparison is known as an attempt. Such behavior of algorithms is termed as

sliding window mechanism.

• On a match or mismatch, next alignment of string pattern and text is checked till the text ends.

• During pre-processing phase a matrix, table or a Finite State Automaton is computed based on the given string pattern, to be used during the searching phase.

Benchmarks

• Benchmarks to evaluate algorithms are;

1. Time Complexity (tc) 2. Space Complexity (sc)3. Pre-processing Time (pt)4. Character Comparisons (cc) (average or worst case)

• Big (O) notation is used to calculate all these time and space complexities.

Exact String Matching Algorithms

Brute Force Algorithm

• basic and very simple algorithm to proceed;• It has no pre-processing phase. • can be done in any order.

Classification

All other algorithms can be classified into four categories depending upon the order in which the comparisons are made, which are as follows;

1. From Left To Right2. From Right To Left 3. In a Specific Order 4. In Any Order

From Left To Right

Deterministic Finite Automaton Algorithm

• Computes the transition table for input, in the pre-processing phase.

• Needs extra space and time to store and search the table.

Karp-Rabin Algorithm

• avoids checking at each position for the pattern in the text, thus is very effective for multiple pattern matching.

• Hashing function is used.

Shift Or Algorithm

• The algorithm uses bitwise techniques• works efficiently if the pattern length is within the memory- word size of the machine.• Searching phase and time complexity is comparatively lesser than Brute Force algorithm

Morris-Pratt Algorithm

• follows Brute Force algorithm • number of shifts is greater that increases the speed of the search • keeps record of text already matched with the pattern.

Knuth-Morris-Pratt Algorithm

• follows Morris and Pratt algorithm, • increases the speed.• has less time and space complexity

Simon Algorithm

• derived from Deterministic Finite Automaton algorithm. • the number of the backward edges is reduced but searching phase is similar. • time complexity increases irrespective of the input size.

Apostolico-Crochemore Algorithm

• refinement of the Knuth-Morris-Pratt algorithm • decreases the number of failure attempts thus saves time.• reduced character comparisons and space complexity.

Not So Naïve Algorithm

• follows the searching behavior of Apostolico-Crochemore algorithm• time complexity is comparable to Brute Force algorithm.

From Right To Left

Boyer-Moore Algorithm

• It uses two functions i.e. good-suffix shift and bad-character shift• maximum shift value from both functions is considered.

Turbo-BM Algorithm

• modified Boyer-Moore algorithm. • Time complexity has reduced as algorithm allows jumping over already matched factor and a turbo-shift.

Apostolico-Giancarlo Algorithm

• variant of Boyer-Moore algorithm. • remembers the length of the longest suffix of the pattern and store it in table Skip.• Suff table is used during computation of bad-character shift function.• number of character comparisons has been reduced

Quick search Algorithm

• simplified Boyer-Moore algorithm • uses only bad character shift function . • reduced space complexity

SSABS Algorithm

• uses Quick Search bad character shift function + the calculation of text window skip value. • has reduced time complexity

Zhu-Takaoka Algorithm

• variation of Boyer-Moore algorithm.• It considers two consecutive characters to calculate the bad character shift.• Its search process is fast • Skip table grows very heavily.• increased pre-processing space and time complexity

Berry-Ravindran Algorithm

• derived from Quick Search algorithm and Zhu-Takaoka algorithm. • It uses two characters to calculate shift value using bad character shift value.• reduces the number of character comparisons.• space and time complexities are similar to that of Zhu-Takaoka algorithm.

TVSBS Algorithm

• combination of Berry-Ravindran and SSABS algorithms.• It uses bad character shift function of Berry-Ravindran algorithm • whereas searching phase is similar to that of the SSABS.

Reverse Factor Algorithm

• preferred for long patterns and short text. • improved length of shifts. • has quadratic worst time complexity but on the average it is optimal.

In a Specific Order

Colussi Algorithm

• enhancement of Knuth-Morris-Pratt algorithm. • pattern position is divided into two disjoint subsets one is scanned from left to right and other from right to left. • time complexity reduced and less character comparisons.

Two Way Algorithm

• requires ordered alphabets. • processing is like Colussi algorithm.

String Matching On Ordered Alphabets Algorithm

• also requires ordered alphabets. • There is no pre-processing phase • comparison of each character of string pattern is made one by one.

In Any Order

Horspool Algorithm

• simplified Boyer-Moore algorithm. • Boyer-Moore bad character shift function is used• saves time during searching phase by reducing number of comparisons.

Smith Algorithm

• derived from Horspool and Quick Search algorithms • uses their bad character shift functions to compute shift values.• no difference in time and space complexities.

Raita Algorithm

• uses Boyer-Moore bad character shift function • performs the shifts like the Horspool algorithm.• same time and space complexities as that of Horspool algorithm

ESMAs tc sc pt cc

Brute Force Algorithm O(mn) constant extra space

no preprocessing 2n

Morris-Pratt Algorithm O(n+m) O(m) O(m) 2n-1

Apostolico-Crochemore Algorithm

O(n) O(m) O(m) 3/2n

Boyer-Moore Algorithm O(mn) O(m +|Σ|) O(m +|Σ|) 3n

Quick Search Algorithm

O(mn) O(|Σ|) O(m +|Σ|) quadratic worst case

SSABS Algorithm O([n/(m+1)]) - - O(m(n-m+1)) worst case

Zhu-Takaoka Algorithm O(mn) O(m+|Σ|^2) O(m+|Σ|^2) quadratic worst case

Berry-Ravindran Algorithm

O(mn) O(m+|Σ|^2) O(m+|Σ|^2) -

TVSBS Algorithm O([n/(m+2)]) O(|Σ|+k^|Σ|) O(|Σ|+k^|Σ|) O(m(n-m+1)) worst case

Colussi Algorithm O(n) O(m) O(m) 3/2n

Skip Search Algorithm O(mn) O(m +|Σ|) O(m +|Σ|) O(n), quadratic worst case

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Evaluation

Table 1: Comparison of Exact String Matching Algorithms

Summary and Conclusion

• Among all the selected ESMAs, the latest one i.e. TVSBS algorithm is the best for exact string matching BECAUSE;

1. uses least space and time complexity during pre-processing phase and otherwise also.

2. provides better results in fewer attempts 3. and less number of character comparisons even when compared

with SSABS.

• As with all other surveys, here too the list of ESMAs is yet not complete, although comprehensive. It is believed that further new proposed algorithms will also be considered, and evaluated in the similar fashion.

References

[1] AHO, A.V., 1990, Algorithms for finding patterns in strings. in Handbook of Theoretical Computer Science, Volume A, Algorithms and complexity, J. van Leeuwen ed., Chapter 5, pp 255-300, Elsevier, Amsterdam.

[2] CHARRAS, C. and LECROQ, T., Handbook of Exact String Matching algorithmshttp://www-igm.univ-mlv.fr/~lecroq/string/

[3] CROCHEMORE, M., LECROQ, T., 1996, Pattern matching and text compression algorithms, in CRC Computer Science and Engineering Handbook, A. Tucker ed., Chapter 8, pp 162-202, CRC Press Inc., Boca Raton, FL.

[4] GONNET, G.H., BAEZA-YATES, R.A., 1991. Handbook of Algorithms and Data Structures in Pascal and C, 2nd Edition, Chapter 7, pp. 251-288, Addison-Wesley Publishing Company.[5] GUSFIELD, D., 1997, Algorithms on strings, trees, and sequences: Computer Science and Computational Biology, Cambridge University Press.

[6] RAHUL THATHOO, ASHISH VIRMANI, S. SAI LAKSHMI, N. BALAKRISHNAN and K. SEKAR, TVSBS: A fast exact pattern matching algorithm for biological sequences CURRENT SCIENCE, VOL. 91, NO. 1, 10 JULY 2006.

Thanks