140
2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th – April 7th 2004

2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

2nd Joint Advanced Student School 2004

Course 1

Complexity Analysis of String Algorithms

St. Petersburg, March 28th – April 7th 2004

Page 2: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

2

Page 3: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Preface

The State University St. Petersburg, the Steklov Institute St. Petersburg,and the Technische Universitat Munchen organized the second Joint AdvancedStudent School (JASS 2004) in St. Petersburg from March 28th through April7th. It was financed by the Bavarian Ministry of Economics, by Siemens and byInfineon. Four courses with different topics were offered. This booklet containsthe papers prepared by the students of course 1 “Complexity Analysis of StringAlgorithms”.

The course covered advanced techniques for the analysis of string algorithmsapplied to fundamental algorithms in this area. There was a special focus onaverage case analysis and exact analysis (as initiated by D. Knuth). The fol-lowing topics were coverd by St. Petersburg and Munich participants.

1. Data Structures for Pattern Matching. Suffix trees and suffix arraysare a basic data structure in pattern matching.

Article: [Ukk95, KS03]. Book chapter: 6 of [Gus97].

Additionally: [MM93]

Presented by: Olga Sergeeva

2. Sub-linear Approximate String Matching. A classical algorithmbased on the use of suffix trees, lowest common ancestor queries. Thecomplexity analysis uses basic techniques such as the Chernoff bound.

Article: [CL94]. Book chapter: 2 of [Szp00], 6 and 8 of [Gus97]. Addi-tionally: [BFC00] (lca)

Presented by: Robert West

3. Approximate Text Indexing. Using simple mathematical argumentsthe matching probabilities in the suffix tree are bound and by a cleverdivision of the search pattern sub-linear time is achieved.

Article: [NBY00]. Book chapter: 6 of [Gus97]

Presented by: Alexander Vahitov

4. Compressed Suffix Arrays. This space efficient index structure is an-alyzed on a bit level to reveal linear size.

Article: [GV00, Sad00]. Book chapter: 6 of [Szp00].

Presented by: Fabian Pache

3

Page 4: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

4

5. Asymptotic Properties of Suffix Trees. Analysis of height and feasi-ble path length using probabilistic tools such as 2nd Moment Method.

Article: [Szp93]. Book chapter: 4 of [Szp00].

Presented by: Ivan Kazmenko

6. Sequential Pattern Matching. Analysis of Knuth-Morris-Pratt typealgorithms using the Subadditive Ergodic Theorem.

Article: [RS98a]. Book chapter: 5 of [Szp00].

Presented by: Tobias Reichl

7. Greedy Algorithms for the SCS Problem. Analysis of some greedyalgorithms using tools from information theory such as asymptotic equityproperty (AEP).

Article: [FS98]. Book chapter: 6 of [Szp00].

Presented by: Anton Nesterov

8. Analysis of Pattern Occurances. Analysis of the number of occurancesof patterns in text using generating functions.

Article: [GO81, RS98b]. Book chapter: 7 of [Szp00].

Presented by: Roland Aydin

9. Rice’s Integrals. Rice’s Integrals (already known by Norlund) are an-other basic and successful technique to analyze asymptotic behavior oftrees.

Article: [FS95]. Book chapter: 8 of [Szp00].

Presented by: Thomas Preu

10. Digital Search Trees. Analysis of different digital trees with Rice’sintegrals.

Article: [FS86a]. Book chapter: 8 of [Szp00].

Presented by: Nicolai Baron von Hoyningen-Huene

11. The Mellin Transform. A very popular technique for the analysis ofdigital trees and the like.

Article: [FGD95]. Book chapter: 9 of [Szp00].

Presented by: Ilja Posov

12. Automaton Searching on Tries. Analysis of a general search techniqueusing automatons in tries, based on the eigenvalues of the automatonsadjacency matrix and Mellin transform.

Article: [BYG96]. Book chapter: 9 of [Szp00].

Presented by: Mikhail Lakunin

Page 5: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Contents

1 Data Structures for Pattern Matching 9

1.1 Suffixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Suffix trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.2.1 Definitions, examples, background . . . . . . . . . . . . . 10

1.2.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2.3 Ukkonen’s algorithm . . . . . . . . . . . . . . . . . . . . . 11

1.3 Suffix arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.3.1 The very first algorithm . . . . . . . . . . . . . . . . . . . 15

1.3.2 Skew algorithm . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Sublinear Approximate String Matching 19

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1.1 What? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.2 Why? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.3 How? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2 The Auxiliary Tools . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.1 Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.2 Matching Statistics . . . . . . . . . . . . . . . . . . . . . . 21

2.2.3 Lowest Common Ancestor . . . . . . . . . . . . . . . . . . 232.2.4 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.1 Linear Expected Time . . . . . . . . . . . . . . . . . . . . 262.3.2 Sublinear Expected Time . . . . . . . . . . . . . . . . . . 29

2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Approximate string indexing 31

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Our Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Basic Ideas and Algoritms . . . . . . . . . . . . . . . . . . . . . . 323.3.1 Lemma: dividing the pattern . . . . . . . . . . . . . . . . 32

3.3.2 Computing edit distance . . . . . . . . . . . . . . . . . . . 323.3.3 NFA construction . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.4 Depth-First Search technique . . . . . . . . . . . . . . . . 35

3.4 Main algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4.1 Suffix Arrays . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 363.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5

Page 6: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

6 CONTENTS

4 Compressed Suffix Arrays 394.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Suffix Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Compressed Suffix Arrays . . . . . . . . . . . . . . . . . . . . . . 404.3.1 Compression of the Suffix Array . . . . . . . . . . . . . . 404.3.2 Compression of the Auxiliary Arrays . . . . . . . . . . . . 424.3.3 Complexity in general . . . . . . . . . . . . . . . . . . . . 434.3.4 Complexity optimization . . . . . . . . . . . . . . . . . . . 43

4.4 Extensions of Compressed Suffix Arrays . . . . . . . . . . . . . . 444.4.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Asymptotic Properties of Suffix Trees. 475.1 Suffix Tree Construction . . . . . . . . . . . . . . . . . . . . . . . 475.2 Depth of Insertion in a Suffix Tree . . . . . . . . . . . . . . . . . 485.3 Height and Shortest Feasible Path in a Suffix Tree . . . . . . . . 505.4 Proof Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Sequential Pattern Matching 536.1 Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.1.1 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . 536.1.2 Defining Sequential Algorithms . . . . . . . . . . . . . . . 546.1.3 Naive / Brute Force Algorithm . . . . . . . . . . . . . . . 546.1.4 Knuth-Morris-Pratt . . . . . . . . . . . . . . . . . . . . . 546.1.5 Defining Complexity . . . . . . . . . . . . . . . . . . . . . 55

6.2 Subadditive Ergodic Theorem . . . . . . . . . . . . . . . . . . . . 556.2.1 Fekete’s Theorem . . . . . . . . . . . . . . . . . . . . . . . 556.2.2 Subadditive Ergodic Theorem . . . . . . . . . . . . . . . . 56

6.3 Martingales and Azuma’s Inequality . . . . . . . . . . . . . . . . 576.3.1 Basic Properties of Martingales . . . . . . . . . . . . . . . 576.3.2 Hoeffding’s Inequality and Azuma’s Inequality . . . . . . 58

6.4 Application to KMP . . . . . . . . . . . . . . . . . . . . . . . . . 586.4.1 Establishing m-Convergence . . . . . . . . . . . . . . . . . 586.4.2 Establishing Subadditivity . . . . . . . . . . . . . . . . . . 606.4.3 Applying the Subadditive Ergodic Theorem . . . . . . . . 616.4.4 Applying Azuma’s Inequality . . . . . . . . . . . . . . . . 61

6.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 Greedy Algorithms for the SCS Problem 637.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.2 Greedy algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 647.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.4 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.5 Graph processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.5.1 RGREEDY . . . . . . . . . . . . . . . . . . . . . . . . . . 667.5.2 GREEDY . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Page 7: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

CONTENTS 7

8 Analysis of Pattern Occurances 678.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678.2 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698.3 Generating functions of languages . . . . . . . . . . . . . . . . . . 698.4 Declaring languages . . . . . . . . . . . . . . . . . . . . . . . . . 718.5 Language relationships . . . . . . . . . . . . . . . . . . . . . . . . 718.6 Languages & Generating Functions . . . . . . . . . . . . . . . . . 738.7 Looking for Generating Functions . . . . . . . . . . . . . . . . . . 748.8 Main findings I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758.9 On to other shores . . . . . . . . . . . . . . . . . . . . . . . . . . 75

9 Rice’s integrals 799.1 Basics of Complex Analysis . . . . . . . . . . . . . . . . . . . . . 79

9.1.1 Complex Differentiability . . . . . . . . . . . . . . . . . . 799.1.2 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 809.1.3 Holomorphic functions and the Cauchy Integral Theorem 829.1.4 Cauchy integral formula and residue calculus . . . . . . . 83

9.2 Other mathematical formulas . . . . . . . . . . . . . . . . . . . . 849.2.1 The Gamma function . . . . . . . . . . . . . . . . . . . . 849.2.2 Zeta functions and Modified Bell Polynomials . . . . . . . 85

9.3 Motivation and Basic Integrals . . . . . . . . . . . . . . . . . . . 859.4 Integrals of Functions with Poles . . . . . . . . . . . . . . . . . . 87

9.4.1 Rational functions . . . . . . . . . . . . . . . . . . . . . . 879.4.2 Meromorphic functions . . . . . . . . . . . . . . . . . . . 909.4.3 Functions with Algebraic and Logarithmic Singularities . 93

9.5 Mellin Transforms and Rice’s Integrals . . . . . . . . . . . . . . . 979.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

10 Digital Search Trees 9910.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9910.2 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9910.3 Digital Search Trees . . . . . . . . . . . . . . . . . . . . . . . . . 100

10.3.1 Data Structure of Binary Search Trees . . . . . . . . . . . 10010.3.2 Data Structure of Digital Search Trees . . . . . . . . . . . 10110.3.3 Internal Path Length . . . . . . . . . . . . . . . . . . . . . 10210.3.4 External Internal Nodes . . . . . . . . . . . . . . . . . . . 107

10.4 Digital Search Tries . . . . . . . . . . . . . . . . . . . . . . . . . 11110.4.1 Data Structure of Tries . . . . . . . . . . . . . . . . . . . 11110.4.2 Data Structure of Patricia Tries . . . . . . . . . . . . . . . 11210.4.3 External Path Length . . . . . . . . . . . . . . . . . . . . 11210.4.4 External Internal Nodes . . . . . . . . . . . . . . . . . . . 113

10.5 Multiway Branching . . . . . . . . . . . . . . . . . . . . . . . . . 11410.6 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 114

11 Mellin transforms and asymptotics 11711.1 Mellin transform definition . . . . . . . . . . . . . . . . . . . . . 11811.2 Mellin transform basic properties . . . . . . . . . . . . . . . . . . 11911.3 Singularities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12111.4 Direct mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12211.5 Converse mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Page 8: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

8 CONTENTS

11.6 Harmonic sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

12 Automaton Searching on Tries. 12912.1 Structure of the paper . . . . . . . . . . . . . . . . . . . . . . . . 12912.2 Our task or what we want to do . . . . . . . . . . . . . . . . . . 12912.3 Algorithm in a “few words” . . . . . . . . . . . . . . . . . . . . . 13012.4 Basic index structures and notation . . . . . . . . . . . . . . . . 130

12.4.1 Indexing structures . . . . . . . . . . . . . . . . . . . . . . 13012.4.2 Basic definitions of the Theory of Formal Languages . . . 132

12.5 A restricted class of regular expression . . . . . . . . . . . . . . . 13212.6 General algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 135

12.6.1 What we want to do there . . . . . . . . . . . . . . . . . . 13512.7 Efficiency Estimation for the General Algorithm . . . . . . . . . 135

12.7.1 The structure of the proof . . . . . . . . . . . . . . . . . . 13512.8 Apllying the General Estimation Theorem . . . . . . . . . . . . . 13712.9 Heuristic for optimizing the query . . . . . . . . . . . . . . . . . 137

12.9.1 What we mean by optimizing . . . . . . . . . . . . . . . . 13712.9.2 Substring graph . . . . . . . . . . . . . . . . . . . . . . . 13712.9.3 How to Use . . . . . . . . . . . . . . . . . . . . . . . . . . 137

12.10Open problems and conclusion . . . . . . . . . . . . . . . . . . . 138

Page 9: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Chapter 1

Data Structures for PatternMatchingOlga Sergeeva

For many applications, efficient string processing is crucial. Searching for asubstring or a subsequence in a string; searching for common substrings of aset of strings; finding out the number of direct repetitions these are just a fewimportant examples of the problems often appearing in work with strings. In theapplications, there is need in solving these problems as efficient (in asymptotic)as possible, and also to store the strings ’economically’. So, there arises aquestion of efficient strings representations, both being compact and providinggood bases for algorithms.

This paper is concerned with two representations, in some sense revealing thestructure of the initial string and thus meeting many demands: suffix trees andsuffix arrays.

1.1 Suffixes

Definition 1.1. Let s be a string. s′ is called a suffix of s, if s = as′ for somea. Note, that an empty string is a suffix of any string.

It turns out, that if the suffixes are well structured , the resulting constructioncan be very informative and can be a good base for developing efficient algo-rithms. Also, both structures we are to discuss can be built in time, linear inthe length of the initial string (in this paper, always n).

Why is it possible to build representations, based on suffix relations, efficienly?The very simple (but also very general) idea is that the suffixes are closely re-lated, being parts of each other. Maybe it s worth looking at how this idearefracted in combination with other ideas, and what it lead to in different algo-rithms, as they historically appeared. But we in this paper will discuss mostlythose which made the most of it.

One of the string representations of interest is suffix tree.

9

Page 10: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

10 CHAPTER 1. DATA STRUCTURES FOR PATTERN MATCHING

1.2 Suffix trees

1.2.1 Definitions, examples, background

Definition 1.2. Suffix tree for a string s is a rooted tree with edges, markedwith substrings of s, having the following properties:

1. Any concatenation of the marks along each path from the root to a leafforms a suffix and every suffix appears once.

2. The marks on the edges, having a common root, begin with differentsymbols of the alphabet.

From the definition, you can see that there must be as many leaves in the suffixtree as there are non-empty suffixes in s, i.e. n.What is good in such a representation? The following bright example can givesome comprehension of the advantages of suffix trees.

Example 1.1 (Substring search). Suppose that we have a string s and a(shorter) string p, for which we are to answer if it occurs in s.In fact, even this is not enough detailed description of the problem. The contextis important: will we work with s and p once? Or we have a constant pattern,which we are to search in many strings? Or we have a fixed string s, and areto answer many queries about different p? Answers to such questions of coursegreatly influence the structures and the algorithms preferred.Suppose that this is s which is fixed (some encyclopedia, for example). So it iss to be preprocessed how can we answer if p is a substring of s, having its suffixtree S?If p is a substring, then it s a beginning of some suffix. In S, every suffix mustappear as a concatenation of marks from the root to a leaf, so if p begins witha symbol, with which no branch from the root begins, we can definitely saythat p is not a substring. Otherwise, we should search only in the subtree,corresponding to the branch beginning with this symbol (there is only one suchbranch from the root!) Reasoning the same for the consequent vertices, we willanswer our question, making in each vertex a constant number of comparisons(the alphabet is constant), and there will be no more then the length of p(denote: |p|) steps. So overall time is O(|p|) operations plus O(|s|) to build thesuffix tree.Note that for our case the distribution of work is very successful: although forthe first query we spend O(|p| + |s|) time, for all the consequent queries timewill be linear in the length of the pattern! So, although for the first querythe time is the same with, for example, Knuth-Moris-Pratt algorithm, but inthat algorithm p needs |s| time preprocessing, not s. So in fact we have analgorithm which, after initial preprocessing, answers the queries in time, linearin the length of p. Mention, that Donald Knuth did not believe in existence ofsuch an algorithm, until suffix trees were invented.

Suffix trees have many other applications, for instance, in search for the longestcommon substring of a pair (and, further a set of) strings; for repetitions; forlongest common ancestor.It is also like a ’bridge’ to much more difficult and important problem of inexactmatching when given strings may contain errors and in practice they will containthem!

Page 11: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

1.2. SUFFIX TREES 11

Figure 1.1: Suffix tree and implicit tree for ’gamma’.

Note. We gave a definition of suffix tree, but didn t check correctness of thedefinition: does suffix tree exist for an arbitrary s? Indeed, if we want to havea leaf for each suffix, then it is impossible to build a suffix tree for a string, inwhich one suffix s1 is a prefix of another suffix s2 we will inevitably spell outthe shorter suffix, spelling out the longer one. This problem is easy to solve: ifwe add a symbol not from alphabet (say, ′$′) to the end of s, then no suffix willbe a prefix of another suffix.

We will call a tree for a string without additional symbol an implicit tree for S.In implicit tree, there is exactly one path spelling out any suffix, but the endsof some suffixes are not marked anyhow.

1.2.2 Algorithms

The suffix tree can be built naively , adding suffixes to the tree one by one,beginning from the longest suffix (which is the string itself). To add a newsuffix, we try to spell it out in the tree, and at the point it s no more possible,create a new branch, marked with the not spelled remainder. This approachdemands O(|s|2) time.The first linear-time algorithm, by Weiner, appeared in 1973, in his article LinearPattern Matching Algorithms . Although it was linear in time, it was space-consuming. A less complex and less space-consuming algorithm was inventedin 1976 (McCreight. A Space-Economial Suffix Tree Construction Algorithm).Eventually, in 1993, Ukkonen in his On-Line Construction of Suffix Trees intro-duced his simpler algorithm, having several nice features.

1.2.3 Ukkonen’s algorithm

Description

We will start with a simple but not efficient algorithm, and then in several steps,suggested by common sense (’how not to do excessive work?’), will transform itto linear.Ukkonen s algorithm is on-line: it is split up into |s| phases, after each of themthere is an implicit(!) tree for a prefix of s. We begin with the string, containing

Page 12: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

12 CHAPTER 1. DATA STRUCTURES FOR PATTERN MATCHING

Figure 1.2: Extending ’yea’ to ’year’ in different trees.

only the first symbol of s, and each phase increase the length of the processedstring by one. Phase i+ 1 is itself split into i+ 1 extensions, one extension foreach from the i+ 1 suffixes of s[1..i+ 1]. In the extension j in the phase i+ 1algorithm finds the end of the path, marked with s[j..i]. It then extends thissubstring, adding s(i + 1) to its end, if it doesn’t exist in the tree. So, in thephase i+ 1 it one by one inserts the strings s[1..i+ 1], s[2..i+ 1], .., s[i+ 1] intothe tree, if they do not exist.T1 consists of one edge, marked with s(1).After the last phase, we will carry out one more phase , adding symbol $ . Theresulting implicit tree for s$ will be a real suffix tree, as we already pointed out.The formal definition of the algorithm is as follows.

Algorithm.Build T1.for i from 1 to m− 1 do begin phase i+ 1

for j from 1 to i+ 1 begin extension jfind in the current tree the end of the path with mark S[j..i].If needed, extend the path with s(i+ 1), providing existenceof the string s[j..i+ 1].

end;end;

Let’s look closely on the possible cases of extention (see Fig. 2).

1. We are just to extend a mark on an edge

2. When we are to add an edge (and maybe to split an existing edge intotwo)

3. When nothing is needed to be done.

We will refer to these cases as the first rule, the second rule and the third rulefor the algorithm.What do we spend time for? For looking for the end of the current suffix in atree after this, we spend constant to prolong it. So, it is essential how we searchfor the ends of the suffixes in Ti.We can find the end of a suffix s in O(|s|), walking from the root each time. Inthis case, we will build the Ti+1 from Ti in O(i2), so the final tree will appear

Page 13: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

1.2. SUFFIX TREES 13

after O(n3) operations, comparing to O(n2) in the naive algorithm! We ll reducethis to O(n) using some observations and techniques. Each of them is (just!) auseful heuristics, which cannot qualitatively change estimation of time for theworst case, but applied together they result in essential speeding-up.

Suffix links

Definition 1.3. Suffix link is a pointer from an inner vertex v with the pathmark xα to a vertex s(v) with mark α, if it exists in the tree, where x is asymbol, and α is a string. Notice, that if α is empty, suffix link points to theroot.

It is easy to see that there is a vertex s(v) for every inner vertex of the tree.Moreover, the following statement is true: if a vertex v with path mark xα isadded to the tree in the extension j of the phase i+1 (that is, the second rule isapplied), then the vertex s(v) either already exists in the tree of will be createdin the next extension, in this phase. In both cases, we will find it incidentally,so adding a pointer will not require additional search.Using suffix links can substantially reduce amount of search.First, we will maintain a pointer to the end of the longest suffix s[1..i], so for thefirst extension we need only to prolong a mark on the given edge - no search.To proceed with the next one, we will not move from the root every time wesearch for the end of the next suffix. Instead, we will get from the availableending point up to the first inner vertex v (if it s not an inner vertex itself),walk along its suffix link s(v) and search only in the subtree of s(v). If ourcurrent suffix can be written in the form xαβ, where xα is the mark of thepath to v, then s(v) is marked with α, so trying to spell out the next suffix(which is αβ), we would surely have come to s(v). Adding new suffix links aftertransformation of the tree in each extension (if needed), we will easily maintainthe tree in this pleasant for processing form.

Note. This note will help us to estimate the algorithm complexity.Denote by depth of a vertex the number of edges on the way from the root tothat vertex. Then the moment we are moving along the suffix link, the depth ofv, exceeds the depth of s(v) by no more then 1. The reason is that the prefixesof the shorter string (α) will occur in the string more often, so there will be morebifurcations along the way corresponding to α. And in the case no bifurcationsare added comparing to the way corresponding to xα, d(xα) − d(α) = 1. So,getting along the suffix link, we will not get much nearer to the root. Later thiswill help us to estimate how many times we could descend, and then the overallprocessing time.

Not check, but search

Still, there are unnecessary steps. The end of the next suffix to prolong surelyexists in the tree (in the subtree from s(v), as we cleared up), so we are not tocheck its existence, but only to find its end. We do not have to compare everysymbol on en edge, once having chosen it. If it is shorter then the not spelledout remainder of the suffix, we can skip over it to the next vertex; if it is longersplit it at the needed point. Doing this way, we spend for descending each edgeconstant time. All we need for this style of work is to know the number of

Page 14: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

14 CHAPTER 1. DATA STRUCTURES FOR PATTERN MATCHING

symbols on the edge (which is a detail of implementation) and to be able toextract a symbol in s by its number in constant time. Along with the notice onthe vertex depths correlation, this yields in the following

Theorem 1.1. In the improved algorithm, each phase takes O(n) time.

Proof. Summing up, what steps we perform during one extension? Get up tothe nearest inner vertex no more than one edge; go along the suffix link; descendsome vertices; apply one of the rules of suffix extension; maybe add a new suffixlink. All these actions, except descending, take constant time.We need to estimate how many times we descend during one phase. Let s payour attention to the current vertex depth changes. Raising, we decrease it by nomore than 1; the same with walking along the suffix link; and while descending,we increase current depth that s why overall depth increment over the phase innot more than 3 ∗ n.

Corollary 1.1. The current version of Ukkonen s algorithm terminates in

O(n2).

Question. We spent so much effort, and got nothing in comparison with thenaive algorithm?

The last touches

Improving an algorithm, one can encounter the following problem: if the outputis large, the algorithm cannot be very fast, for working time is no less then outputsize. This is the case: if we don’t change the format of representation the tree,we will not get further optimization, because in general case the overall lengthof labels on the edges doesn’t have to be linear.

Example 1.2. Suffix tree for the string ’abcd...xyz’ consists of 26 branches,with marks having 26, 25, ... 1 letters on them. So, the overall length of marksis 26 ∗ 27/2. Although for arbitrary long strings there will not exist such anexample, because the size of the alphabet is considered constant, this examplegives notion on how much redundancy can the marks contain.

Example 1.3. Tree for (a)n(b)n(a)n−1bn−1...a2b2ab.

We will modify algorithm in a few touches, taking the following observations:

1sttouch If we replace the labels with indexes (the beginning and the end of thesubstring in s), we’ll have two numbers, corresponding to each edge, andas the number of edges is less then 2∗m−1, linear space is spent. This alsosimplifies maintaining the length of the edge (the detail, which we need),making it more consequent. After mentioning that, we can immediatelyforget that we have not symbols, but numbers, because working with themis the same.

2ndtouch If xαβ appeared in the tree, then definitely αβ appeared also. Then whenwe are to apply the third rule, we can complete with the phase. So aphase is a consequency of extensions, which use the first (prolonging amark) and the second (branching off) rules.

Page 15: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

1.3. SUFFIX ARRAYS 15

3rdtouch A leaf cannot become an inner vertex, because the three ’rules’ algorithmuses do not transform leaves anyhow.

During the phase, we add the same symbol to edge marks. In terms ofindices, we increment by one ending indices, setting them to i on the phasei.

We will split or do nothing only with the suffixes, not processed in theprevious phase (those, to which we applied the do nothing rule).

If in the previous phase we ended in extension j, in the beginning of phasewe already know that all we have to do with the first j suffixes is toincrement by one their last end. That s why it s efficient to write on suchedges the mark of infinity, in the phase i implying infinity is i, and toreplace them only after the tree is built. This means, that we build thetree ’in one path’, moving only ’forward’.

The last observation yealds the

Theorem 1.2. Ukkonen s algorithm terminates in O(n).

Difficulties

We saw that suffix trees can make string processing much simpler. At the sametime, this is a complicated structure, which is sometimes difficult to implement.The reasons are

1. No “locality” – bad for paging

2. Dependency on the length of the alphabet (Σ): the ’constant’ for choosingthe right branch gets bigger with grouth of |Σ|.

3. Number of “children” ranges for different vertices – no general ways ofrepresentation: the vertices near the root have many children (almost|Σ|), and for them arrays provide good representation; the vertices nearleaves have almost no children, arrays would be too rarefied, and for thesevertices linked lists are preferred; and for ’middle’ vertices balanced treesand hashing are better (linked lists would increase time for search, in case|Σ| is large).

Due to these reasons, in a number of applications a simpler, although a bit lessconvenient for algorithms, structure is preferred.

1.3 Suffix arrays

1.3.1 The very first algorithm

Definition 1.4. Suffix array for a string s is an array, containing the suffixesof s in lexicographic order.

The idea to alphabetically order the suffixes belongs to Udi Manber and GeneMyers (1993, “Suffix Arrays: a New Method For On-Line String Searches”).They proposed an algorithm of direct constructing the array (i.e. construction,not based on firstly built suffix tree) in O(n∗logn) time. This algorithm not only

Page 16: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

16 CHAPTER 1. DATA STRUCTURES FOR PATTERN MATCHING

builds the array, but on the way gathers some additional information (useful foralgorithms). In their article, Manber and Myers also presented an algorithm ofsearch, using this information, for a pattern P in O(|P |+ logn) time.The clear advantage of suffix arrays in comparison with suffix trees is thatthey are much less complicated structure. So, they were preferred in severalfields even though there were no known linear time algorithms for direct suffixarray construction. Such algorithms (in amount of three, simultaniously andindependently!) appeared in 2003, and we will touch upon one of them.

Note. Search time on suffix arrays (O(|P | + logn)) seems to be great loss incomparison with O(|P |) for suffix trees. But in practice, these values (O(|P |)and O(logn)) are usually comparable, and search time does not decrease dra-matically.

As we mentioned, suffix array can be built (in linear time) from a correspondingsuffix tree. This approach has obvious disadvantages. Here is an idea of analgorithm, based on different approach.

Algorithm. As with the trees, we build the array inductively, greatly using it’sstructure. Initially, we have an array with unordered suffixes. Beginning withsorting the suffixes by the first symbol (which is linear in the string’s length -radix sort), every phase we twice the number the suffixes are sorted on.After the phase H, the suffixes are organized into buckets, holding suffixes withthe same H first symbols.

If A(i) is the suffix in the first bucket, A(i-H) should be first in its 2H-bucket.We can move it to the beginning of its 2H-bucket, and mark this fact. For everybucket, we need to know the number of suffixes in this bucket that have alreadybeer moved and placed in 2H-order. The algorithm basically scans the suffixesas they appear in the H-order and for each A(i) it moves A(i-H) (if it exists) tothe next available place in its bucket.The number of phases is logarithmical, so the overall running time is O(n∗logn).The implementation is quite interesting, see [MM93].

1.3.2 Skew algorithm

In 2003, independently and in parallel, three different direct linear time suffixarray construction algorithms were introduced (by Kim; by Ko and Aluru; andthe one we are to consider - ’Skew’ algorithm by Juha Karkkainen and PeterSanders.)Before getting to the idea of the skew algorithm, we need to refer to somebackground, and thus to give another glance at the history of suffix trees devel-opment. Together with the line of algorithms of Weigner, McCreight, Ukkonen,in which the suffix tree is built inductively, with use in some way of the tight re-lations between suffixes which, provided with insight into these relations, makeit possible to organize induction very efficiently - together with this line, thereexisted another line, introduced by Martin Farach. Farach’s suffix tree con-struction was based on quite different idea: construct separate trees for suffixes

Page 17: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

1.3. SUFFIX ARRAYS 17

starting at odd positions (recursively) and for the remaining suffixes (using theresults of the first step). Merge the two suffix trees into one. Merging, beinga difficult procedure, relies on structural properties of suffix trees that are notavailable in suffix arrays. (Worth mentioning that Farach’s approach has animportant advantage of not being dependent on the alphabet size.) Kim (theauthor of another linear algorithm) managed to perform similar merging withsuffix arrays, but the procedure is still very complicated.The skew algorithm has a similar structure:

1. Construct the suffix array of the suffixes starting at positions imod3 6= 0.This is done by reduction to the suffix array construction of a string oftwo thirds the length, which is solved recursively.

2. Construct the suffix array of the remaining suffixes using the result of thefirst step

3. Merge the two suffix arrays into one.

The use of two thirds instead of half of the suffixes in the first step makes the laststep indeed easy: a simple comparison-based merging is sufficient. For example,to compare suffixes strating at i and j with imod3 = 0 and jmod3 = 1, wefirst compare the initial characters, and if they are the same, we compare thesuffixes starting at i+ 1 and j + 1 whose relative order is already known fromthe first step (the situation here is quite similar to the situation in the algorithmin previous section, see picture).The figure 3 gives an example of algorithm. In [KS03], together with usefulideas and theoretical observations, there is a (short and easy understandable)implementation in C++.

Theorem 1.3. The skew algorithm can be implemented to run in time O(n).

Proof. The second step is easy, due to the idea, same with the one for step 3.Again: the suffixes Si with imod3 = 0 are sorted by sorting the pairs (s[i), Si+1),where s is the initial string. So, it’s not difficult to see that the second and thethird steps require linear time, and the execution time obeys the reccurenceT (n) = O(n) + T (2n/3), T (n) = O(1) for n < 3. This reccurence has thesolution T (n) = O(n).

Note. In the previous section, we mentioned that generally suffix array shouldbe built together with collecting some additional information (an array of longestcommon prefixes of suffixes that are adjacent in the suffix array), which makesit much more valuable for algorithms. Referring to [KS03] for details, we willjust mention that this information can be gathered in the course of the skew

algorithm as well.

Page 18: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

18 CHAPTER 1. DATA STRUCTURES FOR PATTERN MATCHING

Figure 1.3: Skew algorithm, applied to string ’mississippi’.

Page 19: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Chapter 2

Sublinear ApproximateString MatchingRobert West

The present paper deals with the subject of approximate stringmatching and demonstrates how Chang and Lawler [CL94] conceiveda new sublinear time algorithm out of ideas that had previously beenknown.The problem is to find all locations in a text of length n over ab-letter alphabet where a pattern of length m occurs with up to kdifferences (substitutions, insertions, deletions).The algorithm will run in O( n

mk logbm) time when the text is ran-

dom and k is bounded by the threshold m/(logbm+O(1)). In par-ticular, when k = o(m/ logbm) the expected running time is o(n).

2.1 Introduction

I never have found the perfect quote. At best I have been able tofind a string of quotations which merely circle the ineffible idea Iseek to express.

Caldwell O’Keefe

2.1.1 What?

In order to be able to explain why and how we will solve a problem, we arebound to exactly define what the problem is.

Definition 2.1. Given a text string T of length n and a pattern string P oflength m over a b-letter alphabet, the k-differences approximate string matching

problem asks for all locations in T where P occurs with at most k differences(substitutions, insertions, deletions).

The following examply will clarify the matter:

19

Page 20: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

20 CHAPTER 2. SUBLINEAR APPROXIMATE STRING MATCHING

Example TORTEL LINI

YELTSIN

* **

“YELTSIN” matches in “TORTELLINI” with three differences: The ‘Y’ in“YELTSIN” is replaced by a ‘T’, the ‘T’ in “YELTSIN” is deleted and the ‘S’in “YELTSIN” is replaced by an ‘L’.

2.1.2 Why?

Approximate string matching as a generalisation of exact string matching hasbeen made necessary due to one major reason:Genetics is the science that has, in the last years, conjured up a set of newchallenges in the field of string processing. (e.g. the search for a string like“GCACTT...” in a huge gene database)Sequencing techniques, however, are not perfect: The experimental error is upto 5–10%.Moreover, gene mutation (leading to polymorphism; cf. Darwin’s ideas) is acondicio sine qua non, it is the mother of evolution. Thus matching a piece ofDNA against a database of many individuals must allow a small but significanterror.

2.1.3 How?

We will first gather the ingredients (suffix trees, matching statistics, lowestcommon ancestor retrieval, edit distance) and then merge the ingredients toform the algorithm: We will develop the linear expected time algorithm (LET)in detail and will then obtain the sublinear expected time algorithm (SET) aftersome modifications.We will mainly stick to the paper of Chang and Lawler [CL94]; since it is veryconcise I added material from further sources where it seemed necessary forreasons of understanding.

2.2 The Auxiliary Tools

Hunger, work and sweat are the best spices.

Icelandic proverb

– Not in our case! Our spices are suffix trees, matching statistics, lowest commonancestor retrieval and edit distance. (An apt and detailed introduction to all ofthese concepts can be found in [Gus97].)

2.2.1 Suffix Trees

Suffix trees may be called the data structure for all sorts of string matching.Thus it takes small wonder that they will extensively be used in our case, too.We will denote the suffix tree of a string P [1..m]$ ($ is the terminal symbol thatappears nowhere but at the last position) as SP .A string α is called a branching word (of P$) iff there are different letters x and

Page 21: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

2.2. THE AUXILIARY TOOLS 21

y such that both αx and αy are substrings of P$.Quite obviously, the following correlations hold:

root ←→ ε (empty string)

internal nodes ←→ branching words

leaves ←→ suffixes

Next we define floor(α) to be the longest prefix of α that is a branching wordand ceil(α) to be the shortest extension of α that is a branching word or a suffixof P$. Note that α is a branching word iff floor(α) = ceil(α) = α.To make matters simpler, we introduce the ‘inverted string’: Let β−1α be thestring α without its prefix β; of course this only makes sense, if β really is aprefix of α.The nodes of the tree will be labelled with the correlating branching words orsuffixes respectively, while an edge (β, α) will be labelled with the triple (x, l, r)such that P$[l] = x and β−1α = P$[l..r].The following are intuitionally clear: son(β, x) := α; len(β, x) := r − l + 1.Furthermore, let first(β, x) := l (the position of the first letter in P$, not theletter itself, which is already known to be x).At last, let shift(α) be α without its first letter (if α 6= ε); so applying the shiftfunction means following a suffix link (cf. [Ukk95, Gus97]).

2.2.2 Matching Statistics

The next crucial data structure to be used is matching statistics. It will storeinformation about the occurrence of a pattern in a text; this is put more preciselyby the following

Definition 2.2. The matching statistics of text T [1..n] with respect to patternP [1..m] is an integer vector MT,P together with a vector M

T,P of pointers tothe nodes of SP , where MT,P [i] = l if l is the length of the longest substringof P$ (anywhere in P$) matching exactly a prefix of T [i..n] and where M

T,P [i]points to ceil(T [i..i+ l− 1]).More shortly we will write M and M

′.

Our goal is to find an O(n + m) time algorithm for computing the matchingstatistics of T and P in a single left-to-right scan of T using just SP . (We willfollow [CL94].)Brief description of the algorithm: The longest match starting at position 1 inT is found by walking down the tree, matching one letter a time. Subsequentlongest matches are found by following suffix links and carefully going downthe tree. (cf. Ukkonen’s construction of the suffix tree: “skip-and-count trick”,[Ukk95, Gus97])

What follows now is some notes that shall serve to clarify the pseudo-code givenlater:

• i, j, k are indices into T :

– The i-th iteration computes M[i] and M′[i].

– Position k of T has just been scanned.

Page 22: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

22 CHAPTER 2. SUBLINEAR APPROXIMATE STRING MATCHING

– j is some position between i and k.

• Invariants:

– At all times true:(1) T [i..k − 1] is a substring of P ; T [i..j − 1] is a branching word ofP .

– After step 3.1 the following becomes true:(2) T [i..j − 1] = floor(T [i..k − 1]) and corresponds to node α.

– After step 3.2 the following becomes true as well:(3) T [i..k] is not a word.

• If j < k after step 3.1, then T [i..k − 1] is not a branching word (2), soneither is T [i− 1..k − 1].So, as substrings of P they must have the same single-letter extension.We know from iteration i− 1 that T [i− 1..k − 1] is a substring of P (1)but T [i− 1..k] is not (3), so T [k] cannot be this letter. Hence the matchcannot be extended.

• Together invariants (1) and (3) imply M[i] = k − i.

• i, j, k never decrease and are bounded by n: i + j + k ≤ 3n. For everyconstant amount of work in step 3, at least one of i, j, k is increased. Therunning time is therefore O(n) for step 3, and O(m) for steps 1 and 2(use e.g. Ukkonen’s [Ukk95, Gus97] or McCreight’s [Gus97] algorithm toconstruct the suffix tree), yielding together the desired O(n+m).

After the above explanations the following code, which computes the matchingstatistics of T with respect to P , should be more easily understandable:

Page 23: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

2.2. THE AUXILIARY TOOLS 23

1 construct SP in O(m) time2 α := root; j := k := 13 for i := 1 to n do3.1 while (j < k) ∧ (j + len(α, T [j]) ≤ k) do // “skip and count”

α := son(α, T [j]);j := j + len(α, T [j])

elihw3.2 if j = k then // extend the match

while son(α, T [j]) exists ∧ T [k] = P$[first(α, T [j]) + k − j] dok := k + 1if k = j + len(α, T [j]) then

α := son(α, T [j]);j := k fi

elihwfi

3.3 M[i] := k − iif j = k then M

′[i] := αelse M

′[i] := son(α, T [j]) fi3.4 if (α is root) ∧ (j = k) then

j := j + 1;k := k + 1 fi

if (α is root) ∧ (j < k) thenj := j + 1 fi

if (α is not root) thenα := shift(α) fi

rof

2.2.3 Lowest Common Ancestor

Having introduced the notion of suffix trees, we will at some points while deriving themain algorithm be interested in the node that is an ancestor to two given nodes of thetree and that is ‘minimal’ with that quality, meaning it is the node furthest from theroot. More formally:

Definition 2.3. For nodes u, v of a rooted tree T, lca(u, v) is the node furthest fromthe root that is an ancestor to both u and v.

Our goal is a constant time LCA retrieval after some preprocessing; it is achieved byreducing the LCA problem to the range minimum query (RMQ) problem as proposedin [BFC00]. RMQ operates on arrays and is defined as follows.

Definition 2.4. For an array A and indices i and j, rmqA(i, j) is the index of thesmallest element in the subarray A[i..j].

For ease in notation we will say that, if an algorithm has preprocessing time p(n) andquery time q(n), it has complexity 〈p(n), q(n)〉.The following main lemma states how closely the complexities of RMQ and LCA areconnected.

Lemma 2.1. If there is a 〈p(n), q(n)〉-time solution for RMQ on a length n array,then there is a 〈O(n) + p(2n− 1),O(1) + q(2n − 1)〉-time solution for LCA in a treewith n nodes.

The O(n) term will come from the time needed to create the soon-to-be-presentedarrays. The O(1) term will come from the time needed to convert the RMQ answeron one of these arrays to the LCA answer in the tree.

Page 24: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

24 CHAPTER 2. SUBLINEAR APPROXIMATE STRING MATCHING

Proof. The LCA of nodes u and v is the shallowest (i.e. closest to the root) nodebetween the visits to u and v encountered during a depth first search (DFS) traversalof T (n nodes; labels: 1, ..., n).

Therefore, the reduction proceeds as follows:

1. Let array D[1..2n − 1] store the nodes visited in a DFS of T. D[i] is the labelon the i-th node visited in the DFS.

2. Let the level of a node be its distance from the root. Compute the level arrayL[1..2n − 1], where L[i] is the level of node D[i].

3. Let the representative of a node be the index of its first occurrence in the DFS.Compute the representative array R[1..n], where R[w] = minj | D[j] = w.

All of this is feasible during a single DFS; thus the running time is O(n).Now the LCA may be computed as follows (suppose u is visited before v): The nodesbetween the first visits to u and v are stored in D[R[u]..R[v]]. The shallowest node inthis subtour is found at index rmqL(R[u],R[v]). The node at this position and thusthe output of lca(u, v) is D[rmqL(R[u],R[v])].The time complexity is as claimed in the lemma: Just L (size 2n− 1) must be propro-cessed for RMQ. So the total preprocessing time is O(n) + p(2n − 1). For the queryone RMQ in L and three constant time array lookups are needed. In total we getO(1) + q(2n − 1) query time. 2

What about RMQ’s complexity? – After procomputing (at least a crucial part of) allpossible queries, lookup time is q(n) = O(1).Preprocessing time p(n) is O(n3) by a brute force algorithm: For all possible indexpairs, search the minimum.It is O(n2) if we fill the table by dynamic programming. This is, however, still naive.A far better algorithm runs in O(n log n) time: Precompute only queries for blocks ofa power-of-two length; remaining answers may then be inferred in constant time atthe moment of query.Finally, a really clever O(n) time solution was proposed by Bender and Colton: Itmakes use of the fact that adjacent elements in L differ by exactly ±1; so only solutionsfor the few generic ±1-patterns are precomputed. It can be shown that there are onlyO(√n) such patterns and that preprocessing is then O(

√n(log n)2) = O(n).

For details regarding the two latter cases, I refer to [BFC00]; dwelling upon themwould prolong this report in an undue way.

2.2.4 Edit Distance

We will now introduce another data structure called ‘edit distance’, which will beneeded in the main algorithm:

Definition 2.5. The edit distance (or Levenshtein distance) between two strings S1

and S2 is the minimum number of edit operations (insertions, deletions, substitutions)needed to transform S1 into S2.

Such a transformation may be coded in an edit transcript, i.e. a string over the alphabetI,D, S,M, meaning “insertion”, “deletion”, “substitution” or “match” respectively.The following example will clarify this (S1 is to be transformed to S2). Note theremay be more than one transformation – even more than one optimal transformation.

Example SIMDMDMMI

v intner = S1

wri t ers = S2

Page 25: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

2.2. THE AUXILIARY TOOLS 25

In the now to be presented ‘naive’ dynamic programming solution we need not use theauxiliary tools already gathered (suffix trees, matching statistics, LCA), but later on– in the Landau–Vishkin algorithm – we will do so to obtain some lower complexity.

Lemma 2.2. The edit distance is computable using dynamic programming.

Proof. What we want to do is to build the table E where E[i, j] denotes the editdistance between S1[1..i] and S2[1..j]. Then E[n1, n2] will be the edit distance betweenthe complete strings S1[1..n1] and S2[1..n2].The base conditions are: E[i, 0] = i (all deletions); E[0, j] = j (all insertions)The recurrence scheme is:

E[i, j] = minE[i, j − 1] + 1,E[i− 1, j] + 1,E[i− 1, j − 1] + Iij,

where Iij = 0, if S1[i] = S2[j], and Iij = 1 otherwise.

The last letter of an optimal transcript is one of I,D, S,M. The recurrence selectsthe minimum of these possibilities. For example, if the last letter of an optimal tran-script is I then S2[n2] is appended to the end of S1, and what remains to be done is totransform S1 to S2[1..n2 − 1]. The edit distance between these two strings is alreadyknown, and the first of the three possibilities is chosen. 2

Here is some part of the table for the example given above (The arrows point tothe cells from which a particular cell may be computed and thus represent one ofI,D, S,M). That there is often more than one arrow makes clear once more thatthere may be more than one optimal way to transform one string into another. Eachtrace that follows a sequence of arrows from cell (n1, n2) to cell (1, 1) represents anoptimal transcript.

E[i, j] S2 w r i t e r sS1 0 1 2 3 4 5 6 7

0 0 ← 1 ← 2 ← 3 ← 4 ← 5 ← 6 ← 7v 1 ↑ 1 1 ← 2 ←3 ←4 ←5 ←6 ←7i 2 ↑ 2 ←2 2 2 *n 3 ↑ 3t 4 ↑ 4n 5 ↑ 5e 6 ↑ 6r 7 ↑ 7

The complexity is O(|S1| · |S2|).Also note that rows, columns and diagonals are non-decreasing and differ by at mostone.

We need, however, some slightly different thing: We need the minimum number ofoperations to transform P [1..m] so that it occurs in T [1..n], not that it actually is T ;i.e. we want starting spaces to be “free”.Thus we need a table D, where

D[i, j] := min1≤l≤j

edit distance between P [1..i] and T [l..j]

We achieve this by simply changing the base conditions: D[i, 0] = i (as before: alldeletions); D[0, j] = 0 (ε ends anywhere).There is a match if row m is reached and if the value there is ≤ k.

It is possible to reduce the complexity from O(mn) to O(kn); the method is calledthe Landau–Vishkin algorithm (LV):

Page 26: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

26 CHAPTER 2. SUBLINEAR APPROXIMATE STRING MATCHING

Call cell D[i, j] an entry of diagonal j − i (range: −m, ..., n).LV makes use of one fact: When computing D, we may not omit whole rows or wholecolumns, but we may in fact omit whole diagonals. That is why we will not computeD but, column by column, the (k+ 1)× (n+ 1) “meta table” L where L[x, y] (with xranging from 0 to k and y from 0 to n) is the row number of the last (i.e. deepest) xalong diagonal y − x.−k ≤ y − x ≤ n, so all relevant diagonals and thus solutions are represented becauseD[k + 1, 0] = k + 1 > k and diagonals are non-decreasing.There is a solution if row m is reached in D, i.e. if L[x, y] = m; then there is a matchending at position m+ y − x with x differences.

First, define L[x,−1] = L[x,−2] := −∞; this is sensible because every cell of diagonal−1− x is at least D[x+ 1, 0] = x+ 1 > x.Now fill row 0: L[0, y] = jump(1, y+1), where jump(i, j) is the longest common prefixof P [i..m] and T [j..n], i.e.

jump(i, j) := minM[j], length of word lca(M′[j], leaf P$[i..m]).

Consider some part of L:y→

x α β γ↓ L[x, y]

α = L[x− 1, y− 2], β = L[x− 1, y− 1] and γ = L[x− 1, y] denote respectively the rownumbers of the last x− 1 on diagonals y− x− 1, y− x and y− x+ 1. Along diagonaly−x the cells at rows α, β+1 and γ+1 are at most x (due to the non-decreasingnessand difference by at most one). The cell in row β + 1 must be exactly x. So it followsfrom the non-decreasingness of diagonals that the deepest of these three cells (in rowt := maxα, β + 1, γ + 1) must have value x, too. To find the row where the lastx occurs on diagonal y − x, we go down this diagonal as long as the value does notincrease to x+ 1. Thus cell (x, y) may be computed as follows:

L[x, y] = t+ jump(t+ 1, t+ 1 + y − x)

2.3 The Algorithm

Virtus in usu sui tota posita est.

Marcus Tullius Cicero, De re publica

– “The value of virtue is entirely in its use.” So let us apply all the “virtual” assetswe have gathered up to now and make use of them in the core algorithm.

2.3.1 Linear Expected Time

The algorithm me develop first will run in O(n) time when the following two conditionshold:

1. T [1..n] is a uniformly random string over a b-letter alphabet.

2. The number of differences allowed in a match is

k < k∗ =m

logb m+ c1− c2.

(the constants ci are to be specified later; m is the pattern length)

Page 27: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

2.3. THE ALGORITHM 27

Pattern P need not be random.

The algorithm is named after those who invented it – Chang and Lawler (CL) – andis given in pseudo-code below:

s1 := 1; j := 1do

sj+1 := sj + M[sj ] + 1; // compute the “start positions”j := j + 1

until sj > njmax := j − 1for j := 1 to jmax do

if (j + k + 2 ≤ jmax) ∧ (sj+k+2 − sj ≤ m− k) thenapply LV to T [sj ..sj+k+2 − 1] fi // “work at sj”

rof

It is not presently obvious that the algorithm is correct. So we will have a closer lookat it:

If T [p..p + d − 1] matches P and sj ≤ p ≤ sj+1, then this string can be written inthe form ζ1x1ζ2x2...ζk+1xk+1, where each xl is a letter or empty, and each ζl is asubstring of P . As may be shown by induction, for every 0 ≤ l ≤ k + 1, sj+l+1 ≥p + length(ζ1x1...ζlxl). So in particular sj+k+2 ≥ p + d, which implies sj+k+2 − sj ≥d ≥ m − k. So CL will perform work at start position sj and thereby detect there isa match ending at position p+ d− 1.

If we can show the probability to perform work at s1 is small, this will be true for allsj ’s because they are all stochastically independent and equally distributed (becauseany knowledge of all the letters before sj is of no use when “guessing” sj+1).Since sk∗+3− s1 ≥ sk+3− s1 and m−k ≥ m−k∗, the event sk+3− s1 ≥ m−k impliesthe event sk∗+3 − s1 ≥ m− k∗.So Pr[sk∗+3 − s1 ≥ m − k∗] ≥ Pr[sk+3 − s1 ≥ m − k] and it suffices to prove thefollowing lemma.

Lemma 2.3. For suitably chosen constants c1 and c2, and k∗ = mlogb m+c1

− c2,

Pr[sk∗+3 − s1 ≥ m− k∗] < 1/m3.

Proof. For the sake of easiness, let us assume (i) b = 2 (b > 2 gives slightly smallerci’s) and (ii) k∗ and logm are integers (logm := log2m).

Let Xj be the random variable sj+1 − sj .

Note that sk∗+3 − s1 = X1 + ... +Xk∗+2 (telescope sum).

There are m2d different strings of length logm+ d, but at most m such substrings ofP . Together with the fact that X1 = M[1] + 1, we have

Pr[X1 = logm+ d+ 1] < 2−d for all integer d ≥ 0 (2.1)

Note that all theXj are stochastically independent and equally distributed (you do notgain any knowledge if you know how big the gap between the previous start positionswas). So E[Xj ] = E[X1] for all j. We will now show that E[Xj ] < logm+ 3:

Page 28: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

28 CHAPTER 2. SUBLINEAR APPROXIMATE STRING MATCHING

E[Xj ] = 1 + E[M[1]] = 1 + logm+ E[M[1] − logm]

= 1 + logm+∞X

d=1

dPr[M[1] − logm = d]

= 1 + logm+∞X

d=1

dPr[X1 = logm+ d+ 1]

< 1 + logm+∞X

d=1

d

„1

2

«d

= 1 + logm+1

2/

1− 1

2

«2

= logm+ 3

Now let Yi := Xi − m−k∗

k∗+2and

apply Markov’s inequality: Pr[X ≥ h] ≤ E[X]/h, for all h > 0 (t > 0):

Pr[X1 + ... +Xk∗+2 ≥ m− k∗] = Pr[Y1 + ... + Yk∗+2 ≥ 0]

= Pr[et(Y1+...+Yk∗+2) ≥ et·0]

≤ E[et(Y1+...+Yk∗+2)]/1

= E[etY1 · ... · etYk∗+2 ]

= E[etY1 ] · ... ·E[etYk∗+2 ]

= E[etY1 ]k∗+2

Note that we used the independence and equal distribution of the variables Yj , whichfollows from the independence and equal distribution of the variables Xj .Inequality (2.1): Pr[X1 = logm+ d+ 1] < 2−d, is equivalent to

Pr[Y1 = logm+ d+ 1 − m− k∗k∗ + 2

] < 2−d for all integer d ≥ 0

So, the theorem of total expectation implies, for all t > 0 (α := logm+ 1− m−k∗

k∗+2),

E[etY1 ] = E[etY1 |Y1 ≤ α] · Pr[Y1 ≤ α]| z

≤1

+

+

∞X

d=1

E[etY1 |Y1 = α+ d] · Pr[Y1 = α+ d]

≤ etα +∞X

d=1

et(α+d) · Pr[Y1 = α+ d]

<∞X

d=0

et(α+d) · 2−d

So we have up to now:

Pr[sk∗+3 − s1 ≥ m− k∗] ≤ E[etY1 ]k∗+2

< (

∞X

d=0

et(α+d) · 2−d)k∗+2,

and choosing t = loge 2

2and doing some algebra will yield:

Page 29: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

2.3. THE ALGORITHM 29

= (√

∞X

d=0

√2−d

)k∗+2

≤ (√

2α+3.6

)k∗+2

≤√

2(k∗+2) log m−(m−k∗)+4.6(k∗+2)

<√

2(5.6−c1)(k∗+2)−(c2−2)(c1+log m)

which is less than 1/m3 if c1 = 5.6 and c2 = 8.So the probability to perform work at position s1 and thus at each position is lessthan 1/m3. Thus LV is applied with a probability of less than 1/m3. The text it isapplied to is supposed to have length (k+ 2)E[X1] < (k+ 2)(logm+ 3) = O(k logm),and LV has complexity O(kl), if l is the length of the input string. Also recall thatk = O( m

log m). So the average expected work for any start position sj is

m−3O(k2 logm) = m−3O(m2

(logm)2logm)

= O(1

m logm)

= O(λn.λm.1)

Hence the total expected work is O(n). 2

2.3.2 Sublinear Expected Time

Now an algorithm is derived from LET that is sublinear in n (when k < k∗/2 − 3; k∗

as before).

The trick is to partition T into regions of length m−k2

. Then any substring of T thatmatches P must contain the whole of at least one region:

Now, starting from the left end of each region R, compute k + 1 “maximum jumps”(using M, as in LV), say ending at position p. If p is within R, there can be no matchcontaining the whole of R. If p is beyond R, apply LV to a stretch of text beginningm+3k

2letters to the left of R and ending at p.

A variation of the proof for LET yields that

Pr[p is beyond R] < 1/m3.

So, similarly to the analysis of LET, the total expected work is:

m−3 2n

m− k| z

] regions

[(k + 1)(logm+O(1)) +O(m)]| z

exp. work at region examined

= ... = O(n/m3)

= o(n)

2

Page 30: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

30 CHAPTER 2. SUBLINEAR APPROXIMATE STRING MATCHING

To understand what an asset the algorithm is, consider the following facts:A combination of LET (for k ≥ k∗/2 − 3) and SET (for k < k∗/2 − 3) runs inO( n

mk logm) expected time. In a 16-letter alphabet, k∗ may be up to 25% of m, in a

64-letter alphabet even 35%.

2.4 Conclusion

Gut gekaut ist halb verdaut.

German proverb

The problem we have dwelt upon over the last few pages demonstrates in a beautifulmanner how important a thorough preprocessing is in the business of algorithm design:Having gathered all the ingredients (data structures and auxiliary algorithms), mergingthem into the core algorithm was (although not obvious) quite short a task. – But thathas been common knowledge for centuries: “A good chewing is half the digestion,” asgoes the translation of the above saying.

Page 31: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Chapter 3

Approximate stringindexing: comments toslideshowAlexander Vahitov

Using simple mathematical arguments the matching probabilities inthe suffix tree are bound and by a clever division of the search patternsub-linear time is achieved.

The report is based on the article of G. Navarro and R. Baeza-Yates ’AHybrid Indexing Method For Approximate String Matching’

3.1 Introduction

First Section 3.2 is about the task of the algorithm with some simple examples. NextSection 3.3 will tell you some basic ideas and algortihms used in the main resultingalgorithm of the report, which is presented in Section 3.4. Then comes Section 3.5where you will find some ideas to prove the average-case complexity of the algorithm(full complexity analysis can be found in the issue of G. Navarro and R. Baeza-Yates).The last part of my report is Section 3.6 with conclusions and future directions ofscientific work in this field.

3.2 Our Task

We have a long text T and a short pattern (P ). Our task is to find substrings from Twhich match our pattern approximately. Approximate matching means that we canaccept some errors during the searching process (you can imagine that these are trans-portation errors). These errors are differences between the pattern and the foundedtext piece (ocuurence). We define 3 kinds of differences between strings: insertion,replacement and deletion.

If we have a pattern abc and a text adbc, then there are some obvious examples ofdifferences between this strings :adc = a+ d+ b+ c - insertion

abd = a+ b+ c→ d - replacementab = a+ b+ c→ ∅ - deletion

31

Page 32: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

32 CHAPTER 3. APPROXIMATE STRING INDEXING

We call the minimum number of such changes needed to transform one string toanother as edit distance (abbreviated ed) between the strings. For example, all thementioned above strings have the edit distance equal to 1 (because we needed fortransformation only 1 change).If S can be transformed to S′ with x changes, then S′ can be transformed to S alsowith x changes (we remove deletions with insertions, and vice versa).The resulting algorithm, which will be presented later, has to solve the approximatesearching problem. There are some different approaches to this problem, and at firstI will tell you some of them. The first variant is to build a suffix trie for the text andsearch in O(n) time for the occurences. We can descend by the branches of the trietill the level where we can understand that this branch does not contain an occurence.If we use suffix array structure, we will free much memory (tis problem arises becausesuffix trees have very big memory requirements). This method of search is calledDepth-First Search. We can generate a set of viable prefixes (possible prefixes for thepattern occurence) and search for them in our trie.The second way is to build an on-line filtering algorithm. It can use some sort of index(for example, many algorithms are based on storing text q-grams - pieces of the textwith length equal to some number q).But it is obvious that the number of errors insuch algorithms is strongly bounded, nd it can be incompatible with many practicalsituations.There is another outstanding algorithm by Myers based on the mixture of the twoapproaches. It uses q-grams. It divides a pattern in such pieces that their length isless than q− k (where k is error-level, error number divided by pattern length). Thenit generates for each text piece all the strings that can appear from the pattern whensome errors are done. All this strings are searched then in the q-gram set, and thelast step is merging founded piece occurences and searching for the occurence of thewhole pattern.Our algorithm at first will divide our pattern into some pieces. Then it will useapproximate search to find the occurences of this pieces in the text, and then verifywhether the occurence of some pattern piece can be continued to the occurence of thewhole pattern.

3.3 Basic Ideas and Algoritms

3.3.1 Lemma: dividing the pattern

Dividing the pattern is useful for our algorithm. If we have two strings A and B,theiredit distance ed ≤ k and we divide A intoj substrings, then at least one of the sub-strings appears in B with at most bk/jc errors. This is obvious because we have tochange A k times to transform it to j. Each change is applied to one of the substrings.The average number of changes per substring is k

j. So it is easy to see that there is at

least one of Ai that has less than or equal to kj

changes.

Example:ed(′he likes′,′ they like′) = 3 = k;A1 =′ he ′, A2 =′ likes′ ⇒ j = 2;

ed(′he ′,′ they ′) = 2; ed(′likes′,′ like′) = 1 = b kjc.

3.3.2 Computing edit distance

Computing edit distance is also useful for approximate string matching. There is aclassical dynamic programming algorithm solving this problem. We have 2 strings: xand y with characters xi and yj . Let’s consider Cij as edit distance between x1..xi

Page 33: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

3.3. BASIC IDEAS AND ALGORITMS 33

and y1..yj . Let’s fill the matrix of Cij with such algorithm: Ci,0 = i because youneed i insertions to the empty string to change it to x1..xi. And also C0,j = j bythe same cause. Cij = Ci−1,j−1 if xi = yj . Another case is when xi 6= yj . Youneed one deletion of the character yj from the string y1...yj to make it matchingx1...xi with Ci,j−1 errors. Also you may delete xi to make Cij equal to Ci−1,j+1.And you can replace xi with yj to make Cij equal to Ci−1,j−1. So if xi 6= yj thened(x1..xi, y1..yj) = 1 +minCi−1,j ;Ci,j−1;Ci−1,j−1.An example shows how does the algorithm work with ′survey′ and ′surgery′ strings.The cases when xi = yj are marked with green, and other cases - with red. There arearrows from the minimal matrix element (left, upper or diagonal) which summed with1 gives us Cij to make it matching x1..xi with Ci,j−1 errors.

x = x1x2 . . . xm; y = y1y2 . . . yn;xp, yq ∈ ΣCij = ed(x1 . . . xi, y1 . . . yj);

(C) is a matrix filled with Cij

C0,j = j;Ci,0 = i;

Ci,j =

Ci−1,j−1 xi = yj

1 + minCi−1,j , Ci−1,j−1, Ci, j − 1 else;

Now let’s consider y as the text and x as the pattern. Let’s construct the algorithmto search the substrings from the text, approximately matching the pattern. Theonly different between this algorithm and previous one is that we initialize C0,j with0 instead of j because the pattern matching process can start from every text position.

These are fullfilled matrices by the algorithm that simply computed the edit distanceand the algorithm that searched the pattern in the text.

Page 34: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

34 CHAPTER 3. APPROXIMATE STRING INDEXING

3.3.3 NFA construction

Here is presented Nondeterministic Finite Automaton. We will use it to search forapproximate matches of pattern P in the text T . The initial state of the automatoncorresponds to 0 errors in matching process and to the first character of the pattern.The automaton will go by pattern and text characters simultaneously. When it reachesthe last pattern character, it means that we have found an occurence. It is presentedlower in the picture as a table. In columns pattern characters are written, and eachrow is used to represent errors in matching. We search for pattern occurence acceptingsome errors, and transition between rows means accepting one error.There are 4 kinds of transitions between states (we make changes with pattern andthe text is remaining the same):

• if current text and pattern characters are the same;

• if current pattern character is replaced with the text one;

• if current text character is inserted to the pattern;

• if current pattern character is deleted.

Look at the illustration of the automaton which searches the text for approximatematching the pattern ’survey’ with at most 2 errors. The rows correspond to theerrors which are already made, and the columns correspond to the characters of thepattern already reached by the automaton.The transitions are:

• horizontal, if current text and pattern characters are the same;

• solid diagonal, if current pattern character is replaced with the text one;

• vertical, if current text character is inserted to the pattern;

• dashed diagonal, if current pattern character is deleted.

The automaton finish states are the right ones.

Page 35: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

3.4. MAIN ALGORITHM 35

NFA structure

How NFA searches ’surga’ in ’surgery’

3.3.4 Depth-First Search technique

And the last is the algorithm of the depth-first search. Here we define the Uk(P )which is a set of the strings, matching to P with at most k errors. It is possible tosearch this string in the text, but the complexity of this search is quite large. We canuse a suffix tree and search in it the strings from U t

k(P ). These are the neighborhoodelements which are not prefixes of other neighborhood elements.

3.4 Main algorithm

Our algorithm, searching in the suffix tree, has to start from the root, consider somestring x incrementally,determine when ed(x,P ) ≤ k and determine when adding ofany character makes the ed greater than k.

In the picture you can see how the algorithm works with the input of the suffixesfrom the suffix tree. It fills the matrix like in the example above and analyzes it’s ele-ments.The algorithm increments x and updates the last column in O(m) time. Whenthe element in the last row (last column element) is ≤ k, the match is detected. Oth-erwise, if all the values in the column are ≥ k, the match cannot be detected.

This is the illustration of the algorithm working in the suffix tree with 2 suffixes,matching the pattern ′surgery′. It is shown the column of a matrix for the suffix′surga′.

The cost of the suffix tree search is exponential in m and k, so it’s better to perform jsearches of patterns of length [ m

j] and k

jerrors (remember the lemma about dividing!).

That’s why we divide patterns. So, we divide our pattern into j pieces and search themusing the above algorithm. Then, for each match found ending at text position i wecheck the text area [i −m − k..i + m + k]. But the larger j, the more text positions

Page 36: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

36 CHAPTER 3. APPROXIMATE STRING INDEXING

need to be verified, and the optimal j will be found soon.

Now we have to adapt our NFA for searching in the suffix tree. At first, we’ll search formatching from the beginning ot the suffix, so we don’t need initial self-loop. Second,we don’t need initial insertions to the pattern - because if te suffix matches with suchinsertions, we will find the suffix matching without these insertions. So, we remove thedown-left triangle of the automaton, under diagonal. And at last, we can start matchwith k + 1 first characters of the pattern (because we removed initial insertions, onlyinitial deletions remain, initial replacement is the same as initial deletion).

This is the illustration to the changes of our NFA from the previous example.

3.4.1 Suffix Arrays

Here I can say something about using suffix arrays instead of suffix trees. Suffix arrayshave less space requirements, but the time complexity of the search in the case ofsuffix array should be multiplied by log n. Suffix array replaces nodes with intervalsand traversing to the node is going to the interval. If there is a node and it’s children,then the node interval contains children intervals. More information on suffix arrayswas in Olga Sergeeva’s report.

3.5 Complexity Analysis

Let’s analyse the algorithm to determine it’s complexity and the best variant of par-titioning the pattern. Now we’ll find the average number of nodes at level l. If we areworking with random text, then the number of suffixes in level l is σl, and for smalll the number of suffixes longer than l is nearly n. In the probability model of n ballsthrown into σl urns we find that the average number of filled urnes is θ(minσl, n).If l ≤ m′, at least l− k text characters must match the pattern, and if l > m′, at leastm′−k pattern characters must match the text. There is no difference, which exactly isthe length of the pattern prefix. So, we sum all the probabilities for different patternprefix lengths:

lX

m′=l−k

1

σl−kCl−k

l Cl−km′ +

l+kX

m′=l+1

1

σm′−kCl

m′−kCm′−km′

Page 37: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

3.5. COMPLEXITY ANALYSIS 37

In the first sum the largest term is first one: 1σl−k

Ckl , and we can bound the whole

sum with (l − k) 1σl−k

Ckl . By Stirling’s approximation we have

Ckl =

el√

2πl

kk(l − k)l−k√

2πkp

2π(l − k)

!2 „

1 +O(1

l)

«

And the whole first sum is (l − k)γ(β)lO( 1l), where

γ(β) =1

σ1−xx2x(1− x)2(1−x)

Here you see that (l− k)γ(β)lO( 1l) = O(γ(β)l). The first sum exponentially decreases

when γ(β) < 1, it means that:

σ >

„1

β2β(1 − β)2(1−β)

« 11−β

=1

β2β

1−β (1 − β)2>

e2

(1 − β)2⇔ β < 1 − e√

σ,

because e−1 < ββ

1−β if β ∈ [0, 1]. The second summation can be also bounded withO(γ(β)l), and the probability of processing a given node at depth l is O(γ(β)l). Inpractice, e should be replaced by c = 1.09 (founded experimentally) because we havefound only upper bound for the probability, but not the exact upper bound.

Using the formulas bounding the probability of matching, let’s consider that in levels

l ≤ L(k) =k

1− c√σ

= O(k)

all the nodes are visited, and in levels l > L(k) nodes are visited with probability

O(γ( kl)l). Remember that the average number of visited nodes at the level l (for small

l) is θ(minn, σl).Now we will speak about searching of a single pattern in the text using our automatonwith depth-first search technique. We can define three cases of analysis of this searchprocess:

• L(k) ≥ logσn, n ≤ σL(k) - ’small n’, online search is preferable and no indexis needed (since the total work is n); It shows that the indexing technique doesnot work for very small texts.

• m+ k < logσn, n > σm+k - ’large n’, the total search cost is

σL(k) +σk(1 + β)2(m+k)

β2k,

independent of n;

• L(k) < logσn ≤ m + k, ’intermediate n’, the search is sublinear of n in time iferror level β < 1− e√

σ.

Now let’s add to our analysis the pattern partitioning mechanism. Remember that jhere is a number of pieces in which pattern is divided. After dividing, the mechanismanalised above is used. With pattern partitioning,

• First case conditions are:

k

1 − c√σ

≥ jlogσn, n ≤ σL( kj),

complexity is O(n).

Page 38: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

38 CHAPTER 3. APPROXIMATE STRING INDEXING

• Here m + k < jlogσn, n > σm+kj , if β = k

l< 1 − e√

σthe complexity is

O(n1−logσ

1γ(1+β) ). This is sublinear of n, we use j = m+k

logσn. This j is sim-

ply the smallest of possible j’s in this case (with j less than m+klogσn

we get into

the first case).

• Third case has it’s own j for the minimum of complexity, but it can get overthe bounds of this case, so in most cases we can simply use such j as in secondcase, and we will get sublinear of n time complexity.

3.6 Conclusions

• The splitting technique balances between traversing too many nodes of the suffixtree and verifying too many text positions

• The resulting index has sublinear retrieval time O(nλ), 0 < λ < 1 if the errorlevel is moderate.

• In future there can appear more exact algorithms to determine the correct num-ber of pieces in which the pattern is divided and there are (and may appear infuture) some better algorithms for verifying after matching a piece of pattern.

Page 39: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Chapter 4

Compressed Suffix ArraysFabian Pache

In this work I present Compressed Suffix Arrays from A to Z, start-ing with ordinary Suffix Arrays, covering Compressed Suffix Arrays asdescribed in [GV00] by Grossi and Vitter in-depth and finishing with anoutline of further improvements on Compressed Suffix Arrays developedby K. Sadakane described in [Sad00]

4.1 Introduction

Merely having a certain text usually is not very satisfying. Before long one wantsto find the occurences of a smaller text within the larger text. This is called anenumerative query. If we are interested only in the number of occurences it is ancounting query, while existential queries only return if there is at least one occurenceof the subtext. The terms larger text and smaller text can be interpreted very liberally.While one of the obvious applications would be using something like this paper asthe larger text and for instance ‘Suffix Array’ as the smaller text there are otherapplications. The human genome can be seen as a text, admittedly one with a rathersmall alphabet, with any subsequence being a word or rather a pattern.

4.2 Suffix Arrays

The entire idea of a suffix array is to find a certain pattern P within a text T as fastas possible, using as little additional space as possible. A suffix array SA for a textT has as many entries as T has characters. Each entry i of the suffix array pointsto the position of the i-smallest suffix of T . ‘Smallest suffix’ in this case refers tothe lexicographic ordering of all suffices of T . The order in turn is defined by thealphabet Σ which contains all characters of T . Note that the suffix array is createdindependant of the pattern P or the length of the pattern. Therefore a suffix array canbe used for multiple sequential queries using different patterns efficiently. Searchingonly once for a single instance of a certain pattern can be done more efficiently usingother algorithms. Suffix arrays excel for multiple queries on a static text.

4.2.1 Algorithms

A suffix array in its basic form provides no more than the information where thesmallest suffix, second smallest suffix, third smallest suffix and so on starts. It does

39

Page 40: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

40 CHAPTER 4. COMPRESSED SUFFIX ARRAYS

not, in itself, provide a function that given a certain pattern, returns the position ofthe pattern in the text.However such an alogrithm is quickly outlined once one remembers that the soughtpattern is a subset the text. Each subset can in turn be seen as a prefix of a suffix.This is where the suffix array comes in. Each and every suffix is referenced exactlyonce by the corresponding suffix array. The suffix array contains all suffixes in theirlexicographic order therefore all entries of the suffix array pointing to occurences ofthe pattern are in an unbroken sequence. Note that the entries in the suffix array aresorted, the suffixes in turn usually are not.Since the occurences of P are in sequence, finding all occurences of P boils down tofinding the first and the last occurence. Everything in between matches the patternas well.

4.2.2 Complexity

Since the only intrinsic function of a suffix array is lookup(i), returning the i-smallestsuffix in T by table lookup, time complexity for one operation is O(1). Space con-sumption at this point is considered O(n) for both the text and the suffix array

4.3 Compressed Suffix Arrays

Considering that every text will be stored binary in a digital environment it seemsprudent to reduce the alphabet to binary as well. This also gives the highest possibledegree of freedom for pattern seeking operations. But it introduces a problem in thesize analysis of the text and its accompanying suffix array. A text of n characterscan be seen as a bit sequence of O(m) bits where m = n log2 n. But for each bit, aentry in the suffix array is required and each entry must be able to adress one theO(m) suffixes. Therefore the suffix array ‘grows’ to O(m · logm) bits as logm bits arerequired to uniquely address one of m entries.Grossi and Vitter pointed out a clever way of reducing the space requirement back toO(m) without incuring a great speed penalty.

4.3.1 Compression of the Suffix Array

The basic idea of Grossi and Vitter is a recursive divide and conquer algorithm. Foreach step of the recursion, half the entries of the suffix array are retained for the nextstep while the other half are stored implicitly.For now I will only describe the compression of the suffix array itself. The observantreader might note that additional data is required to recover the implictly stored data.Compression of this information is not trivial and will be covered later in section 4.3.2.

Remember that each suffix of T , and therefore each index of 1, . . . , m is refered to onlyonce in SA, the suffix array can be interpreted as a permutation. It follows that theinitial steps are applicable to all permutations. However section 4.3.2 will show thatit is not a generic method that can be applied to all permutations.As mentioned before, we will compress the suffix array recursively. In each step ofthe recursion we remove half the entries. The original suffix array is stored at levelk = 0 and the recursion is applied often enough so that the suffix array shrinks backto m bits. Since the size of the initial suffix array SA0 was m entries of logm bits andthe number of bits for each entry is assumed constant Grossi and Vitter reduce the

Page 41: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

4.3. COMPRESSED SUFFIX ARRAYS 41

number of explicitly stored entries. The number of levels K required therefore is

m

2K· logm = m

logm = 2K

⇒K = log logm

Each step of the recursion removes the odd values and keeps the even values of SAk.For reasons discussed in section 4.3.2 the kept entries are divided by two in each step.As a side effect of this division the new array SAk+1 again contains odd and evenvalues, so the recursion can be applied again.

How to recover the elements removed from SAk? For now, do not consider the requiredspace, only keep in mind that we want to retain constain time access for each level.Therefore Grossi and Vitter introduce 3 new arrays on each level:

1. A bit vector Bk where

Bk[i] =

(

1 if SAk[i] is even

0 if SAk[i] is odd

2. A integer mapping ψk that is only required to be defined for all i where Bk[i] = 0i.e the entry will not be kept in the recursion. The mapping j = ψk[i] is usedto find the entry SAk[j] that is one less than SAk[i]. In other words

SAk[i] = SAk[ψk(i)] + 1 iff Bk[i] = 0

3. A integer vector rankk where rankk[i] contains the number of 1s in B[0..i]. Likeψ this array is not required to be defined for all entries, but only for those whereBk[i] = 1 i.e the entry will be kept in the recursion. This array denotes theposition of the halfed entry in the next level.

Using these arrays, it is possible to recover SAk, using SAk+1, Bk, rankk and ψk asfollows:

SAk(i) =

(

2 · SAk+1[rankk[i]] iff Bk[i] = 1

2 · SAk+1[rankk[ψk[i]]] + 1 iff Bk[i] = 0

The evaluation of rankk in the second statement is possible since Bk[ψ[i]] = 1 iffBk[i] = 0

Grossi and Vitter combine the above formulas by filling the unneccessary entries(rankk[i] for Bk[i] = 0 and ψk for Bk[i] = 1) with neutral operations and are thereforeable to put both cases in a single statement. While this is mathematically very so-phisticated it is not a representation that makes the compression and reconstructionscheme easier to understand. Considering modern CPU architectures that have a op-erations pipeline that can reach its full potential primarily on code that is executedwithout conditions a unified statement might have advantages in the implementation.Please refer to the original work of Grossi and Vitter [GV00] for details.

This scheme is obviously applied only until the last level of the compression is reached.At this point the values of SA are stored explicitly. Figure 4.1 shows a suffix array fora binary text of length 32. Therefore there are levels k = 0 . . . dlog log 32e = 0 . . . 3.Entries of rank and psi that are not required in the decompression are left blank.

Page 42: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

42 CHAPTER 4. COMPRESSED SUFFIX ARRAYS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

T a b b a b b a b b a b b a b a a a b a b a b b a b b b a b b a #

SA0 15 16 31 13 17 19 28 10 7 4 1 21 24 32 14 30 12 18 27 9 6 3 20 23 29 11 26 8 5 2 22 25B0 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0rank0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16ψ0 2 14 15 18 23 28 30 31 7 8 10 13 16 17 21 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

SA1 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11B1 1 1 0 1 1 1 0 0 1 0 0 1 0 1 0 0rank1 1 2 3 4 5 6 7 8ψ1 9 1 6 12 14 2 4 5

1 2 3 4 5 6 7 8

SA2 4 7 1 6 8 3 5 2B2 1 0 0 1 1 0 0 1rank2 1 2 3 4ψ2 5 8 1 4

1 2 3 4

SA3 2 3 4 1

Figure 4.1: Example of a compressed suffix array

4.3.2 Compression of the Auxiliary Arrays

Meanwhile we compressed the suffix array down to an acceptable size of m bits. Onthe other hand we optained 3 new array of which only Bk can be stored in m bits.

rankk is larger by far, requiring m logm bits on each level. But there is no need tostore rankk explicitly as Guy Jacobson developed a method described in the thesispaper for [Jac89] that can be based on Bk and requires only o(m) bits. The basic ideais to store a two level dictionary that allows for constant access time in the cost modelused here.

Compression of ψk is more involved than compressing rankk. This is also the pointwhere the compression becomes inapplicable to ordinary permutations. For each levelk ∈ 0 . . . K − 1we create 2k+1 list. Each of the lists is labeled with a unique binarystring of length k+ 1. For each entry ψk(i) with Bk(i) = 0 we determine the array tostore the entry in by looking up a substring t of T . t is defined as a prefix of a suffix inT . The suffix is the one pointed to by the corresponding entry in SAk. The length ofthe prefix is determined by the current level of recursion. t(i) = T ((2k ·SA[i]) . . . (2k ·SA[i] + 2k − 1)). We append j = psik(i) to the list labeled t(i).

Continuing the example of figure 4.1 the lists are shown in figure 4.2. Note that eachof the lists is sorted and the maximum entry in Level k is m

2kdue to the division by

two of each entry in SA on each level of the recursion. Thus each of the 2k+1 lists canbe stored using a bit-vector of length m

2k. The space requirement (without assisting

structures to optimize access) is therefore

2k+1 · m2k

= O(m)

In order to access the compressed entry j = ψk(i) we have to determine the numberof 0s preceeding entry i in Bk as this is the index to the concatenated lists for level k.This can be done by calculating h = i− rankk(i) using the technique based on [Jac89]for rank outlined above. Finding the hth entry in the lexicographically ordered listsis not described in detail in [GV00], but Grossi and Vitter claim constant access timewhile using no more than O(m) bits for access optimzing structures

Page 43: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

4.3. COMPRESSED SUFFIX ARRAYS 43

Level 0:

a list: 2 14 15 18 23 28 30 31

b list: 7 8 10 13 16 17 21 27

Level 1:

aa list:

ab list: 9

ba list: 1 6 12 14

bb list: 2 4 5

Level 2:

aaaa list: aaab list:

aaba list: aabb list:

abaa list: abab list:

abba list: abbb list: 5 8

baaa list: baab list:

baba list: 1 babb list: 4

bbaa list: bbab list:

bbba list: bbbb list:

Figure 4.2: Lists for ψ of figure 4.1

4.3.3 Complexity in general

On each level 0 ≤ k ≤ log logm we store Bk, rankk and the tables for psik in O(m)bits. Therefore the entire structure can be stored in m log logm bits. Since time isconstant for each level SA(i) can be accessed in O(log logm).

4.3.4 Complexity optimization

With a small penalty in time requirement it is possible to further reduce the size of thestructure to O(m) bits. This is done by using only a subset of the levels. As long as thesize of the subset is constant the asymptotic size remains at O(m). Level k = log logmis always stored explicitly. Of the other array it suffices to store 1 ≤ L < log logm levelswith minor modifications if these L levels are l(j) =

˚jL

log logmˇ

for 0 ≤ j ≤ L − 1.To be able to reconstruct SAl(j+1) from SAl(j) it is neccessary modify Bl(j)(i) to be 1only if SAl(j)(i) can be found in SAl(j+1)(rankl(j)(i)) instead of SAl(j)+1(rankl(j)(i)).Note that rank needs not be modified if based solely on B.

ψl(j) still has to be defined for all entries where Bl(j) = 0, only now there are morethen half of the entries defined.

This modifactions enable us to use ψ to seek an entry of B = 1. The length of theseeking process in turn determines the time requirements. The process is bounded bylongest possible sequence that has to be investigated. For a compressed suffix arraywith L levels this sequence can be no longer than the sequence required to move fromlevel l(0) to l(1). Since all upper bound of sequences on the same level are of equal

Page 44: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

44 CHAPTER 4. COMPRESSED SUFFIX ARRAYS

length, the longest seeking process is

L−1X

j=0

‖l(j)‖‖l(j + 1)‖ =

L−1X

j=0

m

2l(j)

m

2l(j+1)

=L−1X

j=0

2l(j+1)

2l(j)

=

L−1X

j=0

2j+1L

log log m

2jL

log log m

=

L−1X

j=0

(2log log m)j+1L

(2log log m)jL

=“

2log log m” 1L

L−1X

j=0

1

= (logm)1L · L

= O(logε m) with ε =1

L

4.4 Extensions of Compressed Suffix Arrays

Leaving the most general case of suffix arrays on binary texts, Sadakane proposes someextensions of suffix arrays on human readable texts. For a more in-depth descriptionof Sadakanes datastructures and algorithms please refer to [Sad00]. Here I can onlygive a short summary of his work and what it might imply.

4.4.1 Operations

While the original suffix array offers no more functionality than a mere lexicagraphicordering new operations are included.

Inverse Suffix Array A suffix array answers the question ‘At which position i doesthe j-smallest suffix begin?’. Or more concise i = SA[j] To answer the inversequestion ‘What is the order j of the suffix starting at position i?’(j = SA−1[i])we have no methods available up to now. Sadakane proposes a structure thatcontains SA−1 that has the same time and space requirements as SA.

Searching There are algorithms that allow for searching operations in suffix arrays.Sadakane augments the structure so that searching is an intrinsic feature of thesuffix array.

Decompression Using only the suffix array and the introduced extensions but with-out the original text, recover a substring defined by first and last index in theoriginal text.

The last two functions are based on something Sadakane calls ‘the inverse of the arrayof cumulative frequencies’. The abstract does not elaborate enough on the subject tomake a further description possible in this paper.

4.4.2 Complexity

Obviously Sadakane requires more memory to store the suffix array itself, but on theother hand removes the need to store the text. His claim is that the entire searchindex can be reduced to below the size of the original text for certain texts.

Page 45: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

4.5. CONCLUSION 45

4.5 Conclusion

We started out with a large text and an even larger suffix array. In a first step Grossiand Vitter drastically reduced the size of the suffix array without unreasonable penal-ties in usability. Sadakane took this process even further and removed the neccessityfor storing the text, at the same time adding to the functionality of the datastructure.Meanwhile we are storing neither the text nor the suffix arrays at all. All we retainedis a set of hints and literally pointers to both the text and the suffix array. This is afascinating evolution of a once seeming unchangable structure.

Page 46: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

46 CHAPTER 4. COMPRESSED SUFFIX ARRAYS

Page 47: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Chapter 5

Asymptotic Properties ofSuffix Trees.Ivan Kazmenko

Unlike the previous chapters, this one is not going to introduce a newsophisticated suffix tree construction algorithm, dig into its propertiesand prove that it works fast and fine. Instead, we’ll consider one of themost dumb algorithms of suffix tree construction and find out that undercertain conditions on the text, it almost surely works rather well, meaningthat we can find almost sure upper and lower bound for the complexityof new suffix insertion while the size of the text tends to infinity.

This chapter is based on the article Asymptotic Properties of DataCompression And Suffix Trees by Wojciech Szpankowski [Szp93] and thebook Average case analysis of algorithms on sequences [Szp00] of the sameauthor.

5.1 Suffix Tree Construction

Let’s start with some common symbol definitions which we will use in the entirechapter.

Let Σ is a finite alphabet of size |Σ| = V , Xk∞k=1 be a stationary ergodic sequenceof symbols generated from Σ, and Xn

m = (Xm, ..., Xn) for m < n be a partial sequenceof the whole sequence Xk∞k=1.

We shall now consider a very simple algorithm of suffix tree construction. A node ofour tree can be either internal, i. e. branching node, or external node storing oneof the suffixes Si = Xk∞k=i. Each edge is labeled by some symbol from Σ. Whenadding a suffix, we start from the root of our tree and try to ’align’ the suffix to thetree, that is, move by the edge corresponding to the current symbol of our suffix andchange the current symbol to the next one in the suffix. That procedure continuesuntil we find no such edge at the vertex we are currently in. We then add that edgeand create a new vertex at its end storing the suffix we were adding.

More formally, consider a digital tree built in the following way:

Step 0. At the beginning, the tree consists of its root only.

Step 1. Consider a tree Tn built for the partial sequence Xn1 = (X1, ..., Xn).

Step 2. Set current vertex to root.

Step 3. Starting with j = n+ 1, we either

47

Page 48: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

48 CHAPTER 5. ASYMPTOTIC PROPERTIES OF SUFFIX TREES.

(A) move by the edge marked by Xj from the current vertex if it exists thus changingthe current vertex and increase j by 1, or

(B) construct a new edge marked with symbol Xj from the current vertex to a newvertex marked with our suffix X∞

n+1 and proceed to Step 1 with n increased by 1otherwise.

Note that j − n is the number of case (A) occurences during a single Step 3.

The picture shows an example of a single loop of our algorithm.

Let X101 = (0, 1, 0, 1, 1, 0, 1, 1, 1, 0).

S1 = 0101101110

S2 = 101101110

S3 = 01101110

S4 = 1101110

S5 = 101110

Four inserted suffixes.

Z

ZZ

JJ

JJ

S4

S1

JJ

S3

S2

Fifth suffix insertion.

Z

ZZ

JJ

JJ

S4

S1

JJ

S3

JJ

JJ

S2

JJ

S5

We do not formalize our ’splitting policy’, that is, the way how we split an externalnode that becomes internal during some other suffix insertion. The natural way to dothe ’splitting’ is shown on the picture. We can consider all previous suffix marks tobe infinite branches of our tree to make the algorithm formally correct.

We are interested in the complexity of a single loop of our algorithm. Formally, ourmain questions regarding the algorithm described will be the following:

What is the typical height of Tn?

What is the typical difference j − n when Step 3 is finished?

What is the typical minimal possible difference j − n at the end of Step 3 for the treeTn?

In the next section, we will present some assumptions on the sequence Xk∞k=1 that,being not too restrictive, will get us some bounds on the value in question.

5.2 Depth of Insertion in a Suffix Tree

As we study our sequence Xk∞k=1 in a probabilistic framework, its most importantcharacteristic is nth order probability distribution P (Xn

1 ) = PrXk = xk, 1 6 k 6

n, xk ∈ Σ. The entropy of our sequence is the limit h = limn→∞

E− log P (Xn1 )

n. It is

known that h 6 log V . All logarithms are natural ones in this chapter.

Another characteristic of much interest is the parameter Ln which is the smallestinteger L > 0 such that Xm+L−1

m 6= Xn+Ln+1 for all 1 6 m 6 n. Informally, it has the

following meaning: when we insert the suffix Sn+1, we will require exactly Ln steps(A) to do it.

Returning to our example, let X101 = (0, 1, 0, 1, 1, 0, 1, 1, 1, 0). Here L1 = 1, L2 = 3,

L3 = 2, and L4 = 5 since X85 = X5

2 = (1, 0, 1, 1) and therefore L4 > 4.

Page 49: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

5.2. DEPTH OF INSERTION IN A SUFFIX TREE 49

So, what will be our assumptions on the sequence Xk? Below we introduce themixing condition - a weakened form of independence.

Remember that Xk is called an independent sequence if for every set of indexesI = i1, . . . , ir the probablity of Xkk∈I being in

NAkrk=1 is equal to the prod-uct of the corresponding probabilities: PrXi1 ∈ A1, . . . , Xir ∈ Ar = PrXi1 ∈A1 . . . P rXir ∈ Ar. Somewhat weaker is pairwise independent condition whichtakes only the sets I of size 2 into consideration, stating that PrXi1 ∈ A1, Xi2 ∈A2 = PrXi1 ∈ A1PrXi2 ∈ A2. The independence itself can be also written inpairwise form with events being not subsets of a single copy of Σ, but elements of amore complex σ-field.

Let Fnm be a σ-field (also known as σ-algebra) generated by Xknk=m for m 6 n.

Independence means that for every pair of events A ∈ Fm0 and B ∈ F∞

m+1 it is true thatPrAB = PrAPrB. The mixing condition creates a gap of size d between our σ-fields so that A ∈ Fm

0 and B ∈ F∞m+d and transforms our equality into two inequalities

bounding the left term with the right one multiplied by some constants from bothsides. The strong α-mixing condition substitutes that constants by functions tendingto 1 from both sides while the gap size d tends to infinity. The formal definitionsfollow.

We say that Xk satisfies the mixing condition if and only if there exist constants 0 <c1 6 c2 and an integer d such that for all A ∈ Fm

0 , B ∈ F∞m+d and 0 6 m 6 m+ d 6 n

the following condition is true: c1PrAPrB 6 PrAB 6 c2PrAPrB.Now let α be a function of d such that α(d) −−−→

d→∞0. Xk satisfies the strong α-mixing

condition if and only if for all A ∈ Fm0 , B ∈ F∞

m+d and 0 6 m 6 m+d 6 n the followingcondition is true: (1− α(d))PrAPrB 6 PrAB 6 (1 + α(d))PrAPrB.We define two new parameters of Xk. They are parameters h1 and h2:

h1 = limn→∞

maxlog P−1(Xn1 ), P (Xn

1 )>0n

= limn→∞

log(1/ minP (Xn1 ), P (Xn

1 )>0)n

,

h2 = limn→∞

log(EP (Xn1 ))−1

2n= lim

n→∞

log(P

Xn1P2(Xn

1 ))−1

2n.

The relationship with entropy h is as follows: 0 6 h2 6 h 6 h1. The values h1 and h2

are also known as Renyi entropy of order −∞ and 2, respectively.

The formulas are complex, so we could use a simple example, Bernoulli model, to seewhat these values are like.

Assume that symbols Xi are generated indepenently, and ith symbol is generated

according to the probability pi. Thus, h =VP

i=1

pi log(p−1i ), h1 = log(1/pmin) and

h2 = 2 log(1/P ) where pmin = min16i6V

pi is the probability of least probable symbol

occurence and P =VP

i=1

p2i can be interpreted as a probability of a match between any

two symbols.

Now, we are ready to present our main result, Theorem 5.1. It proposes the conditionsunder which we can find almost sure lower and higher bounds for Ln, the value weare interested in. An important finding is that we not only know how it behaves (itsbehavior is logarithmic with respect to n), but also find the range of the constant bythat logarithm.

Theorem 5.1. Consider stationary ergodic sequence Xk∞k=−∞.

1. Assume strong α-mixing condition.

2. Let h1 <∞ and h2 > 0.

(B) ∃ρ : 0 < ρ < 1, ∃β such that α(d) = O(dβρd) for d→∞.

Then

(1) lim infn→∞

Lnlog n

= 1h1

(a.s.),

(2) lim supn→∞

Lnlog n

= 1h2

(a.s.).

Page 50: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

50 CHAPTER 5. ASYMPTOTIC PROPERTIES OF SUFFIX TREES.

How restrictive is the condition (B)? Many practically occuring cases fit it, for exam-ple, in Bernoulli model, α(d) = 0 because of independence of Xk, and if the sequenceXk is Markovian, α(d) decays exponentially fast. In general, statement (1) of The-orem 5.1 does not hold without the (B) condition.

5.3 Height and Shortest Feasible Path in a Suf-fix Tree

In this section, we will introduce yet another bundle of auxiliary definitions to formu-late our Theorem thm:kaz-2, and then prove Theorem 5.1 using Theorem 5.2. Theproof of Theorem 5.2 itself will not be given due to its complexity, however, a shortoverview of its proof techniques will be done in Section 5.4.Let us define some more depth characteristics. Let Tn be a suffix tree constructedfrom the first n suffixes of Xk. mth depth Ln(m) is the depth of the ith suffix in Tn;note that Ln = Ln+1(n + 1). Average depth Dn is the depth of a randomly selected

suffix, that is, Dn = 1n

nP

m=1

Ln(m).

Height and shortest feasible path are defined as follows. Height Hn is the length ofthe longest path in Tn; Hn = max

16m6nLn(m). Available node is a node which does

not belong to Tn but its predecessor does, that is, a node that could be inserted inTn+1 at the next insertion with no other nodes added. Shortest feasible path sn is thelength of the shortest path from the root to an available node.For each two suffixes, we can find their longest common prefix by walking down thetree along them till they part. Self-alignment Ci,j is the length of the longest commonprefix of Si and Sj .One can easily prove the following relations of self-alignment to other suffix tree pa-rameters:Ln(m) = max

16k6n,k 6=mCk,m+ 1;

Hn = max16i<j6n

Ci,j + 1;

Ln = max16m6n

Cm,n+1 + 1.

Returning to our example, let X101 = (0, 1, 0, 1, 1, 0, 1, 1, 1, 0). Consider suffix tree T4

built from first 4 suffixes. L4(1) = 3, L4(2) = 2, L4(3) = 3, L4(4) = 2. H4 = 3, s4 = 2.But L4 = L5(5) = 5.Note that the S5 node of T5 is not an available node in T4 since it requires auxiliaryinternal nodes to be inserted. In T5, H5 = 5, and s5 = 2 = s4.Digging into the properties of Ci,j gives the proof of Theorem 5.2 formulated below.It is a variant of Theorem 5.1 with Ln substituted by sn and Hn. As we alreadyobserved, the statement (2) of the theorem does not need (B) condition to hold.

Theorem 5.2. Consider stationary ergodic sequence Xk∞k=1.1. Assume strong α-mixing condition.2. Let h1 <∞ and h2 > 0.Then(1) lim

n→∞sn

log n= 1

h1(a.s.) when (B) holds,

(2) limn→∞

Hnlog n

= 1h2

(a.s.) when α(d) satisfies the following:∞P

d=0

α2(d) <∞.

Proof of Theorem 5.1 by Theorem 5.2: For each of the two statements, we will boundthe left side of equality by the right side from both sides.(1): lim sup

n→∞Ln

log n6 lim

n→∞Hnlog n

(a.s.) simply holds by definition as Ln 6 Hn; let’s prove

that lim supn→∞

Lnlog n

> limn→∞

Hnlog n

(a.s.). Note that Hn is a non-decreasing sequence;

Page 51: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

5.4. PROOF TECHNIQUES 51

Ln = Hn when Hn+1 > Hn, and that occurs infinitely often since Hn → ∞ andXk is an ergodic sequence, so PrLn = Hn i.o. = 1 and there exists a subsequencenk → ∞ such that Lnk = Hnk . It is clear now that the upper limit of Ln in not lessthan the limit of Hn with an arbitrary common denominator, which is equal to log nin our case.(2) can be proved in a similar way: sn is a non-decreasing sequence also.

5.4 Proof Techniques

In this section, we will throw a short glance on the tools used to prove Theorem 5.2itself. The whole proof is complex and technically hard.One of the methods used in the proof is a technique called String-Ruler Approach.According to it, the correlation between different substrings is measured using anotherstring ω called a string-ruler. To illustrate it, we shall find the longest common prefixof two independent strings Xk(1)∞k=1 and Xk(2)∞k=1. Let its length be C1,2. Thefollowing equivalence is obvious:C1,2 > k⇐⇒ ∃ω of length k: Xk

1 (1) = ω = Xk1 (2).

We then construct a set Wk = ω ∈ Σk : |ω| = k and estimate the probabilitiesP (ωk) = P (Xm+k

m+1 = ωk) for a fixed position m in our sequence Xk.Another important method is a probabilistic one, called Second Moment Method. Theversion by Chung and Erdos of this method states that for a sequence of events Ai wehave

PrnS

i=1

Ai >

(nP

i=1PrAi)2

nP

i=1PrAi+

P

i6=jPrAi∩Aj

.

We then apply it to the sets Ai,j = Ci,j > k.The reasoning of the latter method is elementary. Let us remember Markov’s inequalityPrX > t 6

EXt

and Chebyshev’s inequalityPr|X −EX| > t 6

V arXt2

.After some trivial calculations we get First Moment Method:for integer-valued nonnegative random variable XPrX > 0 6 EXand Second Moment Method (Chebyshev’s version):

PrX = 0 6V arX(EX)2 ,

respectively. The version by Chung and Erdos is derived from the latter one.

5.5 Summary

In our main result, Theorem 5.1, we have shown that, given a stationary ergodicsequence generated over a finite alphabet, under strong α-mixing condition on thesequence, the depth of the nth suffix insertion into a partial suffix tree of that sequenceusing simple and natural algorithm specified above can be described by the expressionc log n where c almost surely lies between 1/h1 and 1/h2 and the parameters h1 andh2 can be found explicitly.

Page 52: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

52 CHAPTER 5. ASYMPTOTIC PROPERTIES OF SUFFIX TREES.

Page 53: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Chapter 6

Sequential PatternMatching – Analysis ofKnuth-Morris-Pratt TypeAlgorithms Using theSubadditive ErgodicTheoremTobias Reichl

Based on an article by Mireille Regnier and Wojciech Szpankowski thisreport outlines the complexity analysis of Knuth-Morris-Pratt type algo-rithms using the Subadditive Ergodic Theorem, Martingales and Azuma’sInequality.

Using the Subadditive Ergodic Theorem we will prove the existence ofa linearity constant for worst and average case. Although the SubadditiveErgodic Theorem doesn’t indicate a way to compute the linearity constant,we may use Azuma’s Inequality to show that the number of comparisonsdone is well concentrated around its mean value.

6.1 Pattern Matching

6.1.1 Conventions

Before starting we have to introduce some conventions in nomenclature: a pattern pof length m, denoted pm

1 , is matched against a text t of length n, denoted tn1 .

We have to define some kind of counting function:

M(l, k) =

1 t[l] is compared to p[k]0 otherwise

.

A position in the text is called an alignment position (AP) if starting from it com-parisons between text and pattern are done, or more formally

53

Page 54: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

54 CHAPTER 6. SEQUENTIAL PATTERN MATCHING

M(AP + (k − 1), k) = 1 for some k.

6.1.2 Defining Sequential Algorithms

We will classify algorithms by a property we call sequentiality.

1. Semi-sequential: The sequence of alignment positions used by the algorithmis non-decreasing.

2. Strongly semi-sequential: (1) and the comparisons M(li, ki) define non-decreasing text-positions li.

3. Sequential: (1) and M(l, k) = 1 ⇒ tl−1l−(k−1) = pk−1

1 , so: text-pattern compar-

isons M(l, k) are only done as long as there is a prefix of the pattern to the leftof the text position to be compared next.

4. Strongly sequential: (1), (2) and (3).

6.1.3 Naive / Brute Force Algorithm

In short we may outline the naive or brute force algorithm as follows:

• Every text position is an alignment position.

• The aligned pattern is matched against the text from left to right until either amismatch occurs or the pattern is found.

• The pattern is then shifted by one and the next matching is started.

The brute force algorithm is a sequential algorithm: the APs are non-decreasing andthe condition M(l, k) = 1 ⇒ tl−1

l−(k−1) = pk−11 holds: no more comparisons are done

after a mismatch is found, so every alignment is used only as long as prefixes of thepattern are found in the text.

The sequence of text positions li defined by the sequence of comparisons M(li, ki),however, may include ‘jumping backwards’, i.e. if a mismatch occurs, the AP isshifted by one and comparisons again start at the beginning of the pattern.

6.1.4 Knuth-Morris-Pratt

Idea: (Morris-Pratt) Disregard APs if we already know that there cannot be a prefixof the pattern, namely the ones that safisfy tl+k−1

l+i 6= pk−i1 for all i. Or equivalently

pk1+i 6= pk−1

1 as the already processed text has to be identical to the correspondingprefix of the pattern.

This knowledge can be obtained by a preprocessing of the pattern. The specific shift-ing functions can formally be described as following:

Morris-Pratt-Variant (MP):

S = mink − 1; mins > 0 : pk−(s+1)1+s

Knuth-Morris-Pratt-Variant (KMP):

S = mink; mins : pk−(s+1)1+s and pk

k 6= pk−sk−s

MP and KMP differ in the amount of information used from the pattern. Both arestrongly sequential algorithms, because from the definition of the shift function it isautomatic that the sequentiality condition (2) holds, there is no ‘jumping backwards’.

Page 55: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

6.2. SUBADDITIVE ERGODIC THEOREM 55

6.1.5 Defining Complexity

The complexity in matching a pattern p against a text portion tsr can be defined asthe number of comparisons needed:

cr,s(t, p) =X

l∈[r,s],k∈[1,m]

M(l, k) (6.1)

Overall complexity c1,n is denoted as cn. If either the text or the pattern is a realizationof a random sequence we shall write Cn.To look at KMP we have to introduce two probabilistic tools: the Subadditive ErgodicTheorem and Azuma’s Inequality.

6.2 Subadditive Ergodic Theorem

6.2.1 Fekete’s Theorem

Assume a deterministic sequence xn∞n=0 satisfies the so called subadditivity property,that is

xm+n ≤ xn + xm (6.2)

for all integers m,n ≥ 0. We may fix m ≥ 0 and write

n = km+ l ⇔ k

n=

1

m− l

mn. (6.3)

Then by successive application of the subadditivity property arrive at

xn = xkm+l ≤ xm + xm + · · ·+ xm + xl = kxm + xl . (6.4)

Now dividing by n and considering n→∞ resp. k/n→ 1/m, cf. (6.3) we get

lim supn→∞

xn

n≤ inf

m≥1

xm

m≤ α . (6.5)

To complete the derivation we may use the definition of lim inf and get the following:

lim infn→∞

xn

n= sup

n≥0

infk≥n

xn

n

ff

= α (6.6)

Thus we just derived the theorem of Fekete.

Theorem 6.1 (Fekete 1923). If a sequence of real numbers satisfies the subadditiveproperty

xm+n ≤ xn + xm (6.7)

for all integers m,n ≥ 0, then

limn→∞

xn

n= inf

m≥1

xm

m. (6.8)

If the subadditvity property (6.7) is replaced by the superadditvity property

xm+n ≥ xn + xm (6.9)

for all integers m,n ≥ 0, then

limn→∞

xn

n= sup

m≥1

xm

m. (6.10)

Page 56: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

56 CHAPTER 6. SEQUENTIAL PATTERN MATCHING

Example 6.1 (Longest Common Subsequence). The longest common subse-quence (LCS) problem is a special case of the edit distance problem. Two ergodicstationary sequences X = X1, X2, . . . , Xn and Y = Y1, Y2, . . . , Yn are given, then let

Ln = maxK : Xik = Yjk for 1 ≤ k ≤ K, where 1 ≤ i1 < i2 < · · · < iK ≤ n,and 1 ≤ j1 < j2 < · · · < jK ≤ n

be the length of the longest common subsequence. Observe that

L1,n ≥ L1,m + Lm,n . (6.11)

The LCS in the region (1, n) may cross the boundary of Xm1 , Y m

1 and Xnm, Y n

m. Henceit may be bigger than the sum of the LCSs in each subregion (1,m) and (m,n) andso an = E [L1,n] is superadditive:

limn→∞

an

n= α = sup

m≥1

E [Lm]

m. (6.12)

But here you can already see the cavity: Fekete’s Theorem1 only states the existenceof the linearity constant, but neither tells us its value nor even how to compute it.For the LCS problem here Steele in 1982 conjectured α ≈ 0.8284.

Theorem 6.2 (DeBruijn and Erdos 1952). The subadditivtiy property can berelaxed to include a sequence cn = o(n)

xn+m ≤ xn + xm + cn+m (6.13)

where

∞X

k=1

ckk2

<∞ . (6.14)

Then, too

limn→∞

xn

n= inf

m≥1

xm

m. (6.15)

6.2.2 Subadditive Ergodic Theorem

As Fekete’s Theorem only applies to deterministic sequences, effort has been taken togeneralize it to sequences of random variables.

Theorem 6.3 (Kingman and Liggett). Let Xm,n (m < n) be a sequence of randomvariables satisfying the following properties:

1. X0,n ≤ X0,m +Xm,n (subadditivity property)

2. For every k, Xnk,(n+1)k, n ≥ 1 is a stationary sequence.

3. The distribution of Xm,m+k, k ≥ 1 does not depend on m.

4. E [X0,1] <∞ and for each n, E [X0,n] ≥ c0n where c0 > −∞.

Then

limn→∞

E [X0,n]

n= inf

m≥1

E [X0,m]

m:= α , (6.16)

limn→∞

X0,n

n= X (a.s) . (6.17)

1And the Subadditive Ergodic Theorem, as as we will see later.

Page 57: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

6.3. MARTINGALES AND AZUMA’S INEQUALITY 57

Theorem 6.4 (Deriennic). Similar to subadditivity with deterministic sequences,subaddititvity with random sequences can be relaxed to include a sequence An

X0,n ≤ X0,m +Xm,n +An (6.18)

such that limn→∞ E [An/n] = 0. Then, too

limn→∞

X0,n

n= X (a.s) . (6.19)

6.3 Martingales and Azuma’s Inequality

6.3.1 Basic Properties of Martingales

Martingale is a standard tool in probabilistic analysis. A sequence

Yn = f(X1, X2, . . . , Xn), n > 0 (6.20)

is a martingale with respect to the filtration

Fn = (X1, X2, . . . , Xn) (6.21)

if for all n ≥ 0 the following hold:

1. E [|Yn|] <∞ and

2. E [Yn+1 | X0, X1, . . . , Xn] = E [Yn+1 | Fn] = Yn

So E [Yn+1 | Fn] defines a random variable depending on the knowledge contained in(X1, X2, . . . , Xn). Now let’s define the martingale difference as

Dn = Yn − Yn−1 (6.22)

so that

Yn = Y0 +nX

i=1

Di ⇔nX

i=1

Di = Yn − Y0 . (6.23)

Then we may rewrite the martingale difference as

Di = Yi − Yi−1 = E [Yn | Fi] −E [Yn | Fi−1] (6.24)

This is possible as the realization of the martingale sequence Yn depends on the knowl-edge contained in Fi, so the difference between neighbouring elements depends on thedifference in knowledge about Xi. Now observe:

E [Yn | Fn] = Yn and E [Yn | F0] = E [Yn] .

Note: Fn completely defines Yn, while F0 contains no information about Yn.

Interestingly we are now able to rewrite the martingale diffence sum, cf. (6.23), as

nX

i=1

Di = Yn −E [Y0] . (6.25)

And this is what we are interested in: the deviation of Yn from its mean value. Tofurther assess it we will now introduce Hoeffding’s Inequality.

Page 58: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

58 CHAPTER 6. SEQUENTIAL PATTERN MATCHING

6.3.2 Hoeffding’s Inequality and Azuma’s Inequality

Theorem 6.5 (Hoeffding’s Inequality). Let Yn∞n=0 be a martingale and let thereexist a constant cn such that

|Yn − Yn−1| = |Dn| ≤ cn (6.26)

Then

Pr |Yn − Y0| ≥ x = Pr

(˛˛˛˛˛

nX

i=1

Di

˛˛˛˛˛≥ x

)

≤ 2 exp

− x2

2Pn

i=1 c2i

«

. (6.27)

By now, we know how to use the martingale difference sumPn

i=1Di for assessing thedeviation from the mean. We also know how to assess this martingale difference sum,provided Di is bounded.What we still need is to establish bounds on Di.The trick: let Xi be an independent copy of Xi. Then

E [fn(X1, . . . , Xi, . . . , Xn) | Fi−1] =

Eh

fn(X1, . . . , Xi, . . . , Xn) | Fi−1

i

=

Eh

fn(X1, . . . , Xi, . . . , Xn) | Fi

i

because both Xi and Xi share the same distribution, but Fi in respect to Fi−1 doesn’tcontain additional information about Xi. Hence we may rewrite the martingale dif-ference again as

Di = E [Yn | Fi] −E [Yn | Fi−1]

= E [fn(X1, . . . , Xi, . . . , Xn) | Fi]−E [fn(X1, . . . , Xi, . . . , Xn) | Fi−1]

= E [fn(X1, . . . , Xi, . . . , Xn) | Fi]−Eh

fn(X1, . . . , Xi, . . . , Xn) | Fi

i

Taking into account both terms only differ in including Xi resp. Xi we are able topostulate the existence of a constant di with |Di| ≤ di and using Hoeffding’s Inequaltitywe finally arrive at Azuma’s Inequality.

Theorem 6.6 (Azuma’s Inequality). Let Yn∞n=0 be a martingale and let thereexist a constant cn such that

˛˛˛fn(X1, . . . , Xi, . . . , Xn)− fn(X1, . . . , Xi, . . . , Xn)

˛˛˛ ≤ ci (6.28)

where Xi is an independent copy of Xi. Then

Prn˛˛˛fn(X1, . . . , Xi, . . . , Xn)−E

h

fn(X1, . . . , Xi, . . . , Xn)i˛˛˛ ≥ x

o

= Pr |Yn −E [Yn]| ≥ x ≤ 2 exp

− x2

2P

i=1 nc2i

«

6.4 Application to KMP

6.4.1 Establishing m-Convergence

An alignment position in the text is called unavoidable alignment position if for anyr ≤ i and any l ≥ i + m it’s an alignment position when the algorithm is run on tlr.KMP-like algorithms share the same set of unavoidable alignment positions

Page 59: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

6.4. APPLICATION TO KMP 59

U =

n[

l=1

Ul (6.29)

where

Ul = min min1≤k≤l

tlk p, l + 1 . (6.30)

This equation on the one hand specifies starting positions of pattern prefixes as un-avoidable alignment positions (no positions inside those prefixes, as those would bejumped over) and on the other hand specifies steps of size one if there is no patternprefix.Interstingly this property seems to be uniquely limited to Morris-Pratt type algorithms– e.g. the Boyer-Moore algorithm does not have this property.

An algorithm is said to be l -convergent if there exists an increasing sequence of un-avoidable alignment positions Ulni=1 satisfying

Ui+1 − Ui ≤ l . (6.31)

Thus, l-convergence indicates the maximum size ‘jumps’ for an algorithm. For exam-ple, the brute force algorithm is 1-convergent and – what we are interested in more –KMP-like algorithms are m-convergent.

Proof. Let l be a text position and let r be any text position with r ≤ Ul. Then letAi be the set of APs when the algorithm is run on Tm

r . Note: r ∈ Ai as thealgorithm inevitably aligns at the starting position.Then we may define the last alignment position AJ before Ul as

AJ = max Ai : Ai < Ul . (6.32)

So we have AJ+1 ≥ Ul. Using an adversary argument we will show that AJ+1 > Ul

cannot be true, thus AJ+1 = Ul. We define

y = max k : M (k, (k −AJ) + 1) = 1 , (6.33)

so y is the rightmost position in the text we can do a comparison at when starting atAJ . Observe: y ≤ l. Otherwise, when comparisons would be done further, T l

AJ H

would have to hold – and this in turn contradicts the definition of Ul.Since KMP-like algorithms are strongly sequential, the text-pattern comparisons de-fine non-decreasing sequences of text positions. For pattern-text comparisons at textposition y + 1 the pattern cannot be aligned at AJ , it has to be aligned at the nextalignment position AJ+1 with AJ+1 ≤ y + 1 ≤ l + 1.The definition of Ul leaves two possibilities: Ul ≤ l if there is a prefix of the pattern,or Ul = l + 1 if there is no prefix. The above equation AJ+1 ≤ l + 1 together withthe second possibility Ul = l + 1 contradicts the assumption Ul < AJ+1, so we mayassume the first possibility Ul ≤ l – this then implies that H l

Ul H.

An occurence of the whole pattern is consistent with the available information. We –as we want to create a contradiction – may assume this is the case. As the sequenceAi is non-decreasing and AJ+1 > Ul this occurence will be ‘jumped over’ and notbe detected by the algorithm. Thus AJ+1 = Ul as needed.

Taking this a bit further we may combine AJ+1 ≤ y + 1 and y ≤ AJ + m − 1 toAJ+1 ≤ y + 1 ≤ AJ +m + 1 − 1 = AJ + m. So we have shown AJ+1 − AJ ≤ m forany pair (AJ , AJ+1) of APs in the text, thus KMP-like algorithms are m-convergent.

Page 60: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

60 CHAPTER 6. SEQUENTIAL PATTERN MATCHING

6.4.2 Establishing Subadditivity

If cn, the number of comparisons, is subadditive we may use the Subadditive ErgodicTheorem to prove linear complexity of algorithms. To achieve this we have to showthat cn is (almost) subadditive

c1,n ≤ c1,r + cr,n + a . (6.34)

After rearranging the equation it suffices to prove the existence of an a such that

|c1,n − (c1,r + cr,n)| ≤ a . (6.35)

Let Ur be the smallest unavoidable aligment position greater than r. Then we are ableto split c1,n − (c1,r + cr,n) into c1,n − (c1,r + cUr,n) and cr,n − cUr ,n.

For the first part we have to count either:

• S1: comparisons done after position r with alignment positions before r. Thoseonly contribute to c1,n but neither to c1,r (the algorithm won’t compare after r)nor cUr ,n (the algorithm doesn’t align before Ur, thus not before r, too).

• S2: comparisons done with alignment positions between r and Ur. Those alsoonly contribute to c1,n but neither to c1,r nor cUr ,n (the algorithm in those casesonly aligns before r resp. after Ur).

Summing up we arrive at

S1 =X

AP<r

X

i≥r

M(i, i −AP + 1) ≤ m2 . (6.36)

This sum is bounded as there are at maximum m alignment positions before r withcomparisons done after r. And for each aligment position there are at maximum mcomparisons done.

S2 =X

r≤AP<Ur

X

i≥r

M(AP + (i− 1), i) ≤ lm (6.37)

This sum is bounded: because of the l-convergence of sequential algorithms there areat maximum l text positions between r and Ur, each with at maximum m comparisonsdone. Note: with m-convergent KMP-like algorithms this would resolve to m2, too.

For the second part we have to count comparisons done with alignment positons beforeUr (thus between r and Ur). Those contribute to cr,n only as cUr,n starts comparingat position Ur.

S3 =X

r≤AP<Ur

X

i≥r

M(AP + (i− 1), i) ≤ lm (6.38)

This is the same sum as S2, hence bounded for the same reasons. Finally we are ableto put the parts together:

|c1,n − (c1,r + cr,n)| ≤ |S1 + S2 − S3| ≤ m2 + lm = a . (6.39)

So by now we have show subadditivity

c1,n ≤ c1,r + cr,n + a (6.40)

and are able to apply the Subadditive Ergodic Theorem.

Page 61: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

6.4. APPLICATION TO KMP 61

6.4.3 Applying the Subadditive Ergodic Theorem

Before continuing we have to develop some modelling assumptions about the structureof text and pattern.

• Deterministic Model: Both text and pattern are non-random.2 In this casewe have to maximize complexity over all possible texts.

• Semi-Random Model: The text is a realization of stationary and ergodicsequence, the pattern is given, thus non-random. In this case we use averagecomplexity over all texts.

• Stationary Model: Both text and pattern are a realization of a stationaryand ergodic sequence, so we use average complexity over all texts and patterns.

Applying the Subadditve Ergodic Theorem yields similar results for worst and averagecase:

Deterministic Model: limn→∞

maxt(cn(t, p))

n= α1(p)

Semi-Random Model: limn→∞

Et [Cn(p)]

n= α2(p)

Stationary Model: limn→∞

Et,p [Cn]

n= α3

6.4.4 Applying Azuma’s Inequality

Even if we cannot determine the linearity constants α1 to α3, we still can show thatCn is concentrated around its mean.

We may assume that the text t is generated by a memoryless source, and Cn is afunction of this random text t = t1, t2, . . . , tn. By flipping a single character we maychange Cn by at most 2m2 comparisons, so Cn satisfies the condition for applyingAzuma’s Inequality:

˛˛Cn (t1, t2, . . . , ti, . . . , tn)− Cn

`t1, t2, . . . , ti, . . . , tn

´˛˛ ≤ 2m2 (6.41)

Theorem 6.7. Let t be a random text of length m generated by a memoryless sourceand let the pattern p of length m be given. Then the number Cn of comparisons madeby the Knuth-Morris-Pratt algorithm is concentrated around its mean

E [Cn] = α2n (1 + o(n)) . (6.42)

Equally

Pr |Cn − α2n| ≥ εn ≤ 2 exp

− (εn)2

2 · n · (2m2)2(1 + o(n))

«

= 2 exp

− ε2n

4m4(1 + o(n))

«

(6.43)

for any ε > 0.

2Applying Murphy’s Law we may asume text and/or pattern to be exactly what you donot want them to be. . .

Page 62: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

62 CHAPTER 6. SEQUENTIAL PATTERN MATCHING

6.5 Concluding Remarks

The Subadditive Ergodic Theorem proves the existence of the linearity constant underquite general probabilistic assumptions. The main prerequisite is the existence of socalled unavoidable alignment positions, a property that seems to be uniquely limitedto Knuth-Morris-Pratt like algorithms.Although we have not been able to compute this constant, we have been able to showthat the number Cn of comparisons done is concentrated around its mean value α2n.

Page 63: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Chapter 7

Greedy Algorithms for theShortest CommonSuperstring ProblemAnton Nesterov

This paper is based on a article by A. Frieze and W.Szpankowski.It presents some greedy algorithms that solve Shortest Common Super-string Problem and their analysis in probabylistic framework. Also graphalgorithms for equivalent problems are presented.

Various versions of the Shortest common superstring (in short SCS) problem playimportant role in data compression and DNA sequensing.Problem Formulation. Given a collection of strings, say x1, x2, . . . , xn over an alphabetΣ, find the shortest string z such that each of xi appears as substring (a consecutiveblock) of z. In the DNA seqquencing another formulation of the problem may be ofeven greater interest. We call it an approximate SCS and one asks for a superstringthat contains approximately (e.g in the Hamming distance sense) the original stringsas x1, x2, . . . , xn as substrings.Our results are about some greedy approximations of the SCS but in a probabilisticframefork. We prove that several greedy algorithms for the SCS problem are asymp-totically optimal in the sence that thay produce a total overlap of SCS that differsfrom the optimal (maximum) overlap by a quantity that is an order of magnutudesmaller than the leading term of the overlap.We assume, that the strings are generated independently. We first consider the so-called Bernoulli model in which symbols of the alphabet Σ are generating indepen-dently within a string. Later we extends our results to other models: Markovian modeland Mixing Model, which is generalization of previous ones.

7.1 Definitions

Before presenting some results, we introduce some notation and a framefork for de-scribing our algorithms.Suppose x = x1x2 . . . x3 and y = y1y2 . . . y3 are strings over the same finite alphabetΣ = (ω1, ω2, ..., ωM ) where M is the size of the alphabet. We define their overlap

o(x, y) = maxj : yi = xr−j+i, 1 ≤ i ≤ j.

63

Page 64: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

64 CHAPTER 7. GREEDY ALGORITHMS FOR THE SCS PROBLEM

If x 6= y and k = o(x, y), then

x⊕ y = x1x2...yk+1yk+2...ys.

Let S be a set of all superstrings built over strings x1, .., xn. Then

Ooptn =

nX

i=1

|x|i −minz∈S|z|.

We assume that the input strings are independently generated. We analyse theBernoully model, that is, each x = xj = x1x2...xi−1 is the same length l and andxi is generated independently of x1, x2 . . . xi−1. Futhermore, P (xi = ωj) = pj > 0 for1 ≤ j ≤M . Let

H =mX

i=1

pi log pi

be the associated entropy for the Bernoully model (i.e., memoryless source).

7.2 Greedy algorithms

We study the following algorithm: its input is the strings x1, x2, .., xn over Σ. Itoutputs a string z which is a superstring of the input.

Page 65: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

7.3. RESULTS 65

Generic greedy algorithm.1. I ← x1, x2, ...xn;Ogr

n ← 0;2. repeat3. choose x, y ∈ I; z = x⊕ y4. I ← (I \ x, y)5. Ogrn ← Ogr

n + o(x, y)6. until |I| = 1We consider two variants:GREEDY. In Step 3 choose x 6= y in order to maximize o(x, y).RGREEDY. In Step 3 x is the string z produced in the previous iteration, while y ischosen in order to maximize o(x, y) = o(z, y). Our initial choice for x is x1. Thus inRGREEDY we have one ”long” string z grows by addition of strings at the right-handend.

7.3 Results

Consider the SCS problem under the Bernoulli model. Let P =PM

j=1 p2i . Then, with

high probability,

limn→∞

Ooptn

n log n=

1

H

limn→∞

Ogrn

n log n=

1

H

provided

|x| > − 4

logPlog n

for all 1 ≤ i ≤ nIn many applications, notably for data compression and the DNA recombination prob-lem, the Bernoully model assumption is too unrealistic. Therefore, we extend thistheorem to the case when there is some dependency among symbols withing a string.However we still assume that strings x1,x2, . . . , xn are statically independent. But werestrict somewhat the dependency among symbols of each string, that is, we desribea main ideas of the mixing model.Mixing Model. During the generation of string, each symbol depends on all previousones, but the farther the symbol the lesser the dependence on it.

7.4 Compression

The SCS can be used to compress strings. Indeed, instead of storing all strings of totallength nl we can store the SCS and n pointers indicating the beginning of an originalstring plus length of all strings. However, this does not provide optimal compression(which is known to be the entropy H). Show this, compute the compressionn ratio Cn

which is defined as the ratio of the number of bits needed to transmit the compressioncode to the length of the original set of strings. It is easy to see that

Cn =nl − (1/H)n log n+ n log2(nl − (1/H)n log n)

nlwhere the first term of the numerator represents the length of the Shortest superstringand the second term corresponds to the number of bits needed to encode the pointers.Observe that when the length of a string l grows faster than log n, then Cn → 1,that means no compression. When l = O(log n) some compression might take place.The fact that SCS does not compress well is hardly surprising: in the construction of

Page 66: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

66 CHAPTER 7. GREEDY ALGORITHMS FOR THE SCS PROBLEM

SCS we do not use all available redundancy of all strings but only that contained insuffixes/preffixes of original strings.

7.5 Graph processes

In this section some graph algorithms, that correspond to GREEDY and RGREEDYare presented. But before them,First, we show that a pair i, j such that o(xi, xj) ≥ l/2 unlikely exists. Let ε denotethe event that there is no such pair. If l = K log n, then

P (¬ε) ≤ n(n− 1)

2

lX

k=l/2

P k = O(n2+(K log P )/2)) = o(1)

provided K ≥ −4/ log P .

7.5.1 RGREEDY

Consider a tree process that is equal to RGREEDY. Tree T be an infinite rooted M-ary tree. M (size of an alphabet) edges leading down from each vertex will be labeledwith ω1, ω2, ...ωM . Thus, each vertex of depth d is identified with string of length d.Also, label each vertex v with an integer v(v), number of strings that have the prefixassociated with this vertex.We model the process of RGEEDY in the following way: particle Z starts at the root.Then at a vertex v it moves to v’s ωj descendent with probability pj . The particlestops at depth l/2. Let ω = sksk−1...s1be the lowest vertex on the path traversed thathas a nonzero v. This process models the computation of the largest suffix sksk−1...s1of z which can be merged with a prefix of ai.Then we model the deletion of at = a1a2...al/2 which has the prefix a1a2...ak. Letωi = a1a2...ai. Put v(ωi) = max0, v(ωi)− 1 for 1 ≤ i ≤ l/2.We iterate the above process n − 1 times.

7.5.2 GREEDY

Let D be the digraph ([n], A) with edge weights ωi,j = o(bi, aj) for i, j ∈ [n].Sort the edges A into e1, e2, ..., eN , where N = n2, so that ω(ei) ≥ ω(ei+1);SG ← ∅;For i = 1 to N do: if SG∪ei contains in D neither a vertex of outdegree or indegreeat least 2 in SG, nor a directed cycle, then SG ← SG ∪ ei.After termination SG contains n − 1 edges of a Hamilton path of D and correspondsto superstring x1,x2, . . . , xn. The selection of an edge weight (bi, aj) corresponds tooverlaping xi to the left of xj .

Page 67: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Chapter 8

Analysis of PatternOccurancesRoland Aydin

This paper will summarize the proof for the formula to compute theexpected number of occurrences of a given pattern H in a text of size n.The intuitive solution of E[On(H)] = P (H)(n −m + 1) will be verifiedutilising generating functions. Frequency analysis will rely on the decom-position of the text T onto languages, the so-called initial, minimal, andtail languages. Going from there to their generating functions both fora Markovian and a Bernoulli environment, the formula will be shown towork due to properties of the respective generating functions.

8.1 Preliminaries

Markov sequence

A sequence X1, X2, ... of random variates is called a Markov sequence of order 1 iff,for any n,

F (Xn|Xn−1, Xn−2, ...X1) = F (Xn|Xn−1)

i.e., if the conditional distribution F of Xn, assuming Xn−1, Xn−2, ...X1 equals theconditional distribution F of Xn assuming only Xn−1.

Markov chain

If a Markov sequence of random variates Xn take the discrete values a1, ..., aN then

P (xn = ain|xn−1 = ain−1, ..., x1 = ai1) = P (xn = ain|xn−1 = ain−1)

and the sequence xn is called a Markov chain of order 1.

Correlation of patterns

A correlation of two patterns X (size m) and Y is a string, denoted by XY , over theset Ω = 0, 1.

|XY | = |X|

67

Page 68: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

68 CHAPTER 8. ANALYSIS OF PATTERN OCCURANCES

Each position i can be computed as

i = 1⇔ place Y at Xi ∧ all overlapping pairs are identicalelse i = 0

Example of pattern correlation

Let Ω = M,P, X = MPMPPM and Y = MPPMP . Then XY can be deducedin the following manner:

whilst Y X can be shown to equal 00010

Representation of the correlation

Other representations of either string:

1. as a number in some base t. Thus, e.g. XY2 = 9

2. as a polynomial. Thus, e.g. XYt = t3 + 1

Autocorrelation

Furthermore, autocorrelation of X can be defined as XX. It represents the periods ofX, i.e. those shifts of X that cause that pattern to overlap itself. Using Y = MPPMPfrom our previous example, Y Y evaluates to 10010 Using A = MMM , AA evaluatesto 111

Autocorrelation set

Given a string H, the autocorrelation set AHH or just A is defined as

AHH = Hmk+1 : Hk

1 = Hmm−k+1

Example of an autocorrelation set

Let H = SOS The autocorrelation reveals to be

HH = 101

whereas the autocorrelation set in that case is

A = ε, 01

Page 69: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

8.2. SOURCES 69

Let’s play a game

The Penny game - invented by Penney.

Each player chooses a pattern.

They then flip a coin until the pattern comes up consecutively. The player who choosesonly one symbol (k times), has a chance to win of at least 0.5 This is because of the”optimal” autocorrelation.

8.2 Sources

Bernoulli

A Bernoulli Source, or memoryless source, generates text randomly.

Every subsequent symbol (of a finite alphabet) is created independently of its prede-cessors, and the probability of each symbol is not necesserily the same.

If it is, the Source is called a symmetric, or unbiased Bernoulli Source.

If text over an alphabet S is generated by a Bernoulli Source, then each symbol s ∈ Salways occurs with probability P (s).

Markovian Source

A Markovian Source generates symbols based not on the a priori probability of eachsymbol.

Instead, it only heeds a (finite) set of predecessors to ascertain the probability of eachnext symbol.

In order to do so, it requires a memory of previously emitted symbols.

Text generated by a Markovian Source is a realization of a Markov sequence of orderK.

K denotes the number of previous symbols that the probability of the next symboldepends on.

In our application, this sequence will be stationary and K = 1, i.e. a first-orderMarkov sequence.

When computing the next symbol, we only need to observe the last symbol.

In our case (K = 1), the transition matrix is defined by

P = pi,ji,j∈S

where

pi,j = Probability (tk+1 = j|tk = i)

The matrix entry (i, j) denotes the conditional probability of the next symbol being jif the current symbol is i.

8.3 Generating functions of languages

What is a language, after all

A language L is a collection of words.

This collection must satisfy certain properties to belong to a specific language.

Thus, we can associate with a language L its generating function L(z).

Page 70: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

70 CHAPTER 8. ANALYSIS OF PATTERN OCCURANCES

Generating functions

Given a sequence ann≥0, we know its generating function is defined as

A(z) =X

n≥0

anzn

For sinister purposes, we represent it differently as

A(z) =X

α∈S

zw(α)

where S is a set of objects (words ...) and w(α) is a weight function.Henceforth we will interpret it as the size of α, i.e. w(α) = |α|The equivalence becomes evident when we set an to be the number of objects αsatisfying w(α) = n. Now we have a more combinatorial view

Generating function of a language

Now, for any language L, we define its generating function L(z) as

L(z) =X

w∈L

P (w)z|w|

where P (w) is the probability of word w’s occurence and |w| is the length of w.So the coefficient of z|w| is the sum of the probabilites all words of that length.In addition, we assume that P (ε) = 1. So every language includes the empty word (aswe know).

Conditional generating function

In addition, the H-conditional generating function of L is given as

LH(z) =X

w∈L

P (w|w−m = h1 . . . w−1 = hm)z|w|

=X

w∈L

P (w|w−1−m = H)z|w|

where w−i is the symbol preceding the first character of w at distance i.We use this definition for Markovian sources, where the probability depends on theprevious symbols.

Example: autocorrelation generating function

In our previous example, the autocorrelation set was

A = ε, 01

The generating function of the set is

A(z) = 1 +z2

4

given a Bernoulli source, and

ASOS(z) = 1 + pSOpOSz2

given a Markovian source of order one.

Page 71: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

8.4. DECLARING LANGUAGES 71

Formulating our objective

We will now formulate the special generating functions whose closed form we will laterstrive to compute:

1. T (r)(z) =P

n≥0 Pr(On(H) = r)zn

2. T (z, u) =P∞

r=1 T(r)(z)ur =

P∞r=1

P∞n=0 Pr(On(H) = r)znur

8.4 Declaring languages

Introduction

Let H be a given pattern.

• The initial language R is the set of words containing only one occurrence of H,located at the right end.

• The tail language U is defined as the set of words u such that Hu has exactlyone occurrence of H, which occurs at the left end.

• The minimal language M is the set of words w such that Hw has exactly twooccurrences of H, located at its left and right ends.

Component languages

We differentiate several special languages, given a pattern H. ”·” stands for concate-nation of words.

1. R = r : r ∈ T1 ∧H occurs at the right end ofr2. U = u : H · u ∈ T13. M = w : H · w ∈ T2 ∧H occurs at the right end of H · w

8.5 Language relationships

Qualities of Tr

At first, we will try to describe the languages T and Tr in terms of R, M and U :

∀r ≥ 1 :

Tr = R ·Mr−1 · U

Composition proof (Tr)

Proof:

First occurance of H in a Tr word determines the prefix p

which is in R.

From that prefix on, we look onward until the next occurance of H.

The found word w is ∈ M .

After r − 1 iterations, we add a H-devoid suffix, which is in U , because its prefix hasH at the end.

2

Page 72: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

72 CHAPTER 8. ANALYSIS OF PATTERN OCCURANCES

Qualities of T

The ”extended” version of Tr, its words including an arbitrary number of H occur-rences, can be composed similarily:

T = R ·M∗ · Uwhere M∗ :=

S∞r=0M

r

Composition proof (T )

Proof:A word belongs to T , if for some 1 ≤ r <∞ it belongs to Tr.AsS∞

r=1Mr−1 =

S∞r=0M

r = M∗, the assertion is proven.

2

Four language relationships

Analyzing the relationships between M, U and R further, we introduce

1. W , the set of all words

2. S, the alphabet set

3. the operators ”+” and ”-”, which denote disjoint union and language subtraction

Four language relationships I[

k≥1

Mk = W ·H + (A− e)

Proof:←:Let k be the number how often H occurs in W ·H.k ≥ 1.The last occurrence of H in every included word is on the right.That means, that W ·H ⊆ Sk≥1 M

k.→:Let w ∈ Sk≥1 M

k.Iff |w| ≥ |H|, then surely the inclusion is correct.Iff |w| < |H| (how can that be?), then w /∈ W ·H.But then, necessarily, w ∈ A − ε, because the second H in Hw overlaps with thefirst H by definition (it is element of Mk), so w must be in the autocorrelation set A.

2

Four language relationships II

U · S = M + U − eProof:All words of S consist of a single character s.Given a word u ∈ U and concatenating them, we differentiate two cases.If Hus contains a second occurrence of H, it is clearly at the right end. Then us ∈M .If Hus does contain only a single H, then us must be non-empty word of U .

2

Page 73: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

8.6. LANGUAGES & GENERATING FUNCTIONS 73

Four language relationships III

H ·M = S · R− (R −H)

Proof: →: Let sw be a word in H ·M , s ∈ S (we can write every such word in thisway WLOG).

sw contains exactly two times H, evidently at its left, and also at its right end. Thus,sw is also ∈ S ·R←: If a word swH from S ·R is not in R, then because it contains a second H startingat the left end of sw, because wH ∈ R. Of course, in that case it is ∈ H ·M .

2

Four language relationships IV

T0 ·H = R ·A

Proof:

Let wH be ∈ T0 ·H. Then there can be either be one or more occurences of H in wH,one of which is at the right end.

If there is no second one, then wH is ∈ R by definition of R

If, however, there is a second one, then it overlaps somehow with the first one.

So we view the word until the end of the first H, which is in R. Due to the overlapping,the remaining part is ∈ A.

2

One more

Combining relationships II and III yields

H · U · S −H · U = (S − ε)R

No proof is necessary, as we have validated both ingredients.

Using II, the left side is H(U · S − U) = H ·MThe right side is

S ·R −R = S ·R − (R ∩ S ·R) = S ·R − (R −H)

Together, that is just relationship III.

8.6 Languages & Generating Functions

in the bernoulli environment

We will now transcend from languages to their generating functions. Given any lan-guage L1, we know its generating function to be

A1(z) =X

w∈L1

P (w)z|w|

Page 74: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

74 CHAPTER 8. ANALYSIS OF PATTERN OCCURANCES

So what is the the result of multiplying two languages (i.e. concatenating them) inrespect to their gen. func.? What is L3 = L1 · L2?

A3(z) =X

w∈L3

P (w)z|w|

=X

w∈L1∧w∈L2

P (w1)P (w2)z|w1|+|w2|

=X

w∈L1

P (w1)z|w1|

X

w∈L2

P (w2)z|w2|

= A1(z)A2(z)

! The assumption P (wv) = P (w)P (v) only holds true with a memoryless source.

Special Cases

A few particular cases:

• S (alphabet set) ⇒ S(z) =P

s∈S P (s)z|s| = z

• L = S · L1 ⇒ L(z) = zL1(z)

• ε ⇒ E(z) =P

w∈ε P (w)z|w| = 1 · 1 = 1

• H ⇒ H(z) =P

w=H P (H)z|H| = P (H)zm

• W (the set of all words) ⇒W (z) =PP (w)z|k| =

P

k≥0 zk = 1

1−z

8.7 Looking for Generating Functions

Translating I

We will now attempt to translate our known language relationships into generatingfunctions: In case I only, the formula we derive is correct just for a memoryless source.

[

k≥1

Mk = W ·H + (A− e)

∞X

k=1

MH(z)k = W (z) · P (H)zm +AH(z)− 1

∞X

k=0

MH(z)k − 1 =1

1− z · P (H)zm +AH(z)− 1

1

1−MH(z)=

1

1− z · P (H)zm +AH(z)

Translating II

U · S = M + U − eU · S − U = M − e

UH(z)z − UH(z) = MH(z)− 1

UH(z)(z − 1) = MH(z)− 1

UH(z) =MH(z)− 1

(z − 1)

Page 75: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

8.8. MAIN FINDINGS I 75

Translating III

H ·M = S · R− (R −H)H ·M −H = S · R−RP (H)zmMH(z)− P (H)zm = S(z) ·R(z)−R(z)

P (H)zm(MH(z)− 1) = R(z)(z − 1)

R(z) = P (H)zmMH(z)− 1

z − 1

R(z) = P (H)zmUH(z)

8.8 Main findings I

T (r)(z)

We remember, that for r ≥ 1

Tr = R ·Mr−1 · UWe have now gleaned every component, and can translate it (for r ≥ 1) into

T (r)(z) = R(z)Mr−1(z)UH(z)

T (z, u)

We do also remember, that

T = R ·M∗ · UAs T is the language with any number of Hs, its generating function is indeed ...

T (z, u) = R(z)u

1− uMH(z)UH(z)

8.9 On to other shores

What is left to do?

We still have no formula of gathering On(H), i.e. the frequency of H-occurrences(|H| = m) in random text of length n over an alphabet S with |S| = V .

Let us make an educated guess, though. What we do not know, is how importantoverlapping is. Assuming to disregard that topic, the answer could be

E[On(H)] = P (H)(n−m+ 1)

It is.

But why?

Using derivatives

Looking at our bivariate generating function of T ,

T (z, u) =∞X

r=1

∞X

n=0

Pr(On(H) = r)znur

Page 76: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

76 CHAPTER 8. ANALYSIS OF PATTERN OCCURANCES

we notice that we would like the two sums to be reversed. Deriving it after u ...

Tu(z, u) =∞X

r=1

∞X

n=0

Pr(On(H) = r)znr (=#Occ) ur−1

... and setting u to 1 leads to ...

Tu(z, 1) =

∞X

n=0

(

∞X

r=1

Pr(On(H)r)zn

Proof Preparations

To shorten things, we introduce

DH(z) = (1 − z)AH(z) + zmP (H)

and rewrite MH(z) as

MH(z) = 1 +z − 1

DH(z)

as well as

UH(z) =1

DH(z)

and

R(z) = zmP (H)1

DH(z)

Deriving the closed form formula (1)

Tu(z, u) = R(z)UH(z)u

(1− uMH)

d

du

= R(z)UH(z)(1− uM) + uM

(1− uMH)2

= R(z)UH(z)1

(1− uMH)2

Deriving the closed form formula (2)

u is now set to 1 due to the previous calculus:

Tu(z, 1) = R(z)UH(z)1

(1−MH)2

= R(z)UH(z)(1− 1 +z − 1

DH(z))−2

= R(z)UH(z)DH(z)2

(z − 1)2

= R(z)1

DH(z)

DH(z)2

(z − 1)2

= zmP (H)1

DH(z)

DH(z)

(z − 1)2

=zmP (H)

(z − 1)2

Page 77: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

8.9. ON TO OTHER SHORES 77

Main findings II

As the text has length n, we are extracting the nth coefficient of Tu(z, 1), and voila

E[On] = [zn]Tu(z, 1)

= P (H)[zn]zm(1− z)−2

= P (H)[zn−m](1− z)−2

= (n−m+ 1)P (H)

About certainty

the variance of E(On(H) is, for a r > 1:

V ar[On(H)] = nc1 + c2 +O(r−n)

where

c1 = P (H)(2AH(1) − 1 − (2m− 1)P (H) + 2P (H)E1))

c2 = P (H)((m− 1)(3m− 1)P (H)− (m− 1)

(2AH(1)− 1) − 2A′H(1))− 2(2m− 1)

(P (H)2E1 + 2E2P (H)2

E1, E2 are

E1 =1

πh1

[(P −Π)Z]hm,h1E2 =1

πh1

[(P 2 −Π)Z2]hm,h1

Without going into detail (cf. literature references), we see that the Variance depensmainly on the length of the text plus a constant.

Page 78: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

78 CHAPTER 8. ANALYSIS OF PATTERN OCCURANCES

Page 79: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Chapter 9

Rice’s integrals – a methodfor solving generalizeddifferencesThomas Preu

Often in the analysis of algorithm and data structures we have the needto estimate the asymptotic growth of differences and sequences defined byrecurrence equations. But often we don’t know the explicit representa-tion of the solutions. Therefor we need methods, which we can establishasymptotics without knowing the exact representation. Rice’s integral issuch a method. In this paper we will introduce basics in complex analysisand develope the mathematic foundations needed in theoretical computerscience. The paper is based on the lecture ”Analysis 4” held by Prof. W.Heise in 2003 at the TU Munchen and an article by Flajolet et al. [FS95].

9.1 Basics of Complex Analysis

9.1.1 Complex Differentiability

In the whole of this paper, we will have to deal massivly with complex analysis. Ascomplex analysis is often seen as a discipline of pure mathematics, most people workingin computer science, even in theoretical computer science, have not heard much aboutit. So I will give a rough introduction to it and present the needed theorems. It isassumed that the reader has basic knowledge of real analysis, e.g., knows what a realfunction is, what continuity and differentiablity means, and , what complex numbersare.

First of all we consider a complex mapping on an open set E

f : E ⊂ C→ C, z = x+ yi 7→ f(z) = u(x, y) + iv(x, y) (9.1)

This mapping is said to be continous, if the correspondig 2-dimensional mapping

g : E′ ⊂ R2 → R

2, (x, y) 7→ (u(x, y), v(x, y)) (9.2)

is continous in the sense of multidimensional real analysis.

79

Page 80: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

80 CHAPTER 9. RICE’S INTEGRALS

If g is differentiable at (x0, y0), we have the derivative

A :=

∂u∂x

∂u∂y

∂v∂x

∂v∂y

!

x=x0,y=y0

(9.3)

and A is a linear mapping with the property

∀(x, y) ∈ E′ : g(x, y) = g(x0, y0) +A ·„x− x0

y − y0

«

+B(x− x0, y − y0) (9.4)

where B is a mapping with B(0, 0) = (0, 0)T and lim(x,y)→(0,0)

B(x,y)|(x,y)| = (0, 0)T which, of

course, implies continuity at (0, 0)As a linear mapping E ⊂ C→ C is in fact a complex multiplication, the natural wayto introduce complex derivatives, is trying to find a complex number f ′(z0), whichacts the same way as A in (9.4):

∀z ∈ E : f(z) = f(z0) + f ′(z0) · ((x− x0) + i(y − y0)) + b((x− x0) + i(y − y0)) (9.5)

where b is a complex mapping with b(0) = 0 and limz→0

b(z)|z| = 0. If such a number f ′(z0)

exists, f is said to be complex differentiable at z0.Now we compare the matrix-vector-multiplication with the product of two complexnumbers: „

a1,1 a1,2

a2,1 a2,2

«

·„x1

x2

«

=

„a1,1x1 + a1,2x2

a2,1x1 + a2,2x2

«

(9.6)

(a1 + ia2)(x1 + ix2) = (a1x1 − a2x2) + i(a2x1 + a1x2) (9.7)

we get the correspondence a1,1 = a1 = a2,2 and a1,2 = −a2 = a2,1

From this, it is plausible that a real function R2 → R

2 understood as a complexfunction is complex differentiable, iff ux = vy and uy = −vx. These equations areknown as the Cauchy-Riemann Differential Equation (CRDE).As a result, we have that every complex differentiable function is differentiable in a realsense, but a real differentiable function is only complex differentiable, if the CRDEsare satisfied. So complex differentiability is somewhat stronger.

Definition 9.1. A complex mapping f , as introduced in (9.1), is said to be differen-tiable at a point z0 ∈ E with derivative f ′(z0), iff (9.5) holdes for some appropriateb.

Definition 9.2. The complex mapping f is said to be holomorphic at a point z0 onE, iff f is complex differentiable at any point of a neighbourhood F ⊂ E of z0, whichis more than just complex differentiable at z0.

Definition 9.3. The complex mapping f is said to be holomorphic on an open setF ⊂ E, iff f is holomorphic at any point z0 ∈ F .

In fact, when evaluating complex derivatives, not many changes occur; e.g. (zn)′ =nzn−1, (ez)′ = ez or (sin(z))′ = cos(z)

9.1.2 Integration

In real analysis we integrate over intervalls, where the integral is a limit of sums. Thesesums take into account the values of the function and the length of the parts of thediscretisation of the intervall. The discretisation Z gets ”finer” in the sense, that thelongest part tends to 0:

Z b

a

f(x)dx = limn→∞,|Zn|→0

nX

k=0,xn,k∈[zn,k,zn,k+1]

f(xn,k)(zn,k+1 − zn,k) (9.8)

Page 81: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

9.1. BASICS OF COMPLEX ANALYSIS 81

where zn,0 = a < zn,1 < . . . < zn,n < zn,n+1 = b for every nIn some way, we walk along the intervall from a to b, picking some points, wherewe examine the function closer, and get a number from this process. Of course, itshouldn’t be important, whether we are faster or slower while “walking”. This isexpressed by the substitution formula:

bZ

a

g(γ(s))γ′(s)ds =

dZ

c

g(t)dt (9.9)

where γ : [a, b]→ [c, d] is piecewise differentable and monotonous and g : [c, d]→ R ispiecewise continous. In γ we have the information how fast we are on the intervall ateach point of the parametrisation.In a similar way we can integrate “along” a piecewise differentiable curve γ : R →C; t 7→ γ(t); we can think of C as R

2. In this situation γ is called an integrationpath. An integral of a complex function is the sum of the integrals of the real and theimaginary part, so for a complex function f we get1:

Definition 9.4. The integral of a complex function f along a curve γ is:Z

γ

f(z)dz =

Z b

a

f(γ(t)) · γ′(t)dt =

Z b

a

R`f(γ(t)) · γ′(t)

´dt+ i

Z b

a

I`f(γ(t)) · γ′(t)

´dt

(9.10)

For example we have the curve γ : [0, 2π]→ C; t 7→ eit which is a circle of radius 1 andcenter 0 and f(z) = 1

z.

Z

γ

dz

z=

Z 2π

0

1

eit· ieitdt =

Z 2π

0

idt = 2πi (9.11)

Some properties of real integrals can be copied almost literarily:The length of a path of integration is

Λ(γ) =

Z b

a

˛˛γ′(t)

˛˛ dt =

Z b

a

p

(Rγ′(t))2 + (Iγ′(t))2dt (9.12)

For a piecewise continous f : [a, b]→ C:˛˛˛˛

Z b

a

f(t)dt

˛˛˛˛ ≤

Z b

a

|f(t)| dt (9.13)

For f , continous on the image of a path of integration γ and M the maximum absolutevalue of f on that image, we have:

˛˛˛˛

Z

γ

f(t)dt

˛˛˛˛ ≤M · Λ(γ) (9.14)

So far we have always considered curves and not their images as domain of integration.But the following result shows, that somehow only the domain is important. Forthis purpose we call a surjective, piecewise continous differentiable, real function φa parameter transform, iff for all points of its domain, where φ is differentiable, thederivative is greater than 0.

Theorem 9.1. If φ : [c, d] → [a, b] is a parameter transform and γ1 : [a, b] → C anintegration path, then γ2 = γ1 φ : [c, d] → C is also a path of integration and for afunction f continous on the image of γ1 we have:

Z

γ1

f(z)dz =

Z

γ2

f(z)dz (9.15)

1Note, that we integrate along a curve in the first place and not the image of the curve.

Page 82: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

82 CHAPTER 9. RICE’S INTEGRALS

Although the proof is simple, we won’t state it here. The essence of this theorem is,that the speed, at which we follow the line defined by a curve, is not important forthe value of the integral – as long as we follow the line ”smooth” enough, i.e. we don’t”jump” around and we don’t stay to long at one point.

Definition 9.5. For a continous function f on an open set E ⊂ C we call F : E → C

the antiderivative of f , if F is holomorphic on E and F ′ = f . f is said to have localantiderivatives, if for each z0 ∈ E there exists a neighbourhood G ⊂ E of z0, so thatthere exists a holomorphic F with F ′ = f on G.

9.1.3 Holomorphic functions and the Cauchy Integral The-orem

Holomorphic functions do have many nice properties. The proofs are often sophisti-cated and technical, so we will again omit them.

Definition 9.6. A complex function f is said to be analytic at a point z0 if there existsa neighbourhood G of z0, for example an open circle, such that f has a powerseriesexpansion, which converges for some radius R. f is said to be analytic on an open set,if it is analytic on every point of this set.

Theorem 9.2. For a continous function f on a domain2 E the following statementsare equivalent:

1. f is holomorphic

2. f has local antiderivatives

3. f is differentiable in the real sense and obeys the CRDEs

4. f is analytic

Since powerseries can be differentiated without any loss in their convergent domain,this shows, that holomorphic function are arbitrarily often differentiable – at least ina local sense; but since we can cover domains with circles we can connect the domainsof definition of these derivatives and get one derivative for the whole domain of aholomorphic function. This is an outstandig property of holomorphic functions: inreal analysis a function can be differentiable but doesn’t have to be for a second time;in complex analysis a function, which is differentiable on an open set, is arbitrarilyoften differentiable.Since powerseries are more or less Taylor series, holomorphic functions are completlydefined by only one point – if you know all higher order derivatives at just one point.Another interesting property similar to that is, when you know the function at count-able many points, which cluster at one point, then the holomorphic functions is alsodefined uniquely. All this has the consequence, that whenever we have a holomorphicfunction on a certain domain, then possible holomorphic extensions are again unique– so restrictions of functions to a domain which is too small for our purpose is noproblem because we just extend the function to wherever we want, at least if no poleshinder us.

Theorem 9.3. Let E be a convex domain and Z ⊂ E a set of points without clusterpoints. If f : E → C is continous and holomorphic on E \ Z then for every closed(meaning γ(a) = γ(b)) path of integration γ : [a, b]→ C the following holds:

Z

γ

f(z)dz = 0 (9.16)

2a domain is a connected, open set; those, who have never heard of connectivity, canimagine this as a set, in which from every point to every other point there exists a path –despite this is called path connectivity and is stronger than sole connectivity, it is enough toget an idea of it.

Page 83: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

9.1. BASICS OF COMPLEX ANALYSIS 83

This is the Cauchy integral theorem3 and is something like the fundametal theorem ofintegral and differential calculus: if you have two paths from x to y you can put thesetwo together to form a closed path, by altering the direction of one of those paths;so their integral is zero. Since the alternation means a change in sign4, this meansthat the integrals along the two paths for each of them have the same value, and theintegral itself does only depend on the starting and end point of integration.At least if the function is holomorphic “enough” and the domain is convex – thelater can be extend to so called simple connected domains, witch is roughly speakingconnected and “without holes”.

9.1.4 Cauchy integral formula and residue calculus

Next we will state the Cauchy integral formulas5 and formulate the residue theorem,which will be the central tools used in solving Rice’s integrals.

Theorem 9.4. Let U be an open disc in a domain E with U ⊂ E6 and we denote apositivly oriented boundary of U with ∂U 7. Let f : E → C be holomorphic, then thefollowing statement hold:

∀z ∈ U : f(z) =1

2πi

Z

∂U

f(ζ)

ζ − z dζ (9.17)

∀z ∈ U : f (n)(z) =n!

2πi

Z

∂U

f(ζ)

(ζ − z)n+1dζ (9.18)

∀z ∈ E \ U : 0 =1

2πi

Z

∂U

f(ζ)

ζ − z dζ (9.19)

There are some generalisations of the CIT and CIF for general paths (not only circlesand points can be encircled more than once). Their mathematical exact presentationwould need some unnecesary complicated definitions. The main result is however,that the circle in the CIF can be deformated ”continously” arbitrarily as long as thepath doesn‘t cross the singularity z. We can even encircle z more than once, but theintegral will than be an integer multiple of integral, whose path does encircle z onlyonce, according to the number how often and in what direction z is encircled.Untill now we had only considered polynomial singularities with the CIF. But theconcept of integrating around singularities can be extended.As stated above every function that is holomorphic can be expressed as a powerseries.There exists a generalisation of this concept, which is called Laurent series. Laurentseries have the form

∞X

n=−∞an(z − z0)n (9.20)

It can be shown that under certain conditions f can be expressed within some discpartialy containing E and whose boundary is in E entirely as a Laurent series. Heref must be defined on a domain E for every point z0, for which a path in E exists thatencircles z0.In the following we consider holomorphic functions f which have isolated singularities;this means roughly speaking, that the domain of definition doesn’t have any holesbeside some points and these points do not cluster.

3further denoted with CIT4this can easily be checked by considering the substition formula5further denoted with CIF6U indicates the closure of U in the standard topology on C7a positivly oriented boundary of a disc is a path (this means a special curve and not a

set) starting from one point at the boundary then encircles the disc counterclockwise untill itreturns for the first time to the starting point – this means it is a closed path

Page 84: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

84 CHAPTER 9. RICE’S INTEGRALS

Definition 9.7. The coefficient at n = −1 of a Laurent series of a holomorphicfunction f with isolated singularities of type (9.20) is called the residue at z0: Res

z0

(f) =

a−1

It can be shown8 that there exists an ε, such that a circle Uε(z0) with radius ε smallenough and center z0 has the property Uε(z0) \ z0 ⊂ E and that

Resz0

(f) =1

2πi

Z

∂Uε(z0)

f(z)dz (9.21)

Taking into account the CIT this yields the residue theorem9:

Theorem 9.5. Let E be an open set and U an open disc with E ⊂ U . Let, for somen ∈ N0, f be holomorphic on E \ z1, . . . , zn. Then

1

2πi

Z

∂U

f(z)dz =nX

k=0

Reszk

(f) (9.22)

This is an extension of the CIT. It can even be extended to the case, where U is nota circle in U and even if U is not bounded, as long as the singularities are isolated inthe encircled domain. This finishes our short introduction to complex analysis.

9.2 Other mathematical formulas

9.2.1 The Gamma function

The Gamma function is a generalisation of the factorial to complex numbers – one ofits definitions is

Γ(s) :=

Z ∞

0

e−tts−1dt (9.23)

It satisfies the relations:

∀n ∈ N0 : Γ(n) = (n− 1)! ∀x ∈ C \ −N0 : Γ(x+ 1) = xΓ(x) (9.24)

A direct consequence is:nY

i=0

(s− i) =Γ(s+ 1)

Γ(s− n)(9.25)

The Gamma function is holomorphic on its domain C \ −N0; at the negativ integersit diverges.An estimate for the growth of the Gamma function is the so called Stirling formula:

Γ(x) =√

2πxx− 12 e−x

1 +O(1

n)

«

(9.26)

The following formula, which is a direct consequence of the Stirling formula is thestandard estimate:

Γ(n+ 1)

Γ(n+ 1− α)= nα

1 +O(|a|2n

)

«

(9.27)

It can be shown that

limn→∞

nX

i=1

1

i− ln(n)

!

=: γ ≈ 0.577 (9.28)

8in fact, this is a direct consequence of the structure of the Laurent series and the fact that∀n ∈ Z \ −1 :

R

γ zn = 0 if γ is a closed path9again the proof is quiet obvious and the idea simple, if you are used to complex analysis,

but if exactly formulated very technical

Page 85: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

9.3. MOTIVATION AND BASIC INTEGRALS 85

This γ is known as Eulers constant.Using this, another result is:

Γ′(x)

Γ(x)= −γ − 1

x−

∞X

k=1

„1

x+ k− 1

k

«

(9.29)

The proofs of all the statements above are wonderful perls of mathematical analy-sis. But as with all perls, you have to dive deep to find them, so the proofs are farfrom trivial and we omit them all. The Gamma function does have many other niceproperties – but we won’t need them.

9.2.2 Zeta functions and Modified Bell Polynomials

The so called incomplete Hurwitz ζ function is:

ζn(r, β) =n−1X

i=0

1

(i+ β)r(9.30)

ζn(r, 1) defines the generalized harmonic numbers ζn(r) and their limit (n → ∞) isthe famous Riemann ζ function.From (9.29) it follows, that

ζn+1(1, β) = ln(n)− Γ′(β)

Γ(β)+O(

1

n) (9.31)

The modified Bell polynomials Lm = Lm(x1, x2, . . . , xm) are defined as

exp

∞X

k=1

xktk

k

!

= 1 +

∞X

m=1

Lmtm (9.32)

It is rather technical than difficult to proof that in general

Lm(x1, x2, . . .) =X

1m1+2m2+...=m

1

m1!m2! . . .

“x1

1

”m1“x2

2

”m2

. . . (9.33)

and to get an idea of them, we have

exp

∞X

k=1

xktk

k

!

= 1 + x1t+

„x2

2+x2

1

2

«

t2+

„x3

3+x1x2

2+x3

1

6

«

t3 +

„x4

4+x1x3

3+x2

2

8+x2x

21

4+x4

1

24

«

t4 + . . . (9.34)

9.3 Motivation and Basic Integrals

First we will introduce generalized differences for a sequence fkk∈N0:

∆fn = fn+1 − fn ∆nf0 =nX

k=0

„nk

«

(−1)n−kfk = (−1)nDn [f ] (9.35)

The differences Dn arise often in the average case analysis of some data structures assearch trees or tries. As a naive bound we get:

|Dn [f ]| ≤ 2n max0≤k≤n

|fn| (9.36)

Page 86: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

86 CHAPTER 9. RICE’S INTEGRALS

But for many sequences fn that come across in data structure analysis this boundis way to rough and polynomial bounds can be found – this phenomenon is calledexponential cancelation.In the analysis of recurrent sequences generating functions are often used to simplifyand solve the relations. For example the exponential generating is

f(z) =

∞X

n=0

fnzn

n!(9.37)

and Poisson generating function is defined by

f (z) =∞X

n=0

fne−z z

n

n!(9.38)

We consider the transform fn 7→ gn = Dn [f ]. Substitution in the exponetial generat-ing function or the Poisson generating function respectively yields the equations

g(z) = ezf(−z) g(z) = e−zf (−z) (9.39)

So it can be supposed, that when these transforms induce drastical simplifications ofrecurences or difference equations, high order differences as Dn may play a significantrole.We assume in the following, that a holomorphic function φ(x) interpolates the valuesof the sequence fn, which means ∀k ∈ N0 : fk = φ(k).

Lemma 9.1. Let φ be a holomorphic function in a domain that contains the half-line[n0,∞[ and C is a positivly oriented closed path in the domain of φ, which encircles[n0, n] and does not include any of the integers 0, 1, . . . , n0 − 1 nor a point, where φ isnot holomorphic. Then the following holds

nX

k=n0

„nk

«

(−1)kφ(k) =(−1)n

2πi

Z

Cφ(s)

n!

s(s− 1) . . . (s− n)ds (9.40)

Proof. We apply the residue theorem (9.22). Since the only points where the integrandis not holomorphic are the integers n0, n0 + 1, . . . , n, we only have to consider theseintegers. Let k be such an integer; the residues can be evaluated according to the CIF(9.17). Then we have:

Ress=k

φ(s)n!

s(s− 1) . . . (s− n)=

Ress=k

1

s− k

φ(s)n!

s(s− 1) . . . (s− k + 1)(s− k − 1) . . . (s− n)

«

= φ(k)(−1)n−kn!

k!(n− k)!(9.41)

Simple summation while taking the sign into account yields the proposition.

For the further discussion the definition of polynomial growth will be important:

Definition 9.8. A function φ in an unbounded domain Ω is said to have polynomialgrowth, if for some r the formula |φ(s)| = O(|s|r) holds as s→∞ in Ω. r is called thedegree of φ

If the function in (9.40) is of polynomial growth in the half-plain R(s) > n0−ε for someε > 0 and n is sufficiently large, then we have for some n0 > c > maxn0 − ε, n0 − 1the representation

nX

k=n0

„nk

«

(−1)kφ(k) = − (−1)n

2πi

Z c+i∞

c−i∞φ(s)

n!

s(s− 1) . . . (s− n)ds (9.42)

Page 87: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

9.4. INTEGRALS OF FUNCTIONS WITH POLES 87

Take as contour of integration the path

γj :

»

−j, j +1

2

→ C;

(

c+ ix −j ≤ x ≤ jc+ je−2πi( 1

4−x) j ≤ x ≤ j + 1

2

where j > n. This path is negatively oriented (so the sign changes) and we can apply(9.40). Since for any allowed j (9.40) holds, we consider the limit j → ∞. Thefirst part of the path yields the integral in (9.42) and the second can be estimated asO(|j|−n−1+r+1) using formula (9.14), since the integrand and therefor its maximumhas asymtotical growth O(|j|−n−1+r) and the length of the path is 2πj

2. So if n is

sufficiently large the second part of the integral vanishes.

9.4 Integrals of Functions with Poles and Rep-resentation of Sumes

9.4.1 Rational functions

Theorem 9.6. Let φ be a rational function holomorphic in a domain that containsthe half-line [n0,∞[. If n is big enough we have

nX

k=n0

„nk

«

(−1)kφ(k) = −(−1)nX

s

Ress

φ(s)n!

s(s− 1) . . . (s− n)

«

(9.43)

where the sum is taken over all poles of φ and over 0, 1, . . . , n0 − 1

Proof. First we use Lemma 9.1 and take as path of integration a circle of radius R bigenough to encircle all poles. When R → ∞ and n > r the integral on the right sideof (9.40) tends to 0 by a similar argument used for (9.42). Applying once again theresidue theorem (9.22), we find that the sum on the righthand side of (9.43) minus itslefthand side is 0, which directly yields (9.43)

As a next step we try to express the residues. For this purpose we will need theincomplete Hurwitz zeta function and the modified Bell polynomials introduced in(9.30) and (9.33). As every rational function can be expressed as a linear combinationof terms of the form A(x− a)−r, where r ∈ N0, we only have to consider functions φof this type.

Lemma 9.2. When α ∈ C \ N0, then

In(α) = (−1)nn! Ress=α

„1

(s− α)r

1

s(s− 1) . . . (s− n)

«

(9.44)

has the following asymptotic

In(α) = −Γ(−α)nα (lnn)r−1

(r − 1)!

1 +O

„1

lnn

««

(9.45)

In the following we use the symbol 〈φ(s)〉k,α to denote the kth coefficient in the Laurentseries at a certain point α ∈ C.

Page 88: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

88 CHAPTER 9. RICE’S INTEGRALS

Proof.

In(α) =− n!

fi

(s− α)r 1

(−s)(1− s) . . . (n− s)

fl

−1,α

=

− n!

fi1

(−s)(1− s) . . . (n− s)

fl

r−1,α

=

− n!

fi1

(−α− s)(1− α− s) . . . (n− α− s)

fl

r−1,0

=

− n!

fi

exp

− ln

nY

j=0

(j − α− s)!!fl

r−1,0

=

− n!

fi

exp

−nX

j=0

ln(j − α− s)!fl

r−1,0

(9.46)

Since we have ln(j − α− x) = ln“

(j − α)“

1 + −xj−α

””

= ln(j − α) + ln“

1 + −xj−α

, we

can apply the series expansion ln(1 + x) =∞P

m=1

(−1)m+1 xm

mto get

= −n! exp

−nX

j=0

ln(j − α)

!fi

exp

nX

j=0

∞X

m=1

1

m

„s

j − α

«m!!fl

r−1,0

=

− n!1

(−α)(1 − α) . . . (n− α)

fi

exp

∞X

m=1

nX

j=0

1

(j − α)m

sm

m

!!fl

r−1,0

(9.30)=

− n!1

(−α)(1 − α) . . . (n− α)

fi

exp

∞X

m=1

ζn+1(m,−α)sm

m

!fl

r−1,0

(9.33)=

− Γ(n+ 1)Γ(−α)

Γ(n + 1 − α)Lr-1(ζn+1(1,−α), ζn+1(2,−α), . . . ζn+1(r − 1,−α))

(9.27)=

− Γ(n+ 1)Γ(−α)

Γ(n+ 1− α)Lr-1(lnn− Γ′(−α)

Γ(−α)+O(1/n), ζn+1(2,−α), . . . ζn+1(r− 1,−α)) =

(9.47)

Since the incomplete Hurwitz zeta function fullfills ζn(r, β) = O(1) for n → ∞ andr ∈ N \ 1 and since beside the first coefficients of the modified Bell polynomials Lm

all coefficent are of degree smaller than m all other coefficient can be neglected.

= −Γ(n+ 1)Γ(−α)

Γ(n+ 1− α)

1

(r − 1)!(lnn− Γ′(−α)

Γ(−α)+O(1/n))r−1 =

− Γ(n + 1)Γ(−α)

Γ(n + 1 − α)

(lnn)r−1

(r − 1)!·„

1 +O

„1

lnn

««(9.31)=

− Γ(−α)nα

1−O„

1

n

««(lnn)r−1

(r − 1)!·„

1 +O

„1

lnn

««

=

− Γ(−α)nα (lnn)r−1

(r − 1)!·„

1 +O

„1

lnn

««

(9.48)

As a first example we analyze for an m ∈ N the asymtotic growth of the sum

Sn(m) =nX

k=1

„nk

«(−1)k

km(9.49)

Page 89: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

9.4. INTEGRALS OF FUNCTIONS WITH POLES 89

Here we can use the function φ(s) = 1sm

to interpolate the sequence and we set n0 = 1.We have only one pole not in n0, n0 +1, . . . , n, that is 0, which is of the order m+1.We have to modify our calculations yielding (9.2), since the pole is in N0:

Sn(m) = −Ress=0

„1

sm+1

n

s− nn− 1

s− n + 1. . .

2

s+ 2

1

s+ 1

«

=

−Ress=0

„1

sm+1

““

1 − s

1

”“

1 − s

2

. . .“

1− s

n

””−1«

=

−fi““

1− s

1

”“

1− s

2

. . .“

1 − s

n

””−1fl

m,0

(9.50)

Similar to the calculations taken out in (9.46) and (9.47) with the generalized harmonicnumbers ζn(k) this is

−fi

exp

∞X

k=1

ζn(k)sk

k

!fl

m,0

(9.51)

Now we can use once again the modified Bell polynomials and the facts, that ζn(k) =ζ(k) +O

`1

nk−1

´for k ≥ 2 and ζn(1) = ln(n) + γ +O(1/n), which follows from (9.31)

with β = 1 and Γ′(1) = γΓ(1) to get:

− Sn(m) =X

1m1+2m2+...=m

1

m1!m2! . . .

„ζn(1)

1

«m1„ζn(2)

2

«m2„ζn(3)

3

«m3

. . . =

1 +O

„1

n

««X

1m1+2m2+...=m

1

m1!m2! . . .·

ln(n) + γ +O

„1

n

««m1„ζ(2)

2

«m2„ζ(3)

3

«m3

. . . (9.52)

Since the ζ(k) are constants we have for a polynomial Pm of degree m the asymtotics

−Sn(m) = Pm(ln(n)) +O

„(ln(n))m

n

«

(9.53)

Using the values of the ζ function, we get for the first values of m

−Sn(1) = ln(n) + γ +O

„1

n

«

(9.54)

−Sn(2) =1

2(ln(n))2 + γ ln(n) +

γ

2+π2

12+O

„ln(n)

n

«

(9.55)

Moreover for m = 1 we get the exact result

nX

k=1

„nk

«(−1)k−1

k= −Sn(1) = ζn(1) (9.56)

The above asymptotic equation can be generalized for m 6∈ N, as we will see later.Another example is the sequence

Tn =nX

k=0

„nk

«(−1)k

k2 + 1(9.57)

This sequence obeys the recurrence

T0 = 1 T1 =1

2Tn =

n

n2 + 1((2n − 1)Tn−1 − (n− 1)Tn−2) (9.58)

Page 90: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

90 CHAPTER 9. RICE’S INTEGRALS

which seams hard to solve or even estimate by conventional methods.Since the sequence underlaying this differences can be interpolated by φ(s) = (1 +s2)−1 = 1

s−i+ 1

s+i, this allows us to directly applay (9.2) using trigonometric identities

and the fact that |Γ(z)| = |Γ(z)| to get

Γ(−i)ni

1 +O

„1

n

««

+ Γ(i)n−i

1 +O

„1

n

««

=

(Γ(−i)ei ln(n) + Γ(i)e−i ln(n))

1 +O

„1

n

««

= ρ · cos(ln(n) + θ) + o(1) (9.59)

for some θ and ρ = 2 |Γ(i)| = 2pπ/ sinh(π) ≈ 1.04313.

This example shows, that complex poles introduce periodic behavior in the asymtoticsof a sequence.

9.4.2 Meromorphic functions

Meromorphic functions are generalisations of rational function. Meromorphic func-tions are holomorphic on an certain domain except isolated singularities.

Theorem 9.7. Let φ be a function holomorphic in a domain that contains the half-line[n0,∞[. If n is big enough we have

1. If φ is meromorphic on C and of polynomial growth, then

nX

k=n0

„nk

«

(−1)kφ(k) = −(−1)nX

s

Ress

φ(s)n!

s(s− 1) . . . (s− n)

«

(9.60)

where the sum is taken over all poles of φ and over 0, 1, . . . , n0 − 1

2. If φ is meromorphic on the half-plane defined by R(s) ≥ d for some d < n0 andof polynomial growth in this set, then

nX

k=n0

„nk

«

(−1)kφ(k) = −(−1)nX

s

Ress

φ(s)n!

s(s− 1) . . . (s− n)

«

+O(nd)

(9.61)where the sum is taken over all poles but n0, n0 + 1, . . . , n

Proof. Since in both cases the function is meromorphic, the number of poles are count-able. So in the first case we can find positively oriented, concentric circles γj whoseradii tend to ∞ and do not come across any pole. In the second case we can find forany ε > 0 a d < d′ < d + ε, such that the pathes defined by [d′ − iRj , d

′ + iRj ] andthe half circle with center d′ and radius Rj don’t cross a pole and Rj tends to infinity.Since d′ is arbitrarily close to d, we can assume d = d′. Of course, if the theoremis used in practice, other types of paths can be used, when they have the essentialproperties stated and used here.Now we integrate along these curves and get using the residue theorem similar totheorem 9.6 the results above. Since φ is of polynomial growth10 we can use thearguments at (9.42) to get the asymtotics of the integrals over the ”infinite” paths.In the first case we get 0 for n big enough to overwhelm the polynomial growth. Inthe second case the O(nd) comes from the integral along the parallel to the imaginary

10in fact, it is only necessary to have polynomial growth on the union of the paths ofintegration – so we don’t have to bother about poles or even infinitly many poles on ourcompact set, because we just circumnavigate them; in the second case the polynomial growthis even only needed beside a compact set, because the path of integration is not allowed tocross a pole and then holomorphic functions do attain their maximum on a compact set, sothey can be estimated by a constant, which is trivially of polynomial growth

Page 91: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

9.4. INTEGRALS OF FUNCTIONS WITH POLES 91

axes, as this argument illustrates (It’s not a proof – it’s just a plausibility argument;we asume |φ(s)| = O(|s|r) for an integer r):

˛˛˛˛

(−1)n

2πi

Z d+i∞

d−i∞φ(s)

n!

s(s− 1) . . . (s− n)ds

˛˛˛˛

(9.13)

1

Z d+i∞

d−i∞

˛˛˛˛φ(s)

n!

s(s− 1) . . . (s− n)

˛˛˛˛ ds ≤

1

Z d+i∞

d−i∞|s|r

˛˛˛˛

n!

s(s− 1) . . . (s− n)

˛˛˛˛ds =

1

Z d+i∞

d−i∞

|s|r|s(s− 1) . . . (s− r + 1) · (s− r)(s− r − 1)| |(n(n− 1) . . . (n − d+ 1))|

˛˛˛˛

(n − d)(n− d− 1) . . . (r + 2− d)(n − s)(n− s− 1) . . . (r + 2− s)

˛˛˛˛ |(r + 2− d)(r + 1 − d)(r − d) . . . 2 · 1|−1 ds ≤

O(nd)

Z d+i∞

d−i∞

1

|s|2ds = O(nd) (9.62)

This rough approximation holds for r + 1 − d < 0. For the other case, this argumentis not applicable, although I think, that O(nd) holds even in this case.

As our next example we want to analyze the recurrence relation

fn = an + 2

nX

k=0

„nk

«1

2nfk (9.63)

Further we will assume that a0 = a1 = 0; this is without loss of generality. Then weuse exponential generating function introduced in (9.37) to get

f(z) =

∞X

n=0

fnzn

n!=

∞X

n=0

an + 2

nX

k=0

1

2n

„nk

«

fk

!

zn

n!=

∞X

n=0

anzn

n!+

∞X

n=0

2nX

k=0

1

2n

„nk

«

fkzn

n!

!

=

a(z) + 2∞X

n=0

nX

k=0

1

2n−k(n− k)!fk

2kk!zk+(n−k)

!

=

a(z) + 2

∞X

n=0

1

2n

zn

n!

! ∞X

n=0

fn

n!

“z

2

”n!

= a(z) + 2ez/2f“z

2

(9.64)

This easily translates into the Poisson generating function via multiplication the wholeequation by e−z to get

f (z) = a(z) + 2f“z

2

(9.65)

so for the coefficients fn = n!〈f (z)〉n,0 we have

fn = an + 21

2nfn ⇒ fn =

an

1− 21−n(9.66)

Since the equality

fn =

nX

k=0

„nk

«

fn (9.67)

holds, we get the identity respecting a0 = a1 = 0 and hence a0 = a1 = 0

fn =nX

k=0

„nk

«ak

1 − 21−k=

nX

k=2

„nk

«ak

1− 21−k(9.68)

Page 92: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

92 CHAPTER 9. RICE’S INTEGRALS

To proof (9.67) we first show:

fn = n!〈f 〉n,0 = n!

fi ∞X

n=0

fne−z z

n

n!

fl

n,0

= n!

fi ∞X

n=0

fn

∞X

k=0

(−1)k zk

k!

zn

n!

fl

n,0

=

n!

fi ∞X

i=0

iX

j=0

(−1)j 1

j!fi−j

zjzi−j

(i − j)!

fl

n,0

= n!

fi ∞X

n=0

iX

k=0

1

k!

1

n − k! (−1)kfn−kzn

fl

n,0

=

(−1)n

fi ∞X

n=0

iX

k=0

n!

k!(n− k)! (−1)kfkzn

fl

n,0

= (−1)niX

k=0

„nk

«

(−1)kfk (9.69)

Next we examine the righthand side of (9.67):

nX

k=0

„nk

«

fk =

nX

k=0

„nk

«

(−1)nkX

j=0

„nj

«

(−1)jfj =

nX

l=0

nX

m=l

(−1)l+m

„nm

«„ml

«

fl

(9.70)

Since we want this sum to be fn, we have to show, that the inner sum is δl,n11

nX

m=l

(−1)l+m

„nm

«„ml

«

=

nX

m=l

(−1)l+m n!

m!(n−m)!

m!

l!(l −m)!=

nX

m=l

(−1)l+m n!

(n−m)!(l −m)!l!=n!

l!

nX

m=l

(−1)l+m 1

(n−m)!(l −m)!=

n!

l!

n−lX

k=0

(−1)2l+k 1

(n − l − k)!k! =n!

l!(n− l)!n−lX

k=0

(−1)k (n− l)!(n− l − k)!k! =

„nl

« n−lX

k=0

„n − lk

«

(−1)k(1)n−l−k =

„nl

«

(1 + (−1))n−l =

„nl

«

δl,n = δl,n (9.71)

Since (9.63) appears in the analysis of tries, the asymptotics of

Un =nX

k=2

„nk

«ak

1 − 21−k(9.72)

are of great interest. The ak are usualy simple; for n ≥ 2 we get for the an = n − 1,which appears taking a closer look to tries, an = (−1)n. So we can apply theorem 9.7with fk = (2k−1 − 1)−1. We have infinitly many poles at χk = 1 + 2πik

ln 2.

We choose as path of integration circles centered at the origin, that avoid the polesand let the radius tend to ∞. The whole analysis is carried out in Knuth’s ”Art ofcomputer programming”; we don’t carry it out here, but the result is:

Un =n

ln 2

0

@ln(n) + γ − 1 − ln 2

2+

X

k∈Z\0Γ

−1− 2πik

ln 2

«

e2πik ln(n)/ ln 2

1

A+O(1)

(9.73)Since the Γ function decreases rapidly along the imaginary axes the effects of the sumcan almost be neglected and we get the simplyfied asymptotic

n log2(n) + nP (log2(n)) +O(1) (9.74)

More generally regular spaced poles introduce disturbances, which asymptoticaly be-have like Fourier series in ln(n), as seen her with P .

11This is the widly used Dirac δ symbol

Page 93: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

9.4. INTEGRALS OF FUNCTIONS WITH POLES 93

Another example is the sum Vn =n−1P

k=1

„nk

«

Bk2k−1

, which arises in the analysis of

Patricia tries. Since Bk = 0 for k ∈ 2N + 1, the sign is not necessary, and usingBk = −kζ(1− k) for k ∈ 2N ∪ 1, we have to analyze the integral

Vn =(−1)n

2πi

Z 1/2+i∞

1/2−i∞

n!

(s− 1)(s− 2) . . . (s− n)

ζ(1− s)2s − 1

ds (9.75)

The path of integration is the infinite rectangle from 12−∞ to 1

2+∞ and from n− 3

4+∞

to n − 34−∞; but it can be shown that the integral of the second path is identical

0 for each n and so the rectangle can be extended to ∞ therefor is a variant of thesecond part of theorem 9.7, where the residues are not yet evaluated.Now we have to consider the double pole at 0 (from the ζ function and from (2s−1)−1)and all the simple poles at χk = 2πik/ ln(2), which yields analog to the example befor

Vn =ln(n)

ln(2)− 1

2− 1

ln(2)

X

k∈Z\0ζ(1 − χk)Γ(1− χk)e2πik ln(n)/ ln(2) +O(1) (9.76)

There are also some other examples, where Rice’s integrals can be used succesfully;for example digital trees or quad trees used for multidimensional searching.In the last example the extrapolating function was just given to us, but you canimagine that this would otherwise be a difficult task – especially with such not everydayoccuring numbers as the Bernoulli numbers. If we have coefficients which are sums orproducts of other sequence αk and we can interpolate the elements of this sequenceby α(s), then we can use

An =

nY

k=1

αk ⇒ A(s) =

∞Y

k=1

α(k)

α(k + s)An =

nX

k=1

αk ⇒ A(s) =

∞X

k=1

(α(k)−α(k+s))

(9.77)

9.4.3 Functions with Algebraic and Logarithmic Singular-ities

Now we turn to general algebraic and logarithmic functions. The problem with thesefunction is that they can not to defined on C, even not with some pointwise exceptions,but only with some uncountable exceptions. For example the complex extension ofthe logarithm and the root are defined on C \ ]−∞, 0]12. For this reason we can notsimply integrate around the points, where the functions are not defined, because everycircle would be “slashed”, but we have to use the so called Hankel contours to get ourintegration done.Because of the difficulties that arise with this class of functions, we will only considerexamples; but the first in more detail. Since we are familiar with the sequence 1

km

from (9.49), where m was an integer, we try to generalise this to arbitrary λ, as weindicated before.

Theorem 9.8. For any nonintegral λ, the sum

Sn(λ) =

nX

k=1

„nk

«

(−1)kk−λ (9.78)

has an asymptotic expansion in descending powers of ln(n) of the form

−Sn(λ) = (ln(n))λ∞X

j=0

(−1)j Γ(j)(1)

j!Γ(1 + λ− j)1

(ln(n))j(9.79)

12People having done some complex analysis know, that there are some means to make thisrestriction a little bit more flexible, but you will never get rid of a malicious slash, which cutsinto the complex plane

Page 94: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

94 CHAPTER 9. RICE’S INTEGRALS

Proof. As seen in (9.50) we can represent the sum as an integral in an analoge way

Sn(λ) =1

2πi

Z

Cωn(s)

ds

sλ+1with ωn(s) =

““

1 − s

1

”“

1 − s

2

. . .“

1− s

n

””−1

(9.80)

Here C may be the vertical line R(s) = 12

as in (9.42). Despite all, these integralswould be hard to evaluate; since we are only interested to encircle the poles (hereonly 1, 2, . . . , n) we can deformate the path of integration as we like unless we don’tencounter any new pole or the slash. For this reason we introduce the Hankel contour:

C = C1 + C2 + C3 + C4

C1 =

s

˛˛˛˛ |s| = R ∧

|I(s)| ≥ 1

ln(n)∨R(s) > 0

«ff

C2 =

s

˛˛˛˛s =

i − tln(n)

∧ t ≥ 0 ∧ |s| ≤ Rff

C3 =

s

˛˛˛˛s =

eθi

ln(n)∧ θ ∈

h

−π2,π

2

iff

C4 =

s

˛˛˛˛s =

−i− tln(n)

∧ t ≥ 0 ∧ |s| ≤ Rff

This is a circle of radius R > n around the origin, that circumvents the slash by leavingsome space around it; but as n increases the slash is left less space, so that if n tendsto ∞ any point in the domain of s−λ will be encircled.Now we split the integral in three parts Sn(λ) = J1 + J2 + J3; we will see that for theasymptotic J1 and J2 can be neglected.

1. Let J1 be the part, that belongs to the outer “circle” C1. Of course s−λ−1 is ofpolynomial growth, so we can once more use our estimate for polynomial growth,to see that J1 is O(R−n−λ) and in the end, when R→∞, we get J1 = 0

2. Now we will estimate the parts of C2 and C4 with R(s) < − 1√ln(n)

=: −t0. For

simplification, we forget about the term s−λ−1 and asume we integrate alongthe negative real axes till we reach −t013. This simplifications do not touchthe character of our estimate, but make it simpler. Additionaly, we mirroreverything on the imaginary axes, to get:

µ(n) :=

Z ∞

t0

dt

(1 + t1) . . . (1 + t

n)

(9.81)

As next step we split the new path into the intervalls [t0, 1],h

1, n1/3i

andh

n1/3,∞i

, to get the integrals µ1(n), µ2(n) and µ3(n)

13We asume λ + 1 ≥ 0 and then it is quiet obvious, that we can neglect the powers of s

for the asymptotic analysis, because they would make the asymptotic only smaller; but forexample if λ are negative integers we get the Stirling numbers of second kind, which havenormaly exponential growth – our derivation can not be applayed in this case

Page 95: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

9.4. INTEGRALS OF FUNCTIONS WITH POLES 95

(a)

µ1(n) =

Z 1

t0

dt

(1 + t/1) . . . (1 + t/n)=

Z 1

t0

n!dt

(1 + t)(2 + t) . . . (n + t)=

Z 1

t0

Γ(n+ 1)Γ(t+ 1)

Γ(n+ t+ 1)dt

(9.26)=

Z 1

t0

√2π(n+ 1)n+1−1/2e−n−1

√2π(t+ 1)t+1−1/2e−t−1

√2π(n+ t+ 1)n+t+1−1/2e−n−t−1

1 +O

„1

n

««

dt ≤Z 1

t0

√2π(t+ 1)t+1−1/2

e(n+ t+ 1)t

1 +O

„1

n

««

dt = O(1)

Z 1

t0

(t+ 1)t+1−1/2

(n+ t+ 1)tdt =

O(1)

Z 1

t0

1

(n)tdt (9.82)

The last few transformation can be carried out because t ∈ [t0, 1], sois small compared to n and can be ignored or estimated by a constantrespectivly.

O(1)

Z 1

t0

1

(n)tdt = O(1)

„−1 + n1−t0

n ln(n)

«

= O(e−t0 ln(n)+ln(ln(n)))

= O(e−1/2√

ln(n)) (9.83)

(b) In the next part we use, that 1 ≤ t

µ2(n) =

Z n1/3

1

n!

(1 + t) . . . (n+ t)dt =

Z n1/3

1

1 · 2 · 3

2 + t

4

3 + t. . .

n

n− 1 + t

1

(1 + t)(n+ t)dt ≤

Z n1/3

1

O(1)1

(1 + t)(n+ t)dt = O

„ln(n1/3 + n)

n − 1

«

= O(n−2/3) (9.84)

(c) Similar to the estimates done in the formula above we get, for n largeenough

µ3(n) =

Z ∞

n1/3

1

(1 + t/1) . . . (1 + t/n)dt ≤

Z ∞

n1/3

1

e(1/2)tdt = O(e−(1/2)n1/3

) (9.85)

In the whole we have the result J2 = O(e(1/2)√

ln(n)), so it is of smaller orderthan any negativ power of ln(n)

3. Now we have to estimate J3; this is the integral along the contour of C2∪C3∪C4,for which R(s) ≥ −t0, denoted in the following by C0.Again we use Stirlings formula (9.26) to get the asymptotic:

ωn(s) = nsΓ(1− s)„

1 +O

„ln(n)

n

««

(9.86)

Since s is very small on C0 the part with O“

ln(n)n

can be estimated easily by

O(e(1/2)√

ln(n)) and this yields

J3 = J0 +O(e(1/2)√

ln(n)) with J0 =1

2πi

Z

C0

nsΓ(1− s) ds

sλ+1(9.87)

Page 96: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

96 CHAPTER 9. RICE’S INTEGRALS

Now we use the transformation z = s ln(n) with D0 is the image under thistransform of C0 and have

J0 = (ln(n))λ 1

2πi

Z

D0

ezΓ

1 − z

ln(n)

«dz

zλ+1(9.88)

By the transform we get |z| = O(p

ln(n)) on D0; this is the reason, we canexpand the Gamma function around 1 in a power series and after changingsummation and integration, we get

J0 = (ln(n))λ∞X

m=0

(−1)m Γ(m)(1)

m!

1

(ln(n))m

1

2πi

Z

D0

ezzm−λ−1dz (9.89)

Now we have to estimate the remaining integrals; this can be done by the socalled Laplace method, where we extend the contour to −∞ to get L. We willonly present the result here:

1

2πi

Z

Lezzm−λ−1dz =

1

Γ(1 −m+ λ)(9.90)

Now, taking together all the parts, we have prooven the theorem.

As an application, we can examine the sum

Xn =nX

k=1

„nk

«(−1)k

√1 + k2

(9.91)

We have the local behvior (s±i)−1/2 at the “problem points”, so we get an asymptoticgrowth of

pln(n). Similar to (9.59) we have for some ρ and θ0

Xn = ρp

ln(n) cos(ln(n) + θ0) +O((ln(n))−1/2) (9.92)

Other examples and direct applications are

−Sn(−1/2) =1

pπ ln(n)

− γ

2pπ(ln(n))3

+O((ln(n))−5/2) (9.93)

−Sn(1/2) = 2

r

ln(n)

π+

γpπ ln(n)

+O((ln(n))−3/2) (9.94)

In general the coefficients are rational expressions of terms as γ, Γ(−λ) and ζ(2), ζ(3),. . .

For the rest of this part, we will only state some more examples, where the method ofRice’s integrals (perhaps with use of the Hankel contour) can be applied succesfully.

Theorem 9.9. For the logarithmic differences we have the asymptotics

Yn =nX

k=1

„nk

«

(−1)k ln(k) = ln(ln(n))+γ+γ

ln(n)− π2 + 6γ2

12(ln(n))2+O

„1

(ln(n))3

«

(9.95)

The method can also be used for entire functions, which have no poles at all (despitethe artificial ones introduced by the kernel . . . ). For example for

Zn =nX

k=0

„nk

«(−1)k

k!(9.96)

Page 97: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

9.5. MELLIN TRANSFORMS AND RICE’S INTEGRALS 97

which obeys the recurrence Zn+2 = (2− 2/n)Zn+1 + (1− 1/n)Zn can be extrapolatedby the entire function 1/Γ(s), and after carrying out Rice’s method it reveals for someconstants c and θ

Zn = cn−1/4 sin(2n1/2 + θ) + o(n−1/4) (9.97)

We have seen many cases were heavy use of complex analysis can resolve the asymp-totics of recurrences, generalized differences and sums. We summarize all this in thetable below.

Some types of singularities and the asymptotics they introducein the corresponding difference

singularity asymptotics

singularity of φ(s) at s0 = σ0 + iτ approximatly ns0 = nσ0eiτ0 ln(n)

simple pole: (s− s0)−1 −Γ(−s0)ns0

multiple pole: (s− s0)−r −Γ(−s0)ns0 (ln(n))r−1

(r−1)!

algebraic singularity: (s− s0)λ −Γ(−s0)ns0 (ln(n))−λ−1

Γ(−λ)

logarithmic singularity: (s− s0)λ(ln(s− s0))r −Γ(−s0)ns0 (ln(n))−λ−1

Γ(−λ) ln(ln(n))r

9.5 Mellin Transforms and Rice’s Integrals

The Mellin transform of a function and its inverse have the form

φ(z) =

Z ∞

0

tz−1f(t)dt f(t) =1

2πi

Z c+i∞

c−i∞t−zφ(z)dz (9.98)

If we take a closer look to the integral (9.42) and take into account the asymptotic ofωn(s) (9.86)14 we get:

1

2πi

Z d+i∞

d−i∞φ(s)

(−1)nn!

s(s− 1) . . . (s− n)ds ≈ 1

2πi

Z d+i∞

d−i∞φ(s)Γ(−s)nsds (9.99)

If we would change the sign of the variable and then compare the result with theinverse Mellin transform we observe a definite analogy. Without stating it formally,in cases, were we want to evaluate Rice’s integrals and we get stuck with it, it canbe worth a try to evaluate the corresponding inverse Mellin transform to get an ideaabout the size of the asymptotics.But the similarity can be stated formally as the Poisson-Mellin-Newton cycle.

Theorem 9.10. The coefficients of a Poisson generating function are expressible asa Rice’s integral of a Mellin transform of the Poisson generating function.

fn Poisson GF−−−−−−−→ f(t) =∞X

n=0

fne−t t

n

n!

Mellin transform−−−−−−−−−−→ f∗(s) =

Z ∞

0

f (t)ts−1dtRice’s integral−−−−−−−−→ fn (9.100)

Proof. We take a closer look to the Mellin transform and state a Newton series to have

f∗(s) =

Z ∞

0

f (t)ts−1dt =

∞X

0

fn

n!

Z ∞

0

e−tts+n−1dt(9.23)=

Γ(s)

f0 + f1s

1!+ f2

s(s+ 1)

2!+ . . .

«

(9.101)

14don’t forget about the additional s in the denominater, so we realy get Γ(−s) and notΓ(1 − s), after changing sign

Page 98: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

98 CHAPTER 9. RICE’S INTEGRALS

By differencing15 we get to the following formula

fn =nX

k=0

„nk

«

(−1)k f∗(−s)

Γ(−s) (9.102)

But these differences are exactly of the Rice type, so we conclude the correspondingequations

fn =(−1)n

2πi

Z

C

f∗(s)

Γ(−s)n!

s(s− 1) . . . (s− n)

!

ds (9.103)

f∗(s) =

Z ∞

0

e−t∞X

n=0

fntn

n!

!

ts−1dt (9.104)

This are the relations, which were stated.

For example if we carry out the Mellin transform of the Poisson generating function(9.65) we have f∗(s) = a∗(s)

1−21+s and from the formulas above we get the result

fn =

nX

k=0

„nk

«a∗(−k)Γ(k)

(−1)

1− 21−k(9.105)

This is formally the same result as if we had carried out the Rice’s method.There are some other examples as digital search trees, were the Poisson-Mellin-Newtoncycle can be applied. This is a hint that the formal result we get from the cycle can bedeveloped to an actual result by the Rice’s integrals. The cycle is also an explanationfor some other phenomena, but this would lead too far here.

9.6 Summary

Despite complex analysis seems to be part of pure mathematics, it can be appliedfor finding asymptotics of solutions of difference equations and generalized differences,which are needed in the analysis of algorithms. By the method of Rice’s integralswe can tackle the average case analysis of tries, digital search trees, multidimensionalsearching and other datastructures and algorithms of great practical use. In mostcases a detailed calculation of the asymptotics is in fact much too complex, but youcan get a first estimate of the growth by comparing the problem with the examplesoutlined here and the table given at the end of the section befor the last.

15normaly we would differenciate, but here we have not a series in powers of s like sn but aseries in s(s + 1)(s + 2) . . .; since differencing leads to success with s(s− 1)(s− 2) . . ., we haveto play a little bit with the sign and get a result

Page 99: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Chapter 10

Digital Search TreesAverage case analysis ofdigital search trees and triesNicolai Baron von Hoyningen-Huene

10.1 Introduction

A very common problem in computer science is to search for the appearance of astring inside a text. There exist algorithms that use suffix trees, which are oftenimplemented as digital trees to optimize the performance of the search. In this articlesome properties of those data structures are analyzed, which can be used to calculatethe average performance and space complexity of algorithms using digital trees. For anintroduction to Rice’s method (see Chapter 9) is recommended, but the mathematicalderivation in this article is largely based on [FS86b].

In the first part basic terms and structures are defined. Subsequently there will bea detailed analysis of the internal path length and external internal nodes for digitalsearch trees. The following derivation for properties of tries are just sketched becauseof similarity to the precedent part. Then the binary trees are generalized to M -arytrees and examined. Concluding, a general framework for analysis of properties ofdigital trees is presented.

10.2 Trees

First of all we look at rudimentary definitions for the sake of completeness and laterusage, the definitions are taken from [CLR90]. In computer science data has to bestored in an intelligent way. Often there exist keys which are related to data, thus aspecial data structure is needed.

Definition 10.1. A dictionary is defined as an abstract data type storing items, orkeys, associated with values. Basic operations are insert, find, and delete.

The operations new(), insert(i, v, D), and find(i, D) may be defined as follows:

• new () returns a dictionary

99

Page 100: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

100 CHAPTER 10. DIGITAL SEARCH TREES

• find (i, insert (i, v,D)) = vfind (i, insert (j, v,D)) = find (i, D) if i 6= j where i and j are items or keys, vis a value, and D is a dictionary. The operation find (i, new ()) is not defined.

• The modifier function delete (i, D) may be defined as follows.delete (i, new ()) = new ()delete (i, insert (i, v,D)) = delete (i, D)delete (i, insert (j, v,D)) = insert (j, v, delete (i, D)) if i 6= j

We can define find (i, new ()) using a special value: fail. This only changes the returntype of find. find (i, new ()) = fail

A tree is a commonly used structure for implementation of dictionaries:

Definition 10.2. A tree is defined as a data structure accessed beginning at the rootnode. Each node is either a leaf or an internal node. An internal node has one or morechild nodes and is called the parent of its child nodes. All children of the same nodeare siblings. Contrary to a physical tree, the root is usually depicted at the top of thestructure, and the leaves are depicted at the bottom.

'& %$ ! "#root

gggggggggggggggggg

VVVVVVVVVVVVVVVV

/.-,()*+kkkkkkkkkkk

NNNNNNNNNN /.-,()*+NNNNNNNNNN /.-,()*+

KKKKKKKK

'& %$ ! "#rootA

wwww

wGG

GGG

/.-,()*+

1A 2A 3A

Figure 10.1: Example for a trinary tree

And we have a special property of a tree:

Definition 10.3. The depth of a node in a tree is the distance from this node to theroot of the tree.

We can constrain trees to optimize performance for some purpose:

Definition 10.4. A search tree is a tree where every subtree of a node has values lessthan any other subtree of the node to its right. The values in a node are conceptuallybetween subtrees and are greater than any values in subtrees to its left and less thanany values in subtrees to its right.

In the binary world of a computer, trees with binary properties are widely supported:

Definition 10.5. A binary tree is either empty (no nodes), or has a root node, aleft binary tree, and a right binary tree.

10.3 Digital Search Trees

10.3.1 Data Structure of Binary Search Trees

Definition 10.6. A binary search tree is a binary tree and also a search tree. Anew node is added as a leaf.

Page 101: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

10.3. DIGITAL SEARCH TREES 101

?>=<89:;Dless

nnnnnnnnnnngreater

PPPPPPPPPPP

?>=<89:;Bless

~~~~

~ greater

@@@@

@@?>=<89:;H

less

?>=<89:;A ?>=<89:;C ?>=<89:;G

Figure 10.2: Example for a binary search tree with lexical ordering

The worst case for search is in the order of the number of keys N stored in a binarysearch tree, since the tree can be degenerated to a linear list. This arise, when keys areinserted in a a- or descending order. Instead the average case for successful search in abinary search tree is logarithmic, because the worst case is very unlikely. It evaluatesto

2

1 +1

N

«

HN − 3 = (2 ln 2) lgN + 2γ − 3 +O

„logN

N

«

.

10.3.2 Data Structure of Digital Search Trees

Keys are always handled by a computer as binary data. Therefore we can use thedigital properties of the keys.

Definition 10.7. A digital tree is a tree for storing a set of strings where nodes areorganized by substrings common to two or more strings.

The ordering of the keys is intuitively: we follow the tree by the bits descending fromthe first bit of the key represented as a binary number until we come to a leaf, a zerodirects us to the left, a one to the right. In Figure 10.3 and 10.4 you can see an exampletree, the binary coding of each letter is written next to it. Note, that the structure ofthe digital search tree depends of the order of the input of the keys. We define N asthe number of keys stored in this dictionary. The number of nodes is limited by thenumber of bits in the keys and larger than lgN but likely less than a constant factorfor many natural situations.

Definition 10.8. A digital search tree is a dictionary implemented as a digital treewhich stores keys in internal nodes, so there is no need for extra leaf nodes to storethe keys.

ONMLHIJKD011

0

kkkkkkkkkkkkk1

NNNNNNNNN

ONMLHIJKB001

0

wwwwww 1

GGGGGG

ONMLHIJKG110

1

HHHHHH

ONMLHIJKA000ONMLHIJKC010

ONMLHIJKH111

Figure 10.3: Example of a digital search tree with internal path length = 8

The worst case is the same as for binary search trees in a similar pathological case.But the average case for successful search in a digital search tree is improved:

lgN +γ − 1

ln 2+

3

2− α + δ (N) +O

„logN

N

«

.

Page 102: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

102 CHAPTER 10. DIGITAL SEARCH TREES

ONMLHIJKA000

0

kkkkkkkkkkkkkk1

NNNNNNNNN

ONMLHIJKC010

0

wwwwww 1

HHHHHHONMLHIJKH111

1

HHHHHH

ONMLHIJKB001ONMLHIJKD011

ONMLHIJKG110

Figure 10.4: Another example of a digital search tree with same keys

10.3.3 Internal Path Length

In this chapter the first property of a digital search tree is analyzed.

Definition 10.9. The internal path length of a tree is the sum of the depth of everynode of the tree.

This property is directly related to the complexity of the data structure: The numberof nodes examined during a successful search in a search tree with N nodes is the pathlength of this node and in average case this counts as one plus the internal path lengthnormalized through division by N .

Let AN be the average internal path length of a digital search tree built from N(sufficiently long) keys comprised of random bits. Then we have the fundamentalrecurrence relation

AN = N − 1 +∞X

k=0

1

2N−1

N − 1

k

!

(Ak +AN−1−k) , N ≥ 1 (10.1)

with A0 := 0.

The internal path length of any tree of N nodes is the sum of the internal path lengthsof the subtrees of the node plus N − 1, that are the ones missing for the distance fromeach node to the root of the whole tree. We count for each possible partition of thenodes to both subtrees and weight the sum by the number of all possibilities. Thesubtrees are randomly built.

We strike now for the goal to approximate AN to get an explicit useful term. Bysymmetry Ak is equal to AN−1−k and we get

AN = N − 1 +∞X

k=0

1

2N−1

N − 1

k

!

(Ak +Ak)

This equation is transformed into a functional one on the exponential generating func-

tion A (z) =P∞

N=0ANzN

N!with A′ (z) =

P∞N=1AN

zN−1

(N−1)!by multiplying both sides

Page 103: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

10.3. DIGITAL SEARCH TREES 103

by zN−1

(N−1)!and summing for N ≥ 1:

∞X

N=1

ANzN−1

(N − 1)!=

∞X

N=1

(N − 1) zN−1

(N − 1)!+ 2

∞X

N=1

∞X

k=0

1

2N−1

N − 1

k

!

AkzN−1

(N − 1)!

= z∞X

N=2

zN−2

(N − 2)!+ 2

∞X

k=0

∞X

N=k+1

Ak

2N−1

(N − 1)!

k! (N − 1 − k)!zN−1

(N − 1)!

= z∞X

t=0

zt

t!+ 2

∞X

k=0

Ak

k!

∞X

N=k+1

zN−1

2N−1 (N − k − 1)!

= zez + 2

∞X

k=0

Ak

k!

∞X

N=k+1

“z

2

”N−1 1

(N − k − 1)!

= zez + 2∞X

k=0

Ak

k!

∞X

N=0

“z

2

”(N+k+1)−1 1

((N + k + 1)− k − 1)!

= zez + 2∞X

k=0

Ak

k!

“z

2

”k∞X

N=0

“z

2

”N 1

N !

= zez + 2∞X

k=0

Ak

k!

“z

2

”k

ez2

A′ (z) = zez + 2A“z

2

ez2

We simplify this by substituting ezB (z) for A (z) with

B (z) =∞X

N=0

BNzN

N !

and

B′ (z) =

∞X

N=1

BNzN−1

(N − 1)!

That is, A (z) = ezB (z) and A′ (z) = ezB′ (z) + ezB (z). One can say that B (z)is the expectation of the internal path length, if the number of keys is Poisson withparameter z. So the equation reads in terms of B(z) as

ezB′ (z) + ezB (z) = zez + 2B“z

2

ez2 e

z2

B′ (z) +B (z) = z + 2B“z

2

∞X

N=1

BNzN−1

(N − 1)!+

∞X

N=0

BNzN

N != z + 2

∞X

N=0

BN

`z2

´N

N !

This corresponds to a simple recurrence on the coefficients

BN +BN−1 =1

2N−2BN−1

BN = −„

1 − 1

2N−2

«

BN−1

for N ≥ 3 with B2 = 1.And leads us to an explicit formula for BN :

BN = (−1)NN−2Y

j=1

1− 1

2j

«

Page 104: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

104 CHAPTER 10. DIGITAL SEARCH TREES

So we can get an explicit formula for AN :

A (z) = ezB (z)

= ez∞X

N=0

BNzN

N !

=

∞X

N=0

zN

N !

! ∞X

N=0

BNzN

N !

!

=

∞X

N=0

NX

k=0

1

(N − k)!Bk

k!

!

zN

=∞X

N=0

NX

k=0

Bk

N

k

!!

zN

N !

AN =NX

k=0

N

k

!

Bk (10.2)

We want to analyse this sum by Rice’s integral resp. a theorem for meromorphicfunction, which is derived in 9:

Theorem 10.1. Let φ be a function holomorphic in a domain that contains the half-line [n0,∞[. If n is big enough we have:If φ is meromorphic on the half-plane defined by R(s) ≥ d for some d < n0 and ofpolynomial growth in this set, then

nX

k=n0

n

k

!

(−1)kφ(k) = −X

s

(−1)nRess

φ(s)n!

s(s− 1) . . . (s− n)

«

+O(nd) (10.3)

where the sum is taken over all poles but n0, n0 + 1, . . . , n.

Therefore we introduce a new series QN to come closer to the preceding equation:

QN =NY

j=1

1− 1

2j

«

So we can get BN = (−1)N QN−2and by substitution:

AN =NX

k=2

N

k

!

(−1)k Qk−2 (10.4)

QN is defined only for integers, so we have to find a meromorphic function to extendQN to the complex plane. We choose the following function

Q (x) =∞Y

j=1

1 − x

2j

with obviously Q (1) = Q∞ and you can see clearly that Q(1)

Q(2−N )is a correct extension:

QN =NY

j=1

1 − 1

2j

«

=

Q∞j=1

`1− 1

2j

´

Q∞j=N

`1 − 1

2j

´ =Q (1)

Q∞j=1

`1− 1

2j+N

´ =

Q (1)Q∞

j=1

1− 2−N

2j

” =Q (1)

Q (2−N )

Page 105: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

10.3. DIGITAL SEARCH TREES 105

Our equation now has the form:

AN =NX

k=2

N

k

!

(−1)k Q (1)

Q (2−k+2)

Q (x) is obviously meromorphic on the half-plane defined by R(s) ≥ d with d := 12

andthe function is also polynomial. So we get the following equation by theorem 10.1:

AN = −X

z

(−1)NResz

B (N + 1,−z) Q (1)

Q (2−z+2)

«

+O“

N12

(10.5)

We proceed to evaluate the poles at z ≤ 1 for B (N + 1,−z) Q(1)

Q(2−z+2):

• B (N + 1,−z) is singular at z = 0, 1 because the Γ function is zero in thedenominator.

• z = j ± 2πikln 2

for j = 1, 0,−1, ... and all k ≥ 0 are poles since at these points2−z+j = 1 which causes one of the factors of Q

`2−z+2

´to vanish.

Only the poles for z ≥ 12

are within the region of interest.So we can approximate the residues at the poles. Beginning with z = 1:

−B (N + 1,−z) Q (1)

Q (2−z+2)= −B (N + 1,−z) 1

1 − 2−z+1

Q (1)

Q (2−z+1)

First analyze −B (N + 1,−z):

−B (N + 1,−z) = −Γ (N + 1) Γ (−z)Γ (N + 1 − z)

= −N ! (−z − 1)!

(N − z)!

= (−1)N N !QN

k=0 (z − k)

= (−1)N N !

zQN

k=1 k`

zk− 1´

= (−1)N N !

zN !QN

k=1

`zk− 1´

=

(−1)N z

NY

k=1

−“

1 − z

k

”!−1

= −

zNY

k=1

1− z

k

”!−1

=

z (z − 1)NY

k=2

1− z

k

”!−1

We want to approximate this by a Taylor series expansion. To do this the followinglemma is helpful:

Lemma 10.1. If F (z) =Q

j∈R1

1−fj(z)for some index set R, then the Taylor series

expansion of F at a, if it exists, is given by

F (z) = F (a)

1 +X

j∈R

f ′j (a)

1 − fj (a)(z − a) +O

`(z − a)2

´

!

Page 106: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

106 CHAPTER 10. DIGITAL SEARCH TREES

Proof.

G (z) =Y

j∈R

gj (z)

G′ (z) =X

j∈R

g′j (z)Y

k∈R6=j

gk (z)

G′ (z)

G (z)=

P

j∈R g′j (z)

Q

k∈R6=j gk (z)Q

j∈R gj (z)=X

j∈R

g′j (z)

gj (z)

F (z) = F (a)

1 +F ′ (a)

F (a)(z − a) +O

`(z − a)2

´«

=

F (a) + F ′ (a) (z − a) +O`(z − a)2

´

At a = 1 we have the expansion for the Beta function:

−B (N + 1,−z) =1

1 − z z−1

NY

j=2

1− z

j

«−1

=1

1 − zN`1 + (HN−1 − 1) (z − 1) +O

`(z − 1)2

´´

=N

1 − z −N (HN−1 − 1) +O (z − 1)

= − N

z − 1−N (HN−1 − 1) +O (z − 1)

with HN−1 = γ + lnN −O`

1N

´. So we get

−B (N + 1,−z) = − N

z − 1−N (γ + lnN − 1) +O (z − 1)

Approximation of 11−2−z+1 with the series expansion for 1

ex−1leads to:

1

1 − 2−z+1=

1

1 − eln 2(−z+1)= − 1

(−z + 1) ln 2+

1

2− −z + 1

12+O

`(−z + 1)3

´

=1

(z − 1) ln 2+

1

2+O (z − 1)

The missing part Q(1)

Q(2−z+1)is analyzed similarly to −B (N + 1,−z) by using the Taylor

expansion for the Q function:

Q (1)

Q (2−z+1)= Q (1)

Y

j<1

1 − 2−z+j”−1

= 1 − ln 2X

j<1

2j−1

1 − 2j−1(z − 1) +O

`(z − 1)2

´

= 1 − α ln 2 (z − 1) +O`(z − 1)2

´

with α = 1 + 13

+ 17

+ 115

+ ... ≈ 1, 606695.Connecting the analyzed parts, the integrand is approximately:

−B (N + 1,−z) Q (1)

Q (2−z+2)=

− N

z − 1−N (HN−1 − 1) +O (z − 1)

«

ׄ

1

(z − 1) ln 2+

1

2+O (z − 1)

«

×`1− α ln 2 (z − 1) +O

`(z − 1)2

´´

Page 107: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

10.3. DIGITAL SEARCH TREES 107

The residue at z = 1 is the Laurent coefficient a−1 of 1z−1

in this product:

Resz=1 = − N

ln 2(HN−1 − 1) +N

α − 1

2

«

= −N lgN −N„γ − 1

ln 2− α+

1

2

«

+O (1)

Then approximate the residues at the other poles z = 1 ± 2πikln 2

:

Resz=1± 2πikln 2

= − 1

ln 2

X

k 6=0

B

N + 1,−1− 2πik

ln 2

«

B

N + 1,−1− 2πik

ln 2

«

=Γ (N + 1) Γ

`−1 − 2πik

ln 2

´

Γ`N − 2πik

ln 2

´

= NΓ

−1− 2πik

ln 2

«Γ (N)

Γ`N − 2πik

ln 2

´

Now the standard approximation formula for the Γ function is used to simplify thisterm.

Γ (N + 1)

Γ (N + 1− α)= Nα

1 +O

„1

N

««

Resz=1± 2πikln 2

= NΓ

−1− 2πik

ln 2

«

(N − 1)2πikln 2

1 +O

„1

N

««

= NΓ

−1− 2πik

ln 2

«

N2πikln 2

1 +O

„1

N

««

= NΓ

−1− 2πik

ln 2

«

e2πik lgN

ln 2

1 +O

„1

N

««

The sum of residues at the points z = z = 1 ± 2πIkln 2

is found to be

Resz=1± 2πIkln 2

= −Nδ (N) +O (1)

where

δ (N) =1

ln 2

X

k 6=0

Γ

−1 − 2πik

ln 2

«

e2πik lg N

is a small oscillatory term. So finally we insert these results in (10.5) and get thefollowing theorem:

Theorem 10.2. The average internal path length of a digital search tree built from Nrecords with keys from random bit stream is

AN = N lgN +N

„γ − 1

ln 2− α+

1

2+ δ (N)

«

+O“

N12

Proof. This follows from the derivation above.

10.3.4 External Internal Nodes

A property of trees of some interest is the number of internal nodes which have bothlinks null. An alternate storage representation could be used for such nodes to savespace.

Page 108: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

108 CHAPTER 10. DIGITAL SEARCH TREES

Theorem 10.3. The average number of nodes with both links null in a digital searchtree built from N records with keys from random bit streams is

N

β + 1 − 1

Q∞

„1

ln 2+ α2 − α

«

+ δ∗ (N)

«

+O“

N12

where the constants involved have the values

α = 1 + 13

+ 17

+ 115

+ ... ≈ 1.606695...,Q∞ = 1

2· 3

4· 7

8+ ... ≈ .288788... and

β = 1·22

1

`11

´+ 2·23

1·3`

11

+ 13

´+ 3·24

1·3·7`

11

+ 13

+ 17

´+ ... ≈ 7.74313...

The function δ∗ (N) is a periodic function in lgN , with |δ∗ (N) | < 10−6. The approx-imate value of the coefficient of the leading term is 0.372046812....

Proof. As before, we use a simple transform with generating functions to derive anexplicit sum, then use Rice’s method to evaluate this sum. The number of externalinternal nodes is

CN =∞X

k=0

1

2N−1

N − 1

k

!

(Ck + CN−1−k) , N ≥ 2 (10.6)

with C1 = 1 and C0 = 0. This follows from the fact that the number of nodes withboth links null in a tree is exactly the sum of the numbers of such nodes in the twosubtrees of the root while the tree has more than one node.In terms of the exponential generating function C (z) =

P∞N=0

CNzN

N!by multiplying

both sides by zN−1

(N−1)!and summing for N ≥ 1, we have:

C1 +∞X

N=2

CNzN−1

(N − 1)!= 1 +

∞X

N=2

2∞X

k=0

1

2N−1

N − 1

k

!

CkzN−1

(N − 1)!

!

∞X

N=1

CNzN−1

(N − 1)!= 1 +

∞X

N=2

2∞X

k=0

1

2N−1

N − 1

k

!

CkzN−1

(N − 1)!

!

This leads, similar to AN , to the equation

C′ (z) = 1 + 2C“z

2

ez2 (10.7)

Again we introduce a new generating functionD (z) =P∞

N=0DNzN

N!defined byD (z) =

e−zC (z) to get a somewhat more manageable form:

D′ (z) +D (z) = e−z + 2D“z

2

By the recurrence on the coefficients we get:

DN +DN−1 = (−1)N−1 +1

2N−2DN−1

DN = (−1)N−1 −“

1− qN−2”

DN−1, N ≥ 2 (10.8)

with D1 = 1, D0 = 0.We define the constant q := 1

2(we will see that only this constant changes for M-ary

trees). This recurrence is inhomogeneous, so we get a somewhat more complicatedexplicit form:

DN = (−1)N−1N−1X

i=1

N−2Y

j=i

1− qj”

Page 109: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

10.3. DIGITAL SEARCH TREES 109

Rewriting this in terms of

RN =

NX

i=1

NY

j=i

1− 1

2j

«

=NX

i=1

QNj=1

`1 − 1

2j

´

Qij=1

`1 − 1

2j

´

= QN

NX

i=2

QNj=1

`1 − 1

2j

´

Qij=1

`1 − 1

2j

´

= QN +NX

i=1

QN

Qi

= QN

1 +

NX

k=1

1

Qk

!

and transforming like (10.2) back:

DN = (−1)N RN−2CN

= N −NX

k=2

N

k

!

DN

We have the following explicit sum for the desired quantity:

CN = N −∞X

k=2

N

k

!

(−1)k Rk−2 (10.9)

This sum is similar to (10.4) but more difficult to evaluate because RN is more com-plicated than QN . Because R(z) does not extend RN for positive integers, we get byTaylor expansion that RN = N + 1− α +R∗

N . R∗N satisfies a simple recurrence, con-

verges very quickly and is polynomial bounded, so we extend R∗N by the meromorphic

function R∗ (z)

R∗N =

(N + 1− α) qN+1

1 − qN+1+

1

1− qN+1R∗

N+1

R∗ (z) =

∞X

i=2

(z + 1 + i − α) qz+1+i

Qij=0 (1 − qz+1+j)

Substituting, we have

CN = N −∞X

k=2

N

k

!

(−1)k (R∗k−2 + k + 1 − α)

After applying the elementary identities of Pascal’s triangle

∞X

k=0

N

k

!

(−1)k =

∞X

k=0

N

k

!

α (−1)k = 0,

we have the simplified result

CN = (N − 1) (α+ 1)−∞X

k=2

N

k

!

(−1)k R∗k−2 (10.10)

Page 110: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

110 CHAPTER 10. DIGITAL SEARCH TREES

Now, by Theorem 10.1 and again looking only at the half-plane to the right of the linez = 1

2, we know that

CN − (N − 1) (α + 1) = −X

z

(−1)NResz (B (N + 1,−z)R∗ (z − 2)) +O“

N12

(10.11)In this case we look at the poles for R∗ (z − 2) at 1 ± 2πik

ln 2and see that they are all

single poles. The main term is given by N limz→1R∗ (z − 2); the poles for k 6= 0 add

a small oscillatory term.

Lemma 10.2.∞X

n=0

un

Qnk=1 (1− qk)

=1

Q∞k=0 (1 − qku)

Proof. The coefficient of unqm on both sides is the number of ways to write n as thesum of m nonnegative integers.

The method of calculating R∗ (−1) is to express R∗ (z) in terms of generating functions,which generalizes the function of Lemma 10.2, then to expand that function and exploitcertain properties of its derivatives. Specifically, we define

F (u, v) =

∞X

j=1

qjuj

Qji=1 (1− qiv)

This is the generating function for restricted partitions, the coefficients of unvmqk isthe number of ways to partition k into m parts not exceeding n. By Lemma 10.2 weget:

F (u, 1) =∞X

n=0

qnun

Qnk=1 (1 − qk)

F (u, 1) + 1 =∞X

n=0

un

Qnk=1 (1 − qk)

F (u, 1) =1

Q∞k=0 (1− qku)

− 1

F (1, 1) = Q−1∞ − 1.

Also we have

F ′1 (u, 1) =

1Q∞

i=1 (1 − qiu)

∞X

(k=1)

qk

(1 − qku)

so that F ′1 (1, 1) = α

Q∞. Furthermore, we have

F`1, qz+1

´=

∞X

j=1

qj

Qji=1 (1− qz+1+i)

F ′1

`1, qz+1

´=

∞X

j=1

jqj

Qji=1 (1− qz+1+i)

,

which finally gives the following expansion:

R∗ (z) =qz+1

1− qz+1

`(z + 1 − α)

`F`1, qz+1´+ 1

´+ F ′

1

`1, qz+1´´

From this formulation, a Taylor expansion around z = −1 is straightforward:

qz+1

1− qz+1=

1

(z + 1) ln q− 1

2+O (z + 1)

Page 111: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

10.4. DIGITAL SEARCH TRIES 111

F`1, qz+1

´= F (1, 1) + (z + 1) ln qF ′

2 (1, 1) +O (z + 1)2

F ′1

`1, qz+1´ = F ′

1 (1, 1) + (z + 1) ln qF ′12 (1, 1) +O

`(z + 1)2

´,

so that

R (z) = −F (1, 1) + 1

ln q+ αF ′

2 (1, 1) − F ′′12 (1, 1) +O (z + 1)

F ′2 (1, 1) =

∞X

j=1

qj

Qji=1 (1− qi)

jX

k=1

qk

1− qk

!!

(10.12)

F ′′12 (1, 1) =

∞X

j=1

jqj

Qji=1 (1 − qi)

jX

k=1

qk

1− qk

!!

(10.13)

Actually, we can relate F ′2 (1, 1) to α and Q∞. Because

F ′1 (1, 1) = F ′

2 (1, 1) + F (1, 1)

there is F ′2 (1, 1) = α−1

Q∞+1. There does not seem to be an easy way to express F ′′

12 (1, 1)in terms of α and Q∞, so we denote that constant simply by β. Collecting terms, wehave shown that the residue of the integrand at z = 1 is

N

β + 1 − 1

Q∞

α2 − α− 1

ln q

««

It remains to calculate the residues of the integrand at the other singularities.This calculation is straightforward: the residue of 1

1−qz+1 at z = −1± 2πikln q

is − 1ln q

, and

the other terms in R∗ (z) contribute a factor of 2πikQ∞ ln q

. The factor for B (N + 1,−z)is expanded exactly as in the preceding Taylor expansion for the internal path length;thus we have the oscillatory term

δ∗ (N) =1

Q∞ ln q

X

k 6=0

2πik

ln qΓ

−1− 2πik

ln q

«

e2πik lg N

This completes the calculation of the coefficient of the linear term.

10.4 Digital Search Tries

10.4.1 Data Structure of Tries

Definition 10.10. A digital search trie is a digital tree for storing a set of strings inwhich there is one node for every prefix of every string in the set.

The name of this data structure comes from the word retrieval. The word retrieval isstressed, because a trie has a lookup time that is equivalent to the length of the stringbeing looked up. Again we represent the strings as keys in binary form. It may beconvenient to assume that the strings are all of same (binary) length, but the methodis also appropriate for varying length strings, if no string is a prefix of another.Digital search tries compared to trees have much improved worst case performance.Their average case performance is asymptotically optimal. If N records with keys fromrandom bit streams are inserted into an initially empty trie, then the average numberof nodes examined during successful search reads as

lgN +γ

ln 2+

1

2+ δ (N) +O

„1

N

«

Page 112: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

112 CHAPTER 10. DIGITAL SEARCH TREES

/.-,()*+0

ddddddddddddddddddd1

TTTTTTTT

/.-,()*+0

iiiiiiiii 1

UUUUUUUUU /.-,()*+1

UUUUUUUUU

/.-,()*+0 xx 1FF /.-,()*+

0 xx 1GG /.-,()*+0 xx 1GG

A000 B001 C010 D011 G110 H111

Figure 10.5: Example for a digital search trie with external path length = 18

10.4.2 Data Structure of Patricia Tries

You can optimize the performance of a trie constructed with N keys by ensuring thatthis trie has just N − 1 internal nodes by collapsing one-way branches on internalnodes and get the so called Patricia tries:

Definition 10.11. A Patricia tree is defined as a compact representation of a digitalsearch trie where all nodes with one child are merged with their parent.

/.-,()*+0

ddddddddddddddddddd11

XXXXXXXXXXXXX

/.-,()*+0

kkkkkkkkk1

SSSSSSSSS /.-,()*+0qqq

q 1MMMM

/.-,()*+0

1AAA

/.-,()*+0

1BBB G110 H111

A000 B001 C010 D011

Figure 10.6: Example for a Patricia trie

The average number of nodes examined during successful search is one less than forstandard tries.

10.4.3 External Path Length

Definition 10.12. The external path length of a tree is the sum of the depth of everyleaf of the tree.

The fundamental recurrence for the average external path length of a binary trie is

A[T ]N = N +

∞X

k=0

1

2N

N

k

!“

A[T ]k +A

[T ]N−k

, N ≥ 2 (10.14)

with A[T ]0 = A

[T ]1 = 0. This is the number of nodes examined during all successful

searches. Note that since no key is stored at the root, the subtrees have a total of Nkeys. The resulting functional equation on the exponential generating function is nota difference-differential but simply a difference equation:

A[T ] (z) = z (ez − 1) + 2A[T ]“z

2

ez−2

It is still convenient to transform the equation with A (z) = ezB (z) to get the equation

B[T ] (z) = z`1 − e−z´+ 2B[T ]

“z

2

This yields directly

B[T ] (z) =N (−1)N

1 −`

12

´N−1

Page 113: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

10.4. DIGITAL SEARCH TRIES 113

and

A[T ]N =

∞X

k=2

N

k

!

(−1)k k

1−`

12

´k−1

This can be handled directly by Rice’s method or Mellin transform techniques, asdescribed in full detail in [Szp00].The fundamental recurrence for the average external path length of a Patricia trie is

A[P ]N = N

1− 1

2N−1

«

+∞X

k=0

1

2N

N

k

!“

A[P ]k +A

[P ]N−k

, N ≥ 1 (10.15)

with A[P ]0 = 0. The external path length is the sum of the ones of the subtries of the

root plus the number of nodes in the subtries (N) unless one of the subtries is emptywhich has the probability 1

2N−1 . The resulting functional equation on the exponentialgenerating function is

A[P ] (z) = z“

ez − e z2”

+ 2A[P ]“z

2

ez2

with transformed version

B[P ] (z) = z“

1− e− z2

+ 2B[P ]“z

2

which yields directly

B[P ] (z) =N (−1)N

2N−1 − 1

and

A[P ]N =

∞X

k=2

N

k

!

k (−1)k

2k−1 − 1= A

[T ]N −N

Given the result for binary tries the average external path length for Patricia tries isobvious.

10.4.4 External Internal Nodes

The average number of internal nodes with both sons external are computed for Pa-tricia tries. The derivation is similar to the one for digital search trees, so we onlysketch it here. We start with the recurrence

C[P ]N =

X 1

2N

N

k

!“

C[P ]k + C

[P ]N−k

, N ≥ 3 (10.16)

with C[P ]0 = C

[P ]1 = 0 and C

[P ]2 = 1. This corresponds to the functional equation

C[P ] (z) =“z

2

”2

+ 2C[P ]“z

2

ez2

which transforms to

D[P ] (z) =“z

2

”2

e−z + 2D[P ]“z

2

and eventually gives the sum

C[P ]N =

1

4

NX

k=2

N

k

!

(−1)k k (k − 1)

1 − 12

k−1

Knuth gives specific evaluations of such sums. The eventual result is that the propor-tion of nodes in Patricia tries with both sons external is 1

4 ln 2= .3606... plus a small

oscillatory term. Thus, according to this measure, digital search trees are (slightly)more balanced than Patricia tries.

Page 114: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

114 CHAPTER 10. DIGITAL SEARCH TREES

10.5 Multiway Branching

Above only binary trees have been analyzed. But we can generalize the average analysisto M -ary trees, where each node contains M links to other nodes, numbered from 0 toM −1. It turns out that the analysis given above survives largely intact for the M -arycase. For example, to find the average number of nodes in a M -ary digital search treewith all links null, we begin with the fundamental recurrence:

C[M]N =

X

k1+k2+...+kM=N−1

1

MN−1

N − 1

k1, k2, ..., kM

!“

C[M]k1

+ C[M]k2

+ ... + C[M]kM

,

N ≥ 2 (10.17)

with C[M]1 = 1 and C

[M]0 = 0. The argumentation is the same as for the digital

search tree. The number of nodes with all links null in a tree is exactly the number ofsuch nodes in all the subtrees of the root, unless the tree has just one node. Again thepartitions of N−1 nodes without the root into M subtrees weighted by all possibilitiesare examined. All the subtrees are randomly built according to the same model.By symmetry, (10.17) is equivalent to

C[M]N = M

X

k1+k2+...+kM=N−1

1

MN−1

N − 1

k1, k2, ..., kM

!

C[M]k1

, N ≥ 2

with C[M]1 = 1 and C

[M]0 = 0. Now we introduce the exponential generating function

C[M] (z) =P∞

N=0

C[M]N

zN

N!and derive the following difference-differential equation:

C[M]′ (z) = 1+MC[M]“ z

M

”“

ezM

”M−1

= 1+MC[M]“ z

M

”“

e(1−1M )z

(10.18)

For M = 2, this is exactly the equation derived from (10.7); moreover none of themanipulations used for solving it depend in an essential way on the value of thatconstant.

Corollary 10.1. The average number of nodes with all links null in an M-ary searchtree (for M ≥ 2) built from N records with keys from random bit streams is

N

β[M] + 1− 1

Q[M]∞

„1

lnM+ α[M]2 − α[M]

«

+ δ[M] (N)

!

+O“

N12

where the constants involved are given byα[M] =

P∞k=1

1Mk−1

,

Q[M]∞ =

Q∞k=1

`1− 1

Mk

´,

β[M] =P∞

k=1kMk+1

Qki=1(Mi−1)

Pkj=1

1Mj−1

and the oscillatory term isδ[M] (N) = 1

Q∞ ln M

P

k 6=02πikln M

Γ`−1 + 2πik

lnM

´e2πik lg N .

10.6 General Framework

The methods that we have used in the previous sections can be applied to study manyother properties of digital trees. If X (T ) and x (T ) are parameters of trees satisfying

X (T ) =X

subtrees Tj of the root of T

X (Tj) + x (T ) (10.19)

Page 115: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

10.6. GENERAL FRAMEWORK 115

then the exponential generating functions for the expectations XN and xN for anM -ary digital search tree built from N records with keys from random bit streamssatisfy

X ′ (z) = MX“ z

M

e(1−1M )z + x (z)

This is derived in exactly the same manner as (10.18). Now in terms of the generatingfunctions Y (z) = e−zX (z) and y (z) = e−zx (z) this becomes

Y ′ (z) + Y (z) = MY“ z

M

+ y′ (z) + y (z) (10.20)

This leads to a nonlinear recurrence like (10.8) satisfied by YN , with the solution soughtgiven by XN =

PNk=0

`Nk

´Yk. If the quantity (−1)k Yk is sufficiently well behaved, we

can study its asymptotics and find a function Y ∗k which

(i) is simply related to Yk so thatPN

k=0

`Nk

´ “

Yk − (−1)k Y ∗k

is easily evaluated,

(ii) satisfies a recurrence of the form Y ∗N+1 = (1 − g (M,N))Y ∗

N + f (M,N),

(iii) goes to zero quickly as N →∞.

Depending on the nature of g (M,N) , f (M,N) and the speed of convergence, condi-tions (ii) and (iii) may allow the recurrence to be turned around to extend Y ∗

N to thecomplex plane and so allow the desired expectation to be computed by evaluating the

sumPN

k=0

`Nk

´ “

Yk − (−1)k Y ∗k

as detailed in the previous sections.

The same type of generalization applies to the study of tries (and Patricia tries), andthe simpler nature of the recurrences follows through the generalization. For example,the exponential generating functions for the expectations XN and xN of parametersof trees satisfying (10.19) for a random trie built from N records from random bitstream is

X (z) = MX“ z

M

ezM + x (z) (10.21)

which is considerably easier to deal with. The equation can be solved by Rice’s methodand also by Mellin transform techniques.This general framework allows quite full analysis of the types of trees considered, andthey clearly expose the fundamental differences and similarities among the analyses.

Page 116: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

116 CHAPTER 10. DIGITAL SEARCH TREES

Page 117: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Chapter 11

Mellin transforms andasymptotics: HarmonicsumsIlja Posov

This survey presents a unified and essentially self-contained approachto the asymptotic analysis of a large class of sums that arise in combi-natorial mathematics, discrete probabilistic models, and the average-caseanalysis of algorithms. It relies on the Mellin transform, a close relativeof the integral transforms of Laplace and Fourier. The method applies toharmonic sums that are superpositions of rather arbitrary ”harmonics” ofa common base function. Its principle is a precise correspondence betweenindividual terms in the asymptotic expansion of an original function andsingularities of the transformed function. Here no theorem is proved, andeven not every theorem is completely formulated. For precise presentationof the theory reader is refered to the original paper.

We have to deal a lot with complex variable functions and I’ll remind you some basicconcepts about them. The first concept about complex variable functions is holomor-phic function. Function is called holomorphic in some area, if it has complex derivativein every point of this area.

The second concept is ‘analitic function’. Function is analitic in some point z0 ofcomplex plane, if it can be expanded into Taylor series in this point, i.e. f(z) =P∞

n=0 cn(z− z0)n = c0 + c1(z− z0) + c2(z− z0)2 + · · · . Similary, the function is calledanalitic in an area, if it is analitic in every point of that area. One of the centralresult of the complex variable function theory is the theorem, that every holomorphicin some area function is analitic there. The converse statement holds too. We’ll usethe word ‘analitic’ a lot.

Except holomorphic functions there are meromorphic functions. Meromorphic in anarea function is a function, that is holomorphic there except discrete set of points thatare called poles. Discrete set means a set, every point of which can be isolated fromother points. For example, every finite set is discrete.

Consider a function f(z) = 1z(z−1)

. It is analitic (and therefore holomorphic) in C \0, 1, but it is meromorphic in entire C with poles z = 0 and z = 1.

The last concept is ‘open strip’. Open strip 〈a, b〉 = z = x + iy | a < y < b is a setof points in a complex plane that looks like:

117

Page 118: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

118 CHAPTER 11. MELLIN TRANSFORMS AND ASYMPTOTICS

Open strip can be infinite, if a or b is infinity, and it’s obvious that 〈−∞,∞〉 = C

11.1 Mellin transform definition

Robert Hjalmar Mellin (1854-1933) was Finnish mathematitian who studied the trans-form which now bears his name and established its reciprocal properties. Now we arefinally going to learn what Mellin transform is.

Definition 11.1. Let f(x) be real function defined on (0,+∞). Then its Mellintransform is complex valued function that is defined by equality

M [f(x); s] = f∗(s) =

Z +∞

0

f(x)xs−1dx

Of course, integral from definition usually converges not for all s ∈ C, but it usuallyconverges for all s from some open strip, which in this case is called ‘fundamental’strip.

Proposition 11.1. If f(x) = O(xu) as x → 0 and f(x) = O(xv) as x → +∞, thenthe integral from Mellin transform definition converges for every s ∈ 〈−u,−v〉 anddefines an analitic function in this strip.

Example 11.1. If f(x) = xk, then f(x) = O(xk) as x → 0 and f(x) = O(xk) asx→ +∞. Proposition states that transform of f(x) (that is f ∗(s)) exists in the openstrip 〈−k,−k〉, but this strip is empty. In fact, transform of xk simply doesn’t exist,i.e.

R∞0xkxs−1dx doesn’t converge for every k ∈ R and s ∈ C. One can simply check

it.

Now I present examples of functions that do have Mellin transforms.

Example 11.2. Let f(x) = 11+x

. f(x) = O(1) = O(x0) as x→ 0, and f(x) = O(x−1)as x → +∞. Now we can make use of proposition 11.1, here u = 0 and v = −1.Proposition states, that in this case Mellin transform f∗(s) =

R +∞0

11+x

xs−1dx existsin the fundamental strip 〈0, 1〉. The integral can be evaluated and it occurs, thatf∗(s) = π

sin πs. But the equality holds only for s ∈ 〈0, 1〉, for other s integral simply

doesn’t converge. By the way, function πsin πs

by itself can be evaluated practically inentire C, except, may be, integer points.

Example 11.3 (Gamma function). Now we consider the function f(x) = e−x.f(x) = O(1) = O(x0) as x → 0, and ∀M>0 f(x) = O(x−M ) as x → +∞. Again,after using the proposition 11.1 we obtain, that Mellin transform of f(x) exists in theopen strip 〈0,M〉 for every M > 0. It means, that the fundamental strip of this Mellintransform is 〈0,+∞〉. Now let’s evaluate the transform. f∗(s) =

R +∞0

e−xxs−1dx.This integral is called Gamma function and notation is Γ(s). There is a well knownfunctional equation on gamma function which states that sΓ(s) = Γ(s + 1). Thisequation allows us to evalute gamma function not only in the right half of complexplane, but also in every other point of complex plane. For example, (− 1

2)Γ(− 1

2) =

Γ( 12) =

√π, so Γ(− 1

2) = −2

√π. The problem is only with nonpositive integers.

Page 119: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

11.2. MELLIN TRANSFORM BASIC PROPERTIES 119

Statement 0Γ(0) = Γ(1) doesn’t allow us to evaluate gamma function in zero, itdemonstrates, that there is a pole of gamma function in zero. Every negative integeris a pole of gamma function for the same reason.

Gamma function occurs frequently in Mellin transforms. But now we look on the lastexample of Mellin transform before going to learn basic properties of Mellin transform.

Example 11.4 (Transform of step function).

H(x) =

(

1, x ∈ (0, 1)

0, x ∈ (1,+∞)

As in the previous example and for the same reasons, fundamental strip of the trans-formed function is 〈0,+∞〉. Here the transform can be simply evaluated. H∗(s) =R +∞0

H(x)xs−1dx =R 1

0xs−1dx = 1

s. As in the all previous examples, we see that

transformed function is defined not only in the fundamental strip. Here fundamentalstrip is 〈0,+∞〉, but 1

smay be evaluated in C \ 0. Fundamental strip is only the

place where an integral converges.

If we had considered another step function H(x) = 1−H(x), we would have obtainedthe transform H

∗(s) = − 1

swith fundamental strip 〈−∞, 0〉.

11.2 Mellin transform basic properties

All basic properties of Mellin transform can be simply obtained by means of suchmethods as integration by parts and change of variable. Here they are summarized ina table.

f(x) f∗(s) 〈α, β〉

(1) xνf(x) f∗(s+ ν) 〈α− ν, β − ν〉

(2) f(xρ) 1ρf∗( s

ρ) 〈ρα, ρβ〉 ρ > 0

(3) f( 1x) −f∗(−s) 〈−β,−α〉

(4) f(µx) 1µsf∗(s) 〈α, β〉 µ > 0

(5)P

k λkf(µkx)“P

kλkµs

f∗(s)

(6) f(x) log x ddsf∗(s) 〈α, β〉

(7) Θf(x) −sf∗(s) 〈α′, β′〉 Θ = x ddx

(8) ddxf(x) −(s− 1)f∗(s− 1) 〈α′ + 1, β′ + 1〉

(9)R x

0f(t)dt − 1

sf∗(s+ 1)

The most interesting are the fourth and the fifth properties. The fourth one iscalled ‘separation property’ and the fifth property is its generalisation. If the sumP

k λkf(µkx) is finite, fifth property is obvious because of linearity of Mellin trans-form. But if the sum is infinite, the fifth property holds only if function f(x) and seriesP

kλkµs

satisfy some additional conditions. Anyway, we’ll use this property in this pa-per for infinite sums without paing attention to the problem. All studied functionsare good enough and fifth property holds for them.

Page 120: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

120 CHAPTER 11. MELLIN TRANSFORMS AND ASYMPTOTICS

Example 11.5 (Zeta function). Here we’ll use the fifth property to introduce zetafunction. Consider the function

g(x) =e−x

1− e−x= e−x + e−2x + e−3x + · · ·

The series converges for every x > 0. Now we apply the fifth property. λk = 1, µk = kand f(x) = e−x, so

g∗(s) =

„1

1s+

1

2s+

1

3s+ · · ·

«

Mˆe−x; s

˜= ζ(s)Γ(s)

Series ζ(s) = 11s

+ 12s

+ 13s

+ · · · converges for every s ∈ 〈1,+∞〉. Fundamental stripof the transform is 〈1,+∞〉 too, and why it is so we’ll discuss later.

Now we’ll summarize in a table a number of Mellin transforms, some of them wereobtained earlier, some of them can be obtained by means of Mellin transform basicproperties.

f(x) f∗(s) 〈α, β〉e−x Γ(s) 〈0,+∞〉e−x − 1 Γ(s) 〈−1, 0〉

e−x2 12Γ( 1

2s) 〈0,+∞〉

e−x

1−e−xζ(s)Γ(s) 〈1,+∞〉

11+x

πsin πs

〈0, 1〉log(1 + x) π

s sin πs〈−1, 0〉

H(x) ≡ 10<x<11s

〈0,+∞〉

xα(log x)kH(x) (−1)kk!

(s+α)k+1 〈−α,+∞〉 k ∈ N

Here the most interesting is in the first two lines, we see two different functions havingthe same Mellin transform. The only diffence is in fundamential strips of the trans-forms. And now it’s a good moment to formulate a theorem about reconstruction ofinitial function having only its Mellin transform.

Theorem 11.1. Let f(x) have Mellin transform f ∗(s) with fundamental strip 〈α, β〉.Let α < c < β and f∗(c+ it) is integrable. Then the equality

1

2πi

Z c+i∞

c−i∞f∗(s)x−sdx = f(x)

holds almost everywhere.

The picture presents a patch of integration used in the theorem.

Page 121: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

11.3. SINGULARITIES 121

11.3 Singularities

Definition 11.2. Laurent expansion of function φ(s) in point s0 is an equality:

φ(s) =

+∞X

k≥−r

ck(s− s0)k

Here c−r 6= 0. If r > 0, then s0 is called a pole of order r. If r = 1, then pole is calledsimple. If r = 2, pole is double.

If r ≤ 0, then function is analitic in s0, because Laurent series in this case degeneratesinto Taylor series.

Example 11.6. Consider the function 1s2(s+1)

, it has two poles on the complex plane.

Double pole is at s0 = 0 and simple pole is at s0 = −1:

1

s2(s+ 1)=

1

s+ 1+ 2 + 3(s+ 1) + · · · s0 = −1

1

s2(s+ 1)=

1

s2− 1

s+ 1− s+ · · · s0 = 0

Definition 11.3. A singular element (s.e.) of φ(s) at s0 is an initial sum of Laurentexpansion truncated at terms of O(1) or smaller.

Example 11.7. We consider the same function φ(s) = 1s2(s+1)

as in the previous

example. Singular elements at s0 = 0 are:

»1

s2− 1

s

,

»1

s2− 1

s+ 1

, . . .

Here we can trancate the Laurant expansion wherever we want, but we are to includeterms with negative degree of s. The same is about singular elements in s0 = 1. Theyare: »

1

s+ 1

,

»1

s+ 1+ 2

,

»1

s+ 1+ 2 + 3(s+ 1)

, . . .

We are to include all items with negative degree of s+ 1, other items we can includeif we want, but usually there is no need to do it.

Definition 11.4. Let φ(s) be meromorphic in some area Ω with G including all thepoles of φ(s) in Ω. A singular expansion of φ(s) in Ω is a formal sum of singularelements of φ(s) at all points of G. Notation: φ(s) E.

Example 11.8.

1

s2(s+ 1)»

1

s+ 1

s=−1

+

»1

s2− 1

s

s=0

+

»1

2

s=1

s ∈ C

It is a singular expansion of function φ(s) = 1s2(s+1)

in all complex plane. G =

−1, 0, 1. There is no pole at point s0 = 1, but we can include a singular element atit in a singular expansion, if we want. G is to include all the poles of the function, butit also may include any other points. However there is usually no sense in it.

Singular expansion is only a formal sum, we are not trying to evaluate it or even tosimplify. This sum only shows us which poles does function have and what singularelements are there.

Page 122: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

122 CHAPTER 11. MELLIN TRANSFORMS AND ASYMPTOTICS

Example 11.9 (Singular expansion of gamma function). I’ll remind you thatgamma function is defined by the equality

Γ(s) =

Z +∞

0

e−xxs−1dx

Integral converges for s ∈ 〈0,+∞〉, but functional equation sΓ(s) = Γ(s+ 1) allows tomake a continuation of the gamma function to all complex plane except nonpositiveintegers. (It has been already noticed in example 11.3) Now we are going to obtain asingular expansion of gamma function in C.First af all, functional equation on gamma function implies

Γ(s) =Γ(s+m+ 1)

s(s+ 1)(s+ 2) . . . (s+m), m ∈ N ∪ 0

It demonstrates again, that gamma function has poles in nonpositive integers, but nowwe can learn much more about them. After making a kind of substitution of −m fors we obtain

Γ(s) ∼(−1)m

m!

1

s+m, as s→ −m

This means that left side divided by right side tends to 1 as s tends to −m. But wecan understand it as that the right side is a singular element of gamma function inpoint s = −m. Now we know singular elements in all the poles of gamma functionand thus we can write a singular expansion:

Γ(s) +∞X

k=0

(−1)k

k!

1

s+ k, m ∈ C

As it was already said, we are not trying to evaluate the sum, we only look on it andsee what poles does Gamma function have, and what singular elements are there. Forexample we see that all poles of gamma function are simple.

Here the arrangement of gamma function poles is demonstrated at the picture.

11.4 Direct mapping

We have already seen in proposition 11.1, that asymptotics of function f(x) in zeroresults on the leftmost boundary of the fundamental strip of Mellin transform f ∗(s).The same is with the asimptotics in infinity. It results on the rightmost boundaryof the fundamental strip. Direct mapping is a theorem about what information canwe obtain about Mellin transform, if we know a more detailed asymptotics of initalfunction f(x) in zero and infinity.

Theorem 11.2 (Direct mapping). Let f(x) have a transform f ∗(s) with nonemtyfundamental strip 〈α, β〉. Let

f(x) =X

(ξ,k)∈A

cξ,kxξ(log x)k + O(xγ), x→ 0

Page 123: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

11.4. DIRECT MAPPING 123

where k ∈ N ∪ 0, −γ < −ξ ≤ α. Then f∗(s) is continuable to a meromorphicfunction in the strip 〈−γ, β〉, where it admits the singular expansion

f∗(s) X

(ξ,k)∈A

cξ,k(−1)kk!

(s+ ξ)k+1, s ∈ 〈−γ, β〉

If asymptotic expansion is given at infinity, then the similar result holds. The onlydifference is, that meromorphic continuation is to the right of the fundamental stripand there is an additional minus sign in singular expansion of transformed function.Look for explanation in two tables given below.

Singular expansion of transformed function presented in the theorem may seem to beconfusing, but in real life there are no logarithms in asymptotic expansions. If thereis one, it comes in the first degree. In this cases k equals 0 or 1, which makes singularexpansion much more simple.

Let’s put in a table some information that we know about connection between asymp-totic expansion of initial function and properties of transformed one.

f(x) f∗(s)

Order at 0: O(x−α) Leftmost boundary of f.s. at <(s) = α

Order at +∞: O(x−β) Rightmost boundary of f.s. at <(s) = βExpansion till O(xγ) at 0 Meromorphic continuation till <(s) = −γExpansion till O(xδ) at +∞ Meromorphic continuation till <(s) = −δ

The next table contains information about connections between terms in asymptoticexpansion of initial function and singularities of its Mellin transform.

f(x) f∗(s)

Term xa(log x)k at 0 Pole with s.e. (−1)kk!

(s+a)k+1

Term xa(log x)k at +∞ Pole with s.e. − (−1)kk!

(s+a)k+1

Term xa at 0 Pole with s.e. 1s+a

Term xa log x at 0 Pole with s.e. − 1(s+a)2

Information contained in two tables is visualized on these two pictures.

Example 11.10. We have already obtained the singular expansion of gamma functionby means of functional equation on it. Now we’ll obtain the same result, but byapplying the direct mapping theorem.

It was shown in example 11.3, that function f(x) = e−x has Mellin transform f∗(s) =Γ(s). Asymptotic expansion of f(x) at 0 is as follows:

f(x) = e−x = 1 − x+x2

2!− x3

3!+ · · · =

MX

j=0

(−1)j

j!xj + O(xM+1)

Theorem states, that f∗(s) is meromorphicaly continuable to 〈−M − 1,+∞〉. (And

Page 124: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

124 CHAPTER 11. MELLIN TRANSFORMS AND ASYMPTOTICS

we do know it already) The asymptotic expansion of continued function is:

f∗(s) MX

j=0

(−1)j

j!

1

s+ js ∈ 〈−M − 1,+∞〉

The last holds for every positive M , so we can rewrite it in the following way:

f∗(s) +∞X

j=0

(−1)j

j!

1

s+ js ∈ C

and this is an asymptotic expansion in entire C. This expansion we have already seenin example 11.9.

Example 11.11. In example 11.5 we intoduced the Mellin pair

g(x) =e−x

1− e−x; g∗(s) = ζ(s)Γ(s)

with fundamental strip 〈1,+∞〉. The asymptotic expansion of g(x) at 0 is

g(x) =e−x

1 − e−x=

+∞X

j=−1

Bj+1xj

(j + 1)!

where Bj are so-called Bernoulli numbers, we can suppose that this asymptotic ex-pansion is a definition of Bernoully numbers. B0 = 1, B1 = −1/2. As in the previousexample, asymptotic expansion is complete, i.e. we can write, that g(x) = . . .+O(xM )for any positive M we want. So, direct mapping theorem states, that transform g∗(s)has meromorphic continuation to strip 〈−∞,+∞〉 = C. The singular expansion thereis

g∗(s) = ζ(s)Γ(s) +∞X

j=−1

Bj+1

(j + 1)!

1

s+ j, s ∈ C

If we compare it with singular expansion of gamma funcion

Γ(s) =

+∞X

j=0

(−1)j

j!

1

s+ j, m ∈ C

we can extract the singular expansion of zeta function

ζ(s) =

»1

s− 1

s=1

+

+∞X

j=0

»

(−1)j Bj+1

j + 1

s=−j

We see, that there are no poles in nonpositive integers, so we have obtained a result,that zeta function in meromorphic in entire C with the only pole in s0 = 1. Sincesingular expansion is a sum of initial sums of Laurent expansions, we can extractinformation about values of zeta function in nonpositive integers.

ζ(−m) = (−1)m Bm+1

m+ 1, m ∈ N ∪ 0

By the way, B2k+1 = 0 for k ∈ N, so zeta function is zero in even negative integers,ζ(0) = −1/2, and

ζ(−2m+ 1) = −B2m

2m, m ∈ N

Page 125: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

11.5. CONVERSE MAPPING 125

Example 11.12. There was another Mallin pair

f(x) =1

x+ 1; f∗(s) =

π

sinπs

with fundamental strip 〈0, 1〉. Asymptotic expansion at 0

f(x) =1

1 + x=

+∞X

n=0

(−1)nxn, x→ 0

implies posibility of continuation of transformed function to the left of fundamentalstrip, namly to 〈−∞, 1〉. Singular expansion there is

f∗(s) +∞X

n=0

(−1)n

s+ n, s ∈ 〈−∞, 1〉

If we consider asymptotic expansion at +∞

f(x) =1/x

1 + 1/x=

+∞X

n=0

(−1)n−1x−n, x→ +∞

we learn about meromorphic continuation of transformed function to the right offundumantal strip. Singular expansion to the right of fundamental strip is

f∗(s) −+∞X

n=1

(−1)n−1

s− n , s ∈ 〈0,∞〉

Minus sign, as it has been already said, comes from the fact, that we consider theasymptotic expansion at infinity.Now two singular expansions we can conbine into one and it gives us singular expansionof transformed function in entire C

f∗(s) ≡ π

sinπxX

n∈Z

(−1)n

s+ n, s ∈ C

As we see, expansion is right, but we could have obtained it much easier.

11.5 Converse mapping

Direct mapping theorem was a method of obtaining information about transformedfunction given the initial function. But usually it is not very interesting. We areinterested in information about initial function and not about its transform. So now weare ready to introduce converse mapping theorem. It will be given exact formulation,but further we will apply the theorem without checking if exemined function satisfiesall conditions.

Theorem 11.3 (Converse mapping). Let f(x) be continious on (0,+∞) function,that has a transform f∗(s) with nonempty fundammental strip 〈α, β〉. Let f ∗(s) bemeromorphically continuable to 〈−γ, β〉 with a fininte number of poles there, and beanalytic on <(s) = −γ. Let f∗(s) = O(|s|−r) with r > 1 when |s| → +∞ in 〈−γ, β〉.If

f∗(x) X

(ξ,k)∈A

dξ,k1

(s− ξ)k, s ∈ 〈−γ, β〉

Then an asymptotic expansion of f(x) at 0 is

f(x) =X

(ξ,k)∈A

dξ,k

„(−1)k−1

(k − 1)!x−ξ(log x)k−1

«

+ O(xγ)

Page 126: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

126 CHAPTER 11. MELLIN TRANSFORMS AND ASYMPTOTICS

Converse mapping theorem is a theorem, that derives asymptotics of initial functionfrom singularities of transformed. Next table includes some explanation of the theo-rem.

f∗(f) f(x)

Pole at ξ Term in asymptotic expansion ≈ x−ξ

left of f.s. expansion at 0

right of f.s. expansion at +∞

Simple pole

left: 1s−ξ

x−ξ at 0

right: 1s−ξ

−x−ξ at +∞

Multiple pole logarifmic factor

left: 1(s−ξ)k+1

(−1)k

k!x−ξ(log x)k at 0

right: 1(s−ξ)k+1 − (−1)k

k!x−ξ(log x)k at +∞

Example 11.13. Consider a function φ(s) = Γ(s)Γ(ν−s)Γ(ν)

. It is analytic in strip 〈0, ν〉.We know the singular expansion of gamma function an we can use this knwoledge toobtain the singular expansion of φ(s) in the strip 〈−∞, ν〉:

φ(s) +∞X

j=0

(−1)j

j!

Γ(ν + j − 1)

Γ(ν)

1

s+ j, s ∈ 〈−∞, ν〉

This φ(s) is a Mellin transform of some function f(x), which can be obtained byapplying of theorem 11.1

f(x) =1

2πi

Z ν/2+i∞

ν/2−i∞φ(s)x−sds

Singularities of φ(s) in strip 〈−∞, ν〉 encode asymptotics for f(x) at 0

f(x) =+∞X

j=0

(−1)j

j!

Γ(ν + j − 1)

Γ(ν)xj , x→ 0

One could have remembed binomial theorem and noticed, that function f (x) = (1 +x)−ν has the same expansion at 0. This means, that difference $(x) = f(x) − f (x)decase to zero faster, than any power of x, i.e. $(x) = O(xM ), ∀M > 0. In fact,$(x) ≡ 0, and we have indirectly obtained a new Mellin pair

f(x) = (1 + x)−ν , f∗(s) =Γ(s)Γ(ν − s)

Γ(ν)

Example 11.14. Consider a function φ(s) = Γ(1 − s) πsin πs

. It is analitic in strip〈0, 1〉. We know singular expansion of every factor in this function and thus we canwrite singular expansion

φ(s) +∞X

0

(−1)nn!1

s+ n, s ∈ 〈−∞, 1〉

Page 127: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

11.6. HARMONIC SUMS 127

Applying of converse mapping theorem yields an asymptics for initial function f(x)

f(x) ∼

+∞X

n=0

(−1)nn!xn, x→ 0

Symbol ‘∼’ is written instead of ‘=’ to make an emphasis, that expansion is onlyasymptotic, i.e. series from the right side doesn’t converge for any x > 0, but we canwrite, that f(x) =

PMn=0(−1)nn!xn + O(xM ) as x → 0 for any positive M . In fact,

f(x) =R +∞0

e−t

1+xtdt.

11.6 Harmonic sums

Definition 11.5. A sum of the type G(x) =P

k λkf(µkx) is called harmonic sum

with base function g(x), frequencies µk and amplitudes λk. Series Λ(s) =P

kλkµs

iscalled the Dirichlet series.

Now we are going to discuss when the fifth basic property of Mellin transform holdsfor infinite harmonic sums. Next proposition is not formulated fully, base function andDirichlet series are to satisfy some certain conditions about speed of growth, but allthis is skipped here for simplicity.

Proposition 11.2. The Mellin transform of the harmonic sum G(x) =P

k λkf(µkx)is defined in the intersection of the the fundamental strip of g(x) and the domainof absolute convergence of the Dirichlet series Λ(s) =

P

kλkµs

which is of the form<(s) > σa (or <(s) < σa) for some real σa. In addition, G∗(s) = Λ(s)g∗(s).

In next (and last) two examples we’ll derive two well known asymptotics. One is anasymptotic of harmonic numbers, and the second is Stirling’s formula. But here we’llobtain complete asymptotics, which is not known so well.

Example 11.15 (Harmonic numbers). Consider the function

h(x) =+∞X

k=1

»1

k− 1

k + x

=+∞X

k=1

1

k

x/k

1 + x/k

It is usuall harmonic sum with frequencies µk = 1/k, amplitudes λk = 1/k and basefunction g(x) = x

1+x. Function h(x) is connected with harmonic numbers in very

simple way: h(n) = 1 + 12

+ · · · + 1n

= Hn for any n ∈ N. Now we are going toevaluate Mellin transform of G(x). To do this, we are to evaluate the Dirichlet seriesΛ(s) and the tranform of base functon g(x). Λ(s) =

P

kλkµs

=P+∞

k=1 k−1+s = ζ(1− s).

Transform of base function is g∗(s) = − πsin πs

with fundamental strip 〈−1, 0〉, which isthe result of applying of the first base property of Mellin transform. Notice, that theDirichlet series Λ(s) = ζ(1− s) converges absolutely in the strip 〈−1, 0〉Now we are ready to write the transform of h(x), which is

h∗(s) = Λ(s)g∗(s) = −ζ(1− s) π

sinπs, s ∈ 〈−1, 0〉

Before trying to write a singular expansion, we are to study zeta function some more.We know that ζ(s) ∼

1s−1

, it’s a begining of Laurent expansion of zeta functionat 1, but it’s not enough for us. We want to know a coeficient of the zero de-gree in the Laurant expansion. Let it be γ, so we can write ζ(s) = 1

s−1+ γ +

· · · as s → 1. This γ is so-called Euler constant, which is approximatly equal to0.577215664901532860606512090082402431042159. Keeping all of this in mind, wecan write singular expansion of h∗(s):

h∗(s) »

1

s2− γ

s

−+∞X

k=1

(−1)k ζ(1 − k)s− k , s ∈ 〈−1,+∞〉

Page 128: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

128 CHAPTER 11. MELLIN TRANSFORMS AND ASYMPTOTICS

Fundamental strip of the transform and its poles to the right of the fundamental stripare presented at the picture:

Double pole at zero is marked with a big circle.Now we can finally apply converse mapping theorem and we obtain wishful asympotics:

Hn ∼ log n+ γ +X

k≥1

(−1)kBk

k

1

nk= log n+ γ +

1

2n− 1

12n2+

1

120n4− · · ·

Example 11.16 (Stirling’s formula). In this example we are to use the productdecomposition of gamma function, that looks like

log Γ(x+ 1) + γx =

+∞X

n=1

hx

n− log

1 +x

n

”i

Let’s denote the right side as `(x). This is a harmonic sum with amplitudes λn = 1,frequencies µn = 1/n and a base function g(x) = x− log(1 + x). The Dirichlet seriesis Λ(s) =

P+∞n=1 λnµ

−sn =

P+∞n=1 n

s = ζ(−s). Transform of g(x) can be evaluatedby means of the first and the ninth basic properties of Mellin transform. The resultis g∗(s) = − π

s sin πswith fundamental strip 〈−2,−1〉. As in the previous example,

the Dirichlet series Λ(s) = ζ(−s) converges absolutely in the fundamental strip oftransform of base function. So, fundamental strip of transform of `(s) is 〈−2,−1〉 too.

`∗(s) = −ζ(−s) π

s sinπs, s ∈ 〈−2,−1〉

We want to obtain an asymptotics in infinity, so we look on meromorphic continuationto the right of the fundamental strip. Laurant expansion of zeta function at zero isζ(s) = 1

2− 1

2log(2π)+O(s), so singular expansion of `∗(s) to the right of fundamental

strip is

`∗(s) »

1

(s+ 1)2+

1− γ(s+ 1)

+

»1

2s2− log

√2π

s

+

+∞X

n=1

(−1)n−1ζ(−n)

n(s+ n), s ∈ 〈−2,+∞〉

Now we can apply converse mapping theorem and derive an asymptotics of function`(s), which is

`(x) =

»

x log x− (1 − γ)x–

+

»1

2log x+ log

√2π

++∞X

n=1

B2n

2n(2n − 1)

1

x2n−1, x→ +∞

Here every item from singular expansion was converted to an item in asymptoticexpansion without any simplification, but now we do some, keeping in mind, thatΓ(x+ 1) = x!, so

log(x!) = log“

xxe−x√

2πx”

++∞X

n=1

B2n

2n(2n − 1)

1

x2n−1, x→ +∞

This dazzling formula is named after Stirling and this is a good reason to finish thepaper right here.

Page 129: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Chapter 12

Automaton Searching onTries.Mikhail Lakunin

Above there were considered a lot of algorithms that allow us to searchfor a pattern in a string and a few powerful methods to give asymptoticestimations of such algorithms. There is an extension of such algorithms,not for one pattern but for a regular expression. The algorithms that areto be reviewed there run on the preprocessed text (we use Patricia tree)and are of logarithmic expected time of the size of the text for the strictedclass of regular expressions and in a sublinear expected time for all regularexpressions.

It’s the first such algorithm to be found with this complexity.The main purpose of the text is to give understanding of what’s going

on, and therefore some of the technicals could be omitted. For the preciseexplanation the reader is referred to the original paper.

12.1 Structure of the paper

1. What we actually want to do or our task

2. The algorithms in a “few words”

3. Introducing indexing structures such as suffix tries and Patricia trees and anotation used in the Regular Languages theory(the Theory of formal languages)

4. Consider the simple algorithm for the stricted class of a regular expression thatruns in logarithmic time.

5. Consider more complicated general algorithm.

6. Give estimation of the general algorithm considering sketch of technicals.

7. How to apply “general estimation theorem” to the particular cases.

8. Consider some heuristic for improving size of the query.

9. Open problems and conclusion (generalization of what we has spoken about)

12.2 Our task or what we want to do

We want to find efficient algorithm of finding all occurrences of a query given in theform of a regular language searching on the static preprocessed text. By finding all

129

Page 130: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

130 CHAPTER 12. AUTOMATON SEARCHING ON TRIES.

the occurrences we mean that we are going to find all the positions of the text fromwhere one substring from our query set could start. In general queries are not of thebig length, so there and below we state that the size of the query is bounded by theconstant. We use there automaton that searches through the indexing tree and we statethat our algorithm is pretty efficient, i.e. works in sublinear expected time (in averagecase). Moreover there exist small stricted subset of such queries that the expectedtime will be logarithmic. Besides we consider a powerful tool for estimating expectedtime of given automaton, the theorem that allows it on basis of some knowledge aboutincidence matrix of the automaton. And after all of that we suggest an heuristic forenhancing or for clarifying the query that helps to work all the algorithm faster.

12.3 Algorithm in a “few words”

As in previous parts we’ve got a large input text that is static or in other words wecan construct indexing structure on it and we don’t bother about how to reconstructit if the text will change. So we construct a trie (from word retrieval) that is indexingbinary tree (we suppose our alphabet is binary; if it’s not we just use some encodingto obtain binary text and hence binary alphabet) was discussed by Olga Sergeeva. Itbecomes obvious to us that we can find arbitrary substring in the text in height ofindexing tree length. In the case of perfectly balanced tree it’s log2 n, where n is thelength of the string.Now we’re able to find a single substring in a large text in logarithmic of the size of thetext expected time (in average case, considering uniform distribution) Let’s complicatethe task. If set of all the substrings we are to find in the text could be representedas a prefix from minimal preword set concatenated with any continuation. For sucha set we can state that it’s possible to search for that query in a time o(

‚‚query

‚‚)

independently of the size of the answer, i.e. in logarithmic expected time of size of thetext.Then we consider general regular expression query, construct Deterministic Automatonfor it and traverse it on the index trie. It could be showed that Automaton don’t visitall the vertices of the index trie (the quantity of which is O(n)),but only a part ofthem in average case, that’s the main idea of sublinear expected time.

12.4 Basic index structures and notation

To approach crystal understanding we need to introduce or just to help to recall somebasics facts we are going to manage with soon after.As it has been said we consider only binary alphabet, any other alphabet could beeasily converted to the binary one with some encoding

12.4.1 Indexing structures

Definition 12.1. Trie is a binary index tree for a text(tree each edge of that ismarked with a character), that satisfy following conditions:

1. each path of it (from the root to a leaf) consist of the prefix of one and only oneof the suffixes of the text

2. paths are as short as possible satisfying first condition

3. each leaf of it is marked with the position of the first character of the corre-sponded suffix

The main problem of the pure trie is that the upper bound of the number of innernodes is O(n2) that approaches in unbalanced tries, that’s why in practice enhancedstructure is of often use. It’s so-called Patricia tree.

Page 131: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

12.4. BASIC INDEX STRUCTURES AND NOTATION 131

Figure 12.1: Binary trie (external node label indicates position in the text) forthe 01100100010111

Page 132: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

132 CHAPTER 12. AUTOMATON SEARCHING ON TRIES.

Definition 12.2. Patricia tree constructed on the basis of the trie, but with oneenhancement. We exclude such nodes that have only one descendant and one ancestor,in other words we exclude nodes that are not leafs and that are not nodes of the“bifurcation”. To not miss the correspondence with the initial text we remember thequantity of steps we need to omit to get to the next “bifurcation” node.

Note. Asymptotically in average case:Height of the trie is: 2log2(n) +O(log2(n)) [Reg81]Height of the Patricia tree: log2(n) +O(log2(n)) [Reg81]

12.4.2 Basic definitions of the Theory of Formal Languages

General definition and notation.

1. Σ is a set of all characters (Alphabet), in some cases one character from the set

2. ε is an empty string

3. xy is concatenation of strings x and y

4. w = xyz if w is concatenation of strings x, y, z then:

(a) x is a prefix of w

(b) z is a suffix of w

(c) y is a substring of w

Let’s define operations we are to use specifying regular expression (RE).

1. r,q are sets of strings

2. r + q is a unite of these sets (r + q = r ∪ q)3. r? is one or zero occurrences of r ( r? = ε+ r in our notation)

4. rk is k occurrences of r

5. r≤k is from zero to k occurrences of r ( r≤k =Pk

i=0 ri)

6. r∗ is Kleene closure,i.e. any number of occurrences of r (r∗ =P+∞

i=0 ri)

7. r+ is one or more occurrences of r (r+ =P+∞

i=1 ri)

12.5 Algorithm for a restricted class of regularexpression

There we consider algorithm of searching for the stricted class of queries. It was brieflyreviewed above, there we define it more exactly.Let’s introduce new expression class so-called prefixed regular expressions that is sub-class of regular expressions class that was defined above.

Definition 12.3. Prefixed regular expression (PRE) is:

1. ∅ is a PRE (the empty set)

2. ε is a PRE(ε the set of empty string)

3. for each a ∈ Σ, a is a PRE (a)4. if p, q ∈ PRE and r ∈ RE (such that ε ∈ r) and x ε Σ then:

(a) p+ q is a PRE (union)

(b) xp is a PRE (concatenation with a character on the left)

(c) pr is a PRE (concatenation with ε-regular expression on the right)

Page 133: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

12.5. A RESTRICTED CLASS OF REGULAR EXPRESSION 133

(d) p∗ is a PRE

Example 12.1. This is PRE

ab(bc∗ + d+ + f(a+ b))

Example 12.2. This is not PRE

aΣ∗b

Example 12.3. This is not PRE too

(a+ b)(c+ d)(e+ f)

Statement. There exist such unique and finite subset(we call it preword set) of the(given) PRE set that satisfy following conditions:

1. for every word in the PRE set there exist unique prefix from preword set

2. for every word from the preword set there’s no other prefix in the set

To search a PRE query, we construct a tree (so-called Complete Prefix Trie) in similarway as we construct trie, but we use not suffixes of the text as we do in case of triebut all the strings in preword set of the query. And then traverse through that treeand through our index trie simultaneously for an answer.

Definition 12.4. Complete Prefix Trie (CPT) is the trie of the set of the stringssuch that:

1. there are no truncated paths(as in Patricia tree), that is, every word correspondsto a complete path in the trie

2. if a word x is a prefix of another word w, then only x is stored in the trie.

The second rule has appeared since the search for x is sufficient to also find theoccurrences of w. We are going to find only starting positions, aren’t we?We’ve constructed the Complete Prefix Trie so we are now able to traverse throughthat.We traverse simultaneously, if possible, the complete prefix trie(CPT) and the trie(Patricia tree) of all suffixes of the text (the index). That is, follow at the same time a0 or 1 edge(we consider binary alphabet), if they exist in both the trie and the CPT.All the subtrees in the index associated with terminal nodes in the complete prefixtrie are the answer. As in prefix searching, because we may skip bits while traversingthe index, a final comparison with the actual text is needed to verify the result.Let’s formulate the algorithm.

Algorithm. Searching for a query:

1. Construct the Complete Prefix Trie from the query

2. Traverse simultaneously the complete prefix trie and Patricia tree to obtain ananswer.

Note. Since we may skip bits while traversing the index(if we used Patricia tree), afinal comparison with the actual text is needed to verify the result.

Because the number of nodes of the complete prefix trie of the preword set is O(|query|),the search time is also O(|query|).

Conclusion. It is possible to search a PRE-query in time O(|query|) independentlyof the size of the answer.

Page 134: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

134 CHAPTER 12. AUTOMATON SEARCHING ON TRIES.

Figure 12.2: Complete Prefix Trie for a query ab(bc∗ + d+ + f(a+ b))

Page 135: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

12.6. GENERAL ALGORITHM 135

12.6 General algorithm

The extension of the previously discussed idea leads to general algorithm of searchingquery given as regular expression.

12.6.1 What we want to do there

There we present the algorithm that can search for arbitrary regular expression intime sublinear in n on the average. For this, we simulate a DFA (Deterministic FiniteAutomaton) in a binary trie built from all the suffixes of a text.Since the situation is pretty similar with the previous algorithm. Let’s start frompresenting algorithm itself.

Algorithm. General Automaton Search

1. Convert the regular expression(query) into minimized DFA(independent of thesize of the text)

2. Eliminate outgoing transitions from final states

3. Convert character DFA into binary DFA (Each state will then have at most twooutgoing transitions, one labeled with 0 and one labeled with 1 ), i.e. applyingsome encoding

4. Simulate binary DFA on the binary trie of all suffixes we’ve constructed(Seepictures of DFA and binary index trie ).That is:

(a) root of the tree with initial state.

(b) for any internal node associated with state i, associate its left descendantwith state j, if i > j for a bit 0, and associate its right descendant withstate k, if i > k for a bit 1.

5. For every node of the index associated with a final state, accept the wholesubtree and halt.

6. On reaching an external node, run the remainder of the automaton on the singlestring determined by this external node.

12.7 Efficiency Estimation for the General Al-gorithm

The precise average case analysis of the above algorithm is not simple. Some explana-tions were given in “few words” in the start of the paper, There I’ll try to clarify thesituation a bit, for precise explanation reader is referred to the original paper.

12.7.1 The structure of the proof

The situation is as follows. We have Patricia indexing trie with O(n) nodes and theDFA(Deterministic finite automaton) and we want to calculate how many nodes ofthe Patricia indexing trie will be visited by automaton. We want to prove that thisquantity is less then O(n) (asymptotically).There I want to give a sketch of the proof on step by step basis.

1. We introduce N in, i.e. the quantity of nodes of our index Patricia tree that was

visited by Automaton when the quantity of suffixes in the tree is n and we’veengaged our automaton on the i-th node.

Page 136: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

136 CHAPTER 12. AUTOMATON SEARCHING ON TRIES.

Figure 12.3: DFA with corresponded index trie4

2. We note that our automaton from each node can go only in 2 directions in theindex tree and taking into account all the possible combinations of the wayssuffixes could go in i-th node we can say , therefore we can say that:

N in = 1 +

1

2n

nX

l=0

n

l

!“

N jl +Nk

n−l

”!

where j, k are states DFA goes on 0, 1 correspondingly (see picture with DFAand trie)

3. Now it looks like equation could be solvable with generating function. We applygenerate function,

4. and convert it to matrix equation writing it for all the starting nodes in a unit.

5. Then we “solve” equation and find the generating function. The generatingfunction equals to the matrix series, where the matrix is incidence matrix ofDFA.

6. Then we convert it to a Jordan form and decompose matrix equation looking atthe single values in a matrix.

7. Then we apply Mellin transform method to get some knowledge about this singlevalues series and about matrix series

8. Using that we got some estimation of eigenvalues and information about N , i.e.about number of nodes was visited and about number of comparison DFA made(in average case)

9. Finally we obtain following theorem(taking in account number of steps we needfor verifying in case of Patricia tree):

Theorem 12.1. The expected number of comparisons performed by a minimalDFA for a query q represented by its incidence matrix H while searching the trieof n random strings is sublinear, and given by:

O`(log2n)t nr

´

,where r = log2λ,λ = maxi(|λi|),t = maxi(mi − 1,such that |λ| = λi), and theλi are the eigenvalues of H with multiplicities mi.

Page 137: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

12.8. APLLYING THE GENERAL ESTIMATION THEOREM 137

12.8 How to apply “general estimation theorem”to the particular cases

To analyze average time of searching for particular query or particular type of thequery ,we need to consider DFA of the query and its incidence matrix and eigenvaluesof the matrix. And then we can apply “the main theorem” to obtain estimation.

Example 12.4. For example, DFAs having only cycles of length 1, have a largesteigenvalue equal to 1, but with multiplicity proportional to the number of cycles,obtaining a complexity of O(polylog(n)).

Example 12.5. For the regular expression (0(011)0)∗1, the eigenvalues are:

21/3,−1

2(21/3 − 31/221/3i),−1

2(21/3 − 31/221/3i), 0

, and the first 3 have the same modulus. The solution in this case is N 1n = O(n1/3)

12.9 Heuristic for optimizing the query

12.9.1 What we mean by optimizing

In this section, we present a general heuristic, which we call substring analysis,to plan what algorithms and order of execution should be used for a generic patternmatching problem, which we apply to regular expressions.The aim of this section is to find from every query a set of necessary conditions thathave to be satisfied.

12.9.2 Substring graph

Definition 12.5. Substring graph of a regular expression is an acyclic directedgraph such that each node is labeled by a string. And its defined recursively by thefollowing rules:

1. G(ε) is a single node labeled ε.

2. G(x) for any x ∈ Σ is a single node labeled with x.

3. G(s+t) is the graph built from G(s) and G(t) with an ε-labeled node with edgesto the source nodes and an ε-labeled node with edges from the sink nodes.

4. G(st) is the graph built from joining the sink node of G(s) with the source nodeof G(t), and relabeling the node with the concatenation of the sink label andthe source label.

5. G(r+) are two copies of G(r) with an edge from the sink node of one to thesource node of the other.

6. G(r∗) is two ε-labeled nodes connected by an edge .

12.9.3 How to Use

reducing the size of the query

After building G(q), we search for all node labels in G(q) in our index of suffixes, deter-mining whether or not that string exists in the text (O(|q|) time). For all nonexistentlabels, we remove:

1. the corresponding node,

Page 138: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

138 CHAPTER 12. AUTOMATON SEARCHING ON TRIES.

2. adjacent edges,

3. and any adjacent nodes (recursively) from which all incoming edges or all out-going edges have been deleted.

This reduces the size of the query.

estimating final answer size

From the number of occurrences for each label we can obtain an upper bound on thesize of the final answer to the query:

1. for adjacent nodes (serial, or “and” nodes) we multiply both numbers

2. for parallel nodes (“or” nodes) we add the number of occurrences

Note. ε-nodes are treated in special way.

Managing ε-nodes:

1. consecutive serial ε-nodes are replaced by a single ε-node.

2. chains that are parallel to a single ε-node, are deleted

3. the number of occurrences in the remaining ε-nodes is defined as 1

After the simplifications,ε-nodes are always adjacent to non-ε-nodes, since ε was as-sumed not to be a member of the query

12.10 Open problems and conclusion

We have shown that using a trie or Patricia tree, we can search for many types ofstring searching queries in logarithmic average time, independently of the size of theanswer. We also show that automaton searching in a trie is sublinear in the size ofthe text on average for any regular expression, this being the first algorithm foundto achieve this complexity. Similar ideas have been used since for approximate stringsearching by simulating dynamic programming over a digital tree [Gonnet et al. 1992;Ukkonen 1993], also achieving sublinear time on average. In particular, Gonnet et al.[1992] have used this algorithm for protein matching.In general, however, the worst case of automata searching is linear. For some regularexpressions and a given algorithm it is possible to construct a text such that thealgorithm must be forced to inspect a linear number of characters. The pathologicalcases consist of periodic patterns or unusual pieces of text that, in practice, are rarelyfound.Finding an algorithm with logarithmic search time for any RE query is still an openproblem [Galil 1985]. Another open problem is to derive a lower bound for searchingREs in preprocessed text.

Page 139: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

Bibliography

[BFC00] Michael A. Bender and Martin Farach-Colton, The lca problem revisited,Latin American Theoretical INformatics, may 2000, pp. 88–94.

[BYG96] Ricardo A. Baeza-Yates and Gaston H. Gonnet, Fast text searching for reg-ular expressions or automaton searching on tries, Journal of the ACM 43(1996), no. 6, 915–936.

[CL94] William I. Chang and Eugene L. Lawler, Sublinear approximate stringmatching and biological applications, Algorithmica 12 (1994), 327–344.

[CLR90] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest, Introduc-tion to algorithms, MIT Press, 1990.

[FGD95] Philippe Flajolet, Xavier Gourdon, and Philippe Dumas, Mellin transformsand asymptotics: Harmonic sums, Theoretical Comp. Science 144 (1995),no. 1–2, 3–58.

[FS86a] Philippe Flajolet and Robert Sedgewick, Digital search trees revisited, SIAMJ. Comput. 15 (1986), no. 3, 748–767.

[FS86b] , Digital search trees revisited, SIAM J. Comput. 15 (1986), no. 3,748–767.

[FS95] , Mellin transforms and asymptotics: Finite differences and rice’sintegral, Theoretical Computer Science 144 (1995), no. 1-2, 101–124.

[FS98] A. Frieze and W. Szpankowski, Greedy algorithms for the shortest commonsuperstring that are asymptotically optimal, Algorithmica 21 (1998), 21–36.

[GO81] L. J. Guibas and A. M. Odlyzko, String overlaps, pattern matching, andnontransitive games, J. of Combinatorial Theory, Series A 30 (1981), 183–208.

[Gus97] Dan Gusfield, Algorithms on strings, trees, and sequences: Comp. scienceand computational biology, Cambridge University Press, 1997.

[GV00] Roberto Grossi and Jeffry Scott Vitter, Compressed suffix arrays and suffixtrees with applications to text indexing and string matching, 32nd Symp. onTheory of Comput., 2000, pp. 397–406.

[Jac89] Guy Jacobson, Sapce-efficient static trees and graphs, IEEE Symposium onFoundations of Computer Science (1989), 549–554.

[KS03] Juha Karkkainen and Peter Sanders, Simple linear work suffix array con-struction, Proc. 30th Int. Colloq. on Automata, Languages and Program-ming (ICALP), LNCS, vol. 2719, Springer, 2003, pp. 943–955.

[MM93] Udi Manber and Gene Myers, Suffix arrays: A new method for on-line stringsearches, SIAM J. Comput. 22 (1993), no. 5, 935–948.

[NBY00] G. Navarro and R. Baeza-Yates, A hybrid indexing method for approximatestring matching, Journal of Discrete Algorithms (JDA) 1 (2000), no. 1, 205–209, Special issue on Matching Patterns.

139

Page 140: 2nd Joint Advanced Student School 2004 Course 1 Complexity ... · 2nd Joint Advanced Student School 2004 Course 1 Complexity Analysis of String Algorithms St. Petersburg, March 28th

140 BIBLIOGRAPHY

[Reg81] M. Regnier, On the average height of trees in digital search and dynamichashing, Inf. Proc.Lett 13 (1981), 64–67.

[RS98a] M. Regnier and W. Szpankowski, Complexity of sequential pattern matchingalgorithms, Proc. RANDOM, LNCS, vol. 1518, Springer, 1998, pp. 187–199.

[RS98b] Mireille Regnier and Wojciech Szpankowski, On pattern frequency occur-rences in a markovian sequence, Algorithmica 22 (1998), 631–649.

[Sad00] Kunihiko Sadakane, Compressed text databases with efficient query algo-rithms based on the compressed suffix array, Proc. ISAAC, LNCS, vol. 1969,Springer, 2000, pp. 410–421.

[Szp93] Wojciech Szpankowski, Asymptotic properties of data compression and suffixtrees, IEEE Transactions on Information Theory 39 (1993), no. 5, 1647–1659.

[Szp00] , Average case analysis of algorithms on sequences, 1 ed., Wiley-Interscience, 2000.

[Ukk95] E. Ukkonen, On-line construction of suffix trees, Algorithmica 14 (1995),249–260.