26
Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis E1-215b [email protected]

Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

Advanced Algorithm Design and Analysis (Lecture 5)

SW5 fall 2005Simonas Šaltenis [email protected]

Page 2: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 2

Text-Search Data Structures

Goals of the lecture: Finish discussing text-searching algorithms:

• Boyer-Moore-Horspool

Dictionary ADT for strings: • to understand the principles of tries, compact tries,

Patricia tries

Text-searching data structures:• to understand and be able to analyze text searching

algorithm using the suffix tree and Pat tree

Page 3: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 3

Reverse naïve algorithm

Why not search from the end of P? Boyer and Moore

Reverse-Naive-Search(T,P) 01 for s ← 0 to n – m02 j ← m – 1 // start from the end 03 // check if T[s..s+m–1] = P[0..m–1]04 while T[s+j] = P[j] do05 j ← j - 106 if j < 0 return s07 return –1

Running time is exactly the same as of the naïve algorithm…

Page 4: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 4

Occurrence heuristic

Boyer and Moore added two heuristics to reverse naïve, to get an O(n+m) algorithm, but it is complex

Horspool suggested just to use the modified occurrence heuristic: After a mismatch, align T[s + m–1] with the

rightmost occurrence of that letter in the pattern P[0..m–2]

Examples: • T= “detective date” and P= “date”

• T= “tea kettle” and P= “kettle”

Page 5: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 5

Shift table

In preprocessing, compute the shift table of the size |Σ|.

Example: P = “kettle” shift[e] =4, shift[l] =1, shift[t] =2, shift[k] =5 shift[any other letter] = 6

Example: P = “pappar” What is the shift table?

shift [w]={m−1−max {im−1∣P [i ]=w} if w is in P [0. .m−2] ,m otherwise.

Page 6: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 6

Boyer-Moore-Horspool Alg.

BMH-Search(T,P) 01 // compute the shift table for P01 for c ← 0 to |Σ|- 102 shift[c] = m // default values03 for k ← 0 to m - 204 shift[P[k]] = m – 1 - k05 // search06 s ← 0 07 while s ≤ n – m do 08 j ← m – 1 // start from the end 09 // check if T[s..s+m–1] = P[0..m–1]10 while T[s+j] = P[j] do11 j ← j - 112 if j < 0 return s13 s ← s + shift[T[s + m–1]] // shift by last letter14 return –1

Page 7: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 7

BMH Analysis

Worst-case running time Preprocessing: O(|Σ|+m) Searching: O(nm)

• What input gives this bound?

Total: O(nm)

Space: O(|Σ|) Independent of m

On real-world data sets very fast

Page 8: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 8

Comparison

Let’s compare the algorithms. Criteria: Worst-case running time

• Preprocessing• Searching

Expected running time Space used Implementation complexity

Page 9: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 9

Dictionary ADT for Strings

Dictionary ADT for strings – stores a set of text strings: search(x) – checks if string x is in the set insert(x) – inserts a new string x into the set delete(x) – deletes the string equal to x from

the set of strings Assumptions, notation:

n strings, N characters in total m – length of x Size of the alphabet d = |Σ|

Page 10: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 10

BST of Strings

We can, of course, use binary search trees. Some issues: Keys are of varying length A lot of strings share similar prefixes

(beginnings) – potential for saving space Let’s count comparisons of characters.

• What is the worst-case running time of searching for a string of length m?

Page 11: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 11

Tries

Trie – a data structure for storing a set of strings (name from the word “retrieval”): Let’s assume, all strings end with “$” (not in Σ)

b s

ea

r$

i

d

$

u

l

k

$ $

l

un

da

y

$

$

Set of strings: {bear, bid, bulk, bull, sun, sunday}

Page 12: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 12

Tries II

Properties of a trie: A multi-way tree. Each node has from 1 to d children. Each edge of the tree is labeled with a

character. Each leaf node corresponds to the stored

string, which is a concatenation of characters on a path from the root to this node.

Page 13: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 13

Search and Insertion in Tries

The search algorithm just follows the path down the tree (starting with Trie-Search(root, P[0..m]))

Trie-Search(t, P[k..m]) //searches for P in t01 if t is leaf then return true 02 else if t.child(P[k])=nil then return false03 else return Trie-Search(t.child(P[k]), P[k+1..m])

How would the delete work?

Trie-Insert(t, P[k..m]) //inserts string P into t01 if t is not leaf then //otherwise P is already present02 if t.child(P[k])=nil then 03 Create a new child of t and a “branch” starting

with that chlid and storing P[k..m] 04 else Trie-Insert(t.child(P[k]), P[k+1..m])

Page 14: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 14

Trie Node Structure

“Implementation detail” What is the node structure? = What is the

complexity of the t.child(c) operation?:• An array of child pointers of size d: waist of space,

but child(c) is O(1)• A hash table of child pointers: less waist of space,

child(c) is expected O(1)• A list of child pointers: compact, but child(c) is O(d)

in the worst-case• A binary search tree of child pointers: compact and

child(c) is O(lg d) in the worst-case

Page 15: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 15

Analysis of the Trie

Size: O(N) in the worst-case

Search, insertion, and deletion (string of length m): depending on the node structure:

O(dm), O(m lg d), O(m) Compare with the string BST

Observation: Having chains of one-child nodes is wasteful

Page 16: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 16

Compact Tries

Compact Trie: Replace a chain of one-child nodes with an edge

labeled with a string Each non-leaf node (except root) has at least

two children

b s

ea

r$

i

d

$

u

l

k

$ $

l

un

da

y

$

$

b sunear$

id$ ul

k$l$

day$$

Page 17: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 17

Compact Tries II

Implementation: Strings are external to the structure in one

array, edges are labeled with indices in the array (from, to)

Can be used to do word matching: find where the given word appears in the text. Use the compact trie to “store” all words in the

text Each child in the compact trie has a list of

indices in the text where the corresponding word appears.

Page 18: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 18

Word Matching with Tries

To find a word P: At each node (after matching P[0..k–1]), follow edge

(i,j), such that P[k..k+i-j] = T[i..j] If there is no such edge, there is no P in T, otherwise,

find all starting indexes of P when a leaf is reached.

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

(31,34)

(14,16)

(1,2)(17,18)

(19,19)(22,24)

(3,3)(8,11)

they think that we were there and there

(28,30)(4,5)

31

12

6

25,35 1

17

20

T:

Page 19: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 19

Word Matching with Tries II

Building of a compact trie for a given text: How do you do that? Describe the compact trie

insertion procedure Running time: O(N)

Complexity of word matching: O(m) What if the text is in external memory?

In the worst-case we do O(m) I/O operations just to access single characters in the text – not efficient

Page 20: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 20

Patricia trie

Patricia trie: a compact trie where each edge’s label (from, to) is

replaced by (T[from], to – from + 1)

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

(a,4)

(a,3)

(t,2)(w,2)

(_,1)(r,3)

(e,1)(i,4)

they think that we were there and there

(r,3)(y,2)

31

12

6

25,35 1

17

20

T:

Page 21: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 21

Querying Patricia Trie

Word prefix query: find all words in T, which start with P[0..m-1]

Patricia-Search(t, P, k) // searches for P in t01 if t is leaf then02 j ← the first index in the t.list03 if T[j..j+m-1] = P[0..m-1] then 04 return t.list // exact match 05 else if there is a child-edge (P[k],s) then06 if k + s < m then 07 return Patricia-Search(t.child(P[k]), P, k+s)08 else go to any descendent leaf of t and do the

check of line 03, if it is true, return lists of all descendent leafs of t, otherwise return nil

09 else return nil // nothing is found

Page 22: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 22

Analysis of the Patricia Trie

Idea of patricia trie – postpone the actual comparison with the text to the end: If the text is in external memory only O(1) I/O are

performed (if the trie fits in main-memory)

Build a Patricia Trie for word matching:

Usually binary patricia tries are used: Consider binary encoding of text (and queries) Each node in the tree has two children (left for 0, right

for 1) Edges are labeled just with skip values (in bits)

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40føtex har haft en fødselsdagT:

Page 23: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 23

Text-Search Problem

Input: Text T = “carrara” Pattern P = “ar”

Output: All occurrences of P in T

Reformulate the problem: Find all suffixes of T that

has P as a prefix! We already saw how to do a

word prefix query.

carrara arrara rrara rara ara ra a

Page 24: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 24

Suffix Trees

Suffix tree – a compact trie (or similar structure) of all suffixes of the text Patricia trie of suffixes is sometimes called a Pat

tree

carrara$1 2 3 4 5 6 7 8

a r

r$

carrara$

rara$ a$

rara$a

$ ra$

2 5

71

6 4

3

(a,1) (r,1)

(r,1)($,1)

(c,8)

(r,5) (a,2)

2 5

71

6 4

3

(r,5)(a,1)

(r,3)($,1)

Page 25: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 25

Pat Trees: Analysis

Text search for P is then a prefix query. Running time: O(m+z), where z is the number

of answers Just O(1) I/Os if the text is in external-memory

(independent of z)! The size of the Pat tree: O(N)

Why? Advantage of compression: the size of the

simple trie of suffixes would be in the worst-case N + (N-1)+ (N-2) + … 1 = O(N2)

Page 26: Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg5.pdf · Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2005 Simonas Šaltenis

AALG, lecture 5, © Simonas Šaltenis, 2005 26

Constructing Suffix Trees

The naïve algorithm Insert all suffixes one after another: O(N2)

Clever algorithms: O(N) McCreight, Ukkonen Scan the text from left to right, use additional

suffix links in the tree Question: How does the the Pat tree looks

like after inserting the first five prefixes using the naïve algorithm?

Honolulu$1 2 3 4 5 6 7 8 9