Upload
tannar
View
62
Download
0
Embed Size (px)
DESCRIPTION
Tree Pattern Matching to Subset Matching in Linear Time. R. Cole and R. Hariharan. Tree Pattern Matching. Input: An ordered binary tree T, |T| = n. An ordered binary tree P, |P| = m. Output: All nodes in T where P matches. p. t. Subset Matching. - PowerPoint PPT Presentation
Citation preview
Tree Pattern Matching to Subset Matching in Linear Time
R. Cole and R. Hariharan
Tree Pattern Matching
Input: An ordered binary tree T, |T| = n.
An ordered binary tree P, |P| = m.
Output: All nodes in T where P matches.
t p
Subset Matching
Input: A set-string T and a set-string P . Output: All occurrences of P in T.
abc
ac
bc
ac
cef
bf
bT =
ac c bP =
History Hoffman and O’Donell, 1982, O(nm). Kosaraju, 1989, O(nm0.75logm). Dubiner, Galil, and Magen, 1994,O(nm0.5logm). Cole and Hariharan, 1997, randomized O(nlog3
m). Indyk, 1998, randomized O(nlogn). Cole, Hariharan, and Indyk, 1999, O(nlog3m). Cole and Hariharan, 2002, O(nlog2m).
Period
Def : The period of a string s is the smallest number j such that s[i]=s[i+j].
0 0 1 0 0 1 0 0 1 0 0 1
S =
j = 3
非正式用語 (1) 後面的投影片如果說週期為 θ ,意思是以 θ `` 開頭 ”,並且週期為 | θ | 。
| θ | 有時會省略為 θ 。
0 0 1 0 0 1 0 0 1 0 0 1
S =
θ
Yes
0 1 0 0 1 0 0 1 0 0 1 0
S = No
Let θ = 0 0 1, |θ| = 3.
s = 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1
Classical Lemma (1)
s: a string with period θ
s =
把 s 切兩半,如果切的地方距離開頭不是 θ的整數倍,
則後面那一半的開頭不會是 θ 。
Ex:
Period in linear time
Exercise 1: Design an algorithm to compute the period in linear time.
θ-Path
0 0 1 0 0 1 s =
p is a θ-path.
Def: A path p is a θ-path if its string representation has period θ.
p
Let θ = 0 0 1.
Maximalθ-Path Def: A θ-path p is maximal if it can not be extended.
θ = 0 0 1.
not
not
Maximalθ-Paths in linear time
Exercise 2: Design an linear time algorithm to find all maximalθ-Paths in a tree.
非正式用語 (2)
一個 node 的大小 = 以這個 node 為 root 的subtree 的大小。
74
1
1
2
2
1
Centroid of a treem = 19
m/2 = 9
Spine of a Tree
Spine = centroid 加強版
Spine of a Tree
0 1 0 1 0 1 0
Link node = Centroid 的最後一個 node
≥ m/2
= Spine 上最後一個 ≥ m/2 的點
< m/2
A Special Case of Tree Pattern Matching
Input: An ordered binary tree T, |T| = n.
An ordered binary tree P, |P| = m. Output: All nodes in T where P matches
Additional constraint: T has only one maximal θ-path, where θ is theperiod of the spine of P.
A Special Case of Tree Pattern Matching
T P
Reduce to Subset Matching (1)
P
B2
B6
0 0 1 0 0 1 0
0
0abce
1
Reduce to Subset Matching (2)
B2B6
c d f
a
b e
0 1 0 0
1abdef
0
Reduce to Subset Matching (3)
T
肋 2
肋 6肋 5
0 0 1 0 0 1 0 0 1 0 0
1
Reduce to Subset Matching (4)
c
e
d f
a
b
a
b e
c f
0 0 0 0 1 0 0 1 0 0
肋 2
0abcef
Time: O( min{ m, 肋 2} )
Reduce to Subset Matching (5)
Total Time: ∑i O( min{m, 肋 i} ) = O(n)
How about the general case?
如果找出來的 maxima θ-paths不只一條 該怎麼 reduce呢 ?
暴力法 : 對每一條 maximal θ-paths都用剛才的方法 reduce成 subset matching problem。
Time?
Where is the intuition come from?
Truncation lemma: If the first |θ| edges of each maximal θ-paths are removed, then those truncated paths are disjoint。
Truncation Lemma (1)
重疊開始的地方和 u 的距離不可能是 θ的整數倍。
θ
θ
θ
p p’
θ
θ
u
Truncation Lemma (2)
By Classical Lemma:
< |θ|
p
p’
Warm up is over!
接下來要做的事 :
Part 1: 證明 truncated maximal θ-paths 可以在 linear time reduce 成 subset matching。
Part 2: 考慮被砍掉的部分該如何解決。
Step 1: Find all maximal θ-paths
T Pθ
Link node
Step 2: Filtering(1)
把不符合以下三個 property 的 maximal θ-pahts 過濾掉。
Property 1:
≥ m
Step 2: Filtering (2)
Propety 2:
≥ m/2
∵ P
≥ m/2
Step 2: Filtering (3)
Propety 3:
≥ m/2
P
≥θ
≥ m/2
≥θ
∵
Step 3: Truncation
將過濾後每一條 maximal θ-paths 開頭的 θ條 edges去掉。
Step 4: Filtering again
把 truncated maximal θ-paths 再過濾一遍,剩下的這些 paths在之後將簡稱為 truncated paths.
Step 5: 一條一條 reduce 成 subset matching
Time: ∑truncated paths ∑i O( min{m, 肋 i} )
Analysis of Step 5 (1)
∑truncated paths ∑i O( min{m, 肋 i} )
= ∑O( min{m, 肋 } )
Analysis of Step 5 (2)
∑O( min{m, 肋 } )
( 大肋 = 大於或等於 m 的肋骨,小肋 = 小於m 的肋骨 )
=∑O( min{m, 大肋 } ) + ∑O( min{m, 小肋 } )
= O(m * (# 大肋 )) + ∑O( 小肋 )
Analysis of Step 5 (3)
O(m * (# 大肋 )) + ∑O( 小肋 )
剩下只需證明Part 1. (# 大肋 ) = O( n/m )
Part 2. ∑O( 小肋 ) = n
Analysis of Step 5 (4)
Part 2: ∑O( 小肋 ) = n
小
大小
大
∵ 小肋骨 are disjoint
小
小
小大
< m
≥ m
Marked nodes
Def: A node in t is marked if its left and right subtrees both contain ≥ m nodes.
# marked nodes is O(n/m)
≥m ≥m
≥m ≥m
≥m
≥m ≥m
m = 2
# marked nodes is O(n/m)
≥m ≥m ≥m ≥m ≥m ≥m ≥m
∵(# external nodes) * m ≤ n ∴ # external nodes ≤ n/m⇒ # marked nodes = # internal nodes ≤ n/m - 1
Analysis of Step 5 (5)
Part 1. (# 大肋 ) = O( n/m )
一條 truncated path 上如果有 k > 1 根大肋骨,則有 k-1 個 maked nodes 。
大大
大大
大
Analysis of Step 5 (6)
擁有 k > 1 根大肋骨的 truncated paths 上的大肋骨全部加起來是 O(n/m) 。
剩下的問題 : 有多少條擁有 k = 1 根大肋骨的 truncated paths?
Analysis of Step 5 (7)
O(n/m) 條
小
小
小
小
大
≥ m/2
An observation
擁有 k > 1 根大肋骨的 truncated paths 只有O(n/m) 條。
擁有 k = 1 根大肋骨的 truncated paths 只有O(n/m) 條。
擁有 k = 0 根大肋骨的 truncated paths 只有O(n/m) 條。
所以 truncated paths 只有 O(n/m) 條。
Disjoint Lemma
Let C be a set of disjoint θ-paths and these θ-paths satisfy property 1~3. Then there are O(n/m) θ- paths in C.
Pf: 擁有 k > 1 根大肋骨的 θ-paths 只有 O(n/m) 條。 擁有 k = 1 根大肋骨的 θ-paths 只有 O(n/m) 條。 擁有 k = 0 根大肋骨的 θ-paths 只有 O(n/m) 條。
Review
Step 1: Finding all maximal θ-paths Step 2: Filtering Step 3: Truncation Step 4: Filtering again Step 5: Reduce to subset mathching
θ
θ
θ
θ
How about the removed parts?
P
Time: O(m)
The Last Job
Step 1: Finding all maximal θ-paths Step 2: Filtering
only O(n/m) paths left. Step 3: Truncation Step 4: Filtering again Step 5: Reduce to subset mathching
Tail Lemma
path 的尾巴不會被其他 path 碰到。
Chains
. . .
Chain Lemma
0
1
2
3
一條 chain 上
(1) 編號 1, 3, 5 , 7 , … 的 paths 會是 disjoint 。
(2) 編號 0, 2, 4, 6, 8, … 的 paths 會是 disjoint 。
0
1
2
3
Truncated-Chains Lemma(1)
Truncated-Chains lemma: If the first two paths of each chains are removed, then those truncated chains are disjoint。
Truncated-Chains Lemma(2)ρ
ρ’
Truncated-Chains Lemma(3)
Case 1:ρ ρ’
≥ θ
Truncated-Chains Lemma(4)
Case 2:ρ ρ’
≥ θ
Truncated-Chains Lemma(5)
Case 3:ρ ρ’
≥ θ
Truncated-Chains Lemma(6)
Case 4:ρ ρ’
≥ θ
Almost Over (1)
Chain lemma: 在一條 chain 上(1) 編號 1, 3, 5, 7 , … 的 paths 會是 disjoint 。(2) 編號 0, 2, 4, 6 , … 的 paths 會是 disjoint 。 Truncated-Chains lemma
去掉編號 0, 1 的 paths 後的 chains 會是 disjoint
⇒(1) 編號 3, 5, 7, … 的 paths 會是 disjoint (2) 編號 2, 4, 6 , … 的 paths 會是 disjoint
Almost Over (2)
(1) 編號 3, 5, 7, … 的 paths 會是 disjoint(2) 編號 2, 4, 6 , … 的 paths 會是 disjoint
By Disjoint Lemma:(1) 編號 3, 5, 7, … 的 paths 共 O(n/m) 條(2) 編號 2, 4, 6, … 的 paths 共 O(n/m) 條
Almost Over (3)
If #chains = O(n/m), then
編號 0 與 1 的 paths 共 O(2n/m) = O(n/m)條。
Over (1)
Maximal Connected
chains
Maximal Connected
chains
Maximal Connected
chains
Over (2)
maximal conneted chains 中如果有 k > 1 條 chains 則有 k – 1 個 marked nodes 。
Over (3)
擁有 k > 1 條 chains 的 maximal connected chains 的 chains 全部加起來是 O(n/m) 。
Over(4)
By Disjont Lemma: 擁有 k = 1 條 chains 的 maximal connected
chains 共 O(n/m) 個。
Unproved Lemma
Maximal θ-paths in linear time. (Lemma 2.2 in M. Dubiner, Z. Galil, and E. Magen. Fast Tree Patt
ern Matching, J. ACM, 1994. ) Chain Lemma. (Lemma 5.5 in R. Cole and R. Hariharan. Tree Pattern Matching
to Subset Matching in Linear Time, SIAM J. Computing, 2003.)