136
TASM: Top-k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 MichaelB¨ohlen 3 Themis Palpanas 4 1 Free University of Bozen-Bolzano, Italy [email protected] 2 University of Alberta, Canada [email protected] 3 University of Zurich, Switzerland [email protected] 4 University of Trento, Italy [email protected] ICDE 2010, March 3 Long Beach, CA, USA Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 1 / 28

TASM: Top-k Approximate Subtree Matchingaugsten/publ/icde10/icde10-slides.pdfTASM: Top-k Approximate Subtree Matching ... TASM computes TED between query and document subtrees Size

  • Upload
    lamhanh

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

TASM: Top-k Approximate Subtree Matching

Nikolaus Augsten1 Denilson Barbosa2

Michael Bohlen3 Themis Palpanas4

1Free University of Bozen-Bolzano, [email protected]

2University of Alberta, [email protected]

3University of Zurich, [email protected]

4University of Trento, [email protected]

ICDE 2010, March 3Long Beach, CA, USA

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 1 / 28

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 2 / 28

Motivation and Problem Definition

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 3 / 28

Motivation and Problem Definition

Motivation

Query (XML fragment) Document (very large XML)

article

authors

author

Tim

author

John

booktitle

ICDEDBLP

28M nodes, 531MB

top-k matches?

Rank the top-k matches for the article query in the DBLP document!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 4 / 28

Motivation and Problem Definition

Motivation

Query (XML fragment) Document (very large XML)

article

authors

author

Tim

author

John

booktitle

ICDEDBLP

28M nodes, 531MB

top-k matches?

Rank the top-k matches for the article query in the DBLP document!

Example Answer: k = 3inproceedings

authors

author

Tim

author

John

booktitle

ICDE

(1 error)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 4 / 28

Motivation and Problem Definition

Motivation

Query (XML fragment) Document (very large XML)

article

authors

author

Tim

author

John

booktitle

ICDEDBLP

28M nodes, 531MB

top-k matches?

Rank the top-k matches for the article query in the DBLP document!

Example Answer: k = 3inproceedings

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

authorsauthor

John

booktitle

TKDE

(1 error) (2 errors)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 4 / 28

Motivation and Problem Definition

Motivation

Query (XML fragment) Document (very large XML)

article

authors

author

Tim

author

John

booktitle

ICDEDBLP

28M nodes, 531MB

top-k matches?

Rank the top-k matches for the article query in the DBLP document!

Example Answer: k = 3inproceedings

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

authorsauthor

John

booktitle

TKDE

inproceedings

authors

author

Tim

author

John

author

Peter

booktitle

ICDE

(1 error) (2 errors) (3 errors)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 4 / 28

Motivation and Problem Definition

TASM: Top-k Approximate Subtree Matching

Definition (TASM: Top-k Approximate Subtree Matching)

Given: query tree Q, document tree T , size k of rankingGoal: Compute a

top-k ranking R = (T1, T2, . . . ,Tk)

of all subtrees Ti of document T

with respect to query Q

using the tree edit distance for the ranking.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 5 / 28

Motivation and Problem Definition

TASM: Top-k Approximate Subtree Matching

Definition (TASM: Top-k Approximate Subtree Matching)

Given: query tree Q, document tree T , size k of rankingGoal: Compute a

top-k ranking R = (T1, T2, . . . ,Tk)

of all subtrees Ti of document T

with respect to query Q

using the tree edit distance for the ranking.

Subtree Ti :

a node and all its descendantslargest subtree is document itself

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 5 / 28

Motivation and Problem Definition

TASM: Top-k Approximate Subtree Matching

Definition (TASM: Top-k Approximate Subtree Matching)

Given: query tree Q, document tree T , size k of rankingGoal: Compute a

top-k ranking R = (T1, T2, . . . ,Tk)

of all subtrees Ti of document T

with respect to query Q

using the tree edit distance for the ranking.

Subtree Ti :

a node and all its descendantslargest subtree is document itself

top-k ranking R = (T1, Ti , . . . , Tk )

subtrees sorted by distance to querybest k subtrees: Ti /∈ R ⇒ ted(Q, Tk ) ≤ ted(Q, Ti )

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 5 / 28

Motivation and Problem Definition

Ranking Function: Tree Edit Distance (TED)

article

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

TKDE

Tree Edit Distance: Minimum number of node edit operations(insert, rename, delete) that transform one tree into the other.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

Motivation and Problem Definition

Ranking Function: Tree Edit Distance (TED)

article

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

TKDE

del(authors)

Tree Edit Distance: Minimum number of node edit operations(insert, rename, delete) that transform one tree into the other.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

Motivation and Problem Definition

Ranking Function: Tree Edit Distance (TED)

article

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

TKDE

del(authors) ren(ICDE)

Tree Edit Distance: Minimum number of node edit operations(insert, rename, delete) that transform one tree into the other.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

Motivation and Problem Definition

Ranking Function: Tree Edit Distance (TED)

article

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

TKDE

del(authors) ren(ICDE)

Tree Edit Distance: Minimum number of node edit operations(insert, rename, delete) that transform one tree into the other.

TASM computes TED between query and document subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

Motivation and Problem Definition

Ranking Function: Tree Edit Distance (TED)

article

authors

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

ICDE

article

author

Tim

author

John

booktitle

TKDE

del(authors) ren(ICDE)

Tree Edit Distance: Minimum number of node edit operations(insert, rename, delete) that transform one tree into the other.

TASM computes TED between query and document subtrees

Size and number of computed subtrees define TASM complexity

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

Motivation and Problem Definition

State of the Art

TASM-Dynamic: dynamic programming solution1

computes distance to every subtree of the documentuse smaller subtrees to compute larger onesrank subtrees by visiting memoization tableSpace complexity: O(mn), m: query size, n: document size

1Zhang and Shasha 1989, Demaine et al. 2007Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 7 / 28

Motivation and Problem Definition

State of the Art

TASM-Dynamic: dynamic programming solution1

computes distance to every subtree of the documentuse smaller subtrees to compute larger onesrank subtrees by visiting memoization tableSpace complexity: O(mn), m: query size, n: document size

Space complexity limits application to databases

in database applications n is huge (database size!)TASM-Dynamic maintains two m × n matrixes in RAM> 6GB RAM for our tiny query (m = 8) on DBLP (n = 28 × 106)

1Zhang and Shasha 1989, Demaine et al. 2007Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 7 / 28

Motivation and Problem Definition

State of the Art

TASM-Dynamic: dynamic programming solution1

computes distance to every subtree of the documentuse smaller subtrees to compute larger onesrank subtrees by visiting memoization tableSpace complexity: O(mn), m: query size, n: document size

Space complexity limits application to databases

in database applications n is huge (database size!)TASM-Dynamic maintains two m × n matrixes in RAM> 6GB RAM for our tiny query (m = 8) on DBLP (n = 28 × 106)

For database size solutions dynamic programming is too expensive.

1Zhang and Shasha 1989, Demaine et al. 2007Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 7 / 28

Motivation and Problem Definition

State of the Art

TASM-Dynamic: dynamic programming solution1

computes distance to every subtree of the documentuse smaller subtrees to compute larger onesrank subtrees by visiting memoization tableSpace complexity: O(mn), m: query size, n: document size

Space complexity limits application to databases

in database applications n is huge (database size!)TASM-Dynamic maintains two m × n matrixes in RAM> 6GB RAM for our tiny query (m = 8) on DBLP (n = 28 × 106)

For database size solutions dynamic programming is too expensive.

State-of-the-art algorithms do not scale!

1Zhang and Shasha 1989, Demaine et al. 2007Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 7 / 28

Motivation and Problem Definition

Problem Definition

Find a solution for TASM (Top-k Approximate Subtree Matching) that

scales to very large documents

runs in small memory

ranks subtrees correctly (no heuristics!)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 8 / 28

TASM-Postorder

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 9 / 28

TASM-Postorder Upper Bound on Subtree Size

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 10 / 28

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

worst match

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′

k |

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′

k |

3. Size upper bound for subtree Ti

|Ti | − |Q| ≤ ted(Q, Ti ) Q Ti

at least:insert missing nodes

|Ti | − |Q|

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′

k |

3. Size upper bound for subtree Ti

|Ti | − |Q| ≤ ted(Q, Ti ) Q Ti

at least:insert missing nodes

|Ti | − |Q|

|Ti | ≤ ted(Q, Ti ) + |Q|

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′

k |

3. Size upper bound for subtree Ti

|Ti | − |Q| ≤ ted(Q, Ti ) Q Ti

at least:insert missing nodes

|Ti | − |Q|

|Ti | ≤ ted(Q, Ti ) + |Q| ≤ 2|Q| + |T ′

k |

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

|T ′

k | ≤ k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′

k |

3. Size upper bound for subtree Ti

|Ti | − |Q| ≤ ted(Q, Ti ) Q Ti

at least:insert missing nodes

|Ti | − |Q|

|Ti | ≤ ted(Q, Ti ) + |Q| ≤ 2|Q| + |T ′

k |

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

TASM-Postorder Upper Bound on Subtree Size

Subtree Size Upper Bound in Three Steps

1. Rank first k subtrees of T in postorder: R ′ = (T ′

1, T′

2, . . . ,T′

k)

Q∅

T ′

k

|T ′

k | ≤ k

delete Q insert T ′

k

worst match

(i) ted(Q, T ′

k) ≤ |Q| + |T ′

k |

2. Final ranking R = (T1, T2, . . . ,Tk) (=TASM result)

Ti ’s in R are better than worst match T ′

k of R ′

(ii) ted(Q, Ti ) ≤ ted(Q, T ′

k) ≤ |Q| + |T ′

k |

3. Size upper bound for subtree Ti

|Ti | − |Q| ≤ ted(Q, Ti ) Q Ti

at least:insert missing nodes

|Ti | − |Q|

|Ti | ≤ ted(Q, Ti ) + |Q| ≤ 2|Q| + |T ′

k | ≤ 2|Q| + k

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

TASM-Postorder Upper Bound on Subtree Size

Upper Bound on Subtree Size

Theorem (Upper Bound on Subtree Size)

TASM needs to consider only small document subtrees of size τ or less:

τ = 2|Q| + k

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 12 / 28

TASM-Postorder Upper Bound on Subtree Size

Upper Bound on Subtree Size

Theorem (Upper Bound on Subtree Size)

TASM needs to consider only small document subtrees of size τ or less:

τ = 2|Q| + k

Upper bound is very powerful:

independent of document size and structure!

linear in query size and k

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 12 / 28

TASM-Postorder Upper Bound on Subtree Size

Upper Bound on Subtree Size

Theorem (Upper Bound on Subtree Size)

TASM needs to consider only small document subtrees of size τ or less:

τ = 2|Q| + k

Upper bound is very powerful:

independent of document size and structure!

linear in query size and k

Example: top-10 with example query |Q| = 8 on DBLP (28M nodes)

with bound: max subtree size τ = 2 ∗ 8 + 10 = 26

without bound: maximum subtree size is 28M (whole document)!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 12 / 28

TASM-Postorder Upper Bound on Subtree Size

Upper Bound on Subtree Size

Theorem (Upper Bound on Subtree Size)

TASM needs to consider only small document subtrees of size τ or less:

τ = 2|Q| + k

Upper bound is very powerful:

independent of document size and structure!

linear in query size and k

Example: top-10 with example query |Q| = 8 on DBLP (28M nodes)

with bound: max subtree size τ = 2 ∗ 8 + 10 = 26

without bound: maximum subtree size is 28M (whole document)!

Document-independent upper bound on subtree size!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 12 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 13 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13 X2,1

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13 X2,1 title,2

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13 X2,1 title,2

book,3

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13 X2,1 title,2

book,3 dblp,22

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13 X2,1 title,2

book,3 dblp,22

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Relevant and state-of-the-art for XML Parsing

full subtree known only at closing tagclosing tags appear in postorder

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Document Format: Postorder Queue

dblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

John,1 auth,2 X1,1 title,2 article,5

VLDB,1 conf,2 Peter,1 auth,2 X3,1

title,2 article,5 Mike,1 auth,2 X4,1

title,2 article,5 proc,13 X2,1 title,2

book,3 dblp,22

Postorder queue: queue of (label,size)-pairs

dequeue removes leftmost element, e.g., (John, 1)no random access!

Relevant and state-of-the-art for XML Parsing

full subtree known only at closing tagclosing tags appear in postorder

Implementation is efficient and heavily used for

XML streamsplain XML files (e.g., SAX)XML in database (Dewey, interval encoding, ...)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Candidate Subtrees

Candidate subtrees are all subtrees Ti of the document with

|Ti | ≤ τ ANDTi is not contained in a larger subtree |Tj | ≤ τ

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 15 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Candidate Subtrees

Candidate subtrees are all subtrees Ti of the document with

|Ti | ≤ τ ANDTi is not contained in a larger subtree |Tj | ≤ τ

Pruning: find candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 15 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Problem: memory buffer can grow very large!must keep subtrees in memory until non-candidate ancestor is readworst case: memory buffer stores O(n) nodes(frequent in data-centric XML!)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Problem: memory buffer can grow very large!must keep subtrees in memory until non-candidate ancestor is readworst case: memory buffer stores O(n) nodes(frequent in data-centric XML!)

Example: DBLP, τ = 50

99% of nodes are still in buffer when root node is read!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Simple Pruning Approach

dblp22

article5

auth2

John1

title4

X13

proceedings18

conf7

VLDB6

article12

auth9

Peter8

title11

X310

article17

auth14

Mike13

title16

X415

book21

title20

X219

Simple pruning approach: (τ = 6 in example above)add nodes to memory buffer until non-candidate (|Ti | > τ) is addedsubtrees of non-candidate with |Ti | ≤ τ are candidate subtrees

Problem: memory buffer can grow very large!must keep subtrees in memory until non-candidate ancestor is readworst case: memory buffer stores O(n) nodes(frequent in data-centric XML!)

Example: DBLP, τ = 50

99% of nodes are still in buffer when root node is read!

Simple pruning not feasible for large documents!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Efficient Pruning is Tricky!

Problem: when can we remove a node from the buffer?

when we see |Ti | ≤ τ , we don’t yet know about parent (postorder!)subtree of parent might be smaller than τ !

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 17 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Efficient Pruning is Tricky!

Problem: when can we remove a node from the buffer?

when we see |Ti | ≤ τ , we don’t yet know about parent (postorder!)subtree of parent might be smaller than τ !

Our Solution does not wait for parent

prefix ring buffer: fixed size bufferpruning rule: prune based on following nodes

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 17 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)

e↑ s↑John,1 auth,2 X1,1 title,4 article,5

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)

e↑ s↑John,1 auth,2 X1,1 title,4 article,5

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

two operations

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)

e↑ s↑John,1 auth,2 X1,1 title,4 article,5

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

two operations

append new node

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)VLDB,1

e↑ s↑John,1 auth,2 X1,1 title,4 article,5

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

two operations

append new node

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)VLDB,1

e↑ s↑John,1 auth,2 X1,1 title,4 article,5

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

two operations

append new noderemove leftmost subtree/node

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)

s↑VLDB,1

e↑

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

two operations

append new noderemove leftmost subtree/node

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Pruning in Small Memory

prefix ring buffer (τ = 6)

s↑VLDB,1

e↑

Prefix ring buffer of size τ + 1 (main memory)

stores prefix (τ nodes in postorder) of the document

two operations

append new noderemove leftmost subtree/node

Pruning rule: If leftmost node in full ring buffer is

leaf: leftmost subtree is candidate subtree

non-leaf: leftmost node is non-candidate node

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Pruning Rule – Intuition

Candidate subtree: leftmost node is a leaf

Ti : leftmost subtree, starts with leftmost nodeTj : smallest subtree that contains Ti

due to postorder: Tj contains all nodes in buffersince |Ti | ≤ τ and |Tj | > τ : Ti is a candidate

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 19 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Pruning Rule – Intuition

Candidate subtree: leftmost node is a leaf

Ti : leftmost subtree, starts with leftmost nodeTj : smallest subtree that contains Ti

due to postorder: Tj contains all nodes in buffersince |Ti | ≤ τ and |Tj | > τ : Ti is a candidate

Non-candidate node: leftmost node is a non-leaf

leftmost non-leaf is parent of previously removed nodeswe remove either candidate subtrees and non-candidate nodesin both cases: parent is a non-candidate

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 19 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)John,1 auth,2 X1,1 · · ·

prefix ring buffer (main memory)

s↑ e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)auth,2 X1,1 title,2 · · ·

prefix ring buffer (main memory)

s↑John,1

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)X1,1 title,2 article,5 · · ·

prefix ring buffer (main memory)

s↑John,1 auth,2

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)title,2 article,5 VLDB,1 · · ·

prefix ring buffer (main memory)

s↑John,1 auth,2 X1,1

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)article,5 VLDB,1 conf,2 · · ·

prefix ring buffer (main memory)

s↑John,1 auth,2 X1,1 title,2

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)VLDB,1 conf,2 Peter,1 · · ·

prefix ring buffer (main memory)

s↑John,1 auth,2 X1,1 title,2 article,5

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)conf,2 Peter,1 auth,2 · · ·

prefix ring buffer (main memory)

s↑John,1 auth,2 X1,1 title,2 article,5 VLDB,1

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)conf,2 Peter,1 auth,2 · · ·

prefix ring buffer (main memory)

s↑John,1 auth,2 X1,1 title,2 article,5 VLDB,1

e↑

append

candidate subtrees:(output)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)conf,2 Peter,1 auth,2 · · ·

prefix ring buffer (main memory)

s↑VLDB,1

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)Peter,1 auth,2 X3,1 · · ·

prefix ring buffer (main memory)

e↑ s↑VLDB,1 conf,2

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)auth,2 X3,1 title,2 · · ·

prefix ring buffer (main memory)Peter,1

e↑ s↑VLDB,1 conf,2

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)X3,1 title,2 article,5 · · ·

prefix ring buffer (main memory)Peter,1 auth,2

e↑ s↑VLDB,1 conf,2

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)title,2 article,5 Mike,1 · · ·

prefix ring buffer (main memory)Peter,1 auth,2 X3,1

e↑ s↑VLDB,1 conf,2

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)article,5 Mike,1 auth,2 · · ·

prefix ring buffer (main memory)Peter,1 auth,2 X3,1 title,2

e↑ s↑VLDB,1 conf,2

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)article,5 Mike,1 auth,2 · · ·

prefix ring buffer (main memory)Peter,1 auth,2 X3,1 title,2

e↑ s↑VLDB,1 conf,2

append

candidate subtrees:(output)

article

auth

John

title

X1

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)article,5 Mike,1 auth,2 · · ·

prefix ring buffer (main memory)

s↑Peter,1 auth,2 X3,1 title,2

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)Mike,1 auth,2 X4,1 · · ·

prefix ring buffer (main memory)

s↑Peter,1 auth,2 X3,1 title,2 article,5

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)auth,2 X4,1 title,2 · · ·

prefix ring buffer (main memory)

s↑Peter,1 auth,2 X3,1 title,2 article,5 Mike,1

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)auth,2 X4,1 title,2 · · ·

prefix ring buffer (main memory)

s↑Peter,1 auth,2 X3,1 title,2 article,5 Mike,1

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)auth,2 X4,1 title,2 · · ·

prefix ring buffer (main memory)

s↑Mike,1

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)X4,1 title,2 article,5 · · ·

prefix ring buffer (main memory)

e↑ s↑Mike,1 auth,2

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)title,2 article,5 proc,13 · · ·

prefix ring buffer (main memory)X4,1

e↑ s↑Mike,1 auth,2

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)article,5 proc,13 X2,1 · · ·

prefix ring buffer (main memory)X4,1 title,2

e↑ s↑Mike,1 auth,2

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)proc,13 X2,1 title,2 · · ·

prefix ring buffer (main memory)X4,1 title,2 article,5

e↑ s↑Mike,1 auth,2

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)X2,1 title,2 book,3 · · ·

prefix ring buffer (main memory)X4,1 title,2 article,5 proc,13

e↑ s↑Mike,1 auth,2

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)X2,1 title,2 book,3 · · ·

prefix ring buffer (main memory)X4,1 title,2 article,5 proc,13

e↑ s↑Mike,1 auth,2

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)X2,1 title,2 book,3 · · ·

prefix ring buffer (main memory)

s↑proc,13

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)title,2 book,3 dblp,22 · · ·

prefix ring buffer (main memory)

s↑proc,13 X2,1

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)book,3 dblp,22 · · ·

prefix ring buffer (main memory)

s↑proc,13 X2,1 title,2

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)dblp,22 · · ·

prefix ring buffer (main memory)

e↑ s↑proc,13 X2,1 title,2 book,3

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)dblp,22

e↑ s↑proc,13 X2,1 title,2 book,3

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)dblp,22

e↑ s↑proc,13 X2,1 title,2 book,3

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)dblp,22

e↑ s↑X2,1 title,2 book,3

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)dblp,22

e↑ s↑X2,1 title,2 book,3

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)

s↑dblp,22

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)

s↑dblp,22

e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

Prefix Ring Buffer Pruning – Exampledblp

article

auth

John

title

X1

proceedings

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

1. fill ring buffer

2. check leftmost node

leaf: candidate subtree – to resultnon-leaf: non-candidate – remove

3. until queue and buffer empty

τ = 6 postorder queue (input)(empty) · · ·

prefix ring buffer (main memory)

s↑ e↑

append

candidate subtrees:(output)

article

auth

John

title

X1

conf

VLDB

article

auth

Peter

title

X3

article

auth

Mike

title

X4

book

title

X2

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder Prefix Ring Buffer Pruning

TASM-Postorder

TASM-postorder

1. empty ranking R, tightening upper bound τ ′= τ

2. for each candidate subtree Ti

a. if |R| = k: update τ ′ = min(τ,max(R) + |Q|)b. compute tree edit distance for all subtrees of Ti within τ ′

c. update ranking R

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 21 / 28

TASM-Postorder Prefix Ring Buffer Pruning

TASM-Postorder

TASM-postorder

1. empty ranking R, tightening upper bound τ ′= τ

2. for each candidate subtree Ti

a. if |R| = k: update τ ′ = min(τ,max(R) + |Q|)b. compute tree edit distance for all subtrees of Ti within τ ′

c. update ranking R

Theorem (TASM-Postorder)

The space complexity of TASM-postorder is independent of thedocument size:

O(m2 + mk)

(m: query size, k: result size)

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 21 / 28

TASM-Postorder Prefix Ring Buffer Pruning

TASM-Postorder

TASM-postorder

1. empty ranking R, tightening upper bound τ ′= τ

2. for each candidate subtree Ti

a. if |R| = k: update τ ′ = min(τ,max(R) + |Q|)b. compute tree edit distance for all subtrees of Ti within τ ′

c. update ranking R

Theorem (TASM-Postorder)

The space complexity of TASM-postorder is independent of thedocument size:

O(m2 + mk)

(m: query size, k: result size)

TASM-postorder scales to very large documents!

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 21 / 28

Experiments

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 22 / 28

Experiments

Pruning Effectiveness

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 23 / 28

Experiments

Pruning Effectiveness

Prefix ring buffer pruning is very effective!Maximum subtree reduced from 37M to 18 nodes.

Dataset: PSD protein sequences, 37M nodes, 683MB

Compute TASM (|Q| = 4, k = 1)

TASM-dynamic (state of the art)TASM-postorder (our solution)

Histogram of computed subtrees

1e0

1e1

1e2

1e3

1e4

1e5

1e6

1e7

1e0 1e1 1e2 1e3 1e4 1e5 1e6 1e7

num

ber

of s

ubtr

ees

subtree size (nodes)

largest subtree: 37Mentire document

TASM-Dynamic

1e0

1e1

1e2

1e3

1e4

1e5

1e6

1e7

1e0 1e1 1e2 1e3 1e4 1e5 1e6 1e7

num

ber

of s

ubtr

ees

subtree size (nodes)

largest subtree: 18

TASM-Postorder

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 23 / 28

Experiments

Scalability: TASM-Postorder vs. TASM-Dynamic

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 24 / 28

Experiments

Scalability: TASM-Postorder vs. TASM-Dynamic

TASM-postorder much faster than TASM-dynamic.

Dataset: XMark (synthetic XML for benchmark)

Vary query size and document size

Compute TASM (k = 5)

TASM-dynamic (state of the art)TASM-postorder (our solution)

Measure wall clock time

1e0

1e1

1e2

1e3

4 8 16 32 64

time

(sec

onds

)

query size (nodes)

dyn, T:224MBdyn, T:112MBpos, T:224MBpos, T:112MB

1e0

1e1

1e2

1e3

112 224 448 896 1792

time

(sec

onds

)

document size (MB)

dyn, |Q|=8dyn, |Q|=4pos, |Q|=8pos, |Q|=4

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 24 / 28

Experiments

Scalability with Result Size k

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 25 / 28

Experiments

Scalability with Result Size k

TASM-postorder scales well with k .Increasing k by 4 orders of magnitude only doubles runtime.

0

50

100

150

200

250

300

1e0 1e1 1e2 1e3 1e4

time

(sec

onds

)

k

dyn, T:224MBdyn, T:112MBpos, T:224MBpos, T:112MB

Dataset: XMark (synthetic XML forbenchmark)

Vary k (size of ranking)

Compute TASM (|Q| = 16)

TASM-dynamic (state of the art)TASM-postorder (our solution)

Measure wall clock time

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 25 / 28

Experiments

Space complexity: TASM-Postorder vs. TASM-Dynamic

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 26 / 28

Experiments

Space complexity: TASM-Postorder vs. TASM-Dynamic

TASM-postorder: space independent of document!

1e0

1e1

1e2

1e3

4e3

112 224 448 896 1792

mem

ory

(MB

)

document size (MB)

3GB

8MB

dyn, |Q|=16dyn, |Q|=4

pos, |Q|=16pos, |Q|=4

Dataset: XMark (synthetic XML forbenchmark)

Vary document size

Compute TASM (k = 5)

TASM-dynamic (state of the art)TASM-postorder (our solution)

Measure main memory usage

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 26 / 28

Conclusion and Future Work

Outline

1 Motivation and Problem Definition

2 TASM-PostorderUpper Bound on Subtree SizePrefix Ring Buffer Pruning

3 Experiments

4 Conclusion and Future Work

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 27 / 28

Conclusion and Future Work

Conclusion

Conclusion

Prefix Ring Buffer for space efficient pruning

Dynamic programming does not scale for database size solutions.

Upper bound τττ : limit maximum subtree size for TASM

TASM-postorder: highly scalable TASM algorithm

TASM-postorder makes TASM feasible.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 28 / 28

Conclusion and Future Work

Conclusion

Conclusion

Prefix Ring Buffer for space efficient pruning

Dynamic programming does not scale for database size solutions.

Upper bound τττ : limit maximum subtree size for TASM

TASM-postorder: highly scalable TASM algorithm

TASM-postorder makes TASM feasible.

Future Work – New research opportunities:

tune tree edit distance to different applications

index the document: can we avoid a document scan?

parallel TASM algorithm: where to split document?

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 28 / 28

Erik D. Demaine, Shay Mozes, Benjamin Rossman, and OrenWeimann.An optimal decomposition algorithm for tree edit distance.In ICALP, volume 4596 of LNCS, pages 146–157, Wroclaw, Poland,July 2007. Springer.

K. Zhang and D. Shasha.Simple fast algorithms for the editing distance between trees andrelated problems.SIAM J. on Computing, 18(6):1245–1262, 1989.

Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 28 / 28