32
DIMACS Streaming Data Wor king Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson (Upenn), Malika Mahoui (Upenn) and Derick Wood (HKUST)

On the Optimality of the Holistic Twig Join Algorithm

  • Upload
    ornice

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

On the Optimality of the Holistic Twig Join Algorithm. Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson (Upenn), Malika Mahoui (Upenn) and Derick Wood (HKUST). A Scenario. Small Devices. XML Doc. Server. Limited computing resources. Memory. Picking up useful elements on the fly. - PowerPoint PPT Presentation

Citation preview

Page 1: On the Optimality of the Holistic Twig Join Algorithm

DIMACS Streaming Data Working Group II

On the Optimality of the Holistic Twig Join Algorithm

Speaker: Byron Choi (Upenn)

Joint Work with Susan Davidson (Upenn), Malika Mahoui (Upenn) and

Derick Wood (HKUST)

Page 2: On the Optimality of the Holistic Twig Join Algorithm

A Scenario

XML Doc. Server

Memory

SmallDevices

Memory is sharedby many Concurrent apps.

Limited computing resources

Streams of elements

Picking up useful

elements on the fly

Page 3: On the Optimality of the Holistic Twig Join Algorithm

Background The Model, Data Representation

and Assumptions

Page 4: On the Optimality of the Holistic Twig Join Algorithm

The Model Data Streaming Model

Spend constant time to process each element

An element in a stream is either discarded or stored in the main memory once it is processed

See the element in streams only once

Page 5: On the Optimality of the Holistic Twig Join Algorithm

Node Representation 4-ary tuple: <preorder #, postorder #,

depth, label> Complexity of Desc, Child, Ances,

Parent: O(1) Desc(n1, n2) = true if

n1.preorder < n2.preorder ^ n1.postorder > n2.postorder

Child(n1, n2) = true ifn1.preorder < n2.preorder ^ n1.postorder >

n2.postorder ^ n1.depth + 1 = n2.depth

Page 6: On the Optimality of the Holistic Twig Join Algorithm

Example Document

a1

a2b1

b2 c1

c2

(1, 9, 1, A)(2, 7, 2, B)

(3, 6, 3, A)

(4, 4, 4, B) (5, 5, 4, C)

(8, 8, 2, C)

Page 7: On the Optimality of the Holistic Twig Join Algorithm

Twig Queries Syntax:

Step ::= / | //NodeTest ::= symbolPath ::= Step NodeTest | Step NodeTest Path

Twig ::= Path | Path (Twig, Twig, …, Twig)

Example // A (//B, //C) In English: Want to find the A nodes which

has a B descendent and a C descendent

A

B C

Page 8: On the Optimality of the Holistic Twig Join Algorithm

Twig Join Algorithms Containment Join [Jiang et al.]

Decompose a twig query into a set of steps Apply relational join algor. to join the nodes of each step Use customized traditional indexes and estimation

methods [SIGMOD03]

Path Join [Zhang et al.] Decompose a twig query into a set of paths Apply relational join algor. to join the nodes of each path

Holistic Twig Join [Bruno et al.] Evaluate the twig query as a whole

Page 9: On the Optimality of the Holistic Twig Join Algorithm

Twig Join Algorithms (cont’) The first two approaches may

compute large intermediate results and not suitable for data streaming

In this talk we will focus on the third approach. The TwigStack Algor. (Bruno et al.

SIGMOD 02)

Page 10: On the Optimality of the Holistic Twig Join Algorithm

The TwigStack Algor. (Overview) Associate a stream to each NodeTest

The nodes in the stream satisfy the NodeTest Asymptotically optimal among the algorithms that

read the entire input Scan the streams only once Spend constant memory only on the nodes that are useful,

i.e. participate in at least one solution

Guarantee the optimality when the query contains descendent edges only.

Suboptimal when the query contains some child edges

Memory is spent on possibly useless nodes.

Page 11: On the Optimality of the Holistic Twig Join Algorithm

Problem Statement Given a twig query and the

associated streams, is it possible to find all solutions … By using a single forward scan of the

streams By paying constant memory only to

the useful nodes By spending constant time on

processing each node in the streams

Page 12: On the Optimality of the Holistic Twig Join Algorithm

Main Results So Far Assume the data streaming model…

There is no optimal holistic twig join algorithm – Theorem 1.

The evaluation of the twig queries is not memory bounded – Theorem 1.

By relaxing some restrictions on the data streaming model, we showed… The lower bounds of such relaxed models

are still quite high – Theorem 2 and Theorem 3.

Page 13: On the Optimality of the Holistic Twig Join Algorithm

Outline TwigStack By Examples Offline Sorting Multiple Scans Discussion Conclusion

Page 14: On the Optimality of the Holistic Twig Join Algorithm

TwigStack By Examples Query: //A (//B, //C) Document: Streams:

TA = [a1, a2], TB = [b1, b2], TC = [c1, c2] pA, pB, pC are the anchor pointing to

the “top” of the streams Useful nodes are stored in the main

memory and can be read later

a1

a2b1

b2 c1

c2

Page 15: On the Optimality of the Holistic Twig Join Algorithm

TwigStack By Examples Step 0

pA -> a1, pB -> b1, pC -> c1 a1 is useful, TA is advanced,

pA->a2

Step 1 b1 is useful, TB is advanced,

pB->b2

a1

a2b1

b2 c1

c2

a1

a2b1

b2 c1

c2

a1

Page 16: On the Optimality of the Holistic Twig Join Algorithm

TwigStack By Examples Step 2

a2 is useful, TA is advanced, pA -> null

Step 3 b2 is useful, TB is advanced,

pB -> null

a1

a2b1

b2 c1

c2

a1b1

a1

a2b1

b2 c1

c2

a1b1a2

Page 17: On the Optimality of the Holistic Twig Join Algorithm

TwigStack By Examples Step 4

c1 is useful, TC is advanced, pC -> c2

Step 5 Printing

Step 6 c2 is useful, TC is advanced,

pC-> null

a1

a2b1

b2 c1

c2

a1b1a2b2

a1

a2b1

b2

c2

a1b1

Page 18: On the Optimality of the Holistic Twig Join Algorithm

TwigStack By Examples Query: //A (/B, /C) Document: Streams: TA = [a1, a2], TB = [b1,

b2], TC = [c1, c2]a1

a2b1

b2 c1

c2

Page 19: On the Optimality of the Holistic Twig Join Algorithm

TwigStack By Examples Computation 1

pA -> a1, pB -> b1, pC -> c1 TA is advanced, pA->a2, TB is advanced,

pB -> b2 a2 is useful (a1 is discarded)

Computation 2 TC is advanced, pC->c2 a1 is useful a2 is useless because c1 is discarded

a1

a2b1

b2 c1

c2

a1

a2b1

b2 c1

Page 20: On the Optimality of the Holistic Twig Join Algorithm

TwigStack By Examplesa1

b1 c4

a1

b2 c3

a1

b3 c2

a1

b4 c1

The Extreme Case O(stream size)

Page 21: On the Optimality of the Holistic Twig Join Algorithm

TwigStack Pseudo Code

We’ve only walkedthrough the red boxes

Page 22: On the Optimality of the Holistic Twig Join Algorithm

Twig Queries over Streams Theorem 1

There is no optimal holistic twig join algorithms, no matter how the nodes are sorted.

Memory must be spent on possibly useless nodes

Given arbitrary streams, memory requirement of exact algorithms is unbounded.

Page 23: On the Optimality of the Holistic Twig Join Algorithm

Proof of Theorem 1 (Sketch) Fix a document Issue a few queries: //A//B, /A

(/A, /A) and /A/A Optimality implies certain

constraints on the streams No single stream can satisfy all the

constraints

Page 24: On the Optimality of the Holistic Twig Join Algorithm

Proof of Theorem 1 (cont’) Reduce a twig query to a SPJ query the twig query is memory bounded

iff the SPJ query is memory bounded.

Babcock et al PODS 02

Page 25: On the Optimality of the Holistic Twig Join Algorithm

Outline TwigStack By Examples Offline Sorting Multiple Scans Discussion Conclusion

Page 26: On the Optimality of the Holistic Twig Join Algorithm

Variation 1: Offline Sorting Pre-compute some intermediate

results and collect the results in a scan

Allow offline sorting on the nodes and keep all the necessary sorted nodes

Allow the algorithm to scan the nodes in the correct orderings

Page 27: On the Optimality of the Holistic Twig Join Algorithm

Motivation The anchors are performing a

depth first transversal But why? How about an ordering in

which recursions are removed?

a1

a2b1

b2 c1

c2a1 a2

b1 b2 c1c2

Page 28: On the Optimality of the Holistic Twig Join Algorithm

The Lower Bound The number of necessary sorting

performed offline is high Data redundancy

m is the number of structurally recursive label in the doc. DTD. d is the doc. depth.

The lower bound is m We identify a restricted case that DTDs

help to lower the lower bound

d

Page 29: On the Optimality of the Holistic Twig Join Algorithm

Variation 2: Multiple Scans Massive storage (tapes, disks)

naturally produces a stream of items.

Sequential scans is a vital requirement of such storage Can only allow a small number of

scans due to the high volume of data

Page 30: On the Optimality of the Holistic Twig Join Algorithm

The Lower Bound Allow P scans on the data streams. The lower bound of P is high d where d is the doc. depth and t

is the number of simple child-edge query in a twig query

t

Page 31: On the Optimality of the Holistic Twig Join Algorithm

Discussion Bruno et al. assigns memory to possible

useless nodes and illustrates that such computation model is practical by experiments

No work on approximating the twig queries with provable guarantees

Constraints expressed in DTDs Our work assumes certain representation of

the node: ancestor, descendent, parent, child relationship can be determined in O(1)

Page 32: On the Optimality of the Holistic Twig Join Algorithm

Conclusion The evaluation of twig queries in

data streaming context is tricky. It is not memory bounded. Optimal memory constraint cannot

be satisfied in a pass of streams. Need to look for other solutions.