XJoin: Faster Query Results Over Slow And Bursty Networks

1

XJoin: Faster Query Results Over Slow And Bursty Networks

IEEE Bulletin, 2000by T. Urhan and M Franklin

Based on a talk prepared by Asima Silva & Leena Razzaq

2

Motivation

Data delivery issues in terms of:– unpredictable delay from some remote data sources– wide-area network with possibly communication links,

congestion, failures, and overload Goal:

– Not just overall query processing time matters– Also when initial data is delivered– Overall throughput and rate throughout query process

3

Overview

Hash Join History 3 Classes of Delays Motivation of XJoin Challenges of Developing XJoin Three Stages of XJoin Handling Duplicates Experimental Results

4

Hash Join

Only one table is hashed

key2 R tuples

key1 R tuples

key3 R tuples

key4 R tuples

Key5… R tuples…

S tuple 1

S tuple 2

S tuple 3

S tuple 4

S tuple 5….

1. BUILD

2. Probe

5

Hybrid Hash Join

One table is hashed both to disk and memory (partitions) G. Graefe. “Query Evaluation Techniques for Large Databases”, ACM1993.

Disk

Bucket i

Bucket i+1

Bucket i+2

Bucket …

Bucket j-1

Bucket j

R tuples

R tuples

R tuples

R tuples

R tuples

R tuplesBucket n

Bucket n+1

Bucket n+2

Bucket …

Bucket m-1

Bucket m

R tuples

R tuples

R tuples

R tuples

R tuples

R tuples

Memory

S tuple 1

S tuple 1

S tuple 1

S tuple 1

S tuple …

6

Symmetric Hash Join (Pipeline)

Both tables are hashed (both kept in main memory only) Z. Ives, A. Levy, “An Adaptive Query Execution”, VLDB 99

Source R

OUTPUT

Source S

Key n

Key n+1

Key n+2

Key …

Key m-1

Key m

R tuples

R tuples

R tuples

R tuples

R tuples

R tuples

BUILD

PROBE

R tuple S tuple

Key i

Key i+1

Key i+2

Key …

Key j-1

Key j

S tuples

S tuples

S tuples

S tuples

S tuples

S tuples

BUILD

PROBE

R tuple S tuple

7

Problem of SHJ:

Memory intensive : – Won’t work for large input streams– Wont’ allow for many joins to be processed in a

pipeline (or even in parallel)

8

New Problems: Three Delays

– Initial Delay First tuple arrives from remote source more slowly than

usual (still want initial answer out quickly)

– Slow Delivery Data arrives at a constant, but slower than expected

rate (at the end, still overall good throughput behavior)

– Bursty Arrival Data arrives in a fluctuating manner (how to avoid sitting

idle in periods of low input stream rates)

9

Question:

Why are delays undesirable?– Prolongs the time for first output– Slows the processing if wait for data to first be

there before acting– If too fast, you want to avoid loosing any data– Waste of time if you sit idle while no data is

incoming– Unpredictable, one single strategy won’t work

10

Challenges for XJoin

Manage flow of tuples between memory and secondary storage (when and how to do it)

Control background processing when inputs are delayed (reactive scheduling idea)

Ensure the full answer is produced Ensure duplicate tuples are not produced Both quick initial output as well as good

overall throughput

11

Motivation of XJoin

Produces results incrementally when available– Tuples returned as soon as produced– Good for online processing

Allows progress to be made when one or more sources experience delays by:– Background processing performed on previously

received tuples so results are produced even when both inputs are stalled

12

Stages (in different threads)

M :M

M :D

D:D

13

Tuple B

hash(Tuple B) = n

SOURCE-B

Memory-resident partitions of source B

SOURCE-A

D I S K

M E

M O

R Y 1

. . . . . . nn1

Memory-resident partitions of source A

1

. . . . . . . . . . . . n

1

Disk-residentpartitions of source A

. . . n

Disk-residentpartitions of source B

. . . . . .1 nk

k

flu

sh

Tuple A

hash(Tuple A) = 1

XJoin

14

1st Stage of XJoin

Memory - to - Memory Join Tuples are stored in partitions:

– A memory-resident (m-r) portion – A disk-resident (d-r) portion

Join processing continues as usual:– If space permits, M to M– If memory full, then pick one partition as victim, flush to disk

and append to end of disk partition 1st Stage runs as long as one of the inputs is producing

tuples If no new input, then block stage1 and start stage 2

15

M E

M O

R Y

Partitions of source B

. . . . . . . . .i j

SOURCE-B

hash(record B) = j

Tuple B

SOURCE-A

Tuple A

hash(record A) = i

i j

Partitions of source A

. . . . . . . . .

Output

Insert Probe InsertProbe

1st Stage Memory-to-Memory Join

16

Why Stage 1?

• Use Memory as it is the fastest whenever possible

• Use any new coming data as it’s already in memory

• Don’t stop to go and grab stuff out of disk for new data joins

17

Question:

– What does Second Stage do?– When does the Second Stage start?– Hints:

Xjoin proposes a memory management technique What occurs when data input (tuples) are too large for

memory?

– Answer: Second Stage joins Mem-to-Disk Occurs when both the inputs are blocking

18

2nd Stage of XJoin

Activated when 1st Stage is blocked Performs 3 steps:

1. Chooses the partition according to throughput and size of partition from one source

2. Uses tuples from d-r portion to probe m-r portion of other source and outputs matches, till d-r completely processed

3. Checks if either input resumed producing tuples. If yes, resume 1st Stage. If no, choose another d-r portion and continue 2nd Stage.

19

Output

i. . . . . . .. . . . . . .i. . . . . . .. . . . . . .

M E

M O

R Y

Partitions of source BPartitions of source A

D I

S K

Partitions of source BPartitions of source A

ii . . . . .. . . . .. . . . .. . . . .

DPiA MPiB

Stage 2: Disk-to-memory Joins

20

Controlling 2nd Stage

Cost of 2nd Stage is hidden when both inputs experience delays

Tradeoff ? What are the benefits of using the second stage?

– Produce results when input sources are stalled– Allows variable input rates

What is the disadvantage?– The second stage must complete a d-r portion before

checking for new input (overhead) To address the tradeoff, use an activation threshold:

– Pick a partition likely to produce many tuples right now

21

3rd Stage of XJoin

Disk-to-Disk Join Clean-up stage

– Assumes that all data for both inputs has arrived– Assumes that first and second stage completed– Makes sure that all tuples belonging in the result

are being produced. Why is this step necessary?

– Completeness of answer

22

Handling Duplicates

When could duplicates be produced? Duplicates could be produced in all 3 stages

as multiple stages may perform overlapping work.

How address it:– XJoin prevents duplicates with timestamps.

When address this:– During processing as continuous output

23

Time Stamping : part 1

2 fields are added to each tuple:– Arrival TimeStamp (ATS)

indicates when the tuple arrived first in memory– Departure TimeStamp (DTS)

used to indicated time the tuple was flushed to disk

[ATS, DTS] indicates when tuple was in memory When did two tuples get joined?

– If Tuple A’s DTS is within Tuple B’s [ATS, DTS] Tuples that meet this overlap condition are not

considered for joining by the 2nd or 3rd stages

24

Tuple B1 178 198

• Tuples joined in first stage

• B1 arrived after A, and before A was flushed to disk

Tuple A 102 234

DTSATS

Tuple B2 348 601

• Tuples not joined in first stage

• B2 arrived after A, and after A was flushed to disk

Tuple A 102 234

DTSATS

Non-Overlapping

Detecting tuples joined in 1st stage

Overlapping

25

Time Stamping : part 2

• For each partition, keep track off:– ProbeTS: time when a 2nd stage probe was done– DTSlast: the latest DTS time of all the tuples that were

available on disk at that time

• Several such probes may occur:– Thus keep an ordered history of such probe descriptors

• Usage: – All tuples before and including at time DTSlast were

joined in stage 2 with all tuples in main memory (ATS,DTS) at time ProbeTS

26

Tuple A

DTS

100 200

ATS

Tuple B

DTS

500 600

ATS

Detecting tuples joined in 2nd stage

ProbeTSDTSlast

20 340

Partition 1

Overlap

250 550

Partition 2

300 700

Partition 3

History list for the corresponding partitions

100 300

Partition 1

800 900

Partition 2

All tuples before and including DTSlast were joined in Stage 2 At time ProbeTS

All A tuples in Partition 2 up to DTSlast 250,Were joined with m-r tuples that arrived before Partition 2’s ProbeTS.

27

Experiments

• HHJ (Hybrid Hash Join)

• Xjoin (with 2nd stage and with caching)

• Xjoin (without 2nd stage)

• Xjoin (with aggressive usage of 2nd stage)

28

Case 1: Slow NetworkBoth sources are slow (bursty)

XJoin improves delivery time of initial answers -> interactive performance

The reactive background processing is an effective solution to exploit intermittant delays to keep continued output rates.

Shows that 2nd stage is very useful if there is time for it

29

Slow Network: both resources are slow

30

Case 2: Fast NetworkBoth sources are fast

All XJoin variants deliver initial results earlier. XJoin also can deliver the overall result in

equal time to HHJ HHJ delivers the 2nd half of the result faster

than XJoin. 2nd stage cannot be used too aggressively if

new data is coming in continuously

31

Case 2: Fast NetworkBoth sources are fast

32

Conclusion

Can be conservative on space (small footprint)

Can be used in conjunction with online query processing to manage the streams

Resuming Stage 1 as soon as data arrives Dynamically choosing techniques for

producing results

33

References

Urhan, Tolga and Franklin, Michael J. “XJoin: Getting Fast Answers From Slow and Bursty Networks.”

Urhan, Tolga, Franklin, Michael J. “XJoin: A Reactively-Scheduled Pipelined Join Operator.”

Hellerstein, Franklin, Chandrasekaran, Deshpande, Hildrum, Madden, Raman, and Shah. “Adaptive Query Processing: Technology in Evolution”. IEEE Data Engineering Bulletin, 2000.

Hellerstein and Avnur, Ron. “Eddies: Continuously Adaptive Query Processing.”

Babu and Wisdom, Jennefer. “Continuous Queries Over Data Streams”.

Documents

XJoin: Faster Query Results Over Slow And Bursty Networks