15
ntroduction to Database Systems Join Algorithms Query Processing: Lecture 1

Join Algorithms

Embed Size (px)

DESCRIPTION

Join Algorithms. Query Processing: Lecture 1. Joins With One Join Column. SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid=S1.sid. In algebra: R S. Common! Must be carefully optimized. R S is large; so, R S followed by a selection is inefficient. - PowerPoint PPT Presentation

Citation preview

Page 1: Join Algorithms

Introduction to Database Systems 1

Join Algorithms

Query Processing: Lecture 1

Page 2: Join Algorithms

Introduction to Database Systems 2

Joins With One Join Column

In algebra: R S. Common! Must be carefully optimized. R S is large; so, R S followed by a selection is inefficient.

Assume: M tuples in R, pR tuples per page, N tuples in S, pS tuples per page.– In our examples, R is Reserves and S is Sailors.

Cost metric: # of I/Os. Ignore output costs.

SELECT *FROM Reserves R1, Sailors S1WHERE R1.sid=S1.sid

Page 3: Join Algorithms

Introduction to Database Systems 3

Cross-Product Based Joins

Apply selection on all pairs of tuples. Family of “nested-loops” joins

Page 4: Join Algorithms

Introduction to Database Systems 4

Tuple Nested Loops Join

– Cost: M + pR * M * N = 1000 + 100*1000*500 I/Os.

How can you use locality to improve performance?

foreach tuple r in R doforeach tuple s in S do

if ri == sj then add <r, s> to result

Page 5: Join Algorithms

Introduction to Database Systems 5

Page-Oriented NL Join

– Cost: M + M*N = 1000 + 1000*500– If smaller relation (S) is outer, cost = 500 +

500*1000– More improvements ?

foreach page pR in R do foreach page pS in S do foreach tuple r in pR do

foreach tuple s in pS doif ri == sj then add <r, s> to result

Page 6: Join Algorithms

Introduction to Database Systems 6

Block Nested Loops Join Use one page as an input buffer for scanning the

inner S, one page as the output buffer, and use all remaining pages to hold ``chunk’’ of outer R.– For each matching tuple r in R-chunk, s in S-page, add

<r, s> to result. Then read next R-chunk, scan S, etc.

. . . . . .

. . .

R & SHash table for chunk of R

(k < B-1 pages)

Input buffer for S Output buffer

Join Result

Page 7: Join Algorithms

Introduction to Database Systems 7

Examples of Block Nested Loops

Cost: Scan of outer + #outer chunks * scan of inner– #outer chunks =

With Reserves (R) as outer, and 100 pages per chunk:– Cost of scanning R is 1000 I/Os; a total of 10 chunks.– Per chunk of R, we scan Sailors (S); 10*500 I/Os.– If space for just 90 pages of R, we would scan S 12 times.

With 100-page chunk of Sailors as outer:– Cost of scanning S is 500 I/Os; a total of 5 chunks.– Per chunk of S, we scan Reserves; 5*1000 I/Os.

How does one exploit sequential I/O?

# /of pagesof outer chunksize

Page 8: Join Algorithms

Introduction to Database Systems 8

Smarter than Cross-Products

Cross-product based joins necessary! Improvements for special cases?

– Simple conditions (X > Y) ??

– Equality conditions (X = Y) ??

Page 9: Join Algorithms

Introduction to Database Systems 9

Index Nested Loops Join

Index on the join column of one relation? Make it the inner relation, and use it!– Cost: M + ( (M*pR) * cost of finding matching S

tuples Cost: For each R tuple, an “index probe” Clustered index: 1 I/O (typical), unclustered:

upto 1 I/O per matching S tuple.

foreach tuple r in R doforeach tuple s in S where ri == sj do

add <r, s> to result

Page 10: Join Algorithms

Introduction to Database Systems 10

Partitioned Joins

Step 1: break R and S into N partitions. Step 2: join N corresponding pairs of

partitions.

Why is this efficient? (do the math) Focus on equality partitioning What are common partitioning algorithms? How are partition sizes distributed?

Page 11: Join Algorithms

Introduction to Database Systems 11

Sort-Merge Join (R S) Sort R and S on the join column, then scan them to

do a ``merge’’ (on join col.), and output result tuples.– Mechanics of the algorithm?

What about skew? Cost: M log M + N log N + (M+N)

– The cost of scanning, M+N, could be M*N (very unlikely!)– how about sequential vs random I/O ?

With 35, 100 or 300 buffer pages, both Reserves and Sailors can be sorted in 2 passes; total join cost: 7500.

i=j

Page 12: Join Algorithms

Introduction to Database Systems 12

Refinement of Sort-Merge Join

We can combine the merging phases in the sorting of R and S with the merging required for the join.– Let B > , where L is the size of the larger relation– Mechanics?

In practice, cost of sort-merge join, like the cost of external sorting, is linear.

L

Page 13: Join Algorithms

Introduction to Database Systems 13

Hash-Join Partition both

relations using hash f’n h: R tuples in partition i will only match S tuples in partition i.

Original RelationOUTPUT

2

B main memory buffers DiskDisk

INPUT1

hashfunction

hB-1

Partitions

1

2

B-1

. . .

Read in a partition of R, hash it using h2. Scan matching partition of S, search for matches.

Partitionsof R & S

Input bufferfor Si

Hash table for partitionRi (k < B-1 pages)

B main memory buffersDisk

Output buffer

Disk

Join Result

hashf’nh2

h2

Page 14: Join Algorithms

Introduction to Database Systems 14

Observations on Hash-Join

#partitions k < B (why?), and B-1 > size of largest partition to be held in memory. Assuming uniformly sized partitions, and maximizing k, we get:– k= B-1, and M/(B-1) = B-1, i.e., B-1 must be >

If we build an in-memory hash table to speed up the matching of tuples, a little more memory is needed.

Skew ? In partitioning phase, read+write both relns;

2(M+N). In matching phase, read both relns; M+N I/Os.

M

Page 15: Join Algorithms

Introduction to Database Systems 15

Hash-Join vs Sort-Merge Join

Given a minimum amount of memory (what is this, for each?) both have a cost of 3(M+N) I/Os.

Hash Join superior if relation sizes differ greatly. Also, shown to be highly parallelizable.

Sort-Merge less sensitive to data skew; result is sorted;

Sequential and random I/O in different phases.