Multi-Way Hash Join Effectiveness

2

Multi-Way Hash Join Effectiveness

M.Sc ThesisMichael Henderson

Supervisor Dr. Ramon Lawrence

3

Outline

• Motivation• Database Terminology• Background• Joins• Multi-Way Joins• Thesis Questions• Experimental Results• Conclusions

4

Motivation

• Data is everywhere• Governments collect data on citizens• Facebook collects data on over 1 billion people• Wal-Mart and Target collect sales data on all their customers

• The goal is to make answering the big questions–Possible–Faster

5

Database Terminology: Relations (Tables)

Part Lineitempartkey name retailprice linenumber partkey quantity saleprice1 Box 0.50 1 1 1 0.502 Hat 25.00 2 1 1 0.503 Bottle 2.50 3 2 3 22.50

4 3 15 2.50

Part Relation

Tuple/Row

Attribute/Column

Lineitem Relation

The tables are related through their partkey attributes

Attribute Names

6

Database Terminology II: SQL

• Structured Query Language• Used to ask the database questions about the data• Standardized• Example: SQL for retrieving all rows from the part table

SELECT * FROM Part;

7

Database Terminology III: Join

• Joins are used to combine the data in database tables• Joins are slow• We want joins to be faster

8

Background

9

What Makes Queries Slow?

• All the data must be read to give an accurate answer• Data is usually much larger than what can fit in memory• Operations such as filtering, ordering, and joins are

costly• A join is especially costly

– May need to match every row in two tables. O(n2)– May need to perform many slow disk operations (I/Os)

10

Background: Example Join QuerySELECT * FROM Part p, Lineitem lWHERE p.partkey = l.partkey;

Part Lineitem

p.partkey = l.partkey

Resultspartkey name retailprice linenumber partkey quantity saleprice1 Box 0.50 1 1 1 0.501 Box 0.50 2 1 1 0.502 Hat 25.00 3 2 3 22.503 Bottle 2.50 4 3 15 2.50

SQL

Relational Algebra

Join Results

11

Resultspartkey name retailprice linenumber partkey quantity saleprice

Nested Loop JoinPart Lineitempartkey name retailprice linenumber partkey quantity saleprice1 Box 0.50 1 1 1 0.502 Hat 25.00 2 1 1 0.503 Bottle 2.50 3 2 3 22.50

4 3 15 2.50


1 Box 0.50 1 1 1 0.501 Box 0.50 2 1 1 0.502 Hat 25.00 3 2 3 22.503 Bottle 2.50 4 3 15 2.50

1 Box 0.50 1 1 1 0.501 Box 0.502 1 1 0.50

1 Box 0.50

3 2 3 22.504 3 15 2.50

12

Dynamic Hash Join

Partpartkey name retailprice1 Box 0.502 Hat 25.003 Bottle 2.503 Bottle 2.502 Hat 25.001 Box 0.50

Part1partkey name retailprice



Three Part PartitionsHash Function: partition = (partkey - 1 mod 3) + 1

= (1 - 1 mod 3) + 1 = 1= (2 - 1 mod 3) + 1 = 2= (3 - 1 mod 3) + 1 = 3

Saved to disk

13

Part1partkey name retailprice1 Box 0.50


Dynamic Hash Join

Lineitemlinenumber partkey quantity saleprice1 1 1 0.502 1 1 0.503 2 3 22.504 3 15 2.50

1 Box 0.50 1 1 1 0.501 Box 0.502 1 1 0.503 2 3 22.504 3 15 2.50

Lineitem1linenumber partkey quantity saleprice



Three Lineitem Partitions

Hash Function: partition = (partkey - 1 mod 3) + 1= (1 - 1 mod 3) + 1 = 1= (2 - 1 mod 3) + 1 = 2= (3 - 1 mod 3) + 1 = 3

14

Part2partkey name retailprice2 Hat 25.00


1 Box 0.50 1 1 1 0.501 Box 0.50 2 1 1 0.50

Dynamic Hash Join

Lineitem2linenumber partkey quantity saleprice3 2 3 22.50


1 Box 0.50 1 1 1 0.501 Box 0.50 2 1 1 0.502 Hat 25.00 3 2 3 22.503 Bottle 2.50 4 3 15 2.50

2 Hat 25.00 3 2 3 22.50

15

Join Three TablesSELECT A.a_key, B.b_key, C.c_key FROM A, B, CWHERE A.a_key = B.a_key AND A.a_key = C.a_key;

A B

A.a_key = B.a_keyC

A.a_key = C.a_key

A B

A.a_key = B.a_keyC

A.a_key = C.a_key

Left Deep Plan Right Deep Plan

16

Multi-way Hash Joins

• Join multiple relations at the same time• Shares memory across the entire join• Produces a result by combining tuples from all relations• Do not have to repartition intermediate results• Less disk operations

A B

A.a_key = B.a_key and A.a_key = C.a_key

C

Multi-way Plan

17

Hash Teams

• Multi-way hash join• Hash teams joins relations on a common attribute

18

Hash Teams Example

A B Ca_key b_key a_key c_key a_key

1 1 1 1 32 2 2 2 13 3 3 3 2

4 1 4 25 2 5 1

SELECT A.a_key, B.b_key, C.c_key FROM A, B, CWHERE A.a_key = B.a_key AND A.a_key = C.a_key;

19

Partitioning A and B

A1

a_key1

Partitions

A Ba_key b_key a_key1 1 12 2 23 3 3

4 15 2

Hash Function: partition = (a_key - 1 mod 3) + 1

= (1 - 1 mod 3) + 1 = 1= (2 - 1 mod 3) + 1 = 2= (3 - 1 mod 3) + 1 = 3

1A2

a_key2

A3

a_key3

A1

a_key

A2

a_key

A3

a_key

23

B1

b_key a_key

B2

b_key a_key

B3

b_key a_key

20

Partitioning A and B

A1

a_key1

Partitions

A Ba_key b_key a_key1 1 12 2 23 3 3

4 15 2

Hash Function: partition = (a_key - 1 mod 3) + 1= (1 - 1 mod 3) + 1 = 1= (2 - 1 mod 3) + 1 = 2= (3 - 1 mod 3) + 1 = 3

A2

a_key2

A3

a_key3

1 12 23 34 15 2

B1

b_key a_key

1 14 1

B2

b_key a_key

2 25 2

B3

b_key a_key

3 3

B1

b_key a_key

B2

b_key a_key

B3

b_key a_key

21

Processing C

A1

a_key1

Disk Partitions


B1

b_key a_key

1 14 1

B1

b_key a_keyCc_key a_key

1 32 13 24 25 1

1 32 13 24 2

1 1 14 1

C2

c_key a_key

C3

c_key a_key

Resultsa_key b_key c_key

2 1412 1

22

Processing C

A1

a_key1

Disk Partitions


B1

b_key a_key

1 14 1

B1

b_key a_keyCc_key a_key

1 32 13 24 25 1

C2

c_key a_key3 24 2

C3

c_key a_key1 3

5 1

1 1 14 1

Resultsa_key b_key c_key1 1 21 4 21 1 51 4 52 2 32 5 32 2 42 5 43 3 1

14115

Resultsa_key b_key c_key1 1 21 4 2

5

23

Generalized Hash Teams (GHT)

• Extends Hash Teams• Does not need the join attributes to be the same• Uses indirect partitioning• Needs an in-memory map to indirectly join relations

24

GHT Partition Maps

• Uses join memory• Use a bitmap to approximate mapping to reduce

memory requirements• Needs a bitmap for each partition• Bitmaps introduce mapping errors that cause tuples to

be mapped to multiple partitions (false drops)• False drops add I/O and Processing cost

25

GHT ExampleSELECT c.custkey, o.orderkey, l.partkey FROM Customer c, Orders o, Lineitem lWHERE c.custkey = o.custkey AND o.orderkey = l.orderkey;

Customercustkey123

Ordersorderkey custkey1 12 23 34 15 2

Lineitemorderkey partkey1 11 22 32 43 13 84 54 65 4

26

GHT Customer Partitions

Customer1 Customer2 Customer3

custkey custkey custkey1 2 3

Hash Function: partition = (custkey - 1 mod 3) + 1

27

Orders Partitions and Bitmap

Orders1

orderkey custkey1 14 1

Orders2

orderkey custkey2 25 2

Orders2

orderkey custkey

Orders3

orderkey custkey3 3

Orders3

orderkey custkey

Orders1

orderkey custkey

Ordersorderkey custkey1 12 23 34 15 2

1 12 23 34 15 2

B1

0000

B2

0000

B3

0000

Index = (orderkey +1) mod 4

B1

0010

B1

0110

B2

0001

B2

0011

B3

1000

Hash Function:partition = (custkey - 1 mod 3) + 1

28

Orders Partitions and Bitmap

B1 B2 B3

0 0 11 0 01 1 00 1 0

B1

0110

B2

0011

B3

1000

29

Lineitem Partitions with False Drops

Lineitem1

orderkey partkey1 11 24 54 65 4

Lineitem2


Lineitem3

orderkey partkey3 13 8

Lineitem1

orderkey partkey

Lineitem2

orderkey partkey

Lineitem3

orderkey partkey

Lineitemorderkey partkey1 11 22 32 43 13 84 54 65 4

B1

0110

B2

0011

B3

1000

Index = (orderkey +1) mod 4

1 11 22 32 43 13 84 54 65 4

1 11 2

5 4

False Drop

False DropFalse Drop

30

Lineitem1


Joining the Partitions

1 11 24 54 65 4

Customer1

custkey1

Orders1

orderkey custkey1 14 11 14 1

1

Resultscustkey orderkey partkey1 1 11 1 21 4 51 4 62 2 32 2 42 5 43 3 13 3 8

Resultscustkey orderkey partkey

1 1 1

1 1 1

2 1 1

1

5 4 1

4 1

1

6 4 1False Drop

31

SHARP

• Limited to star joins– Looks like a star– All tables related to a central table

Fact

key a_key b_key c_key d_key e_key

A

a_key data

C

c_key data

B

b_key data

E

e_key data

D

d_key data

32

SHARP Example

Customer Product Saleitemid name id name c_id p_id1 Bob 1 Hammer 1 12 Joe 2 Drill 1 23 Greg 3 Screwdriver 2 34 Susan 4 Scissors 2 6

5 Toolbox 3 16 Knife 3 5

2 54 13 6

SELECT * FROM Customer c, Product p, Saleitem sWHERE c.id = s.c_id AND p.id = s.p_id;

33

SHARP Example Partitions

Customerid name1 Bob2 Joe3 Greg4 Susan

Customer1

id name1 Bob3 Greg

Customer1

id name

Customer2

id name2 Joe4 Susan

Customer2

id name

1 Bob2 Joe3 Greg4 Susan

Hash Function: partition = (id - 1 mod 2) + 1

34


Productid name1 Hammer2 Drill3 Screwdriver4 Scissors5 Toolbox6 Knife

Product1

id name1 Hammer4 Scissors

Product2

id name2 Drill5 Toolbox

Product3

id name3 Screwdriver6 Knife

1 Hammer

Product1

id name

Product2

id name

Product3

id name

2 Drill3 Screwdriver4 Scissors5 Toolbox6 Knife

Hash Function: partition = (id - 1 mod 3) + 1

35


Saleitemc_id p_id1 11 22 32 63 13 52 54 13 6

Saleitem1,1

c_id p_id1 13 1

Saleitem1,1

c_id p_id

Saleitem1,2

c_id p_id1 23 5

Saleitem1,2

c_id p_id

Saleitem1,3

c_id p_id3 6

Saleitem1,3

c_id p_id

Saleitem2,1

c_id p_id4 1

Saleitem2,1

c_id p_id

Saleitem2,2

c_id p_id2 5

Saleitem2,2

c_id p_id

Saleitem2,3

c_id p_id2 32 6

Saleitem2,3

c_id p_id

1 11 2

2 32 63 13 52 5

4 13 6

c_id mod 2 = 1 c_id mod 2 = 0

p_id mod 3 = 1

p_id mod 3 = 2

p_id mod 3 = 0

36

SHARP Partition Combinations

• Customer1, Product1, and Saleitem1,1






For each partition i of Customer For each partition j of Product probe with partition i,j of Saleitem output matches between Customeri, Productj, and Saleitemi,j

37

Resultsc_id c_name p_id p_name

SHARP Join

Saleitem1,1

c_id p_id1 13 1

Product1

id name1 Hammer4 Scissors

Customer1

id name1 Bob3 Greg

1 13 1

1 Hammer 1 Bob3 Greg

1 Hammer

38

Resultsc_id c_name p_id p_name1 Bob 1 Hammer3 Greg 1 Hammer1 Bob 2 Drill3 Greg 5 Toolbox3 Greg 6 Knife4 Susan 1 Hammer2 Joe 5 Toolbox2 Joe 3 Screwdriver2 Joe 6 Knife

Resultsc_id c_name p_id p_name1 Bob 1 Hammer3 Greg 1 Hammer

SHARP Join

Saleitem1,2

c_id p_id1 23 5

Product2

id name2 Drill5 Toolbox

Customer1

id name1 Bob3 Greg3 5

2 Drill 1 Bob3 Greg5 Toolbox

1 2

39

Multi-Way Join Summary

Algorithm Relevant QueriesHash Teams Any query performing an inner join on identical attributes

in all relations.Generalized Hash Teams

Any query performing an inner join on direct and indirect attributes. Requires extra memory for indirect queries.

SHARP Only star queries.

40

Thesis Questions

• The study seeks to answer the following questions:Q1: Does Hash Teams provide an advantage over DHJ?Q2: Does Generalized Hash Teams provide an advantage over DHJ?Q3: Does SHARP provide an advantage over DHJ?Q4: Should these algorithms be implemented in a relational database system in addition to the existing binary join algorithms?

41

Multi-Way Join Implementation

• Performance is implementation dependent• Multiple implementations were created

– PostgreSQL http://www.postgresql.org/– Standalone C++– Verified the results in another environment

42

Experimental Results

43

PostgreSQL Results

• All experiments were performed by comparing the multi-way join against the built-in hash join

• Hybrid Hash Join (HHJ)• Data was based on 10GB TPC-H benchmark data

– Generated using Microsoft’s TPC-H generator– ftp.research.microsoft.com/users/viveknar/tpcdskew

44

TPC-H Relations

Relation Tuple Size Number of Tuples Relation Size

Customer 194 Bytes 1.5 Million 284 MBSupplier 184 Bytes 100,000 18 MBPart 173 Bytes 2 Million 323 MBOrders 147 Bytes 15 Million 2097 MBPartSup 182 Bytes 8 Million 1392 MBLineitem 162 Bytes 60 Million 9270 MB

45

Hash Teams in PostgreSQL

• Performed 3-way join on the Orders relation using direct partitioning

0 500 1000 1500 2000 2500 30000

50

100

150

200

250

300Time

Hash Teams HHJ

Memory Size (MB)

Tim

e (S

econ

ds)

0 500 1000 1500 2000 2500 30000

4000

8000

12000

16000

20000I/O Bytes

Hash Teams HHJ

Memory Size (MB)

I/Os

(MB)

46

Generalized Hash Teams in PostgreSQL

• Indirect partitioning with a join on Customer, Orders, and Lineitem• Tested using multiple mappers

– Bitmap– Exact

47

Generalized Hash Teams in PostgreSQL

0 500 1000 1500 2000 2500 3000400

500

600

700

800

900

1000

1100Time

GHT Exact GHT Bitmap HHJ

Memory Size (MB)

Tim

e (S

econ

ds)

0 500 1000 1500 2000 2500 30000

5000

10000

15000

20000

25000

30000 I/O Bytes

GHT Exact GHT Bitmap HHJ

Memory Size (MB)

I/Os (

MB)

48

SHARP in PostgreSQL

• Star join using Part, Orders, and Lineitem

0 500 1000 1500 2000 2500 3000400500600700800900

1000110012001300

Time

SHARP HHJ

Memory Size (MB)

Tim

e (S

econ

ds)

0 500 1000 1500 2000 2500 30000

100002000030000400005000060000700008000090000 I/O Bytes

SHARP HHJ

Memory Size (MB)

I/Os (

MB)

49

Standalone C++ Results

• Uses same TPC-H data as the PostgreSQL experiments

50

Standalone C++ Hash Teams• Performed 3-way join on the Orders relation using direct partitioning

0 1000 2000 3000 4000 50000

102030405060708090

100Time

DHJ Left DHJ Right Hash Teams

Memory Size (MB)

Tim

e (S

econ

ds)

0 1000 2000 3000 4000 50000

2000400060008000

100001200014000160001800020000

I/O Bytes

DHJ Left DHJ Right Hash Teams

Memory Size (MB)

I/Os (

MB)

51

Standalone C++ Generalized Hash Teams

• Indirect partitioning with a join on Customer, Orders, and Lineitem• Tested using bitmap mapper• Tested GHT by

– Not counting mapper memory– Counting mapper memory for small memory sizes– Varying the amount of memory available for the mapper

52

GHT Map Memory Not Counted

0 1000 2000 3000 4000 50000

20406080

100120140160180

Time

DHJ Left DHJ Right GHT

Memory Size (MB)

Tim

e (s

econ

ds)

0 1000 2000 3000 4000 50000

5000

10000

15000

20000

25000

30000I/O Bytes


Memory Size (MB)

I/O

s (M

B)

53

GHT at Small Memory Sizes

0 50 100 150 200 250 300 350 4000

50100150200250300350400450500

Time


Memory Size (MB)

Tim

e (s

econ

ds)

0 50 100 150 200 250 300 350 4000

10000

20000

30000

40000

50000

60000

70000I/O Bytes


Memory Size (MB)

I/O

s (M

B)

54

GHT and Bitmap Size

0.1 0.25 0.5 1 1.5 2 40

50000000

100000000

150000000

200000000

250000000

300000000

350000000

400000000

False Drops

Bitmap Size Multiplyer

False

Dro

ps (M

illio

ns)

0.1 0.25 0.5 1 1.5 2 40

50

100

150

200

250

300

350

400

Time

Bitmap Size Multiplyer

Tim

e (s

econ

ds)

55

Standalone C++ SHARP• Star join using Part, Orders, and Lineitem

0 500 1000 1500 2000 25000

50

100

150

200

250

300

350Time

SHARP DHJ

Memory Size (MB)

Tim

e (s

econ

ds)

0 500 1000 1500 2000 25000

5000

10000

15000

20000

25000

30000

35000

40000I/O Bytes

SHARP DHJ

Memory Size (MB)

I/O

s (M

B)

56

Conclusions

57

Thesis Questions

Q1: Does Hash Teams provide an advantage over DHJ?Q2: Does Generalized Hash Teams provide an advantage over DHJ?Q3: Does SHARP provide an advantage over DHJ?Q4: Should these algorithms be implemented in a relational database system in addition to the existing binary join algorithms?

58

Does Hash Teams provide an advantage over DHJ?

• Yes– Performs fewer I/Os than DHJ– Evaluates Queries Faster– Uses memory more efficiently– Performs fewer partitioning steps

• Queries that can use Hash Teams are very limited in practice.• In many cases a traditional sort-merge join would be more efficient• Hash Teams is much more complex to implement and maintain

59

Does Generalized Hash Teams provide an advantage over DHJ?

• Sometimes– When GHT performs fewer I/Os

• Performance is bad when there are a lot of false drops• Much more complex than DHJ or Hash Teams• Mapper can hurt performance

60

Does SHARP provide an advantage over DHJ?

• Yes– Performs fewer I/Os– Evaluates queries quicker– Uses memory more efficiently– Fewer partitioning steps

• Limited to star queries• More complex to implement and maintain

61

Should these algorithms be implemented in a relational database system?

• Hash teams should not be implemented.– Too limited in use– Microsoft removed support for Hash Teams from SQL Server 2003

• Generalized Hash Teams should not be implemented.– GHT can be much slower than DHJ– Mapper makes GHT much more complex to implement and maintain

• SHARP should be implemented.– Shows a significant performance advantage– Star queries are commonly used in data warehousing

62

Future Work

• Experiments with the algorithms on different data sets• Experiments with larger numbers of relations• Extend Hash Teams and GHT implementations to

support GROUP BY to see if it makes them more useful

63

Thank You

64

Appendix

65

TPC-H Relations http://www.tpc.org/tpch/

Documents

Multi-Way Hash Join Effectiveness