Upload
verity
View
53
Download
0
Embed Size (px)
DESCRIPTION
Multi-Way Hash Join Effectiveness. M.Sc Thesis Michael Henderson Supervisor Dr. Ramon Lawrence. Outline. Motivation Database Terminology Background Joins Multi-Way Joins Thesis Questions Experimental Results Conclusions. Motivation. Data is everywhere - PowerPoint PPT Presentation
Citation preview
2
Multi-Way Hash Join Effectiveness
M.Sc ThesisMichael Henderson
Supervisor Dr. Ramon Lawrence
3
Outline
• Motivation• Database Terminology• Background• Joins• Multi-Way Joins• Thesis Questions• Experimental Results• Conclusions
4
Motivation
• Data is everywhere• Governments collect data on citizens• Facebook collects data on over 1 billion people• Wal-Mart and Target collect sales data on all their customers
• The goal is to make answering the big questions–Possible–Faster
5
Database Terminology: Relations (Tables)
Part Lineitempartkey name retailprice linenumber partkey quantity saleprice1 Box 0.50 1 1 1 0.502 Hat 25.00 2 1 1 0.503 Bottle 2.50 3 2 3 22.50
4 3 15 2.50
Part Relation
Tuple/Row
Attribute/Column
Lineitem Relation
The tables are related through their partkey attributes
Attribute Names
6
Database Terminology II: SQL
• Structured Query Language• Used to ask the database questions about the data• Standardized• Example: SQL for retrieving all rows from the part table
SELECT * FROM Part;
7
Database Terminology III: Join
• Joins are used to combine the data in database tables• Joins are slow• We want joins to be faster
8
Background
9
What Makes Queries Slow?
• All the data must be read to give an accurate answer• Data is usually much larger than what can fit in memory• Operations such as filtering, ordering, and joins are
costly• A join is especially costly
– May need to match every row in two tables. O(n2)– May need to perform many slow disk operations (I/Os)
10
Background: Example Join QuerySELECT * FROM Part p, Lineitem lWHERE p.partkey = l.partkey;
Part Lineitem
p.partkey = l.partkey
Resultspartkey name retailprice linenumber partkey quantity saleprice1 Box 0.50 1 1 1 0.501 Box 0.50 2 1 1 0.502 Hat 25.00 3 2 3 22.503 Bottle 2.50 4 3 15 2.50
SQL
Relational Algebra
Join Results
11
Resultspartkey name retailprice linenumber partkey quantity saleprice
Nested Loop JoinPart Lineitempartkey name retailprice linenumber partkey quantity saleprice1 Box 0.50 1 1 1 0.502 Hat 25.00 2 1 1 0.503 Bottle 2.50 3 2 3 22.50
4 3 15 2.50
Resultspartkey name retailprice linenumber partkey quantity saleprice
1 Box 0.50 1 1 1 0.501 Box 0.50 2 1 1 0.502 Hat 25.00 3 2 3 22.503 Bottle 2.50 4 3 15 2.50
1 Box 0.50 1 1 1 0.501 Box 0.502 1 1 0.50
1 Box 0.50
3 2 3 22.504 3 15 2.50
12
Dynamic Hash Join
Partpartkey name retailprice1 Box 0.502 Hat 25.003 Bottle 2.503 Bottle 2.502 Hat 25.001 Box 0.50
Part1partkey name retailprice
Part2partkey name retailprice
Part3partkey name retailprice
Three Part PartitionsHash Function: partition = (partkey - 1 mod 3) + 1
= (1 - 1 mod 3) + 1 = 1= (2 - 1 mod 3) + 1 = 2= (3 - 1 mod 3) + 1 = 3
Saved to disk
13
Part1partkey name retailprice1 Box 0.50
Resultspartkey name retailprice linenumber partkey quantity saleprice
Dynamic Hash Join
Lineitemlinenumber partkey quantity saleprice1 1 1 0.502 1 1 0.503 2 3 22.504 3 15 2.50
1 Box 0.50 1 1 1 0.501 Box 0.502 1 1 0.503 2 3 22.504 3 15 2.50
Lineitem1linenumber partkey quantity saleprice
Lineitem2linenumber partkey quantity saleprice
Lineitem3linenumber partkey quantity saleprice
Three Lineitem Partitions
Hash Function: partition = (partkey - 1 mod 3) + 1= (1 - 1 mod 3) + 1 = 1= (2 - 1 mod 3) + 1 = 2= (3 - 1 mod 3) + 1 = 3
14
Part2partkey name retailprice2 Hat 25.00
Resultspartkey name retailprice linenumber partkey quantity saleprice
1 Box 0.50 1 1 1 0.501 Box 0.50 2 1 1 0.50
Dynamic Hash Join
Lineitem2linenumber partkey quantity saleprice3 2 3 22.50
Resultspartkey name retailprice linenumber partkey quantity saleprice
1 Box 0.50 1 1 1 0.501 Box 0.50 2 1 1 0.502 Hat 25.00 3 2 3 22.503 Bottle 2.50 4 3 15 2.50
2 Hat 25.00 3 2 3 22.50
15
Join Three TablesSELECT A.a_key, B.b_key, C.c_key FROM A, B, CWHERE A.a_key = B.a_key AND A.a_key = C.a_key;
A B
A.a_key = B.a_keyC
A.a_key = C.a_key
A B
A.a_key = B.a_keyC
A.a_key = C.a_key
Left Deep Plan Right Deep Plan
16
Multi-way Hash Joins
• Join multiple relations at the same time• Shares memory across the entire join• Produces a result by combining tuples from all relations• Do not have to repartition intermediate results• Less disk operations
A B
A.a_key = B.a_key and A.a_key = C.a_key
C
Multi-way Plan
17
Hash Teams
• Multi-way hash join• Hash teams joins relations on a common attribute
18
Hash Teams Example
A B Ca_key b_key a_key c_key a_key
1 1 1 1 32 2 2 2 13 3 3 3 2
4 1 4 25 2 5 1
SELECT A.a_key, B.b_key, C.c_key FROM A, B, CWHERE A.a_key = B.a_key AND A.a_key = C.a_key;
19
Partitioning A and B
A1
a_key1
Partitions
A Ba_key b_key a_key1 1 12 2 23 3 3
4 15 2
Hash Function: partition = (a_key - 1 mod 3) + 1
= (1 - 1 mod 3) + 1 = 1= (2 - 1 mod 3) + 1 = 2= (3 - 1 mod 3) + 1 = 3
1A2
a_key2
A3
a_key3
A1
a_key
A2
a_key
A3
a_key
23
B1
b_key a_key
B2
b_key a_key
B3
b_key a_key
20
Partitioning A and B
A1
a_key1
Partitions
A Ba_key b_key a_key1 1 12 2 23 3 3
4 15 2
Hash Function: partition = (a_key - 1 mod 3) + 1= (1 - 1 mod 3) + 1 = 1= (2 - 1 mod 3) + 1 = 2= (3 - 1 mod 3) + 1 = 3
A2
a_key2
A3
a_key3
1 12 23 34 15 2
B1
b_key a_key
1 14 1
B2
b_key a_key
2 25 2
B3
b_key a_key
3 3
B1
b_key a_key
B2
b_key a_key
B3
b_key a_key
21
Processing C
A1
a_key1
Disk Partitions
Hash Function: partition = (a_key - 1 mod 3) + 1
B1
b_key a_key
1 14 1
B1
b_key a_keyCc_key a_key
1 32 13 24 25 1
1 32 13 24 2
1 1 14 1
C2
c_key a_key
C3
c_key a_key
Resultsa_key b_key c_key
2 1412 1
22
Processing C
A1
a_key1
Disk Partitions
Hash Function: partition = (a_key - 1 mod 3) + 1
B1
b_key a_key
1 14 1
B1
b_key a_keyCc_key a_key
1 32 13 24 25 1
C2
c_key a_key3 24 2
C3
c_key a_key1 3
5 1
1 1 14 1
Resultsa_key b_key c_key1 1 21 4 21 1 51 4 52 2 32 5 32 2 42 5 43 3 1
14115
Resultsa_key b_key c_key1 1 21 4 2
5
23
Generalized Hash Teams (GHT)
• Extends Hash Teams• Does not need the join attributes to be the same• Uses indirect partitioning• Needs an in-memory map to indirectly join relations
24
GHT Partition Maps
• Uses join memory• Use a bitmap to approximate mapping to reduce
memory requirements• Needs a bitmap for each partition• Bitmaps introduce mapping errors that cause tuples to
be mapped to multiple partitions (false drops)• False drops add I/O and Processing cost
25
GHT ExampleSELECT c.custkey, o.orderkey, l.partkey FROM Customer c, Orders o, Lineitem lWHERE c.custkey = o.custkey AND o.orderkey = l.orderkey;
Customercustkey123
Ordersorderkey custkey1 12 23 34 15 2
Lineitemorderkey partkey1 11 22 32 43 13 84 54 65 4
26
GHT Customer Partitions
Customer1 Customer2 Customer3
custkey custkey custkey1 2 3
Hash Function: partition = (custkey - 1 mod 3) + 1
27
Orders Partitions and Bitmap
Orders1
orderkey custkey1 14 1
Orders2
orderkey custkey2 25 2
Orders2
orderkey custkey
Orders3
orderkey custkey3 3
Orders3
orderkey custkey
Orders1
orderkey custkey
Ordersorderkey custkey1 12 23 34 15 2
1 12 23 34 15 2
B1
0000
B2
0000
B3
0000
Index = (orderkey +1) mod 4
B1
0010
B1
0110
B2
0001
B2
0011
B3
1000
Hash Function:partition = (custkey - 1 mod 3) + 1
28
Orders Partitions and Bitmap
B1 B2 B3
0 0 11 0 01 1 00 1 0
B1
0110
B2
0011
B3
1000
29
Lineitem Partitions with False Drops
Lineitem1
orderkey partkey1 11 24 54 65 4
Lineitem2
orderkey partkey1 11 22 32 45 4
Lineitem3
orderkey partkey3 13 8
Lineitem1
orderkey partkey
Lineitem2
orderkey partkey
Lineitem3
orderkey partkey
Lineitemorderkey partkey1 11 22 32 43 13 84 54 65 4
B1
0110
B2
0011
B3
1000
Index = (orderkey +1) mod 4
1 11 22 32 43 13 84 54 65 4
1 11 2
5 4
False Drop
False DropFalse Drop
30
Lineitem1
orderkey partkey1 11 24 54 65 4
Joining the Partitions
1 11 24 54 65 4
Customer1
custkey1
Orders1
orderkey custkey1 14 11 14 1
1
Resultscustkey orderkey partkey1 1 11 1 21 4 51 4 62 2 32 2 42 5 43 3 13 3 8
Resultscustkey orderkey partkey
1 1 1
1 1 1
2 1 1
1
5 4 1
4 1
1
6 4 1False Drop
31
SHARP
• Limited to star joins– Looks like a star– All tables related to a central table
Fact
key a_key b_key c_key d_key e_key
A
a_key data
C
c_key data
B
b_key data
E
e_key data
D
d_key data
32
SHARP Example
Customer Product Saleitemid name id name c_id p_id1 Bob 1 Hammer 1 12 Joe 2 Drill 1 23 Greg 3 Screwdriver 2 34 Susan 4 Scissors 2 6
5 Toolbox 3 16 Knife 3 5
2 54 13 6
SELECT * FROM Customer c, Product p, Saleitem sWHERE c.id = s.c_id AND p.id = s.p_id;
33
SHARP Example Partitions
Customerid name1 Bob2 Joe3 Greg4 Susan
Customer1
id name1 Bob3 Greg
Customer1
id name
Customer2
id name2 Joe4 Susan
Customer2
id name
1 Bob2 Joe3 Greg4 Susan
Hash Function: partition = (id - 1 mod 2) + 1
34
SHARP Example Partitions
Productid name1 Hammer2 Drill3 Screwdriver4 Scissors5 Toolbox6 Knife
Product1
id name1 Hammer4 Scissors
Product2
id name2 Drill5 Toolbox
Product3
id name3 Screwdriver6 Knife
1 Hammer
Product1
id name
Product2
id name
Product3
id name
2 Drill3 Screwdriver4 Scissors5 Toolbox6 Knife
Hash Function: partition = (id - 1 mod 3) + 1
35
SHARP Example Partitions
Saleitemc_id p_id1 11 22 32 63 13 52 54 13 6
Saleitem1,1
c_id p_id1 13 1
Saleitem1,1
c_id p_id
Saleitem1,2
c_id p_id1 23 5
Saleitem1,2
c_id p_id
Saleitem1,3
c_id p_id3 6
Saleitem1,3
c_id p_id
Saleitem2,1
c_id p_id4 1
Saleitem2,1
c_id p_id
Saleitem2,2
c_id p_id2 5
Saleitem2,2
c_id p_id
Saleitem2,3
c_id p_id2 32 6
Saleitem2,3
c_id p_id
1 11 2
2 32 63 13 52 5
4 13 6
c_id mod 2 = 1 c_id mod 2 = 0
p_id mod 3 = 1
p_id mod 3 = 2
p_id mod 3 = 0
36
SHARP Partition Combinations
• Customer1, Product1, and Saleitem1,1
• Customer1, Product2, and Saleitem1,2
• Customer1, Product3, and Saleitem1,3
• Customer2, Product2, and Saleitem2,1
• Customer2, Product2, and Saleitem2,2
• Customer2, Product3, and Saleitem2,3
For each partition i of Customer For each partition j of Product probe with partition i,j of Saleitem output matches between Customeri, Productj, and Saleitemi,j
37
Resultsc_id c_name p_id p_name
SHARP Join
Saleitem1,1
c_id p_id1 13 1
Product1
id name1 Hammer4 Scissors
Customer1
id name1 Bob3 Greg
1 13 1
1 Hammer 1 Bob3 Greg
1 Hammer
38
Resultsc_id c_name p_id p_name1 Bob 1 Hammer3 Greg 1 Hammer1 Bob 2 Drill3 Greg 5 Toolbox3 Greg 6 Knife4 Susan 1 Hammer2 Joe 5 Toolbox2 Joe 3 Screwdriver2 Joe 6 Knife
Resultsc_id c_name p_id p_name1 Bob 1 Hammer3 Greg 1 Hammer
SHARP Join
Saleitem1,2
c_id p_id1 23 5
Product2
id name2 Drill5 Toolbox
Customer1
id name1 Bob3 Greg3 5
2 Drill 1 Bob3 Greg5 Toolbox
1 2
39
Multi-Way Join Summary
Algorithm Relevant QueriesHash Teams Any query performing an inner join on identical attributes
in all relations.Generalized Hash Teams
Any query performing an inner join on direct and indirect attributes. Requires extra memory for indirect queries.
SHARP Only star queries.
40
Thesis Questions
• The study seeks to answer the following questions:Q1: Does Hash Teams provide an advantage over DHJ?Q2: Does Generalized Hash Teams provide an advantage over DHJ?Q3: Does SHARP provide an advantage over DHJ?Q4: Should these algorithms be implemented in a relational database system in addition to the existing binary join algorithms?
41
Multi-Way Join Implementation
• Performance is implementation dependent• Multiple implementations were created
– PostgreSQL http://www.postgresql.org/– Standalone C++– Verified the results in another environment
42
Experimental Results
43
PostgreSQL Results
• All experiments were performed by comparing the multi-way join against the built-in hash join
• Hybrid Hash Join (HHJ)• Data was based on 10GB TPC-H benchmark data
– Generated using Microsoft’s TPC-H generator– ftp.research.microsoft.com/users/viveknar/tpcdskew
44
TPC-H Relations
Relation Tuple Size Number of Tuples Relation Size
Customer 194 Bytes 1.5 Million 284 MBSupplier 184 Bytes 100,000 18 MBPart 173 Bytes 2 Million 323 MBOrders 147 Bytes 15 Million 2097 MBPartSup 182 Bytes 8 Million 1392 MBLineitem 162 Bytes 60 Million 9270 MB
45
Hash Teams in PostgreSQL
• Performed 3-way join on the Orders relation using direct partitioning
0 500 1000 1500 2000 2500 30000
50
100
150
200
250
300Time
Hash Teams HHJ
Memory Size (MB)
Tim
e (S
econ
ds)
0 500 1000 1500 2000 2500 30000
4000
8000
12000
16000
20000I/O Bytes
Hash Teams HHJ
Memory Size (MB)
I/Os
(MB)
46
Generalized Hash Teams in PostgreSQL
• Indirect partitioning with a join on Customer, Orders, and Lineitem• Tested using multiple mappers
– Bitmap– Exact
47
Generalized Hash Teams in PostgreSQL
0 500 1000 1500 2000 2500 3000400
500
600
700
800
900
1000
1100Time
GHT Exact GHT Bitmap HHJ
Memory Size (MB)
Tim
e (S
econ
ds)
0 500 1000 1500 2000 2500 30000
5000
10000
15000
20000
25000
30000 I/O Bytes
GHT Exact GHT Bitmap HHJ
Memory Size (MB)
I/Os (
MB)
48
SHARP in PostgreSQL
• Star join using Part, Orders, and Lineitem
0 500 1000 1500 2000 2500 3000400500600700800900
1000110012001300
Time
SHARP HHJ
Memory Size (MB)
Tim
e (S
econ
ds)
0 500 1000 1500 2000 2500 30000
100002000030000400005000060000700008000090000 I/O Bytes
SHARP HHJ
Memory Size (MB)
I/Os (
MB)
49
Standalone C++ Results
• Uses same TPC-H data as the PostgreSQL experiments
50
Standalone C++ Hash Teams• Performed 3-way join on the Orders relation using direct partitioning
0 1000 2000 3000 4000 50000
102030405060708090
100Time
DHJ Left DHJ Right Hash Teams
Memory Size (MB)
Tim
e (S
econ
ds)
0 1000 2000 3000 4000 50000
2000400060008000
100001200014000160001800020000
I/O Bytes
DHJ Left DHJ Right Hash Teams
Memory Size (MB)
I/Os (
MB)
51
Standalone C++ Generalized Hash Teams
• Indirect partitioning with a join on Customer, Orders, and Lineitem• Tested using bitmap mapper• Tested GHT by
– Not counting mapper memory– Counting mapper memory for small memory sizes– Varying the amount of memory available for the mapper
52
GHT Map Memory Not Counted
0 1000 2000 3000 4000 50000
20406080
100120140160180
Time
DHJ Left DHJ Right GHT
Memory Size (MB)
Tim
e (s
econ
ds)
0 1000 2000 3000 4000 50000
5000
10000
15000
20000
25000
30000I/O Bytes
DHJ Left DHJ Right GHT
Memory Size (MB)
I/O
s (M
B)
53
GHT at Small Memory Sizes
0 50 100 150 200 250 300 350 4000
50100150200250300350400450500
Time
DHJ Left DHJ Right GHT
Memory Size (MB)
Tim
e (s
econ
ds)
0 50 100 150 200 250 300 350 4000
10000
20000
30000
40000
50000
60000
70000I/O Bytes
DHJ Left DHJ Right GHT
Memory Size (MB)
I/O
s (M
B)
54
GHT and Bitmap Size
0.1 0.25 0.5 1 1.5 2 40
50000000
100000000
150000000
200000000
250000000
300000000
350000000
400000000
False Drops
Bitmap Size Multiplyer
False
Dro
ps (M
illio
ns)
0.1 0.25 0.5 1 1.5 2 40
50
100
150
200
250
300
350
400
Time
Bitmap Size Multiplyer
Tim
e (s
econ
ds)
55
Standalone C++ SHARP• Star join using Part, Orders, and Lineitem
0 500 1000 1500 2000 25000
50
100
150
200
250
300
350Time
SHARP DHJ
Memory Size (MB)
Tim
e (s
econ
ds)
0 500 1000 1500 2000 25000
5000
10000
15000
20000
25000
30000
35000
40000I/O Bytes
SHARP DHJ
Memory Size (MB)
I/O
s (M
B)
56
Conclusions
57
Thesis Questions
Q1: Does Hash Teams provide an advantage over DHJ?Q2: Does Generalized Hash Teams provide an advantage over DHJ?Q3: Does SHARP provide an advantage over DHJ?Q4: Should these algorithms be implemented in a relational database system in addition to the existing binary join algorithms?
58
Does Hash Teams provide an advantage over DHJ?
• Yes– Performs fewer I/Os than DHJ– Evaluates Queries Faster– Uses memory more efficiently– Performs fewer partitioning steps
• Queries that can use Hash Teams are very limited in practice.• In many cases a traditional sort-merge join would be more efficient• Hash Teams is much more complex to implement and maintain
59
Does Generalized Hash Teams provide an advantage over DHJ?
• Sometimes– When GHT performs fewer I/Os
• Performance is bad when there are a lot of false drops• Much more complex than DHJ or Hash Teams• Mapper can hurt performance
60
Does SHARP provide an advantage over DHJ?
• Yes– Performs fewer I/Os– Evaluates queries quicker– Uses memory more efficiently– Fewer partitioning steps
• Limited to star queries• More complex to implement and maintain
61
Should these algorithms be implemented in a relational database system?
• Hash teams should not be implemented.– Too limited in use– Microsoft removed support for Hash Teams from SQL Server 2003
• Generalized Hash Teams should not be implemented.– GHT can be much slower than DHJ– Mapper makes GHT much more complex to implement and maintain
• SHARP should be implemented.– Shows a significant performance advantage– Star queries are commonly used in data warehousing
62
Future Work
• Experiments with the algorithms on different data sets• Experiments with larger numbers of relations• Extend Hash Teams and GHT implementations to
support GROUP BY to see if it makes them more useful
63
Thank You
64
Appendix
65
TPC-H Relations http://www.tpc.org/tpch/