Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Multi-Core, Main-Memory Joins: Sort vs. Hash
Revisited
Cagri Balkesen, Gustavo Alonso, Jens Teubner, M. Tamer Ozsu
Presented by William Qian
2020 April 166.886 Spring 2020
William Qian Sort vs. Hash 2020 April 16 1 / 32
Overview
1 Background
2 Parallel sort-merge joins
3 Parallel hash joins
4 Evaluation
5 Discussion
William Qian Sort vs. Hash 2020 April 16 2 / 32
Background
1 Background
2 Parallel sort-merge joins
3 Parallel hash joins
4 Evaluation
5 Discussion
William Qian Sort vs. Hash 2020 April 16 3 / 32
Background
Sort-merge joins
SELECT * FROM R, S WHERE F(R.key) = G(S.key)
Sort phase: sort R ’s keys according to F and S ’s keys according to GMerge phase: mergesort-style matching of keys from R and S
Works for any comparator
Requires sorting
Sorting is known to be parallelizable
Merging is much harder to parallelize
William Qian Sort vs. Hash 2020 April 16 4 / 32
Background
Hash joins
SELECT * FROM R, S WHERE F(R.key) = G(S.key)
Build phase: create base hashtable H from applying F to keys of RProbe phase: apply G to keys in S and find matches in H to join
Embarrassingly parallel
Requires lots of memory to store H
Frequently incurs cache misses for large tables
Requires equijoins (which are fairly common)
William Qian Sort vs. Hash 2020 April 16 5 / 32
Background
Non-uniform memory access
P1
P2
P3
P4
M1 M2
P1 can access M1 easily, but M2 is a little more costly
Lots of data movement to “farther” memory increasesbandwidth congestion
William Qian Sort vs. Hash 2020 April 16 6 / 32
Parallel sort-merge joins
1 Background
2 Parallel sort-merge joins
3 Parallel hash joins
4 Evaluation
5 Discussion
William Qian Sort vs. Hash 2020 April 16 7 / 32
Parallel sort-merge joins
Parallel run-generation
Sorting networks
Few data dependencies
No branching
Only sorts across vectors
e = min(a, b)
f = max(a, b)
g = min(c, d)
h = max(c, d)
i = min(e, g)
j = min(f, h)
w = min(e, g)
x = min(i, j)
y = max(i, j)
z = max(f, g)
William Qian Sort vs. Hash 2020 April 16 8 / 32
Parallel sort-merge joins
Parallel run-generation
5 28 1 13 20 8 2 15 12 14 19 17 6 7 22 21
5 7 1 13 6 8 2 15 12 14 19 17 20 28 22 21
5 6 12 20 7 8 14 28 1 2 19 22 13 15 17 21
a
b
Sorting network in (a) generates vectors sorted across positions
Shuffling in (b) transposes vectors so that each vector is sorted
William Qian Sort vs. Hash 2020 April 16 9 / 32
Parallel sort-merge joins
Parallel merge
Bitonic merge networks
Scales poorly
Used as a kernel sort
Adds branch predictions
Avoids scalar-vector registermovement
William Qian Sort vs. Hash 2020 April 16 10 / 32
Parallel sort-merge joins
Out-of-cache sorting
Multi-way merging
Two-way merge unitsconnected with FIFO buffers
External memory bandwidthonly at front of multi-waymerge tree
Helps combat NUMA
William Qian Sort vs. Hash 2020 April 16 11 / 32
Parallel sort-merge joins
Sort-merge: choose your fighter
m-way
NUMA-local partitions
Tables sorted symmetrically
Multiway merging for
Single-pass merge join
m-pass
Similar to m-pass
Two-way bitonic merginginstead of multiway merging
mpsm
Globally partitions & sortsone table
Partially sorts the other table
Keys in S are a subset of keysin R
First table merged w/ NUMAremote runs of second table
William Qian Sort vs. Hash 2020 April 16 12 / 32
Parallel hash joins
Radix partitioning
Problem: large hashtables result in many cache missesSolution: radix partitioning
Moves tuples to destination partitions (pages)
Reduces TLB miss effects during partitioning
TLB size limits the fan-out of the partitioning step
William Qian Sort vs. Hash 2020 April 16 13 / 32
Parallel hash joins
Software-managed buffers
Problem: radix partitioning is limited by TLB sizesSolution: buffer writes in cache
Extra copy step
TLB fetch only needed once every N tuples in a partition
More I/O reordering due to buffered writes & less TLB pressure
Cache line-sized buffers can enable blind writes, which are faster
William Qian Sort vs. Hash 2020 April 16 14 / 32
Parallel hash joins
Hash: choose your fighter
radix
Parallel radix-hash join
Partitioned according toradix-hash
Cache-local hash joins onpartition pairs
n-part
Emabarrassingly-parallelizedhash join
Tables sharded/striped acrossworkers
Build a shared hashtablebased on one table
Hash-and-match with thesecond table
William Qian Sort vs. Hash 2020 April 16 15 / 32
Evaluation
1 Background
2 Parallel sort-merge joins
3 Parallel hash joins
4 Evaluation
5 Discussion
William Qian Sort vs. Hash 2020 April 16 16 / 32
Evaluation
Setup
Benchmarks:
m-way (sort-merge)
m-pass (sort-merge)
mpsm (sort-merge)
radix (hash)
n-part (hash)
Workloads:
Column-store
4-byte keys and values, allintegers
Keys in S are a proper subsetof keys in R
Generally uniform keydistribution in S
William Qian Sort vs. Hash 2020 April 16 17 / 32
Evaluation
Environment
256-bit AVX (floating-point only)
64 threads = 4 sockets, 8 cores/socket, hyperthreading enabled
L1/L2/L3 cache sizes: 32KiB/256KiB/20MiB
L3 is socket-local
Cache line size: 64B
TLB1: 64 entries for 64KiB pages; 32 entries for 2MiB pages
TLB2: page size 4KiB, 512 entries per TLB1 entry
William Qian Sort vs. Hash 2020 April 16 18 / 32
Evaluation
Experiments
Sorting baseline Merging baseline Partitioning
Alternative merges m-way factors Input size
Data skew Scalability
William Qian Sort vs. Hash 2020 April 16 19 / 32
Evaluation
Sorting baselines
Evaluating single-threadedperformance
Confirm that AVX sorting isefficient
William Qian Sort vs. Hash 2020 April 16 20 / 32
Evaluation
Merging
Larger merging fan-ins lead tosmaller buffers
Software managed buffersperform stably
Idea: partition instead ofmerge
William Qian Sort vs. Hash 2020 April 16 21 / 32
Evaluation
Merging
Partition-then-sort: range-partition, sort, concatenate
Sort-then-merge: what we’ve been discussing
Partitioning doesn’t degrade like merging does!
William Qian Sort vs. Hash 2020 April 16 22 / 32
Evaluation
Sort-merge champion: m-way
Multi-way merge helps whenmemory is contended
AVX benefit is persistent
William Qian Sort vs. Hash 2020 April 16 23 / 32
Evaluation
Hash champion: radix-hash
Radix-hash with software-managed buffers [2]
William Qian Sort vs. Hash 2020 April 16 24 / 32
Evaluation
Sort vs. Hash: Input size
Radix-hash wins at smallersizes
Radix-hash degrades quicklywith larger sizes
m-way doesn’t degrade withtable size, but
m-way performs ≈radix-hashat best
William Qian Sort vs. Hash 2020 April 16 25 / 32
Evaluation
Sort vs. Hash: Skew
Radix-hash
Fine-granular taskdecomposition [2, 3]
Redistributes “hotter”partitions to all threads
m-way
Multi-way merging’s two-stepapproach:
1 Sub-task merges, split inNUMA region
2 Special handling for heavyhitters
William Qian Sort vs. Hash 2020 April 16 26 / 32
Evaluation
Sort vs. Hash: Scalability
Sort-merge algorithms allscale
Radix-hash scales as well
William Qian Sort vs. Hash 2020 April 16 27 / 32
Evaluation
Sort vs. Hash
Radix-hash works well
m-way is about similar forlarger joins
Hash joins are still the winners
William Qian Sort vs. Hash 2020 April 16 28 / 32
Evaluation
References
Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M Tamer Ozsu.
Multi-core, main-memory joins: Sort vs. hash revisited.Proceedings of the VLDB Endowment, 7(1):85–96, 2013.
Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M Tamer Ozsu.
Main-memory hash joins on multi-core cpus: Tuning to the underlying hardware.In 2013 IEEE 29th International Conference on Data Engineering (ICDE), pages 362–373. IEEE, 2013.
Changkyu Kim, Tim Kaldewey, Victor W Lee, Eric Sedlar, Anthony D Nguyen, Nadathur Satish, Jatin Chhugani, Andrea
Di Blas, and Pradeep Dubey.Sort vs. hash revisited: Fast join implementation on modern multi-core cpus.Proceedings of the VLDB Endowment, 2(2):1378–1389, 2009.
William Qian Sort vs. Hash 2020 April 16 29 / 32
Discussion
1 Background
2 Parallel sort-merge joins
3 Parallel hash joins
4 Evaluation
5 Discussion
William Qian Sort vs. Hash 2020 April 16 30 / 32
Discussion
Feedback
Positive:
Paper layout is very readable!
Lots of appropriate data visuals
Thorough work on minimizing effects of external factors
Good balance of self and cross-system comparisons
Constructive:
Throughput vs execution time graphs can be confusing
Hyperthread scaling cap for memory-restricted workloads iswell-known
Generally should avoid benchmarking with hyperthreads
William Qian Sort vs. Hash 2020 April 16 31 / 32
Discussion
Discussion
1 How could multi-way merging benefit from advances with(parallel) funnelsort?
2 How would a non-NUMA architecture affect these results?3 How could these results translate to other database data
layouts?
Delta encodingsBit vector layouts
William Qian Sort vs. Hash 2020 April 16 32 / 32