55
Relational Joins on Graphics Processors Suman Karumuri Jamie Jablin

Gpu Join Presentation

  • Upload
    mansu

  • View
    2.812

  • Download
    1

Embed Size (px)

DESCRIPTION

This paper talks about algorithms to do database joins on a GPU. Some interesting work here, that will someday lead to implementing databases on a GPGPU like CUDA.

Citation preview

Page 1: Gpu Join Presentation

Relational Joins on Graphics Processors

Suman KarumuriJamie Jablin

Page 2: Gpu Join Presentation

Background

Page 3: Gpu Join Presentation

Introduction

• Utilizing hardware features of the GPU– Massive thread parallelism– Fast inter-processor communication– High memory bandwidth– Coalesced access

Page 4: Gpu Join Presentation

Relational Joins

• Non-indexed nested-loop join (NINLJ)• Indexed nested-loop join (INLJ)• Sort-merge join (SMJ)• Hash join (HJ)

Page 5: Gpu Join Presentation

Block Non-indexed nested-loop join (NINLJ)

foreach r in R: foreach s in S:

if condition(r,s) output = <r,s>

Page 6: Gpu Join Presentation

Block Indexed nested-loop join (INLJ)

foreach r in R: if S[].has(r.f1)

if condition(r,s) output = <r,s>

Page 7: Gpu Join Presentation

Hash join (HJ)

Hr = Hashtable()foreach r in R:Hr.add(r) if Hr.size() == MAX_MEMORY:

for s in S:if Hr(s):

output sHr.clear()

Page 8: Gpu Join Presentation

Sort-merge join (SMJ)Sort(R), Sort(S) ; i, j = 0while !R.empty() && !S.empty():

if (R[i] == S[j])output += R[i]

i++ ; j++elif (R[i] < S[j])

i++else

j++

Page 9: Gpu Join Presentation

Algorithms on GPU

• Tips for algorithm design– Use the inherent concurrency.– Keep SIMD nature in mind. – Algorithms should be side-effect free.– Memory properties:

• High memory bandwidth.• Coalesced access (for spatial locality)• Cache in local memory (for temporal locality)• Access memory via indices and offsets.

Page 10: Gpu Join Presentation

GPU Primitives

Page 11: Gpu Join Presentation

Design and Implementation

• A complete set of parallel primitives:– Map, scatter, gather, prefix scan, split, and

sort• Low synchronization overhead.• Scalable to hundreds of processors.• Applicable to joins as well as other relational query

operators.

Page 12: Gpu Join Presentation

Map

Page 13: Gpu Join Presentation

Scatter

• Indexed writes to a relation.

Page 14: Gpu Join Presentation

Gather

• Performs indexed reads from a relation.

Page 15: Gpu Join Presentation

Prefix Scan• A prefix scan applies a binary operator on the input

of size n and produces an output of size n.• Ex: Prefix sum: cumulative sum of all elements to

the left of the current element.– Exclusive (used in paper)– Inclusive

Page 16: Gpu Join Presentation

Split

Page 17: Gpu Join Presentation

Each thread constructs tHist from Rin

Page 18: Gpu Join Presentation

L[(p-1)*#thread+t]=

tHist[t][p]

Page 19: Gpu Join Presentation

Prefix sum L(i) = sum(L[0…i-1])

Gives the start location of partitions

Page 20: Gpu Join Presentation

tOffset[t][p]=

L[(p-1)*#thread+t]

Page 21: Gpu Join Presentation

Scatter tuples to Rout based on offset.

Page 22: Gpu Join Presentation

Sort

• Bitonic sort– Uses sorting networks, O(N log2N).

• Quick sort– partition using a random pivot until partition fits in

local memory– Sort each partition using bitonic sort.– Partioning can be parallelized using split.– Complexity is O(N logN).– 30% faster than bitonic sort in experiments– Use Quick sort for sorting

Page 23: Gpu Join Presentation

Spatial and Temporal locality

Page 24: Gpu Join Presentation

Memory Optimizations

• Coalesced memory improves memory bandwidth utilization (spatial locality)

Page 25: Gpu Join Presentation

Local Memory Optimization

• Quick sort– Temporal locality– Use the bitonic sort to sort each chunk after

the partitioning step.

Page 26: Gpu Join Presentation

Joins on GPGPU

Page 27: Gpu Join Presentation

NINLJ on GPU

• Block nested• Uses Map primitive on both relations

– Partition R into R’ and S into S’ blocks respectively.

– Create R’ x S’ thread groups– A thread in a thread group processes one

tuple from R’ and matches all tuples from S’.– All tuples in S’ are in local cache.

Page 28: Gpu Join Presentation

B+ Tree vs CSS Tree

• B+ tree imposes – Memory stalls when traversed (no spatial locality)– Can’t perform multiple searches ( loses temporal

locality).• CSS-Tree (Cache optimized search tree)

– One dimensional array where nodes are indexed.– Replaces traversal with computation.– Can also perform parallel key lookups.

Page 29: Gpu Join Presentation

Indexed Nested Loop Join (INLP)

• Uses Map primitive on outer relation• Uses CSS tree for index.• For each block in outer relation R

– Start with a root node to find the next level• Binary search is shown to be better than sequential search.

– Go down until you find the data node.• Upper level nodes are cached in local memory

since they are frequently accessed.

Page 30: Gpu Join Presentation

Sort Merge Join

• Sort the relations R, S using the sort primitive• Merge phase

– Break S into chunks (s’) of size M.– Find first and last key values of each chunk in s’ and

partition R into those many chunks.– Merge all chunks in parallel using map

• Each thread group handles a pair• Each thread compares 1 tuple in R with s’ using binary

search.

• Chunk size is chosen to fit in local memory.

Page 31: Gpu Join Presentation

Hash Join

• Uses split primitive on both relations• Developed a parallel version of radix hash join

– Partitioning• Split R and S into the same number of partitions, so S

partitions fit into the local memory

– Matching• Choose smaller one of R and S partitions as inner partition to

be loaded into local memory• Larger relation will be used as the outer relation• Each tuple from outer relation uses a search on the inner

relation for matching.

Page 32: Gpu Join Presentation

Lock-Free Scheme for Result Output

• Problems– Unknown join result size. Max size of joins

doesn’t fit in memory.– Concurrent writes are not atomic.

Page 33: Gpu Join Presentation

Lock-Free Scheme for Result Output

• Solution: Three-phase scheme– Each thread counts the number of join results.– Compute a prefix sum on the counts to get an

array of write locations and the total number of results generated by the join.

– Host code allocates memory on device.– Run join again with outputs.

• Run joins twice. That’s ok, GPU’s are fast.

Page 34: Gpu Join Presentation

Experimental Results

Page 35: Gpu Join Presentation

Hardware Configuration

• Theoretical Memory bandwidth – GPU: 86.4 GB/s– CPU: 10.4 GB/s

• Practical Memory bandwidth – GPU: 69.2 GB/s– CPU: 5.6 GB/s

Page 36: Gpu Join Presentation

Workload• R and S tables with 2 integer columns.• SELECT R.rid, S.rid FROM R, S WHERE <predicate>• SELECT R.rid, S.rid FROM R, S WHERE R.rid=S.rid• SELECT R.rid, S.rid FROM R, S WHERE

R.rid<=S.rid<=R.rid + k• Tested on all combinations:

– Fix R, Vary S. All values uniform distribution. |R| = 1M– Performance impact varying join selectivity. |R| = |S| = 16M– Non – uniform distribution of data sizes and also varying join

selectivity. |R| = |S| = 16M• Also tested with columns as strings.

Page 37: Gpu Join Presentation

Implementation Details on CPU

• Highly optimized primitives and join algorithms matching hardware architecture

• Tuned for cache performance.• Compiled programs using MSVC 8.0 with

full optimizations.• Used openMP for threading mechanisms.• 2-6X faster than their sequential counter

parts.

Page 38: Gpu Join Presentation

Implementation Details on GPU

• CUDA parameters– Number of thread groups (128)– Number of threads for each thread group (64)– Block size is 4MB (main memory to device

memory)

Page 39: Gpu Join Presentation
Page 40: Gpu Join Presentation
Page 41: Gpu Join Presentation

Memory Optimizations Work

Page 42: Gpu Join Presentation

Works when join selectivity is varied

Page 43: Gpu Join Presentation

Better than in-memory database

Page 44: Gpu Join Presentation

CUDA vs. DirectX10

• DirectX10 is difficult to program, because the data is stored as textures.

• NINLJ and INLJ have similar performance.• HJ and SMJ are slower because of texture

decoding.• Summary: low level primitives on GPGPU

are better than graphics primitives on GPU.

Page 45: Gpu Join Presentation

Criticisms

• Applications of skew handling are unclear.• Primitives are sufficient to implement the

given joins, but they do not prove the set of primitives to be minimal.

Page 46: Gpu Join Presentation

Limitations and future research directions

• Lack of synchronization mechanisms for handling read/write conflicts on GPU.

• More primitives.• More open GPGPU hardware spec for

optimizations.• Power consumption on GPU.• Lack of support for complex data types.• On GPU in-memory database.• Automatic detection of thread groups and

number of threads using program analysis techniques.

Page 47: Gpu Join Presentation

Conclusion

• GPU-based primitives and join algorithms achieve a speedup of 2-27X over optimized CPU-based counterparts.

• NINLJ, 7.0X; INLJ, 6.1X; SMJ, 2.4X; HJ, 1.9X

Page 48: Gpu Join Presentation

Refrerences

• Scan Primitives for GPU Computing, Sengupta et al

• wikipedia.org• monetdb.cwi.nl

Page 49: Gpu Join Presentation

Thank You.

Page 50: Gpu Join Presentation

Scan Primitives for GPU Computing, Sengupta et al

Page 51: Gpu Join Presentation

Skew Handling

• Skew in data results in an imbalanced partition size in partitioned-based algorithms (SMJ and HJ)

• Solution– Identify partitions that do not fit into the local

memory– Decompose partitions into multiple chunks the

size of local memory

Page 52: Gpu Join Presentation

Implementation Details on GPU

• CUDA parameters– Number of threads for each thread group– Number of thread groups

• DirectX10– Join algorithms implemented using

programmable pipeline• Vertex shader, geometry shader, and pixel shader

Page 53: Gpu Join Presentation
Page 54: Gpu Join Presentation
Page 55: Gpu Join Presentation