42
6/1/2015 UNIVERSITY OF WISCONSIN Toward GPUs being mainstream in analytic processing An initial argument using simple scan- aggregate queries 1 Jason Power || Yinan Li || Mark D. Hill Jignesh M. Patel || David A. Wood <[email protected]> DaMoN 2015

Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Toward GPUs being mainstream in analytic processing

An initial argument using simple scan-aggregate queries

1

Jason Power || Yinan Li || Mark D. HillJignesh M. Patel || David A. Wood

<[email protected]>

DaMoN 2015

Page 2: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Summary

▪ GPUs are energy efficient▪ Discrete GPUs unpopular for DBMS▪ New integrated GPUs solve the problems

2

▪ Scan-aggregate GPU implementation▪ Wide bit-parallel scan▪ Fine-grained aggregate GPU offload

▪ Up to 70% energy savings over multicore CPU▪ Even more in the future

Page 3: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Analytic Data is Growing

▪ Data is growing rapidly

▪ Analytic DBs increasingly important

3

Want: High performance Need: Low energy

Source: IDC’s Digital Universe Study. 2012.

Page 4: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

GPUs to the Rescue?

▪ GPUs are becoming more general▪ Easier to program▪ Integrated GPUs are everywhere

4

[Govindaraju ’04, He ’14, He ’14, Kaldewey ‘12,Satish ’10, and many others]▪ GPUs show great promise

▪ Higher performance than CPUs▪ Better energy efficiency

▪ Analytic DBs look like GPU workloads

Page 5: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

GPU Microarchitecture

5L2 C

ache

Graphics Processing UnitI-Fetch/Sched

SP SP SP SP

L1 Cache Scratchpad Cache

Register File

Compute Unit

SP SP SP SP

SP SP SP SP

CU

Page 6: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Discrete GPUs

6

Mem

ory Bus

CPU chip

Memory Bus

Discrete GPU

CoresPCIe Bus

Page 7: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Discrete GPUs

7

Mem

ory Bus

CPU chip

Memory Bus

Discrete GPU

CoresPCIe Bus

Page 8: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Discrete GPUs

8

Mem

ory Bus

CPU chip

Memory Bus

Discrete GPU

CoresPCIe Bus

➍ And repeat

Page 9: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Discrete GPUs

▪ Copy data over PCIe▪ Low bandwidth▪ High latency

▪ Small working memory

▪ High latency user→kernel calls

▪ Repeated many times

9

98% of time spent not computing

Page 10: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Integrated GPUs

10

Memory Bus

Heterogeneouschip

CPU cores

GPU CUs

Page 11: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

▪ No need for data copies▪ Cache coherence and shared address space

▪ No OS kernel interaction▪ User-mode queues

Heterogeneous System Arch.

11

▪ API for tightly-integrated accelerators

▪ Industry support▪ Initial hardware support today▪ HSA foundation (AMD, ARM,Qualcomm, others)

➊➋

Page 12: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Outline

▪ Background

▪ Algorithms▪ Scan▪ Aggregate

▪ Results

12

Page 13: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Analytic DBs

▪ Resident in main-memory

▪ Column-based layout

13

▪ WideTable & BitWeaving [Li and Patel ‘13 & ‘14]

▪ Convert queries to mostly scans by pre-joining tables▪ Fast scan by using sub-word parallelism▪ Similar to industry proposals [SAP Hana, Oracle Exalytics, IBM DB2 BLU]

▪ Scan-aggregate queries

Page 14: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Running Example

14

ShirtColor

ShirtAmount

Green 1Green 3Blue 1Green 5Yellow 7Red 2Yellow 1Blue 4Yellow 2

Color CodeRed 0Blue 1Green 2Yellow 3

ShirtColor221230313

Page 15: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Running Example

15

ShirtColor

ShirtAmount

Green 1Green 3Blue 1Green 5Yellow 7Red 2Yellow 1Blue 4Yellow 2

ShirtColor221230313

Count the number of green shirts in the inventory

Scan the color column for green (2)

Aggregate amount where there is a match

Page 16: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

10 10 01

11011010000 0000...

Traditional Scan Algorithm

16

ColumnData

CompareCode(Green)

10

ResultBitVector 111

10 10

ShirtColor2 (10)2 (10)1 (01)2 (10)3 (11)0 (00)3 (11)1 (01)3 (11)

Page 17: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Vertical Layout

17

Color

c0 2 (10)c1 2 (10)c2 1 (01)c3 2 (10)c4 3 (11)c5 0 (00)c6 3 (11)c7 1 (01)c8 3 (11)c9 0 (00)

word c0 c1 c2 c3 c4 c5 c6 c7

w0 1 1 0 1 1 0 1 0

w1 0 0 1 0 1 0 1 1

c8 c9

w2 1 0

w3 1 0

word c0

w0 1

w1 0

word c0 c1

w0 1 1

w1 0 0

110110110 00101011 10000000 ...

Page 18: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

1101

CPU BitWeaving Scan

18

ColumnData

CompareCode

ResultBitVector

11111111 00000000

11010000 0000...

11011011 00101011 10000000

CPU width: 64-bits, up to 256-bit SIMD

Page 19: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

GPU BitWeaving Scan

19

11011011 00101011 10000000ColumnData

CompareCode

ResultBitVector

11111111 11111111 11111111

11010000 0000...

GPU width: 16,384-bit SIMD

Page 20: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

GPU Scan Algorithm

▪ GPU uses very wide “words”▪ CPU: 64-bits or 256-bits with SIMD▪ GPU: 16,384 bits (256 lanes × 64-bits)

▪ Memory and caches optimized for bandwidth

▪ HSA programming model▪ No data copies▪ Low CPU-GPU interaction overhead

20

Page 21: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

ShirtAmount131572142

11+31+3+5+...

CPU Aggregate Algorithm

21

ResultBitVector 11010000 0000...

Result

Page 22: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

GPU Aggregate Algorithm

22

ResultBitVector 11010000 0000...

Column Offsets 00,10,1,3,...

On CPU

Page 23: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

GPU Aggregate Algorithm

23

Column Offsets 0,1,3,...

Result 1+3+5+...

On GPU

ShirtAmount131572142

Page 24: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Aggregate Algorithm

▪ Two phases▪ Convert from BitVector to offsets (on CPU)▪ Materialize data and compute (offload to GPU)

▪ Two group-by algorithms (see paper)

▪ HSA programming model▪ Fine-grained sharing▪ Can offload subset of computation

24

Page 25: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Outline

▪ Background

▪ Algorithms

▪ Results

25

Page 26: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Experimental Methods

▪ AMD A10-7850▪ 4-core CPU▪ 8-compute unit GPU▪ 16GB capacity, 21 GB/s DDR3 memory▪ Separate discrete GPU

▪ Watts-Up meter for full-system power

▪ TPC-H @ scale-factor 10

26

Page 27: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Scan Performance & Energy

27

Page 28: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Scan Performance & Energy

28

Takeaway:Integrated GPU most efficient for scans

Page 29: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

TPC-H Queries

29

Query 12 Performance

Page 30: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

TPC-H Queries

30

Query 12 Performance Query 12 Energy

Integrated GPU faster for both aggregate and

scan computation

Page 31: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

TPC-H Queries

31

Query 12 Performance Query 12 Energy

Page 32: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

TPC-H Queries

32

Query 12 Performance Query 12 Energy

More energy:Decrease in latency does not offset power increase

Less energy:Decrease in latency AND

decrease in power

Page 33: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Future Die Stacked GPUs

33

▪ 3D die stacking

▪ Same physical & logical integration

▪ Increased compute

▪ Increased bandwidthBoard

Power et al.Implications of 3D GPUs on the Scan Primitive

SIGMOD Record. Volume 44, Issue 1. March 2015

DRAM

CPU

GPU

Page 34: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Conclusions

34

Discrete GPUs

Integrated GPUs

3D StackedGPUs

Performance High ☺ Moderate High ☺Memory Bandwidth High ☺ Low ☹ High ☺

Overhead High ☹ Low ☺ Low ☺Memory Capacity Low ☹ High ☺ Moderate

Page 35: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

?35

Page 36: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

HSA vs CUDA/OpenCL

▪ HSA defines a heterogeneous architecture▪ Cache coherence▪ Shared virtual addresses▪ Architected queuing▪ Intermediate language

▪ CUDA/OpenCL are a level above HSA▪ Come with baggage▪ Not as flexible▪ May not be able to take advantage of all features

36

Page 37: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Scan Performance & Energy

37

Page 38: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Group-by Algorithms

38

Page 39: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

All TPC-H Results

39

Page 40: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Average TPC-H Results

40

Average Performance Average Energy

Page 41: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

What’s Next?

41

▪ Developing cost model for GPU▪ Using the GPU is just another algorithm to choose▪ Evaluate exactly when the GPU is more efficient

▪ Future “database machines”▪ GPUs are a good tradeoff between specialization and

commodity

Page 42: Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in analytic processing An initial argument using simple scan-aggregate queries 1 Jason Power

6/1/2015 UNIVERSITY OF WISCONSIN

Conclusions

▪ Integrated GPUs viable for DBMS?▪ Solve problems with discrete GPUs▪ (Somewhat) better performance and energy

42

▪ Looking toward the future...▪ CPUs cannot keep up with bandwidth▪ GPUs perfectly designed for these workloads