31
Using CUDA Libraries with OpenACC

Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

Embed Size (px)

Citation preview

Page 1: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

Using CUDA Librarieswith OpenACC

Page 2: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in” Acceleration

Programming Languages

OpenACCDirectives

MaximumFlexibility

Easily AccelerateApplications

CUDA Libraries are interoperable with OpenACC

Page 3: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in” Acceleration

Programming Languages

OpenACCDirectives

MaximumFlexibility

Easily AccelerateApplications

CUDA Languages are interoperable with OpenACC,

too!

Page 4: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

CUDA LibrariesOverview

Page 5: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP

Vector SignalImage

Processing

GPU Accelerated

Linear Algebra

Matrix Algebra on GPU and Multicore NVIDIA cuFFT

C++ STL Features for

CUDAIMSL Library

GPU Accelerated Libraries“Drop-in” Acceleration for Your Applications

Building-block Algorithms for

CUDA

Page 6: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

CUDA Math LibrariesHigh performance math routines for your applications:

cuFFT – Fast Fourier Transforms LibrarycuBLAS – Complete BLAS LibrarycuSPARSE – Sparse Matrix LibrarycuRAND – Random Number Generation (RNG) Library NPP – Performance Primitives for Image & Video ProcessingThrust – Templated C++ Parallel Algorithms & Data Structuresmath.h - C99 floating-point Library

Included in the CUDA Toolkit Free download @ www.nvidia.com/getcuda

More information on CUDA libraries:

http://www.nvidia.com/object/gtc2010-presentation-archive.html#session2216

Page 7: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

cuFFT: Multi-dimensional FFTsNew in CUDA 4.1

Flexible input & output data layouts for all transform typesSimilar to the FFTW “Advanced Interface”Eliminates extra data transposes and copies

API is now thread-safe & callable from multiple host threadsRestructured documentation to clarify data layouts

Page 8: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

FFTs up to 10x Faster than MKL

• Measured on sizes that are exactly powers-of-2• cuFFT 4.1 on Tesla M2090, ECC on• MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33

GHz

1D used in audio processing and as a foundation for 2D and 3D FFTs

Performance may vary based on OS version and motherboard configuration

1 3 5 7 9 11 13 15 17 19 21 23 250

50

100

150

200

250

300

350

400

cuFFT Single Precision

CUFFT MKL

Log2(size)

GF

LO

PS

1 3 5 7 9 11 13 15 17 19 21 23 250

20

40

60

80

100

120

140

160

cuFFT Double Precision

CUFFT MKL

Log2(size)

GF

LO

PS

Page 9: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

0 16 32 48 64 80 96 112 1280

20

40

60

80

100

120

140

160

180

Single Precision All Sizes 2x2x2 to 128x128x128

CUFFT 4.1

CUFFT 4.0

MKL

Size (NxNxN)

GFL

OPS

CUDA 4.1 optimizes 3D transforms

Consistently faster than MKL

>3x faster than 4.0 on average

• cuFFT 4.1 on Tesla M2090, ECC on• MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33

GHzPerformance may vary based on OS version and motherboard configuration

Page 10: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

cuBLAS: Dense Linear Algebra on GPUs

Complete BLAS implementation plus useful extensionsSupports all 152 standard routines for single, double, complex, and double complex

New in CUDA 4.1New batched GEMM API provides >4x speedup over MKL

Useful for batches of 100+ small matrices from 4x4 to 128x128

5%-10% performance improvement to large GEMMs

Page 11: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

cuBLAS Level 3 Performance

• 4Kx4K matrix size • cuBLAS 4.1, Tesla M2090 (Fermi), ECC on • MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz

Up to 1 TFLOPS sustained performance and >6x speedup over Intel MKL

Performance may vary based on OS version and motherboard configuration

SG

EM

M

SS

YM

M

SS

YR

K

ST

RM

M

ST

RS

M

CG

EM

M

CS

YM

M

CS

YR

K

CT

RM

M

CT

RS

M

DG

EM

M

DS

YM

M

DS

YR

K

DT

RM

M

DT

RS

M

ZG

EM

M

ZS

YM

M

ZS

YR

K

ZT

RM

M

ZT

RS

MSingle Complex Double Double

Complex

0

200

400

600

800

1000

1200

GFLOPS

SG

EM

M

SS

YM

M

SS

YR

K

ST

RM

M

ST

RS

M

CG

EM

M

CS

YM

M

CS

YR

K

CT

RM

M

CT

RS

M

DG

EM

M

DS

YM

M

DS

YR

K

DT

RM

M

DT

RS

M

ZG

EM

M

ZS

YM

M

ZS

YR

K

ZT

RM

M

ZT

RS

M

Single Complex Double Double Complex

0

1

2

3

4

5

6

7

Speedup over MKL

Page 12: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

• cuBLAS 4.1 on Tesla M2090, ECC on• MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @

3.33 GHzPerformance may vary based on OS version and motherboard configuration

ZGEMM Performance vs Intel MKL

0 256 512 768 1024 1280 1536 1792 20480

50

100

150

200

250

300

350

400

450

CUBLAS-Zgemm MKL-Zgemm

Matrix Size (NxN)

GF

LO

PS

Page 13: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

• cuBLAS 4.1 on Tesla M2090, ECC on• MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @

3.33 GHzPerformance may vary based on OS version and motherboard configuration

cuBLAS Batched GEMM API improves performance on batches of small matrices

0 16 32 48 64 80 96 112 1280

20

40

60

80

100

120

140

160

180

200cuBLAS 100 matrices cuBLAS 10,000 matrices MKL 10,000 matrices

Matrix Dimension (NxN)

GFL

OPS

Page 14: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

cuSPARSE: Sparse linear algebra routinesSparse matrix-vector multiplication & triangular solve

APIs optimized for iterative methods

New in 4.1Tri-diagonal solver with speedups up to 10x over Intel MKLELL-HYB format offers 2x faster matrix-vector multiplication

[ 𝑦1𝑦2𝑦 3𝑦 4]=𝛼 [1 .0 ⋯ ⋯ ⋯2.0 3.0 ⋯ ⋯⋯ ⋯ 4.0 ⋯5.0 ⋯ 6.0 7.0

] [1.02.03.04.0

]+𝛽 [𝑦1𝑦 2𝑦3𝑦4 ]

Page 15: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

cuSPARSE is >6x Faster than Intel MKL

Performance may vary based on OS version and motherboard configuration

•cuSPARSE 4.1, Tesla M2090 (Fermi), ECC on • MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz

dense2

nd24k

crankse

g_2

pdb1HYS F1ca

ntpwtk

ldoor

qcd5_4

cop20k_a

cage14

2cubes_

sphere

atmosm

odd

mac_eco

n_fwd500

scircu

it

shallo

w_water1

webbase-1M

bcsstm

380

1

2

3

4

5

6

7

Sparse Matrix x Dense Vector Performance

csrmv* hybmv*

Spee

dup

over

Inte

l MKL

*Average speedup over single, double, single complex & double-complex

Page 16: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

Up to 40x faster with 6 CSR Vectors

Performance may vary based on OS version and motherboard configuration

• cuSPARSE 4.1, Tesla M2090 (Fermi), ECC on • MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz

dense2

nd24k

crankse

g_2

pdb1HYS F1ca

ntpwtk

ldoor

qcd5_4

cop20k_a

cage14

2cubes_

sphere

atmosm

odd

mac_eco

n_fwd500

scircu

it

shallo

w_water1

webbase-1M

bcsstm

380

10

20

30

40

50

60

cuSPARSE Sparse Matrix x 6 Dense Vectors (csrmm)Useful for block iterative solve schemes

single double single complex double complex

Spee

dup

over

MKL

Page 17: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

16384 131072 1048576 2097152 41943040

2

4

6

8

10

12

14

16

Speedup for Tri-Diagonal solver (gtsv)*

single double complex double complex

Matrix Size (NxN)

Spee

dup

over

Inte

l MKL

Performance may vary based on OS version and motherboard configuration

• cuSPARSE 4.1, Tesla M2090 (Fermi), ECC on • MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz

Tri-diagonal solver performance vs. MKL

*Parallel GPU implementation does not include pivoting

Page 18: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

cuRAND: Random Number Generation

Pseudo- and Quasi-RNGsSupports several output distributionsStatistical test results reported in documentation

New commonly used RNGs in CUDA 4.1MRG32k3a RNGMTGP11213 Mersenne Twister RNG

Page 19: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

cuRAND Performance compared to Intel MKL

Performance may vary based on OS version and motherboard configuration

0

2

4

6

8

10

12

Double PrecisionUniform Distribution

Series1

Gig

a-S

am

ple

s /

Se

co

nd

0

0.5

1

1.5

2

2.5

Double PrecisionNormal Distribution

Series1

Gig

a-S

am

ple

s /

Se

co

nd

•cuRAND 4.1, Tesla M2090 (Fermi), ECC on • MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 @ 3.33 GHz

Page 20: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

1000+ New Imaging Functions in NPP 4.1NVIDIA Performance Primitives (NPP) library includes over 2200 GPU-accelerated functions for image & signal processing

Arithmetic, Logic, Conversions, Filters, Statistics, etc.

Most are 5x-10x faster than analogous routines in Intel IPP

* NPP 4.1, NVIDIA C2050 (Fermi)* IPP 6.1, Dual Socket Core™ i7 920 @ 2.67GHz

Up to 40x speedups

http://developer.nvidia.com/content/graphcuts-using-npp

Page 21: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

Using CUDA Librarieswith OpenACC

Page 22: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

Sharing data with libraries

CUDA libraries and OpenACC both operate on device arrays

OpenACC provides mechanisms for interop with library calls

deviceptr data clausehost_data construct

Note: same mechanisms useful for interop with custom CUDA C/C++/Fortran code

Page 23: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

deviceptr Data Clause

deviceptr( list )Declares that the pointers in list refer to device pointers that need not be allocated or moved between the host and device for this pointer.

Example:

C#pragma acc data deviceptr(d_input)

Fortran$!acc data deviceptr(d_input)

Page 24: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

host_data Construct

Makes the address of device data available on the host.

deviceptr( list )Tells the compiler to use the device address for any variable in list. Variables in the list must be present in device memory due to data regions that contain this construct

ExampleC#pragma acc host_data use_device(d_input)

Fortran$!acc host_data use_device(d_input)

Page 25: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

Example: 1D convolution using CUFFT

Perform convolution in frequency space1. Use CUFFT to transform input signal and filter kernel into

the frequency domain2. Perform point-wise complex multiply and scale on

transformed signal3. Use CUFFT to transform result back into the time domain

We will perform step 2 using OpenACC

Code walk-through follows. Code available with exercises.

In exercises/cufft-acc

Page 26: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

// Transform signal and kernel error = cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD); error = cufftExecC2C(plan, (cufftComplex *)d_filter_kernel, (cufftComplex *)d_filter_kernel, CUFFT_FORWARD); // Multiply the coefficients together and normalize the result printf("Performing point-wise complex multiply and scale.\n"); complexPointwiseMulAndScale(new_size, (float *restrict)d_signal, (float *restrict)d_filter_kernel);

// Transform signal back error = cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_INVERSE);

Source Excerpt

This function must execute on

device data

Page 27: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

OpenACC convolution codevoid complexPointwiseMulAndScale(int n, float *restrict signal, float *restrict filter_kernel){// Multiply the coefficients together and normalize the result#pragma acc data deviceptr(signal, filter_kernel) {#pragma acc kernels loop independent for (int i = 0; i < n; i++) { float ax = signal[2*i]; float ay = signal[2*i+1]; float bx = filter_kernel[2*i]; float by = filter_kernel[2*i+1]; float s = 1.0f / n; float cx = s * (ax * bx - ay * by); float cy = s * (ax * by + ay * bx); signal[2*i] = cx; signal[2*i+1] = cy; } }}

Note: The PGI C compiler does not currently support structs in OpenACC loops, so we cast the Complex* pointers to float*

pointers and use interleaved indexing

Page 28: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

Linking CUFFT

#include “cufft.h”Compiler command line options:

CUDA_PATH = /usr/local/pgi/linux86-64/2012/cuda/4.0CCFLAGS = -I$(CUDA_PATH)/include –L$(CUDA_PATH)/lib64 -lcudart -lcufft

Must use PGI-provided

CUDA toolkit paths

Must link libcudart and libcufft

Page 29: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

Results

[harrism@kollman0 cufft-acc]$ ./cufft_accTransforming signal cufftExecC2CPerforming point-wise complex multiply and scale.Transforming signal back cufftExecC2CPerforming Convolution on the host and checking correctness

Signal size: 500000, filter size: 33Total Device Convolution Time: 11.461152 ms (0.242624 for point-wise convolution)Test PASSED

OpenACCCUFFT + cudaMemcpy

Page 30: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

Summary

Use deviceptr data clause to pass pre-allocated device data to OpenACC regions and loopsUse host_data to get device address for pointers inside acc data regions

The same techniques shown here can be used to share device data between OpenACC loops and

Your custom CUDA C/C++/Fortran/etc. device code Any CUDA Library that uses CUDA device pointers

Page 31: Using CUDA Libraries with OpenACC. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives

Thank you