97
Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, NVIDIA [email protected]

Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Embed Size (px)

Citation preview

Page 1: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Really Fast Introduction

to CUDA and CUDA C Jul 2013

Dale Southard, NVIDIA [email protected]

Page 2: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

About the Speaker

[Dale] is a senior solution architect with NVIDIA (I fix things). I

primarily cover HPC in Gov/Edu/Research and cloud computing. In

the past I was a HW architect in the LLNL systems group designing

the vis/post-processing solutions.

Page 3: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

GeForce Quadro

Tegra Tesla

Page 4: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

The World Leader in

Parallel Processing

Page 5: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Why CUDA

Page 6: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

CUDA Accelerates Computing

Choose the right processor for the right task

CUDA GPU

Thousands of parallel cores

CPU

Several sequential cores

Page 7: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Power is THE problem

Underlying changes in silicon technology scaling mean

the “free ride” of the past several decades is over.

Power will be the dominant constraint in computing.

Page 8: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Estimated amount of power for an exascale

system using today’s x86 processors,

equivalent to the maximum output of the

Hoover Dam. DATA: U.S. Dept. of Energy

2 GIGAWATTS

Page 9: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

GPU <200 pJ/Instruction

Optimized for Throughput

Explicit Management of On-chip Memory

CPU >1000 pJ/Instruction

Optimized for Latency

Caches

Page 10: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

The World’s Most Energy Efficient Supercomputer

3208 MFLOPS/Watt

128 Tesla K20 Accelerators

$100k Energy Savings / Yr

300 Tons of CO2 Saved / Yr 0

1000

2000

3000

CINECA Eurora-Tesla K20

NICS Beacon-Greenest Xeon

Phi System

C-DAC- GreenestCPU System

MFLOPS/Watt

CINECA Eurora “Liquid-Cooled” Eurotech Aurora Tigon

Greener than Xeon Phi, Xeon CPU

Page 11: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Kepler - Greener: Twice The Science/Joule

Running NAMD version 2.9

The blue node contains Dual E5-2687W CPUs

(150W each, 8 Cores per CPU).

The green nodes contain Dual E5-2687W CPUs

(8 Cores per CPU) and 2x NVIDIA K10, K20, or

K20X GPUs (235W each).

Energy Expended

= Power x Time

Lower is better

Cut down energy usage by ½ with GPUs

0

50000

100000

150000

200000

250000

CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs

En

erg

y E

xp

en

ded

(kJ)

Energy used in simulating 1 ns of SMTV

Satellite Tobacco Mosaic Virus

Page 12: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Tesla Kepler Family World’s Fastest and Most Efficient HPC Accelerators

GPUs

Single

Precision

Peak

(SGEMM)

Double

Precision

Peak

(DGEMM)

Memory

Size

Memory

Bandwidth

(ECC off)

System Solution

Weather & Climate,

Physics, BioChemistry, CAE,

Material Science

K20X 3.95 TF

(2.90 TF)

1.32 TF

(1.22 TF) 6 GB 250 GB/s Server only

K20 3.52 TF

(2.61 TF)

1.17 TF

(1.10 TF) 5 GB 208 GB/s

Server +

Workstation

Image, Signal,

Video, Seismic K10 4.58 TF 0.19 TF 8 GB 320 GB/s Server only

Page 13: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

GTX Titan: For High Performance Gaming

Enthusiasts

CUDA Cores 2688

Single Precision ~4.5 Tflops

Double Precision ~1.27 Tflops

Memory Size 6GB

Memory B/W 288GB/s

Page 14: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

What is CUDA

(six ways to saxpy)

Page 15: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Programming GPUs

Libraries Directives CUDA

Programming

Applications

Easiest Approach for 2x to 10x Acceleration Maximum Performance

Page 16: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Single precision Alpha X Plus Y (SAXPY)

Part of Basic Linear Algebra Subroutines (BLAS) Library

GPU SAXPY in multiple languages and libraries

A menagerie* of possibilities, not a tutorial

𝒛 = 𝛼𝒙 + 𝒚 x, y, z : vector

: scalar

*technically, a program chrestomathy: http://en.wikipedia.org/wiki/Chrestomathy

Page 17: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i !$acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo !$acc end kernels end subroutine saxpy ... ! Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d) ...

void saxpy(int n,

float a,

float *x,

float *y)

{

#pragma acc kernels

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

...

// Perform SAXPY on 1M elements

saxpy(1<<20, 2.0, x, y);

...

OpenACC Compiler Directives

Parallel C Code Parallel Fortran Code

http://developer.nvidia.com/openacc or http://openacc.org

Page 18: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

int N = 1<<20;

...

// Use your choice of blas library

// Perform SAXPY on 1M elements

blas_saxpy(N, 2.0, x, 1, y, 1);

int N = 1<<20;

cublasInit();

cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);

cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);

// Perform SAXPY on 1M elements

cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);

cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1);

cublasShutdown();

CUBLAS Library

Serial BLAS Code Parallel cuBLAS Code

http://developer.nvidia.com/cublas

You can also call cuBLAS from Fortran,

C++, Python, and other languages

Page 19: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

void saxpy(int n, float a,

float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

int N = 1<<20;

// Perform SAXPY on 1M elements

saxpy(N, 2.0, x, y);

__global__

void saxpy(int n, float a,

float *x, float *y)

{

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i < n) y[i] = a*x[i] + y[i];

}

int N = 1<<20;

cudaMemcpy(d_x, x, N, cudaMemcpyHostToDevice);

cudaMemcpy(d_y, y, N, cudaMemcpyHostToDevice);

// Perform SAXPY on 1M elements

saxpy<<<4096,256>>>(N, 2.0, d_x, d_y);

cudaMemcpy(y, d_y, N, cudaMemcpyDeviceToHost);

CUDA C

Standard C Parallel C

http://developer.nvidia.com/cuda-toolkit

Page 20: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

int N = 1<<20;

std::vector<float> x(N), y(N);

...

// Perform SAXPY on 1M elements

std::transform(x.begin(), x.end(),

y.begin(), y.end(),

2.0f * _1 + _2);

int N = 1<<20;

thrust::host_vector<float> x(N), y(N);

...

thrust::device_vector<float> d_x = x;

thrust::device_vector<float> d_y = y;

// Perform SAXPY on 1M elements

thrust::transform(d_x.begin(), d_x.end(),

d_y.begin(), d_y.begin(),

2.0f * _1 + _2);

Thrust C++ Template Library

Serial C++ Code with STL and Boost

Parallel C++ Code

http://thrust.github.com www.boost.org/libs/lambda

Page 21: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

CUDA Fortran

module mymodule contains attributes(global) subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i attributes(value) :: a, n i = threadIdx%x+(blockIdx%x-1)*blockDim%x if (i<=n) y(i) = a*x(i)+y(i) end subroutine saxpy end module mymodule program main use cudafor; use mymodule real, device :: x_d(2**20), y_d(2**20) x_d = 1.0, y_d = 2.0 ! Perform SAXPY on 1M elements call saxpy<<<4096,256>>>(2**20, 2.0, x_d, y_d) end program main

http://developer.nvidia.com/cuda-fortran

module mymodule contains subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i do i=1,n y(i) = a*x(i)+y(i) enddo end subroutine saxpy end module mymodule program main use mymodule real :: x(2**20), y(2**20) x = 1.0, y = 2.0 ! Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x, y) end program main

Standard Fortran Parallel Fortran

Page 22: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Python Copperhead: Parallel Python

http://copperhead.github.com

from copperhead import * import numpy as np @cu def saxpy(a, x, y): return [a * xi + yi

for xi, yi in zip(x, y)] x = np.arange(2**20, dtype=np.float32) y = np.arange(2**20, dtype=np.float32) with places.gpu0: gpu_result = saxpy(2.0, x, y) with places.openmp: cpu_result = saxpy(2.0, x, y)

import numpy as np def saxpy(a, x, y): return [a * xi + yi for xi, yi in zip(x, y)] x = np.arange(2**20, dtype=np.float32) y = np.arange(2**20, dtype=np.float32) cpu_result = saxpy(2.0, x, y)

http://numpy.scipy.org

Standard Python

Page 23: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Enabling Endless Ways to SAXPY

Developers want to build

front-ends for

Java, Python, R, DSLs

Target other processors like

ARM, FPGA, GPUs, x86

CUDA C, C++, Fortran

LLVM Compiler For CUDA

NVIDIA GPUs

x86 CPUs

New Language Support

New Processor Support

CUDA Compiler Contributed to

Open Source LLVM

Page 24: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

CUDA Basics

Page 25: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Heterogeneous Computing

Terminology:

Host The CPU and its memory (host memory)

Device The GPU and its memory (device memory)

Host Device

Page 26: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Heterogeneous Computing

#include <iostream>

#include <algorithm>

using namespace std;

#define N 1024

#define RADIUS 3

#define BLOCK_SIZE 16

__global__ void stencil_1d(int *in, int *out) {

__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];

int gindex = threadIdx.x + blockIdx.x * blockDim.x;

int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];

if (threadIdx.x < RADIUS) {

temp[lindex - RADIUS] = in[gindex - RADIUS];

temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}

// Synchronize (ensure all the data is available)

__syncthreads();

// Apply the stencil

int result = 0;

for (int offset = -RADIUS ; offset <= RADIUS ; offset++)

result += temp[lindex + offset];

// Store the result

out[gindex] = result;

}

void fill_ints(int *x, int n) {

fill_n(x, n, 1);

}

int main(void) {

int *in, *out; // host copies of a, b, c

int *d_in, *d_out; // device copies of a, b, c

int size = (N + 2*RADIUS) * sizeof(int);

// Alloc space for host copies and setup values

in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);

out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copies

cudaMalloc((void **)&d_in, size);

cudaMalloc((void **)&d_out, size);

// Copy to device

cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPU

stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS, d_out + RADIUS);

// Copy result back to host

cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Cleanup

free(in); free(out);

cudaFree(d_in); cudaFree(d_out);

return 0;

}

serial code

parallel code

serial code

parallel fn

Page 27: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Simple Processing Flow

1. Copy input data from CPU memory to GPU

memory

PCI Bus

Page 28: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Simple Processing Flow

1. Copy input data from CPU memory to GPU

memory

2. Load GPU program and execute,

caching data on chip for performance

PCI Bus

Page 29: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Simple Processing Flow

1. Copy input data from CPU memory to GPU

memory

2. Load GPU program and execute,

caching data on chip for performance

3. Copy results from GPU memory to CPU

memory

PCI Bus

Page 30: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

CUDA Kernels: Parallel Threads

A kernel is a function executed

on the GPU as an array of

threads in parallel

All threads execute the same

code, can take different paths

Each thread has an ID

Select input/output data

Control decisions

float x = input[threadIdx.x];

float y = func(x);

output[threadIdx.x] = y;

Page 31: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

CUDA Kernels: Subdivide into Blocks

Threads

Page 32: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Page 33: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Blocks are grouped into a grid

Page 34: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Blocks are grouped into a grid

A kernel is executed as a grid of blocks of threads

GPU

Page 35: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Kernel Execution

• Each kernel is executed on

one device

• Multiple kernels can execute

on a device at one time

… …

CUDA-enabled GPU

CUDA thread • Each thread is executed by a

core

CUDA core

CUDA thread block

• Each block is executed by

one SM and does not migrate

• Several concurrent blocks can

reside on one SM depending

on the blocks’ memory

requirements and the SM’s

memory resources

CUDA Streaming

Multiprocessor

CUDA kernel grid

...

Page 36: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Thread blocks allow cooperation

Threads may need to cooperate:

Cooperatively load/store memory that they all use

Share results with each other

Cooperate to produce a single result

Synchronize with each other

Page 37: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Thread blocks allow scalability

Blocks can execute in any order, concurrently or sequentially

This independence between blocks gives scalability:

A kernel scales across any number of SMs

Device with 2 SMs

SM 0 SM 1

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel Grid

Launch

Block 0

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

Device with 4 SMs

SM 0 SM 1

SM 2 SM 3

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Page 38: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Warps (extra credit)

Blocks are divided into 32 thread wide units called warps Size of warps is implementation specific and can change in the future

The SM creates, manages, schedules and executes threads at warp granularity

Each warp consists of 32 threads of contiguous threadIds

All threads in a warp execute the same instruction If threads of a warp diverge the warp serially executes each branch path taken

When a warp executes an instruction that accesses global memory it coalesces the memory accesses of the threads within the warp into as few transactions as possible

Page 39: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Memory hierarchy

Thread:

Registers

Page 40: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Memory hierarchy

Thread:

Registers

Local memory

Local Local Local Local Local Local Local

Page 41: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Memory hierarchy

Thread:

Registers

Local memory

Block of threads:

Shared memory

Page 42: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Memory hierarchy

Thread:

Registers

Local memory

Block of threads:

Shared memory

All blocks:

Global memory

Page 43: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

CUDA C Basics (finally)

Page 44: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Hello World!

int main(void) {

printf("Hello World!\n");

return 0;

}

Standard C that runs on the host

NVIDIA compiler (nvcc) can be used to

compile programs with no device code

Output:

$ nvcc

hello_world.cu

$ a.out

Hello World!

$

Page 45: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Hello World! with Device Code

__global__ void mykernel(void) {

}

int main(void) {

mykernel<<<1,1>>>();

printf("Hello World!\n");

return 0;

}

Two new syntactic elements…

Page 46: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Hello World! with Device Code

__global__ void mykernel(void) {

}

CUDA C/C++ keyword __global__ indicates a function that:

Runs on the device

Is called from host code

nvcc separates source code into host and device components

Device functions (e.g. mykernel()) processed by NVIDIA compiler

Host functions (e.g. main()) processed by standard host compiler

gcc, cl.exe

Page 47: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Hello World! with Device Code

mykernel<<<1,1>>>();

Triple angle brackets mark a call from host code to device code

Also called a “kernel launch”

First parameter is the number of blocks

Second parameter is the number of threads in each block

Parameters can be scalars (int) or multidimensional (dim3)

That’s all that is required to execute a function on the GPU!

Page 48: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Hello World! with Device Code

__global__ void mykernel(void) {

}

int main(void) {

mykernel<<<1,1>>>();

printf("Hello World!\n");

return 0;

}

mykernel() does nothing, somewhat

anticlimactic!

Output:

$ nvcc hello.cu

$ a.out

Hello World!

$

Page 49: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Parallel Programming in CUDA C/C++

But wait… GPU computing is about massive

parallelism!

We need a more interesting example…

We’ll start by adding two integers and build up

to vector addition

a b c

Page 50: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Addition on the Device

A simple kernel to add two integers

__global__ void add(int *a, int *b, int *c) {

*c = *a + *b;

}

As before __global__ is a CUDA C/C++ keyword meaning

add() will execute on the device

add() will be called from the host

Page 51: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Addition on the Device

Note that we use pointers for the variables

__global__ void add(int *a, int *b, int *c) {

*c = *a + *b;

}

add() runs on the device, so a, b and c must point to device

memory

We need to allocate memory on the GPU

Page 52: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Memory Management

Host and device memory are separate entities

Device pointers point to GPU memory

May be passed to/from host code

May not be dereferenced in host code

Host pointers point to CPU memory

May be passed to/from device code

May not be dereferenced in device code

Simple CUDA API for handling device memory

cudaMalloc(), cudaFree(), cudaMemcpy()

Similar to the C equivalents malloc(), free(), memcpy()

Page 53: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Addition on the Device: add()

Returning to our add() kernel

__global__ void add(int *a, int *b, int *c) {

*c = *a + *b;

}

Let’s take a look at main()…

Page 54: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Addition on the Device: main()

int main(void) {

int a, b, c; // host copies of a, b, c

int *d_a, *d_b, *d_c; // device copies of a, b, c

int size = sizeof(int);

// Allocate space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);

cudaMalloc((void **)&d_b, size);

cudaMalloc((void **)&d_c, size);

// Setup input values

a = 2;

b = 7;

Page 55: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Addition on the Device: main()

// Copy inputs to device

cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU

add<<<1,1>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup

cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

return 0;

}

Page 56: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Moving to Parallel

GPU computing is about massive parallelism

So how do we run code in parallel on the device?

add<<< 1, 1 >>>();

add<<< N, 1 >>>();

Instead of executing add() once, execute N times in parallel

Page 57: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Vector Addition on the Device

With add() running in parallel we can do vector addition

Terminology: each parallel invocation of add() is referred to as a block

The set of blocks is referred to as a grid

Each invocation can refer to its block index using blockIdx.x

__global__ void add(int *a, int *b, int *c) {

c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

}

By using blockIdx.x to index into the array, each block handles a

different index

Page 58: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Vector Addition on the Device

__global__ void add(int *a, int *b, int *c) {

c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

}

On the device, each block can execute in parallel:

c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

Block 0 Block 1 Block 2 Block 3

Page 59: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Vector Addition on the Device: add()

Returning to our parallelized add() kernel

__global__ void add(int *a, int *b, int *c) {

c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

}

Let’s take a look at main()…

Page 60: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Vector Addition on the Device: main()

#define N 512

int main(void) {

int *a, *b, *c; // host copies of a, b, c

int *d_a, *d_b, *d_c; // device copies of a, b, c

int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);

cudaMalloc((void **)&d_b, size);

cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);

b = (int *)malloc(size); random_ints(b, N);

c = (int *)malloc(size);

Page 61: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Vector Addition on the Device: main()

// Copy inputs to device

cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocks

add<<<N,1>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup

free(a); free(b); free(c);

cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

return 0;

}

Page 62: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Review (1 of 2)

Difference between host and device

Host CPU

Device GPU

Using __global__ to declare a function as device code

Executes on the device

Called from the host

Passing parameters from host code to a device function

Page 63: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Review (2 of 2)

Basic device memory management cudaMalloc()

cudaMemcpy()

cudaFree()

Launching parallel kernels

Launch N copies of add() with add<<<N,1>>>(…);

Use blockIdx.x to access block index

Page 64: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

CUDA Threads

Terminology: a block can be split into parallel threads

Let’s change add() to use parallel threads instead of parallel blocks

We use threadIdx.x instead of blockIdx.x

Need to make one change in main()…

__global__ void add(int *a, int *b, int *c) {

c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];

}

Page 65: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Vector Addition Using Threads: main()

#define N 512

int main(void) {

int *a, *b, *c; // host copies of a, b, c

int *d_a, *d_b, *d_c; // device copies of a, b, c

int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);

cudaMalloc((void **)&d_b, size);

cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);

b = (int *)malloc(size); random_ints(b, N);

c = (int *)malloc(size);

Page 66: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Vector Addition Using Threads: main()

// Copy inputs to device

cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads

add<<<1,N>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup

free(a); free(b); free(c);

cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

return 0;

}

Page 67: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Combining Blocks and Threads

We’ve seen parallel vector addition using:

Many blocks with one thread each

One block with many threads

Let’s adapt vector addition to use both blocks and threads

Why? We’ll come to that…

First let’s discuss data indexing…

Page 68: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

0 1 7 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6

Indexing Arrays with Blocks and Threads

With M threads/block a unique index for each thread is given by: int index = threadIdx.x + blockIdx.x * M;

No longer as simple as using blockIdx.x and threadIdx.x

Consider indexing an array with one element per thread (8 threads/block)

threadIdx.x threadIdx.x threadIdx.x threadIdx.x

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

Page 69: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Indexing Arrays: Example

Which thread will operate on the red element?

int index = threadIdx.x + blockIdx.x * M;

= 5 + 2 * 8;

= 21;

0 1 7 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6

threadIdx.x = 5

blockIdx.x = 2

0 1 31 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

M = 8

Page 70: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Vector Addition with Blocks and Threads

What changes need to be made in main()?

Use the built-in variable blockDim.x for threads per block

int index = threadIdx.x + blockIdx.x * blockDim.x;

Combined version of add() to use parallel threads and parallel

blocks __global__ void add(int *a, int *b, int *c) {

int index = threadIdx.x + blockIdx.x * blockDim.x;

c[index] = a[index] + b[index];

}

Page 71: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Addition with Blocks and Threads: main()

#define N (2048*2048)

#define THREADS_PER_BLOCK 512

int main(void) {

int *a, *b, *c; // host copies of a, b, c

int *d_a, *d_b, *d_c; // device copies of a, b, c

int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);

cudaMalloc((void **)&d_b, size);

cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);

b = (int *)malloc(size); random_ints(b, N);

c = (int *)malloc(size);

Page 72: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Addition with Blocks and Threads: main()

// Copy inputs to device

cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU

add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup

free(a); free(b); free(c);

cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

return 0;

}

Page 73: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Handling Arbitrary Vector Sizes

Update the kernel launch: add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);

Typical problems are not friendly multiples of blockDim.x

Avoid accessing beyond the end of the arrays:

__global__ void add(int *a, int *b, int *c, int n) {

int index = threadIdx.x + blockIdx.x * blockDim.x;

if (index < n)

c[index] = a[index] + b[index];

}

Page 74: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Why Bother with Threads?

Threads seem unnecessary

They add a level of complexity

What do we gain?

Unlike parallel blocks, threads have mechanisms to:

Communicate

Synchronize

To look closer, we need a new example…

Page 75: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Need More?

Get CUDA: www.nvidia.com/getcuda

Nsight: www.nvidia.com/nsight

Programming Guide/Best Practices… docs.nvidia.com

Questions: NVIDIA Developer forums devtalk.nvidia.com

Search or ask on www.stackoverflow.com/tags/cuda

General: www.nvidia.com/cudazone

Page 76: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library
Page 77: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

BACKUP

Page 78: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Application Features

Supported

GPU

Perf Release Status Notes

Abinit

Local Hamiltonian, non-local

Hamiltonian, LOBPCG algorithm,

diagonalization /

orthogonalization

1.3-2.7X Released; Version 7.0.5

Multi-GPU support

www.abinit.org

ACES III Integrating scheduling GPU into

SIAL programming language and

SIP runtime environment

10X on kernels Under development

Multi-GPU support

http://www.olcf.ornl.gov/wp-

content/training/electronic-structure-

2012/deumens_ESaccel_2012.pdf

ADF

Fock Matrix, Hessians

TBD

Pilot project completed,

Under development

Multi-GPU support

www.scm.com

BigDFT DFT; Daubechies wavelets,

part of Abinit

5-25X

(1 CPU core to

GPU kernel)

Released June 2009,

current release 1.6.0

Multi-GPU support

http://inac.cea.fr/L_Sim/BigDFT/news.html,

http://www.olcf.ornl.gov/wp-

content/training/electronic-structure-2012/BigDFT-

Formalism.pdf and http://www.olcf.ornl.gov/wp-

content/training/electronic-structure-2012/BigDFT-

HPC-tues.pdf

Casino

TBD

TBD

Under development,

Spring 2013 release

Multi-GPU support

http://www.tcm.phy.cam.ac.uk/~mdt26/casino.html

CP2K DBCSR (spare matrix multiply

library) 2-7X

Under development

Multi-GPU support

http://www.olcf.ornl.gov/wp-

content/training/ascc_2012/friday/ACSS_2012_Va

ndeVondele_s.pdf

GAMESS-US Libqc with Rys Quadrature

Algorithm, Hartree-Fock, MP2

and CCSD in Q4 2012

1.3-1.6X,

2.3-2.9x HF

Released

Multi-GPU support Next release Q4 2012.

http://www.msg.ameslab.gov/gamess/index.html

Quantum Chemistry Applications

GPU Perf compared against Multi-core x86 CPU socket.

GPU Perf benchmarked on GPU supported features

and may be a kernel to kernel perf comparison

Page 79: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Application Features Supported GPU Perf Release Status Notes

GAMESS-UK

(ss|ss) type integrals within

calculations using Hartree Fock ab

initio methods and density functional

theory. Supports organics &

inorganics.

8x Release in 2012

Multi-GPU support http://www.ncbi.nlm.nih.gov/pubmed/21541963

Gaussian Joint PGI, NVIDIA & Gaussian

Collaboration TBD

Under development

Multi-GPU support

Announced Aug. 29, 2011

http://www.gaussian.com/g_press/nvidia_press.htm

GPAW Electrostatic poisson equation,

orthonormalizing of vectors, residual

minimization method (rmm-diis)

8x

Released

Multi-GPU support

https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,

Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)

Jaguar Investigating GPU acceleration TBD

Under development

Multi-GPU support

Schrodinger, Inc.

http://www.schrodinger.com/kb/278

MOLCAS CU_BLAS support 1.1x

Released, Version 7.8

Single GPU. Additional GPU

support coming in Version 8

www.molcas.org

MOLPRO Density-fitted MP2 (DF-MP2),

density fitted local correlation

methods (DF-RHF, DF-KS), DFT

1.7-2.3X

projected

Under development

Multiple GPU

www.molpro.net

Hans-Joachim Werner

GPU Perf compared against Multi-core x86 CPU socket.

GPU Perf benchmarked on GPU supported features

and may be a kernel to kernel perf comparison

Quantum Chemistry Applications

Page 80: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Application Features

Supported GPU Perf Release Status Notes

MOPAC2009 pseudodiagonalization, full

diagonalization, and density

matrix assembling

3.8-14X Under Development

Single GPU

Academic port.

http://openmopac.net

NWChem Triples part of Reg-CCSD(T),

CCSD & EOMCCSD task

schedulers

3-10X projected Release targeting March 2013

Multiple GPUs

Development GPGPU benchmarks:

www.nwchem-sw.org

And http://www.olcf.ornl.gov/wp-

content/training/electronic-structure-

2012/Krishnamoorthy-ESCMA12.pdf

Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/

PEtot Density functional theory (DFT)

plane wave pseudopotential

calculations

6-10X Released

Multi-GPU

First principles materials code that computes

the behavior of the electron structures of

materials

Q-CHEM RI-MP2 8x-14x Released, Version 4.0 http://www.q-

chem.com/doc_for_web/qchem_manual_4.0.pdf

Quantum Chemistry Applications

GPU Perf compared against Multi-core x86 CPU socket.

GPU Perf benchmarked on GPU supported features

and may be a kernel to kernel perf comparison

Page 81: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Application Features

Supported GPU Perf Release Status Notes

QMCPACK Main features 3-4x Released

Multiple GPUs

NCSA

University of Illinois at Urbana-Champaign

http://cms.mcc.uiuc.edu/qmcpack/index.php

/GPU_version_of_QMCPACK

Quantum

Espresso/PWscf

PWscf package: linear algebra

(matrix multiply), explicit

computational kernels, 3D FFTs

2.5-3.5x

Released

Version 5.0

Multiple GPUs

Created by Irish Centre for

High-End Computing

http://www.quantum-espresso.org/index.php

and http://www.quantum-espresso.org/

TeraChem “Full GPU-based solution”

44-650X vs.

GAMESS CPU

version

Released

Version 1.5

Multi-GPU/single node

Completely redesigned to

exploit GPU parallelism. YouTube:

http://youtu.be/EJODzk6RFxE?hd=1 and

http://www.olcf.ornl.gov/wp-

content/training/electronic-structure-2012/Luehr-

ESCMA.pdf

VASP Hybrid Hartree-Fock DFT

functionals including exact

exchange

2x

2 GPUs

comparable to

128 CPU cores

Available on request

Multiple GPUs

By Carnegie Mellon University

http://arxiv.org/pdf/1111.0716.pdf

WL-LSMS Generalized Wang-Landau

method

3x

with 32 GPUs vs.

32 (16-core) CPUs

Under development

Multi-GPU support

NICS Electronic Structure Determination Workshop 2012:

http://www.olcf.ornl.gov/wp-content/training/electronic-

structure-2012/Eisenbach_OakRidge_February.pdf

Quantum Chemistry Applications

GPU Perf compared against Multi-core x86 CPU socket.

GPU Perf benchmarked on GPU supported features

and may be a kernel to kernel perf comparison

Page 82: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Related

Applications

Features

Supported GPU Perf Release Status Notes

Amira 5® 3D visualization of volumetric data

and surfaces 70x

Released, Version 5.3.3

Single GPU

Visualization from Visage Imaging. Next release, 5.4, will use

GPU for general purpose processing in some functions

http://www.visageimaging.com/overview.html

BINDSURF Allows fast processing of large

ligand databases 100X

Available upon request to

authors; single GPU

High-Throughput parallel blind Virtual Screening,

http://www.biomedcentral.com/1471-2105/13/S14/S13

BUDE Empirical Free

Energy Forcefield 6.5-13.4X

Released

Single GPU University of Bristol

http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm

Core Hopping GPU accelerated application 3.75-5000X Released, Suite 2011

Single and multi-GPUs. Schrodinger, Inc.

http://www.schrodinger.com/products/14/32/

FastROCS Real-time shape similarity

searching/comparison 800-3000X

Released

Single and multi-GPUs. Open Eyes Scientific Software

http://www.eyesopen.com/fastrocs

PyMol

Lines: 460% increase

Cartoons: 1246% increase

Surface: 1746% increase

Spheres: 753% increase

Ribbon: 426% increase

1700x Released, Version 1.5

Single GPUs http://pymol.org/

VMD

High quality rendering,

large structures (100 million atoms),

analysis and visualization tasks, multiple

GPU support for display of molecular

orbitals

100-125X or greater

on kernels Released, Version 1.9

Visualization from University of Illinois at Urbana-Champaign

http://www.ks.uiuc.edu/Research/vmd/

Viz, “Docking” and Related Applications Growing

GPU Perf compared against Multi-core x86 CPU socket.

GPU Perf benchmarked on GPU supported features

and may be a kernel to kernel perf comparison

Page 83: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Application Features

Supported

GPU

Speedup Release Status Website

BarraCUDA Alignment of short sequencing

reads 6-10x

Version 0.6.2 – 3/2012

Multi-GPU, multi-node http://seqbarracuda.sourceforge.net/

CUDASW++ Parallel search of Smith-

Waterman database 10-50x

Version 2.0.8 – Q1/2012

Multi-GPU, multi-node http://sourceforge.net/projects/cudasw/

CUSHAW Parallel, accurate long read

aligner for large genomes 10x

Version 1.0.40 – 6/2012

Multiple-GPU http://cushaw.sourceforge.net/

GPU-BLAST Protein alignment according to

BLASTP 3-4x

Version 2.2.26 – 3/2012

Single GPU

http://eudoxus.cheme.cmu.edu/gpublast/gpu

blast.html

GPU-HMMER Parallel local and global

search of Hidden Markov

Models

60-100x Version 2.3.2 – Q1/2012

Multi-GPU, multi-node

http://www.mpihmmer.org/installguideGPUH

MMER.htm

mCUDA-MEME Scalable motif discovery

algorithm based on MEME 4-10x

Version 3.0.12

Multi-GPU, multi-node

https://sites.google.com/site/yongchaosoftwa

re/mcuda-meme

SeqNFind Hardware and software for

reference assembly, blast, SW,

HMM, de novo assembly

400x Released.

Multi-GPU, multi-node http://www.seqnfind.com/

UGENE Fast short read alignment 6-8x Version 1.11 – 5/2012

Multi-GPU, multi-node http://ugene.unipro.ru/

WideLM Parallel linear regression on

multiple similarly-shaped

models

150x Version 0.1-1 – 3/2012

Multi-GPU, multi-node http://insilicos.com/products/widelm

Bioinformatics Applications

GPU Perf compared against same or similar code running on single CPU machine

Performance measured internally or independently

Page 84: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

Overview

GPU performance and scalability across multiple scientific

domains & multiple algorithmic motifs

Different approaches taken (libraries, OpenACC, programming

languages)

Lessons learned

Page 85: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

LSMS – Materials Science

LSMS Fe32 (32 atoms per node)

Relative Performance (FLOPS) vs. Dual-Socket E5-2687w 3.10 GHz Sandy Bridge

0.6 1.0

3.2

6.1

1.3

2.6 3.2

6.2

0

1

2

3

4

5

6

7

1xCPU+1xGPU

1xCPU+2xGPU

2xCPU+1xGPU

2xCPU+2xGPU

2xCPU+1xGPU

2xCPU+2xGPU

1xCPU 2xCPU K20X M2090 K20X

CPU Single-Socket Dual-Socket

Re

lati

ve

to

2x

CP

U

Page 86: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

LSMS – Materials Science

5x vs. Dual-Socket

Sandy Bridge

@8 nodes

LSMS Multi-Node Scalability Fe32 (32 atoms per node)

GFLOPS vs. Dual-Socket E5-2687w 3.10 GHz Sandy Bridge

Dual E5-2687w/node

1 M2090/node

2 M2090/node

1 K20X/node

2 K20X/node

100

1000

10000

1 2 4 8

GF

LO

P/s

(lo

g s

cale

)

# Nodes (log scale)

Page 87: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

gWL-LSMS – Materials Science

gWL-LSMS Fe1024 (1024 atoms per 32 nodes)

Relative Scaling (FLOPS) on Cray XK7

CPU

M2090

K20X

1

2

4

8

16

32

64

128

256

512

1024

2048

4096

16 64 256 1,024 4,096 16,384

Rela

tive

Sc

ali

ng

(lo

g s

ca

le)

# Nodes (log scale)

7.01x vs. CPU

@64 nodes

“CPU” node = XK7 (1x Interlagos)

“M2090” node = XK6 (1x M2090 + 1x Interlagos)

“K20X” node = XK7 (1x K20X + 1x Interlagos)

Page 88: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

Application Power Efficiency of the Cray XK7

WL-LSMS for CPU-only and Accelerated Computing

• Runtime Is 8.6X faster for the accelerated code

• Energy consumed Is 7.3X less

o GPU accelerated code consumed 3,500 kW-hr

o CPU only code consumed 25,700 kW-hr

Power consumption traces for identical WL-LSMS runs with

1024 Fe atoms, 16 atoms/node, on 18,561 Titan nodes (99%

of Titan)

Page 89: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

SPECFEM3D – Earth Science

SPECFEM3D meshfed3D-256x128x15

Relative Performance (solver time) vs. E5-2687w 3.10 GHz Sandy Bridge

0.66 1.00

8.85

13.80

5.41

9.27 8.84

12.83

0

2

4

6

8

10

12

14

16

1xCPU+1xGPU

1xCPU+2xGPU

2xCPU+1xGPU

2xCPU+2xGPU

2xCPU+1xGPU

2xCPU+2xGPU

1xCPU 2xCPU K20X M2090 K20X

CPU Single-Socket Dual-Socket

Re

lati

ve

to

2x

CP

U

Page 90: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

SPECFEM3D – Earth Science

SPECFEM3D on Sandy Bridge 512x512x20, 5 M Elastic cells, 1000 timesteps

Relative Scaling (solver time)

K20X

K20X

XE6

XC30

0

5

10

15

20

25

30

35

40

45

50

0 32 64 96 128 160 192 224 256 288

Rela

tive

Sc

ali

ng

# Nodes

2.75x vs. XC30

@256 nodes

“K20X” node = XK7 (1x K20X + 1x Interlagos)

“XE6” node = XE6 (2x Interlagos)

“XC30” node = XC30 (2x Sandy Bridge)

Page 91: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

SPECFEM3D – Earth Science

SPECFEM3D on Titan 2048x2048x15, 33 M cells, 2000 timesteps

Relative Scaling (solver time) on Cray XK7

K20X

CPU

-

2.00

4.00

6.00

8.00

10.00

12.00

14.00

- 1,024 2,048 3,072 4,096 5,120 6,144 7,168 8,192 9,216

Rela

tive

Sc

ali

ng

# Nodes

“CPU” node = XK7 (1x Interlagos)

“K20X” node = XK7 (1x K20X + 1x Interlagos)

2.45x vs. CPU

@2048 nodes

Page 92: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

Chroma (Lattice QCD) –

High Energy & Nuclear Physics Chroma 243x128 lattice

Relative Performance (Propagator) vs. E5-2687w 3.10 GHz Sandy Bridge

0.5 1.0

3.7

6.8

2.8

5.5

3.5

6.7

0

1

2

3

4

5

6

7

8

1xCPU+1xGPU

1xCPU+2xGPU

2xCPU+1xGPU

2xCPU+2xGPU

2xCPU+1xGPU

2xCPU+2xGPU

1xCPU 2xCPU K20X M2090 K20X

CPU Single-Socket Dual-Socket

Re

lati

ve

to

2x

CP

U

Page 93: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

Chroma (Lattice QCD) –

High Energy & Nuclear Physics Chroma 483x512 lattice

Relative Scaling (Application Time)

XK7 (K20X) (BiCGStab)

XK7 (K20X) (DD+GCR)

XE6 (2x Interlagos)

0

2

4

6

8

10

12

14

16

18

0 128 256 384 512 640 768 896 1024 1152 1280

Rela

tive

Sc

ali

ng

# Nodes

3.58x vs. XE6

@1152 nodes

“XK7” node = XK7 (1x K20X + 1x Interlagos)

“XE6” node = XE6 (2x Interlagos)

Page 94: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

QMCPACK – Materials Science

QMCPACK Graphite 4x4x1, E5-2687w 3.10 GHz Sandy Bridge

0.44 1.00

2.86

4.05

2.84

5.60

4.01

7.97

0

1

2

3

4

5

6

7

8

9

1xCPU+1xGPU

1xCPU+1xGPU

2xCPU+1xGPU

2xCPU+2xGPU

2xCPU+1xGPU

2xCPU+2xGPU

1xCPU 2xCPU M2090 K20X M2090 K20X

CPU Single-Socket Dual-Socket

Re

lati

ve

to

2x

CP

U

Page 95: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

QMCPACK – Materials Science

QMCPACK 3x3x1 Graphite

Relative Scaling (Efficiency) on Cray XK7

K20X

CPU

1

10

100

1,000

10,000

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Rela

tive

Sc

ali

ng

(lo

g s

ca

le)

# Nodes (log scale)

“CPU” node = XK7 (1x Interlagos)

“K20X” node = XK7 (1x K20X + 1x Interlagos)

3.97x vs. CPU

@2048 nodes

Page 96: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

Single-node Performance Summary

Relative Performance vs. dual-socket Sandy Bridge E5-2687w 3.10 GHz Sandy Bridge

Chroma

SPECFEM3D

AMBER

WL-LSMS

NAMD

1.0 2.5

3.5

1.0 3.2

6.1

1.0 3.7

6.8

1.0 7.2

8.5

1.0 8.9

13.8

0 2 4 6 8 10 12 14 16

2xCPU

1xCPU + K20X

1xCPU + 2xK20X

2xCPU

1xCPU + K20X

1xCPU + 2xK20X

2xCPU

1xCPU + K20X

1xCPU + 2xK20X

2xCPU

1xCPU + K20X

1xCPU + 2xK20X

2xCPU

1xCPU + K20X

1xCPU + 2xK20X

Page 97: Really Fast Introduction to CUDA and CUDA C · Really Fast Introduction to CUDA and CUDA C Jul 2013 Dale Southard, ... Library GPU SAXPY in ... CUBLAS Library

Preliminary, NVIDIA Confidential – not for distribution

Summary

Performance

Scalability

Variety of programming approaches

Futures