HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data

HPC on Cloud 류현곤 차장

NVIDIA CONFIDENTIAL

Agenda

VDI, GPU on Cloud

CUDA

OpenACC

CUDA Kepler Architecture

CUDA on ARM

CUDA library

cuBLAS

cuFFT

pyCUDA

VDI, GPU on CLOUD

NVIDIA CONFIDENTIAL

NVIDIA CONFIDENTIAL

NVIDIA CONFIDENTIAL

NVIDIA CONFIDENTIAL

Without GPU With GPU

Night and Day Difference

NVIDIA CONFIDENTIAL

NVIDIA CONFIDENTIAL

Iray Photorealism

Iray Photograph

NVIDIA CONFIDENTIAL

VIRTUAL DESKTOPS

VIRTUAL MACHINE

NVIDIA Driver

NVIDIA GRID Enabled Virtual Desktop

NVIDIA GRID GPU

VDI

NVIDIA GRID ENABLED Hypervisor

NVIDIA CONFIDENTIAL

VIRTUAL DESKTOP Virtualized GPU

VIRTUAL REMOTE WORKSTATION Dedicated GPU

DESIGNER

POWER USER PLM, Factory Floor Work Instructions,

TechPubs

KNOWLEDGE WORKER MS Office, Photoshop

PTC, ANSYS, MSC

Patran

Siemens NX 8.5

CATIA, DELMIA,

SIMULIA

SolidWorks, AutoDesk,

Adobe CS Visualization

NVIDIA CONFIDENTIAL

Graphics Options in Virtualization

NVIDIA CONFIDENTIAL

Ap

pli

cati

on

Co

mp

ati

bil

ity

Number of CCUs

SoftPC VDI

w/o GPU

Dedicated GPU HDX 3DPRO

View vDGA

NVIDIA GRID

vGPU XDT 7.1

Shim Graphics

View sVGA

RemoteFX

GRAPHICS ACCELERATED VDI

NVIDIA CONFIDENTIAL

NVIDIA

GRID

Hypervisor Virtual Machine

Guest OS

NVIDIA GRID USM (Driver)

Virtual Desktop

Apps

Remote Display

NVIDIA GRID GPU GPU MMU

GPU Hypervisor

Hypervisor Device

Emulation Framework

Virtual GPU

Manager

Resource Manager

Remote Protocol

State Graphics Commands

Per-VM Dedicate Channel



Per-VM Dedicated Channels

vGPU = VGX Hypervisor

Partition GPU

memory & allocate

to VMs

Time-slice GPU

cores as needed

for processing

NVIDIA CONFIDENTIAL

NVIDIA GRID K1

Shipping Now

GPU 4 Kepler GPUs

CUDA cores 768 (192 / GPU)

Memory Size 16GB DDR3 (4GB / GPU)

Power 130 W

Form Factor Dual Slot ATX, 10.5”

Display IO None

Aux power requirement 6-pin connector

PCIe x16

PCIe Generation Gen3 (Gen2 compatible)

Cooling solution Passive

# users 4 - 1001

OpenGL 4.x

Microsoft DirectX 11

1 Number of users depends on software solution, workload, and screen resolution

GRID K1 = 4 Quadro K600

NVIDIA CONFIDENTIAL

GPU 2 High End Kepler GPUs

CUDA cores 3072 (1536 / GPU)

Memory Size 8GB GDDR5

Power 225 W

Form Factor Dual Slot ATX, 10.5”

Display IO None

Aux power requirement 8-pin connector

PCIe x16

PCIe Generation Gen3 (Gen2 compatible)

Cooling solution Passive

# users 2 – 641

OpenGL 4.3

Microsoft DirectX 11 NVIDIA GRID K2

Shipping Now

1 Number of users depends on software solution, workload, and screen resolution

GRID K2 = Two Quadro K5000

NVIDIA CONFIDENTIAL

CUDA 101

NVIDIA CONFIDENTIAL

The Era of Accelerated Computing is Here

1980 1990 2000 2010 2020

Era of

Vector

Computing

Era of

Accelerated Computing

Era of

Distributed

Computing

NVIDIA CONFIDENTIAL

GPU Roadmap

2012 2014 2008 2010

DP G

FLO

PS p

er

Watt

Kepler

Tesla

Fermi

Maxwell

Volta Stacked DRAM

Unified Virtual Memory

Dynamic Parallelism

FP64

CUDA

32

16

8

4

2

1

0.5

NVIDIA CONFIDENTIAL

Agenda

openACC

Kepler Architecture

CUDA5.5 features

OpenACC

NVIDIA CONFIDENTIAL

Application Code

+

GPU CPU Use GPU to Parallelize

Compute-Intensive Functions Rest of Sequential

CPU Code

NVIDIA CONFIDENTIAL

Real-Time Object Detection

Global Manufacturer of Navigation Systems

Valuation of Stock Portfolios using Monte Carlo

Global Technology Consulting Company

Interaction of Solvents and Biomolecules

University of Texas at San Antonio

Directives: Easy, Open & Powerful

Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications. ”

-- Developer at the Global Manufacturer of Navigation Systems

“ 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours

NVIDIA CONFIDENTIAL

OpenACC Directives

Program myscience

... serial code ...

!$acc kernels

do k = 1,n1

do i = 1,n2

... parallel code ...

enddo

enddo

!$acc end kernels

...

End Program myscience

CPU GPU

Your original

Fortran or C code

Simple Compiler hints

Compiler Parallelizes code

Works on many-core GPUs &

multicore CPUs

OpenACC

Compiler

Hint

Can you mix CUDA and OpenACC?

Yes, you can even use CUDA to

manage memory

NVIDIA CONFIDENTIAL

2 Basic Steps to Get Started

Step 1: Annotate source code with directives:

Step 2: Compile & run:

pgf90 -ta=nvidia -Minfo=accel file.f

!$acc data copy(util1,util2,util3) copyin(ip,scp2,scp2i)

!$acc parallel loop

…

!$acc end parallel

!$acc end data

NVIDIA CONFIDENTIAL

OpenACC Directives Example !$acc data copy(A,Anew)

iter=0

do while ( err > tol .and. iter < iter_max )

iter = iter +1

err=0._fp_kind

!$acc kernels

do j=1,m

do i=1,n

Anew(i,j) = .25_fp_kind *( A(i+1,j ) + A(i-1,j ) &

+A(i ,j-1) + A(i ,j+1))

err = max( err, Anew(i,j)-A(i,j))

end do

end do

!$acc end kernels

IF(mod(iter,100)==0 .or. iter == 1) print *, iter, err

A= Anew

end do

!$acc end data

Copy arrays into GPU memory

within data region

Parallelize code inside region

Close off parallel region

Close off data region,

copy data back

Kepler Architecture

NVIDIA CONFIDENTIAL

Kepler GK110 Block Diagram

Architecture

7.1B Transistors

15 SMX units

> 1 TFLOP FP64

1.5 MB L2 Cache

384-bit GDDR5

PCI Express Gen3

NVIDIA CONFIDENTIAL

Power vs Clock Speed Example

Fermi

2x clock A B

A

A

B

B

Kepler

1x clock

Logic

Area Power

1.0x 1.0x

1.8x 0.9x

Clocking

Area Power

1.0x 1.0x

1.0x 0.5x

NVIDIA CONFIDENTIAL

GPU Perf.

이론치 성능

# ratio of DP ( 1/2, 1/3, 1/8, 1/16)

# of CUDA core / MP ( 8, 32, 48,192)

# of MP / GPU ( 1, 2, 7,8,13,14,15)

# of FMA ( 1,2)

# of Clock Mhz ( 704, 870, 1200)

DP : 1.2 Tflops

SP : 2.4 Tflops

실측치 성능

DGEMM ratio ( 75%, 85%, 95%)

CPU + GPU DGEMM

DTRSM ratio

MPI ratio ( 50%, 90%)

DGEMM : 1117 Gflops

DP : 880 Gflops

NVIDIA CONFIDENTIAL

new CUDA-HPL (N = 50,000)

( >50 ms) memcpyH2D ( 1024 x 50000 doubles )

loop 50 ( 50,000/1024)

( >1 ms) Gather ( 1 kernel - read/write 1024x1024 doubles)

( >2 ms) Scatter ( 1 kernel - read/write 1024x1024 doubles)

( >4 ms) cublasDtrsm ( 31 Kernels - A=1024x1024 B=1024x1024 )

( >1 ms) Transpose ( 1 kernel - read/write 1024x1024 doubles)

( >110 ms) cublasDgemm ( 1 or 2 kernels A=1024x50000 B=1024x1024 C=1024x50000)

( >50 ms) memcpyD2H ( 1024 x 50000 doubles)

NVIDIA CONFIDENTIAL

New Instruction: SHFL

Data exchange between threads within a warp

Avoids use of shared memory

One 32-bit value per exchange

4 variants:

h d f e a c c b g h a b c d e f c d e f g h a b c d a b g h e f

a b c d e f g h

Indexed

any-to-any

Shift right to nth

neighbour

Shift left to nth neighbour Butterfly (XOR)

exchange

__shfl() __shfl_up() __shfl_down() __shfl_xor()

NVIDIA CONFIDENTIAL

Texture Cache Unlocked

Added a new path for compute

Avoids the texture unit

Allows a global address to be fetched and cached

Eliminates texture setup

Why use it?

Separate pipeline from shared/L1

Highest miss bandwidth

Flexible, e.g. unaligned accesses

Managed automatically by compiler

“const __restrict” indicates eligibility

Tex

SMX

L2

Tex Tex Tex

Read-only

Data Cache

NVIDIA CONFIDENTIAL

const __restrict Example

__global__ void saxpy(float x, float y, const float * __restrict input, float * output) { size_t offset = threadIdx.x + (blockIdx.x * blockDim.x); // Compiler will automatically use texture // for "input" output[offset] = (input[offset] * x) + y; }

Annotate eligible kernel

parameters with const __restrict

Compiler will automatically

map loads to use read-only

data cache path

NVIDIA CONFIDENTIAL

What is Dynamic Parallelism?

The ability to launch new grids from the GPU

Dynamically

Simultaneously

Independently

CPU GPU CPU GPU

Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself

NVIDIA CONFIDENTIAL

CPU GPU CPU GPU

What Does It Mean?

Autonomous, Dynamic Parallelism GPU as Co-Processor

CUDA 5.5

ARM: Fastest Growing CPU

1993 1998 2003 2008 2013 0%

20%

40%

80%

100%

Pro

cess

or

Mark

et

Share

x86 and Cortex Processors Shipped

60%

0

1

2

4

5

3

Pro

cesso

r Ship

ments (B

illions)

Source: Mercury Research, ARM, Internal estimates

PCI-e 슬롯

MXM slot

MXM 슬롯 모든 GPU 호환

CARMA MXM 판매중 Kyla mITX 판매중 Kyla MXM 출시예정

MXM 슬롯 K20X 호환

Q1000M 호환

ARM + CUDA 개발킷

Unpack the Kyla

ARM Kyla Order SECO (Italy) 1EA 600$

Kepler GPU

mATX power

Kepler GPU

RS-232 cable PC

PC login ID : utuntu

login PW : utuntu

CUDA

Smarter Robots to Energy Efficient Supercomputers

Building Self-Aware Robots 35% More Energy Efficient

NVIDIA CONFIDENTIAL

Without Hyper-Q 100

50

0 Time

GPU

Uti

lizati

on %

100

50

0 Time

GPU

Uti

lizati

on %

With Hyper-Q

NVIDIA CONFIDENTIAL

GPU

CUDA Server Process

CUDA

MPI

Rank

0

CUDA

MPI

Rank

1

CUDA

MPI

Rank

2

CUDA

MPI

Rank

3

Multi-Process Server Required for Hyper-Q / MPI

$ mpirun -np 4 my_cuda_app No application re-compile to share the GPU

No user configuration needed Can be preconfigured by SysAdmin

MPI Ranks using CUDA are clients

Server spawns on-demand per user

One job per user No isolation between MPI ranks

Exclusive process mode enforces single server

One GPU per rank No cudaSetDevice()

only CUDA device 0 is visible

NVIDIA CONFIDENTIAL

Strong Scaling of CP2K on Cray XK7

Hyper-Q with multiple

MPI ranks leads to

2.5X speedup over

single MPI rank using

the GPU

Blog post by Peter

Messmer of NVIDIA

NVIDIA CONFIDENTIAL

Stream Priorities Accelerates Critical Path

Kernel A Kernel B

Kernel X

Kernel C

Kernel X

Kernel C Stream 1

High-Priority

Stream 2

Kernel A Kernel B Stream 1

Stream 2

With Priorities—especially useful when Kernel X generates data for MPI_Send()

No Priorities

Kernel X

Launched

Kernel X

Launched

NVIDIA CONFIDENTIAL

NVIDIA CONFIDENTIAL

cuBLAS

행렬 연산용 BLAS 루틴의 CUDA 가속 라이브러리

최신 CUDA library에 내장

CUDA-HPL에서 cuBLAS 라이브러리 이용

NVIDIA CONFIDENTIAL

cuBLAS Model

Malloc Memcpy

Compute

GPU CPU

Memcpy

free

NVIDIA CONFIDENTIAL

cuBLAS call step

GPU Memory Alloc

cublasAlloc( n_size, memSize, GPU_ptr);

GPU Memory upload(set)

cublasSetVector( n_size, memSize, CPU_ptr, 1, GPU_ptr, 1 );

GPU Compute

cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);

GPU Memory download(get)

cublasGetVector(n2, memSize , GPU_ptr, 1, CPU_ptr, 1);

nvcc mysrc.c –lcuda –lcublas

NVIDIA CONFIDENTIAL

Main code in cuBLAS

h_A = (float*) malloc(n2 * sizeof(h_A[0]));

h_B = (float *) malloc(n2 * sizeof(h_B[0]));

h_C = (float *) malloc(n2 * sizeof(h_C[0]));

cublasAlloc (n2, sizeof(d_A[0]), (void**)&d_A );

cublasAlloc (n2, sizeof(d_B[0]), (void **)&d_B );

cublasAlloc (n2, sizeof(d_C[0]), (void **)&d_C );

cublasSetVector (n2, sizeof(h_A[0]), h_A, 1, d_A, 1 );

cublasSetVector (n2, sizeof(h_B[0]), h_B, 1, d_B, 1 );

cublasSetVector (n2, sizeof(h_C[0]), h_C, 1, d_C, 1 );

cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N );

cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1 );

NVIDIA CONFIDENTIAL

cuFFT

CUDA기반 FFTW 가속 라이브러리

최신 CUDA toolkit에 포함

http://docs.nvidia.com/cuda/cufft/index.html









NVIDIA CONFIDENTIAL

cuFFT 사용법

#define NX 64 #define NY 64 #define NZ 128

cufftHandle plan; cufftComplex *data1, *data2; cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ); cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ); /* Create a 3D FFT plan. */ cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C);

/* Transform the first signal in place. */ cufftExecC2C(plan, data1, data1, CUFFT_FORWARD);

/* Transform the second signal using the same plan. */ cufftExecC2C(plan, data2, data2, CUFFT_FORWARD);

/* Destroy the cuFFT plan. */ cufftDestroy(plan); cudaFree(data1); cudaFree(data2);

GPU

GPU Memory

CPU

CPU Memory

NVIDIA CONFIDENTIAL

cuFFT Sample #include <stdio.h>

#include <math.h>

#include "cufft.h"

int main(int argc, char *argv[])

{

cufftComplex *a_h, *a_d;

cufftHandle plan;

int N = 1024, batchSize = 10;

int i, nBytes;

double maxError;

nBytes = sizeof(cufftComplex)*N*batchSize;

a_h = (cufftComplex *)malloc(nBytes);

for (i=0; i < N*batchSize; i++)

{

a_h[i].x = sinf(i);

a_h[i].y = cosf(i);

}

cudaMalloc((void **)&a_d, nBytes);

cudaMemcpy(a_d, a_h, nBytes,

cudaMemcpyHostToDevice);

cufftPlan1d(&plan, N, CUFFT_C2C, batchSize);

cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD);

cufftExecC2C(plan, a_d, a_d, CUFFT_INVERSE);

cudaMemcpy(a_h, a_d, nBytes, cudaMemcpyDeviceToHost);

// check error - normalize

for (maxError = 0.0, i=0; i < N*batchSize; i++)

{

maxError = max(fabs(a_h[i].x/N-sinf(i)), maxError);

maxError = max(fabs(a_h[i].y/N-cosf(i)), maxError);

}

printf("Max fft error = %g\n", maxError);

cufftDestroy(plan);

free(a_h);

cudaFree(a_d);

return 0;

}

NVIDIA CONFIDENTIAL

cuFFT 간접 디버깅

NVIDIA CONFIDENTIAL

pyCUDA를 이용한 이미지 프로세싱

NVIDIA CONFIDENTIAL

이미지 프로세싱 CUDA kernel

R G B

pixel

R G B R G B R G B

0 0 0 1

4 0 0

2

4

2

2

2

4 8 9 9

2 2 0 1

6

1

2 0

1

8

1

6

1

2

1

2

1

2 9

Original

리사이즈, 회전,

컬러변환

히스토그램, 정렬

필터링, 세선화, 압축

FFT, BLAS Result

NVIDIA CONFIDENTIAL

영상처리작업

이미지 파일 처리

BMT, JPG, PNG, DICOM 등 다양한 포맷

OpenCV, ITK 등 Image Library

화면 출력 및 디버깅

windows/linux UI 고려

QT library, MITK library

NVIDIA CONFIDENTIAL

CUDA 커널 개발에 집중할 수 없을까?

File I/O, UI는 python에 맡기고

GPU 가속은 CUDA를 이용하자

pyCUDA!!!

NVIDIA CONFIDENTIAL

PYTHON 101

http://www.python.org/

PIL : python Image Library

Numpy : 과학계산용 python 모듈 (행렬연산)

pyCUDA : python과 CUDA를 연결




NVIDIA CONFIDENTIAL

Python Image Libraray

from __future__ import

division

import numpy

import matplotlib.pyplot as

plt

from PIL import Image

# read image file

img = Image.open("6x3-

pixel.png")

arr=numpy.array(img)

print arr

Image library initialize

Image File Read

array for image processing

Show plot

NVIDIA CONFIDENTIAL

pyCUDA import

########### for image processing ###############

from __future__ import division

import numpy

import matplotlib.pyplot as plt


from scipy import misc

########### for pyCUDA ##############

import pycuda.driver as drv

import pycuda.tools

import pycuda.autoinit

import numpy.linalg as la

from pycuda.compiler import SourceModule

NVIDIA CONFIDENTIAL

UI part

########### for image processing ###############

from __future__ import division

import numpy

import matplotlib.pyplot as plt


from scipy import misc

########### for pyCUDA ##############


import pycuda.tools




########### read image file

##################

img = Image.open("Fisheye-Nikkor 10.5mm-

sample4-building.jpg")

# convert image to numpy array

arr = numpy.array(img)

#upload to GPU

#TODO kernel

#download to CPU

# result value

tmp = numpy.empty_like(arr)

tmp = arr

# plot the numpy array

plt.subplot(121)

plt.title("original")

plt.imshow(arr)

plt.subplot(122)

plt.title("defished")

plt.imshow(tmp)

plt.show()

NVIDIA CONFIDENTIAL

pyCUDA Template


import pycuda.tools


import numpy



mod = SourceModule("""

__global__ void multiply_them(float *dest, float *a, float

*b)

{

const int i = threadIdx.x;

dest[i] = a[i] * b[i];

}

""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(4).astype(numpy.float32)

b = numpy.random.randn(4).astype(numpy.float32)

dest = numpy.zeros_like(a)

multiply_them(

drv.Out(dest), drv.In(a), drv.In(b),

block=(400,1,1))

print a*b

print dest

print dest-a*b

pyCUDA initialize

define CUDA kernel

init on CPU memory

Define cudaMemcpy

Define Job index

Launch Kernel

define python function

NVIDIA CONFIDENTIAL

CUDA Kernel Launch

bright = mod.get_function("bright")

# read image file

img = Image.open("papua.jpg")

# convert image to numpy array

arr = numpy.array(img)

tmp = numpy.empty_like(arr)

bright(

drv.In(arr), drv.Out(tmp) ,

grid=(40,60,1), block=(40,20,1)

)

NVIDIA CONFIDENTIAL

CUDA kernel 작성

mod = SourceModule("""

__global__ void bright(unsigned char *in, unsigned char *out){

int ix = blockIdx.x * blockDim.x + threadIdx.x; //1600 16* 5 , 20

int iy = blockIdx.y * blockDim.y + threadIdx.y; //1200 12* 5 , 20

int it = ix * ( gridDim.y * blockDim.y ) + iy;

int px = ix *3;

int py = iy *3;

int pixel_R = (py * 1600) + px + 0;

int pixel_G = (py * 1600) + px + 1;

int pixel_B = (py * 1600) + px + 2;

int R,G,B;

R=in[pixel_R];

G=in[pixel_G];

B=in[pixel_B];

out[pixel_R]=R/2;

out[pixel_G]=G/2;

out[pixel_B]=B/2;

return;

}

""“)

Numpy Array 와

CUDA job Index 관계 고려

NVIDIA CONFIDENTIAL

[

[[ 0 0 0] [ 0 0 0] [255 255 255] [ 0 0 0] [ 0 0 0] [128 0 255]]

[[ 0 0 0] [ 0 0 0] [ 0 0 0] [ 0 0 0] [ 0 0 0] [237 28 36]]

[[ 0 0 0] [ 0 0 0] [ 0 0 0] [ 0 0 0] [ 0 0 0] [136 0 21]]

]

arr=np.array (img)

PNG image example

Numpy Array 데이터 구조

NVIDIA CONFIDENTIAL

Plot 출력

out[pixel_R]=R/2;

out[pixel_G]=G/2;

out[pixel_B]=B/2;

# plot the numpy array

plt.subplot(121)

plt.title("original")

plt.imshow(arr)

plt.subplot(122)

plt.title(＂result")

plt.imshow(tmp)

plt.show()

NVIDIA CONFIDENTIAL

감사합니다.

[email protected]

mailto:[email protected]

Documents

HPC on Cloud 류현곤 차장 - scent.gist.ac.kr -ta=nvidia -Minfo=accel file.f ... CUDA Server Process CUDA MPI ... With Priorities—especially useful when Kernel X generates data