Upload
voquynh
View
239
Download
0
Embed Size (px)
Citation preview
HPC on Cloud 류현곤 차장
NVIDIA CONFIDENTIAL
Agenda
VDI, GPU on Cloud
CUDA
OpenACC
CUDA Kepler Architecture
CUDA on ARM
CUDA library
cuBLAS
cuFFT
pyCUDA
VDI, GPU on CLOUD
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
Without GPU With GPU
Night and Day Difference
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
Iray Photorealism
Iray Photograph
NVIDIA CONFIDENTIAL
VIRTUAL DESKTOPS
VIRTUAL MACHINE
NVIDIA Driver
NVIDIA GRID Enabled Virtual Desktop
NVIDIA GRID GPU
VDI
NVIDIA GRID ENABLED Hypervisor
NVIDIA CONFIDENTIAL
VIRTUAL DESKTOP Virtualized GPU
VIRTUAL REMOTE WORKSTATION Dedicated GPU
DESIGNER
POWER USER PLM, Factory Floor Work Instructions,
TechPubs
KNOWLEDGE WORKER MS Office, Photoshop
PTC, ANSYS, MSC
Patran
Siemens NX 8.5
CATIA, DELMIA,
SIMULIA
SolidWorks, AutoDesk,
Adobe CS Visualization
NVIDIA CONFIDENTIAL
Graphics Options in Virtualization
NVIDIA CONFIDENTIAL
Ap
pli
cati
on
Co
mp
ati
bil
ity
Number of CCUs
SoftPC VDI
w/o GPU
Dedicated GPU HDX 3DPRO
View vDGA
NVIDIA GRID
vGPU XDT 7.1
Shim Graphics
View sVGA
RemoteFX
GRAPHICS ACCELERATED VDI
NVIDIA CONFIDENTIAL
NVIDIA
GRID
Hypervisor Virtual Machine
Guest OS
NVIDIA GRID USM (Driver)
Virtual Desktop
Apps
Remote Display
NVIDIA GRID GPU GPU MMU
GPU Hypervisor
Hypervisor Device
Emulation Framework
Virtual GPU
Manager
Resource Manager
Remote Protocol
State Graphics Commands
Per-VM Dedicate Channel
Per-VM Dedicate Channel
Per-VM Dedicate Channel
Per-VM Dedicated Channels
vGPU = VGX Hypervisor
Partition GPU
memory & allocate
to VMs
Time-slice GPU
cores as needed
for processing
NVIDIA CONFIDENTIAL
NVIDIA GRID K1
Shipping Now
GPU 4 Kepler GPUs
CUDA cores 768 (192 / GPU)
Memory Size 16GB DDR3 (4GB / GPU)
Power 130 W
Form Factor Dual Slot ATX, 10.5”
Display IO None
Aux power requirement 6-pin connector
PCIe x16
PCIe Generation Gen3 (Gen2 compatible)
Cooling solution Passive
# users 4 - 1001
OpenGL 4.x
Microsoft DirectX 11
1 Number of users depends on software solution, workload, and screen resolution
GRID K1 = 4 Quadro K600
NVIDIA CONFIDENTIAL
GPU 2 High End Kepler GPUs
CUDA cores 3072 (1536 / GPU)
Memory Size 8GB GDDR5
Power 225 W
Form Factor Dual Slot ATX, 10.5”
Display IO None
Aux power requirement 8-pin connector
PCIe x16
PCIe Generation Gen3 (Gen2 compatible)
Cooling solution Passive
# users 2 – 641
OpenGL 4.3
Microsoft DirectX 11 NVIDIA GRID K2
Shipping Now
1 Number of users depends on software solution, workload, and screen resolution
GRID K2 = Two Quadro K5000
NVIDIA CONFIDENTIAL
CUDA 101
NVIDIA CONFIDENTIAL
The Era of Accelerated Computing is Here
1980 1990 2000 2010 2020
Era of
Vector
Computing
Era of
Accelerated Computing
Era of
Distributed
Computing
NVIDIA CONFIDENTIAL
GPU Roadmap
2012 2014 2008 2010
DP G
FLO
PS p
er
Watt
Kepler
Tesla
Fermi
Maxwell
Volta Stacked DRAM
Unified Virtual Memory
Dynamic Parallelism
FP64
CUDA
32
16
8
4
2
1
0.5
NVIDIA CONFIDENTIAL
Agenda
openACC
Kepler Architecture
CUDA5.5 features
OpenACC
NVIDIA CONFIDENTIAL
Application Code
+
GPU CPU Use GPU to Parallelize
Compute-Intensive Functions Rest of Sequential
CPU Code
NVIDIA CONFIDENTIAL
Real-Time Object Detection
Global Manufacturer of Navigation Systems
Valuation of Stock Portfolios using Monte Carlo
Global Technology Consulting Company
Interaction of Solvents and Biomolecules
University of Texas at San Antonio
Directives: Easy, Open & Powerful
Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications. ”
-- Developer at the Global Manufacturer of Navigation Systems
“ 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours
NVIDIA CONFIDENTIAL
OpenACC Directives
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
...
End Program myscience
CPU GPU
Your original
Fortran or C code
Simple Compiler hints
Compiler Parallelizes code
Works on many-core GPUs &
multicore CPUs
OpenACC
Compiler
Hint
Can you mix CUDA and OpenACC?
Yes, you can even use CUDA to
manage memory
NVIDIA CONFIDENTIAL
2 Basic Steps to Get Started
Step 1: Annotate source code with directives:
Step 2: Compile & run:
pgf90 -ta=nvidia -Minfo=accel file.f
!$acc data copy(util1,util2,util3) copyin(ip,scp2,scp2i)
!$acc parallel loop
…
!$acc end parallel
!$acc end data
NVIDIA CONFIDENTIAL
OpenACC Directives Example !$acc data copy(A,Anew)
iter=0
do while ( err > tol .and. iter < iter_max )
iter = iter +1
err=0._fp_kind
!$acc kernels
do j=1,m
do i=1,n
Anew(i,j) = .25_fp_kind *( A(i+1,j ) + A(i-1,j ) &
+A(i ,j-1) + A(i ,j+1))
err = max( err, Anew(i,j)-A(i,j))
end do
end do
!$acc end kernels
IF(mod(iter,100)==0 .or. iter == 1) print *, iter, err
A= Anew
end do
!$acc end data
Copy arrays into GPU memory
within data region
Parallelize code inside region
Close off parallel region
Close off data region,
copy data back
Kepler Architecture
NVIDIA CONFIDENTIAL
Kepler GK110 Block Diagram
Architecture
7.1B Transistors
15 SMX units
> 1 TFLOP FP64
1.5 MB L2 Cache
384-bit GDDR5
PCI Express Gen3
NVIDIA CONFIDENTIAL
Power vs Clock Speed Example
Fermi
2x clock A B
A
A
B
B
Kepler
1x clock
Logic
Area Power
1.0x 1.0x
1.8x 0.9x
Clocking
Area Power
1.0x 1.0x
1.0x 0.5x
NVIDIA CONFIDENTIAL
GPU Perf.
이론치 성능
# ratio of DP ( 1/2, 1/3, 1/8, 1/16)
# of CUDA core / MP ( 8, 32, 48,192)
# of MP / GPU ( 1, 2, 7,8,13,14,15)
# of FMA ( 1,2)
# of Clock Mhz ( 704, 870, 1200)
DP : 1.2 Tflops
SP : 2.4 Tflops
실측치 성능
DGEMM ratio ( 75%, 85%, 95%)
CPU + GPU DGEMM
DTRSM ratio
MPI ratio ( 50%, 90%)
DGEMM : 1117 Gflops
DP : 880 Gflops
NVIDIA CONFIDENTIAL
new CUDA-HPL (N = 50,000)
( >50 ms) memcpyH2D ( 1024 x 50000 doubles )
loop 50 ( 50,000/1024)
( >1 ms) Gather ( 1 kernel - read/write 1024x1024 doubles)
( >2 ms) Scatter ( 1 kernel - read/write 1024x1024 doubles)
( >4 ms) cublasDtrsm ( 31 Kernels - A=1024x1024 B=1024x1024 )
( >1 ms) Transpose ( 1 kernel - read/write 1024x1024 doubles)
( >110 ms) cublasDgemm ( 1 or 2 kernels A=1024x50000 B=1024x1024 C=1024x50000)
( >50 ms) memcpyD2H ( 1024 x 50000 doubles)
NVIDIA CONFIDENTIAL
New Instruction: SHFL
Data exchange between threads within a warp
Avoids use of shared memory
One 32-bit value per exchange
4 variants:
h d f e a c c b g h a b c d e f c d e f g h a b c d a b g h e f
a b c d e f g h
Indexed
any-to-any
Shift right to nth
neighbour
Shift left to nth neighbour Butterfly (XOR)
exchange
__shfl() __shfl_up() __shfl_down() __shfl_xor()
NVIDIA CONFIDENTIAL
Texture Cache Unlocked
Added a new path for compute
Avoids the texture unit
Allows a global address to be fetched and cached
Eliminates texture setup
Why use it?
Separate pipeline from shared/L1
Highest miss bandwidth
Flexible, e.g. unaligned accesses
Managed automatically by compiler
“const __restrict” indicates eligibility
Tex
SMX
L2
Tex Tex Tex
Read-only
Data Cache
NVIDIA CONFIDENTIAL
const __restrict Example
__global__ void saxpy(float x, float y, const float * __restrict input, float * output) { size_t offset = threadIdx.x + (blockIdx.x * blockDim.x); // Compiler will automatically use texture // for "input" output[offset] = (input[offset] * x) + y; }
Annotate eligible kernel
parameters with const __restrict
Compiler will automatically
map loads to use read-only
data cache path
NVIDIA CONFIDENTIAL
What is Dynamic Parallelism?
The ability to launch new grids from the GPU
Dynamically
Simultaneously
Independently
CPU GPU CPU GPU
Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself
NVIDIA CONFIDENTIAL
CPU GPU CPU GPU
What Does It Mean?
Autonomous, Dynamic Parallelism GPU as Co-Processor
CUDA 5.5
ARM: Fastest Growing CPU
1993 1998 2003 2008 2013 0%
20%
40%
80%
100%
Pro
cess
or
Mark
et
Share
x86 and Cortex Processors Shipped
60%
0
1
2
4
5
3
Pro
cesso
r Ship
ments (B
illions)
Source: Mercury Research, ARM, Internal estimates
PCI-e 슬롯
MXM slot
MXM 슬롯 모든 GPU 호환
CARMA MXM 판매중 Kyla mITX 판매중 Kyla MXM 출시예정
MXM 슬롯 K20X 호환
Q1000M 호환
ARM + CUDA 개발킷
Unpack the Kyla
ARM Kyla Order SECO (Italy) 1EA 600$
Kepler GPU
mATX power
Kepler GPU
RS-232 cable PC
PC login ID : utuntu
login PW : utuntu
CUDA
Smarter Robots to Energy Efficient Supercomputers
Building Self-Aware Robots 35% More Energy Efficient
NVIDIA CONFIDENTIAL
Without Hyper-Q 100
50
0 Time
GPU
Uti
lizati
on %
100
50
0 Time
GPU
Uti
lizati
on %
With Hyper-Q
NVIDIA CONFIDENTIAL
GPU
CUDA Server Process
CUDA
MPI
Rank
0
CUDA
MPI
Rank
1
CUDA
MPI
Rank
2
CUDA
MPI
Rank
3
Multi-Process Server Required for Hyper-Q / MPI
$ mpirun -np 4 my_cuda_app No application re-compile to share the GPU
No user configuration needed Can be preconfigured by SysAdmin
MPI Ranks using CUDA are clients
Server spawns on-demand per user
One job per user No isolation between MPI ranks
Exclusive process mode enforces single server
One GPU per rank No cudaSetDevice()
only CUDA device 0 is visible
NVIDIA CONFIDENTIAL
Strong Scaling of CP2K on Cray XK7
Hyper-Q with multiple
MPI ranks leads to
2.5X speedup over
single MPI rank using
the GPU
Blog post by Peter
Messmer of NVIDIA
NVIDIA CONFIDENTIAL
Stream Priorities Accelerates Critical Path
Kernel A Kernel B
Kernel X
Kernel C
Kernel X
Kernel C Stream 1
High-Priority
Stream 2
Kernel A Kernel B Stream 1
Stream 2
With Priorities—especially useful when Kernel X generates data for MPI_Send()
No Priorities
Kernel X
Launched
Kernel X
Launched
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
cuBLAS
행렬 연산용 BLAS 루틴의 CUDA 가속 라이브러리
최신 CUDA library에 내장
CUDA-HPL에서 cuBLAS 라이브러리 이용
NVIDIA CONFIDENTIAL
cuBLAS Model
Malloc Memcpy
Compute
GPU CPU
Memcpy
free
NVIDIA CONFIDENTIAL
cuBLAS call step
GPU Memory Alloc
cublasAlloc( n_size, memSize, GPU_ptr);
GPU Memory upload(set)
cublasSetVector( n_size, memSize, CPU_ptr, 1, GPU_ptr, 1 );
GPU Compute
cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
GPU Memory download(get)
cublasGetVector(n2, memSize , GPU_ptr, 1, CPU_ptr, 1);
nvcc mysrc.c –lcuda –lcublas
NVIDIA CONFIDENTIAL
Main code in cuBLAS
h_A = (float*) malloc(n2 * sizeof(h_A[0]));
h_B = (float *) malloc(n2 * sizeof(h_B[0]));
h_C = (float *) malloc(n2 * sizeof(h_C[0]));
cublasAlloc (n2, sizeof(d_A[0]), (void**)&d_A );
cublasAlloc (n2, sizeof(d_B[0]), (void **)&d_B );
cublasAlloc (n2, sizeof(d_C[0]), (void **)&d_C );
cublasSetVector (n2, sizeof(h_A[0]), h_A, 1, d_A, 1 );
cublasSetVector (n2, sizeof(h_B[0]), h_B, 1, d_B, 1 );
cublasSetVector (n2, sizeof(h_C[0]), h_C, 1, d_C, 1 );
cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N );
cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1 );
NVIDIA CONFIDENTIAL
cuFFT
CUDA기반 FFTW 가속 라이브러리
최신 CUDA toolkit에 포함
http://docs.nvidia.com/cuda/cufft/index.html
NVIDIA CONFIDENTIAL
cuFFT 사용법
#define NX 64 #define NY 64 #define NZ 128
cufftHandle plan; cufftComplex *data1, *data2; cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ); cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ); /* Create a 3D FFT plan. */ cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C);
/* Transform the first signal in place. */ cufftExecC2C(plan, data1, data1, CUFFT_FORWARD);
/* Transform the second signal using the same plan. */ cufftExecC2C(plan, data2, data2, CUFFT_FORWARD);
/* Destroy the cuFFT plan. */ cufftDestroy(plan); cudaFree(data1); cudaFree(data2);
GPU
GPU Memory
CPU
CPU Memory
NVIDIA CONFIDENTIAL
cuFFT Sample #include <stdio.h>
#include <math.h>
#include "cufft.h"
int main(int argc, char *argv[])
{
cufftComplex *a_h, *a_d;
cufftHandle plan;
int N = 1024, batchSize = 10;
int i, nBytes;
double maxError;
nBytes = sizeof(cufftComplex)*N*batchSize;
a_h = (cufftComplex *)malloc(nBytes);
for (i=0; i < N*batchSize; i++)
{
a_h[i].x = sinf(i);
a_h[i].y = cosf(i);
}
cudaMalloc((void **)&a_d, nBytes);
cudaMemcpy(a_d, a_h, nBytes,
cudaMemcpyHostToDevice);
cufftPlan1d(&plan, N, CUFFT_C2C, batchSize);
cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD);
cufftExecC2C(plan, a_d, a_d, CUFFT_INVERSE);
cudaMemcpy(a_h, a_d, nBytes, cudaMemcpyDeviceToHost);
// check error - normalize
for (maxError = 0.0, i=0; i < N*batchSize; i++)
{
maxError = max(fabs(a_h[i].x/N-sinf(i)), maxError);
maxError = max(fabs(a_h[i].y/N-cosf(i)), maxError);
}
printf("Max fft error = %g\n", maxError);
cufftDestroy(plan);
free(a_h);
cudaFree(a_d);
return 0;
}
NVIDIA CONFIDENTIAL
cuFFT 간접 디버깅
NVIDIA CONFIDENTIAL
pyCUDA를 이용한 이미지 프로세싱
NVIDIA CONFIDENTIAL
이미지 프로세싱 CUDA kernel
R G B
pixel
R G B R G B R G B
0 0 0 1
4 0 0
2
4
2
2
2
4 8 9 9
2 2 0 1
6
1
2 0
1
8
1
6
1
2
1
2
1
2 9
Original
리사이즈, 회전,
컬러변환
히스토그램, 정렬
필터링, 세선화, 압축
FFT, BLAS Result
NVIDIA CONFIDENTIAL
영상처리작업
이미지 파일 처리
BMT, JPG, PNG, DICOM 등 다양한 포맷
OpenCV, ITK 등 Image Library
화면 출력 및 디버깅
windows/linux UI 고려
QT library, MITK library
NVIDIA CONFIDENTIAL
CUDA 커널 개발에 집중할 수 없을까?
File I/O, UI는 python에 맡기고
GPU 가속은 CUDA를 이용하자
pyCUDA!!!
NVIDIA CONFIDENTIAL
PYTHON 101
http://www.python.org/
PIL : python Image Library
Numpy : 과학계산용 python 모듈 (행렬연산)
pyCUDA : python과 CUDA를 연결
NVIDIA CONFIDENTIAL
Python Image Libraray
from __future__ import
division
import numpy
import matplotlib.pyplot as
plt
from PIL import Image
# read image file
img = Image.open("6x3-
pixel.png")
arr=numpy.array(img)
print arr
Image library initialize
Image File Read
array for image processing
Show plot
NVIDIA CONFIDENTIAL
pyCUDA import
########### for image processing ###############
from __future__ import division
import numpy
import matplotlib.pyplot as plt
from PIL import Image
from scipy import misc
########### for pyCUDA ##############
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy.linalg as la
from pycuda.compiler import SourceModule
NVIDIA CONFIDENTIAL
UI part
########### for image processing ###############
from __future__ import division
import numpy
import matplotlib.pyplot as plt
from PIL import Image
from scipy import misc
########### for pyCUDA ##############
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy.linalg as la
from pycuda.compiler import SourceModule
########### read image file
##################
img = Image.open("Fisheye-Nikkor 10.5mm-
sample4-building.jpg")
# convert image to numpy array
arr = numpy.array(img)
#upload to GPU
#TODO kernel
#download to CPU
# result value
tmp = numpy.empty_like(arr)
tmp = arr
# plot the numpy array
plt.subplot(121)
plt.title("original")
plt.imshow(arr)
plt.subplot(122)
plt.title("defished")
plt.imshow(tmp)
plt.show()
NVIDIA CONFIDENTIAL
pyCUDA Template
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy
import numpy.linalg as la
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float
*b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(4).astype(numpy.float32)
b = numpy.random.randn(4).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1))
print a*b
print dest
print dest-a*b
pyCUDA initialize
define CUDA kernel
init on CPU memory
Define cudaMemcpy
Define Job index
Launch Kernel
define python function
NVIDIA CONFIDENTIAL
CUDA Kernel Launch
bright = mod.get_function("bright")
# read image file
img = Image.open("papua.jpg")
# convert image to numpy array
arr = numpy.array(img)
tmp = numpy.empty_like(arr)
bright(
drv.In(arr), drv.Out(tmp) ,
grid=(40,60,1), block=(40,20,1)
)
NVIDIA CONFIDENTIAL
CUDA kernel 작성
mod = SourceModule("""
__global__ void bright(unsigned char *in, unsigned char *out){
int ix = blockIdx.x * blockDim.x + threadIdx.x; //1600 16* 5 , 20
int iy = blockIdx.y * blockDim.y + threadIdx.y; //1200 12* 5 , 20
int it = ix * ( gridDim.y * blockDim.y ) + iy;
int px = ix *3;
int py = iy *3;
int pixel_R = (py * 1600) + px + 0;
int pixel_G = (py * 1600) + px + 1;
int pixel_B = (py * 1600) + px + 2;
int R,G,B;
R=in[pixel_R];
G=in[pixel_G];
B=in[pixel_B];
out[pixel_R]=R/2;
out[pixel_G]=G/2;
out[pixel_B]=B/2;
return;
}
""“)
Numpy Array 와
CUDA job Index 관계 고려
NVIDIA CONFIDENTIAL
[
[[ 0 0 0] [ 0 0 0] [255 255 255] [ 0 0 0] [ 0 0 0] [128 0 255]]
[[ 0 0 0] [ 0 0 0] [ 0 0 0] [ 0 0 0] [ 0 0 0] [237 28 36]]
[[ 0 0 0] [ 0 0 0] [ 0 0 0] [ 0 0 0] [ 0 0 0] [136 0 21]]
]
arr=np.array (img)
PNG image example
Numpy Array 데이터 구조
NVIDIA CONFIDENTIAL
Plot 출력
out[pixel_R]=R/2;
out[pixel_G]=G/2;
out[pixel_B]=B/2;
# plot the numpy array
plt.subplot(121)
plt.title("original")
plt.imshow(arr)
plt.subplot(122)
plt.title("result")
plt.imshow(tmp)
plt.show()