View
5
Download
0
Category
Preview:
Citation preview
Advancements in the NVIDIA GPU Ecosystem Axel Koehler, Senior Solutions Architect HPC, NVIDIA
The Visual Computing Company
HPC Advisory Council Meeting, April 2014 , Lugano
Outline
Tesla K40 and GPU Boost
Jetson TK-1 Development Board for Embedded HPC
Pascal GPU
3D Memory
NVLINK
CUDA 6.0
Unified memory
Extended Library Interfaces
GPU Direct RDMA with OpenMPI
… and beyond
0
1
2
3
4
5
CPU K20X K40
ns/day
Tesla K40
FASTER 1.4 TF| 2880 Cores | 288 GB/s
LARGER 2x Memory Enables More Apps
SMARTER Unlock Extra Performance
Using Power Headroom
AMBER Benchmark: SPFP-Nucleosome CPU: Dual E5-2687W @ 3.10GHz, 64GB System Memory, CentOS 6.2, GPU systems: Single Tesla K20X or Single Tesla K40
AMBER Benchmark
6GB
Fluid Dynamics
Seismic Analysis
Rendering
12GB
Average GPU Power in Watts
0
20
40
60
80
100
120
140
160
180
AMBER ANSYS Black Scholes Chroma GROMACS GTC LAMMPS LSMS NAMD Nbody QMCPACK RTM SPECFEM3D
Board
Pow
er
(Watt
s)
Avg GPU Power in Watts for Real Applications on K20X
GPU Boost on Tesla K40
Base
Clock
Workload # 1
Worst case
Reference App
235W
Boost
Clock #1
Workload # 2
E.g. AMBER
235W
Boost
Clock #2
Workload # 3
E.g. ANSYS Fluent
235W
Convert Power Headroom to Higher Performance
5
810Mhz
745Mhz
875Mhz
Non-Tesla
Compute Workload Behavior with GPU Boost
GPU
Clock
Automatic clock switching
Default Boost Base
Preset Options Lock to base clock 3 Levels: Base, Boost1 or Boost2
Boost Interface Control Panel
NV-SMI, NVML
nvidia-smi -q –d CLOCK,SUPPORTED_CLOCKS
nvidia-smi -ac <MEM clock, Graphics clock>
Target duration
for boost clocks ~50% of run-time
100% of workload run time
Must-have for HPC workload
Boost Clock # 1
Boost Clock # 2
Tesla K40
Deterministic Clocks
Base Clock # 1
JETSON TK1 THE WORLD’S 1st EMBEDDED SUPERCOMPUTER
Development Platform for Embedded
Computer Vision, Robotics, Medical, .... • Tegra K1 SOC
• Kepler GPU with 192 Cores (Compute
Capability 3.2)
• 4 Plus 1 Quad core ARM Cortex A15 CPU
• 2 GB Memory, 16 GB eMMC memory
• IO options
• miniPCI-e slot, GigE, HDMI, SD/MMC
connector, USB 3.0, SATA data port, ….
• CUDA Toolkit 6.0, OpenGL 4.4, OpenGL ES 3.0
• Runs 32-bit Ubuntu 13.04 Linux for Tegra (L4T)
• 326 GFLOPS, 5 Watts
https://developer.nvidia.com/jetson-tk1
Pascal GPU
Optimized for double precision FP
Very high bandwidth, large capacity 3D memory on
package
NVLINK for high bandwidth CPU GPU and GPU
GPU interconnect
Unified Memory (UM) HW support
New packaging allows much denser solutions (one-third
(one-third the size of current PCIe boards)
Stacked Memory
3D chip on wafer integration
Multiple layers of DRAM components will be integrated
vertically on the package along with the GPU
Compared to GDDR5 memory
4x Higher Bandwidth
3x Larger Capacity
4x More Energy Efficient per bit
NVLINK
CPU GPU communication limited by low bandwidth connection via PCI-e
NVLINK is a high speed interconnect between CPU GPU and GPU GPU
Basic building block is a 8-lane, differential, dual simplex bidirectional link
Multiple links can be aggregated to increase BW of a connection
NVLink will provide between 80 and 200 GB/s of bandwidth
Cache coherency provided with NVLINK 2.0
Preserves the PCIe programming model
CPU-initiated transactions such as control and configuration over a PCIe
connection
GPU-initiated transactions use NVLink
Allowing the GPU full-bandwidth access to the CPU’s memory system
NVLink is more than twice as energy efficient as a PCIe 3.0 connection
NVLINK
12
Unified Memory
Dramatically Lower Developer Effort
Developer View Today Developer View With Unified Memory
Unified Memory System Memory
GPU Memory
13
Super Simplified Memory Management Code
void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }
void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }
CPU Code CUDA 6 Code with Unified Memory
14
Unified Memory Delivers
1. Simpler
Programming &
Memory Model
2. Performance
Through
Data Locality
Migrate data to accessing processor
Guarantee global coherency
Still allows cudaMemcpyAsync() hand tuning
Single pointer to data, accessible anywhere
Tight language integration
Greatly simplifies code porting
CUDA 6: Ease of Use
Single Pointer to Data
No Memcopy Required
Coherence @ launch & sync
Shared C/C++ Data Structures
Next: Optimizations
Prefetching
Migration Hints
Additional OS Support
Future GPUs
Finer Grain Migration
Not Limited to GPU Memory Size
Unified Memory Roadmap
Learn More: http://bit.ly/um-p4a
GPU Direct RDMA with OpenMPI
Starting with CUDA 6 OpenMPI also supports GPU Direct RDMA
Kepler class GPUs (K10, K20, K20X, K40)
Mellanox ConnectX-3, ConnectX-3 Pro, Connect-IB
CUDA 6.0 (EA, RC, Final), Open MPI 1.7.4 and Mellanox OFED 2.1 drivers.
GPU Direct RDMA enabling software http://www.mellanox.com/downloads/ofed/nvidia_peer_memory-1.0-0.tar.gz
GPU Direct RDMA with OpenMPI
OpenMPI Compilation: configure --with-cuda Support is configured in if CUDA 6.0 cuda.h header file is detected.
To check: > ompi_info --all | grep btl_openib_have_cuda_gdr
MCA btl: informational "btl_openib_have_cuda_gdr" (current value: "true", data
source: default, level: 4 tuner/basic, type: bool)
> ompi_info -all | grep btl_openib_have_driver_gdr
MCA btl: informational "btl_openib_have_driver_gdr" (current value: "true", data
source: default, level: 4 tuner/basic, type: bool)
Enable GPU Direct RDMA usage (off by default) --mca btl_openib_want_cuda_gdr 1
Adjust when we switch to pipeline transfers through host memory.
Current default is 30,000 bytes --mca btl_openib_cuda_rdma_limit 60000
GPU Direct RDMA with OpenMPI
Chipset implementation limits bandwidth at larger message sizes
Still use pipelining with host memory staging for large messages
(hybrid version utilizes asynchronous copies)
GPU Direct RDMA with OpenMPI
HOOMD-blue (git master 28Jan14), Lennard-Jones Liquid dataset (16K, 512K Particles)
Dual-Socket Intel E5-2680 v2 @ 2.80 GHz CPUs, 64GB memory,
RHEL 6.2 , MLNX_OFED 2.1-1.0.0, Mellanox FDR
1 x Tesla K40 per node, Driver 331.20, Open MPI 1.7.4rc1,
GPUDirect RDMA (nvidia_peer_memory-1.0-0.tar.gz)
Dual-Socket Intel E5-2630 v2 @ 2.60 GHz CPUs, 64GB memory,
Scientific Linux 6.4 , MLNX_OFED 2.1-1.0.0, Mellanox FDR
2 x Tesla K20 per node, Driver 331.20, Open MPI 1.7.4rc1,
GPUDirect RDMA (nvidia_peer_memory-1.0-0.tar.gz)
20%
102% Higher is better Higher is better
http://www.hpcadvisorycouncil.com/pdf/HOOMDblue_Analysis_and_Profiling.pdf
Extended (XT) Library Interfaces
Automatic Scaling to multiple GPUs per node
cuFFT 2D/3D & cuBLAS level 3
Operate directly on large datasets that reside in CPU memory
2.2 TFLOPS
4.2 TFLOPS
6.0 TFLOPS
7.9 TFLOPS
0
1
2
3
4
5
6
7
8
1 x K10 2 x K10 3 x K10 4 x K10
16K x 16K SGEMM on Tesla K10
developer.nvidia.com/cublasxt
New Drop-in NVBLAS Library
Drop-in replacement for CPU-only BLAS
Automatically route BLAS3 calls to cuBLAS
Example: Drop-in Speedup for R
> LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so R
> A <- matrix(rnorm(4096*4096), nrow=4096, ncol=4096) > B <- matrix(rnorm(4096*4096), nrow=4096, ncol=4096) > system.time(C <- A %*% B)
user system elapsed
0.348 0.142 0.289
Use in any app that uses standard BLAS3
Octave, Scilab, etc.
0
500
1000
1500
2000
2500
3000
0 5000 10000 15000 20000 25000 30000 35000
fp64 G
Flo
ps/
s
matrix dimension
Matrix-Matrix Multiplication in R
nvBLAS, 4x K20X GPUs
MKL, 6-core Xeon E5-2667 CPU
Remote Development with Nsight Eclipse Edition
Local IDE, remote application
Edit locally, build & run remotely
Automatic sync via ssh
Cross-compilation to ARM
Full debugging & profiling via
remote connection
Build
Run
Debug
Profile
Edit
sync
Goals for the CUDA Platform
• Learn, adopt, & use parallelism with ease Simplicity
• Quickly achieve feature & performance goals Productivity
• Write code that can execute on all targets Portability
• High absolute performance and scalability Performance
Simpler Heterogeneous Applications
We want: homogeneous programs, heterogeneous execution
– Unified programming model includes parallelism in language
– Abstract heterogeneous execution via Runtime or Virtual Machine
GPU CPU GPU CPU
Single Program
Homogeneous
Programming Model
Current Ideal
Hybrid Program
parallel serial parallel + serial
Parallelism in Mainstream Languages
• Enable more programmers to write parallel software
• Give programmers the choice of language to use
• GPU support in key languages
C
C++ Parallel Algorithms Library Progress
• Complete set of parallel primitives:
for_each, sort, reduce, scan, etc.
• ISO C++ committee voted unanimously to
accept as official tech. specification working draft
N3960 Technical Specification Working Draft: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf
Prototype: https://github.com/n3554/n3554
std::vector<int> vec = ... // previous standard sequential loop std::for_each(vec.begin(), vec.end(), f); // explicitly sequential loop std::for_each(std::seq, vec.begin(), vec.end(), f); // permitting parallel execution std::for_each(std::par, vec.begin(), vec.end(), f);
Numba Python Compiler
• Free and open source compiler for array-oriented Python
• NEW numba.cuda module integrates CUDA directly into Python
• http://numba.pydata.org/
@cuda.jit(“void(float32[:], float32, float32[:], float32[:])”) def saxpy(out, a, x, y): i = cuda.grid(1) out[i] = a * x[i] + y[i] # Launch saxpy kernel saxpy[griddim, blockdim](out, a, x, y)
28
GPU-Accelerated Hadoop
Extract insights from customer data
Data Analytics using clustering algorithms
Developed using CUDA-accelerated IBM Java
Compile Java for GPUs
• Approach: apply a closure to a set of arrays
• foreach iterations parallelized over GPU threads
– Threads run closure execute() method
// vector addition float[] X = {1.0, 2.0, 3.0, 4.0, … }; float[] Y = {9.0, 8.1, 7.2, 6.3, … }; float[] Z = {0.0, 0.0, 0.0, 0.0, … }; jog.foreach(X, Y, Z, new jogContext(), new jogClosureRet<jogContext>() { public float execute(float x, float y) { return x + y; } } );
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Millions of Options
Java Black-Scholes Options Pricing Speedup
Speedup vs.Sequential Java
The Massively Parallel Programming Blog
Technical posts on GPUs, CUDA, OpenACC, Libraries, C/C++/Python and more
In-depth articles and regular series:
CUDACasts: instructive videos
CUDA Pro Tips: useful techniques
CUDA Spotlight Interviews
Join the conversation by subscribing to email or RSS updates today!
http://devblogs.nvidia.com/parallelforall
NVIDIA, the NVIDIA logo, GeForce, Quadro, Tegra, Tesla, GeForce Experience, GRID, GTX, Kepler, ShadowPlay, GameStream, SHIELD, and The Way It’s Meant To Be Played are trademarks and/or
registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.
© 2014 NVIDIA Corporation. All rights reserved.
Axel Koehler akoehler@nvidia.com
The Visual Computing Company
Recommended