Upload
clove
View
65
Download
0
Embed Size (px)
DESCRIPTION
Analyzing CUDA Workloads Using a Detailed GPU Simulator. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British Columbia. GPUs and CPUs on a collision course 1 st GPUs with programmable shaders in 2001 - PowerPoint PPT Presentation
Citation preview
Analyzing CUDA Workloads Using a Detailed GPU Simulator
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt
University of British Columbia
• GPUs and CPUs on a collision course– 1st GPUs with programmable shaders in 2001– Today: TeraFlop on a single card. Turing complete. Highly
accessible: senior undergrad students can learn to program CUDA in a few weeks (not good perf. code)
– Rapidly growing set of CUDA applications (209 listed on NVIDIA’s CUDA website in February).
– With OpenCL safely expect number of non-graphics applications written for GPUs to explode.
• GPUs are massively parallel systems:– Multicore + SIMT + fine grain multithreaded
2
No academic detailed simulator for studying this?!?
3
GPGPU-Sim• An academic detailed (“cycle-level”) timing simulator
developed from the ground up at the University of British Columbia (UBC) for modeling a modern GPU running non-graphics workloads.
• Relatively accurate
(no effort expended trying to make it more accurate relative to real hardware)
4
GPGPU-Sim
• Currently supports CUDA version 1.1 applications “out of the box”.
• Microarchitecture model – Based on notion of “shader cores” which approximate
NVIDIA GeForce 8 series and above notion of “Streaming Multiprocessor”.
– Connect to memory controllers using a detailed network-on-chip simulator (Dally & Towles’ booksim)
– Detailed DRAM timing model (everything except refresh)
• GPGPU-Sim v2.0b available: www.gpgpu-sim.org
5
Rest of this talk
• Obligatory brief introduction to CUDA• GPGPU-Sim internals (100,000’ view)
– Simulator software overview– Modeled Microarchitecture
• Some results from the paper
6
CUDA Examplemain()
{
…
cudaMalloc((void**) &d_idata, bytes);
cudaMalloc((void**) &d_odata, maxNumBlocks*sizeof(int));
cudaMemcpy(d_idata, h_idata, bytesin, cudaMemcpyHostToDevice);
reduce<<< nthreads, nblocks, smemSize >>>(d_idata, d_odata);
cudaThreadSynchronize();
cudaMemcpy(d_odata, h_odata, bytesout, cudaMemcpyDeviceToHost);
…
}
__global__ void reduce(int *g_idata, int *g_odata)
{
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if ((tid % (2*s)) == 0)
sdata[tid] += sdata[tid + s];
__syncthreads();
}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
} 7
Runs on CPU
nthreads x nblocks copies run in Parallel on GPU
Normal CUDA Flow
8
• Applications written in a mixture of C/C++ and CUDA.
• “nvcc” takes CUDA (.cu) files and generates host C code and “Parallel Thread eXecution” assembly language (PTX).
• PTX is passed to assembler / optimizer “ptxas” to generate machine code that is packed into a C array (not human readable).
• Combine whole thing and link to CUDA runtime API using regular C/C++ compiler linker.
• Run your app on the GPU.
GPGPU-Sim Flow
9
• Uses CUDA nvcc to generate CPU C code and PTX.
• flex/bison parser reads in PTX.
• Link together host (CPU) code and simulator into one binary.
• Intercept CUDA API calls using custom libcuda that implements functions declared in header files that come with CUDA.
GPGPU-Sim Microarchitecture
10
• Set of “shader cores” connected to set of memory controllers via a detailed interconnection network model (booksim).
• Memory controllers reorder requests to reduce activate /precharge overheads.
• Vary topology / bandwidth of interconnect
• Cache for global memory operations.
Shader Core Details
11
• Shader core roughly like a “Streaming Multiprocessor” in NVIDIA terminology.
• Set of scalar threads grouped together into an SIMD unit called a “warp” (NVIDIA uses 32 on current hardware). Warps grouped into CTAs. CTAs grouped into “grids”.
• Set of warps on a core are fine grain interleaved on pipeline to hide off-chip memory access latency.
• Threads in one CTA can communicate via an on chip 16KB “shared memory”.
Interconnection Network
12
Baseline: MeshVariations: Crossbar, Ring, Torus
Baseline mesh memory controller placement:
Are more threads better?
• More CTAs on a core – Helps hide the latency when some wait for
barriers– Can increase memory latency tolerance– Needs more resources
• Less CTAs on a core– Less contention in interconnection and memory
system
13
Memory Access Coalescing
• Grouping accesses from multiple, concurrently issued, scalar threads into a single access to a contiguous memory region
• Is always done for a single warp• Coalescing among multiple warps
– We explore its performance benefits – Is more expensive to implement
14
Simulation setupNumber of shader cores 28
Warp Size 32
SIMD pipeline width 8
# of Threads/CTAs/Registers per Core 1024 / 8 /16384
Shared Memory / Core 16KB (16 banks)
Constant Cache / Core 8KB (2-way set assoc. 64B lines LRU)
Texture Cache / Core 64KB (2-way set assoc. 64B lines LRU)
Memory Channels 8
BW / Memory Module 8 Byte/Cycle
DRAM request queue size 32
Memory Controller Out of order (FR-FCFS)
Branch Divergence handling method Immediate Post Dominator
Warp Scheduling Policy Round Robin among ready Warps
15
Benchmark Selection
• Applications developed by 3rd party researchers – Less than 50x reported speedups
• + some applications from CUDA SDK
16
Benchmarks (more info in paper)Benchmark Abbr. Claimed Speedup
AES Cryptography AES 12x
Breadth First Search BFS 2x-3x
Coulombic Potential CP 647x
gpuDG DG 50x
3D Laplace Solver LPS 50x
LIBOR Monte Carlo LIB 50x
MUMmerGPU MUM 3.5x-10x
Neural Network NN 10x
N-Queens Solver NQU 2.5x
Ray Tracing RAY 16x
StoreGPU STO 9x
Weather Prediction WP 20x
17
Interconnection Network Latency Sensitivity
• Slight increase in interconnection latency has no severe effect of overall performance– No need to overdesign interconnection to decrease
latency
18
Interconnection Network Bandwidth Sensitivity
• Low Bandwidth decreases performance a lot (8B)• Very high bandwidth moves the bottleneck
19
Effects of varying number of CTAs
• Most benchmarks do not benefit substantially • Some benchmarks even perform better with fewer
concurrent threads (e.g. AES)– Less contention in DRAM
20
More insights and data in the paper…
21
Summary
• GPGPU-Sim: a novel GPU simulator– Capable of simulating CUDA applications– www.gpgpu-sim.org
• Performance of simulated applications– More sensitive to bisection BW– Less sensitive to (zero load) Latency
• Sometimes running fewer CTAs can improve performance (less DRAM contention)
22
23
24
Interconnect Topology (Fig 9)
25
ICNT Latency and BW sensitivity (Fig 10-11)
26
Mem Controller Optimization Effects (Fig 12)
27
DRAM Utilization and Efficiency (Fig 13 -14)
28
L1 / L2 Cache (Fig 15)
29
Varying CTAs (Fig 16)
30
Inter-Warp Coalescing (Fig 17)
31