Upload
dinhdang
View
246
Download
3
Embed Size (px)
Citation preview
Evolution of the NVIDIA GPU Architecture
Jason LowdenAdvanced Computer Architecture
November 7, 2012
∗ Introduction of the NVIDIA GPU∗ Graphics Pipeline
∗ GPU Terminology∗ Architecture of a GPU
∗ Computing Elements∗ Memory Types
∗ Fermi Architecture∗ Kepler Architecture∗ GPUs as a Computational Device
∗ CUDA Programming∗ Performance Comparison
∗ Relation to SMT, Vector Processors, and DSPs∗ Summary
Agenda
∗ First GPU is released in 1999∗ Used for the purpose of graphics processing∗ GeForce and Quadro
∗ CUDA Architecture released in 2006∗ Designed for use by industry and academia as a
computing device∗ Move towards commodity parallel processing
∗ Tesla GPU series released in 2007∗ Fermi Architecture released in 2009∗ Kepler Architecture released in 2012
NVIDIA GPU History
Graphics Pipeline
∗ Thread – The smallest grain of the hierarchy of device computation
∗ Block – A group of threads∗ Grid – A group of blocks∗ Warp – A group of 32 threads that are executed simultaneously on the device
∗ Kernel ‐ The creator of a grid for GPU execution
Terminology
∗ Same components as a typical CPU∗ However,…∗ More computing elements∗ More types of memory
∗ Original GPUs had vertex and pixel shaders∗ Specifically for graphics
∗ Modern GPUs are slightly different∗ CUDA – Compute Unified Device Architecture
Architecture of a GPU
∗ Streaming Processor – Core of the design∗ Place where all of the computation takes place
∗ Streaming Multiprocessor∗ Groups of streaming multiprocessors∗ In addition to the SPs, these also contain the Special
Function Units and Load/Store Units
∗ Instructional Schedulers∗ Complex Control Logic
Computational Elements of a GPU
Streaming Multiprocessor Architecture
∗ Global∗ DRAM∗ Slowest Performance
∗ Texture∗ Cached Global Memory∗ “Bound” at runtime
∗ Constant∗ Cached Global Memory
∗ Shared∗ Local to a block of threads
Types of GPU Memory
Architectural Memory Hierarchy
Fermi Architecture
∗ Increase the number of SPs per SM∗ Unified Request Path for load/store instructions∗ Implementation of a cache hierarchy∗ L1 cache per SM
∗ Configurable with Shared Memory∗ L2 cache is shared globally
∗ Register Spilling∗ Occurs when the register requirements of a thread
exceed what is available on the device∗ Previous Generation: Spill to DRAM (global memory)∗ Fermi: Use of the L1 cache
Fermi Improvements
Summary
Kepler SM Overview
∗ Goal: Improve GPU performance and power efficiency∗ Improved to 3 times performance per watt over Fermi
∗ Increased to 192 SPs per SM∗ 32 Special Floating Point units∗ Improved Warp Scheduling
14
Kepler SM Design
15
Warp Scheduler
∗ 4 warp schedulers∗ Each scheduler can issue up to 2 independent
instructions when it is ready to issue.
16
Kepler Memory Architecture
∗ Shared Memory and L1 are still physically shared∗ New configuration: 32K L1,
32K Shared∗ Shared memory bandwidth is
doubled compared with Fermi
∗ Increased the size of L2∗ Doubled the size Fermi,
increasing it to 1536 KB
∗ Introduction of Read‐Only Cache∗ Previously, this was used in
Fermi for Texture cache∗ 48 KB of storage 17
Warp Shuffle Instructions
∗ In Fermi, data could only be exchanged between threads using shared memory.∗ Resulted in additional synchronization time
∗ Kepler allows the shuffle functions, which∗ Exchange data between threads without using shared memory∗ Handles the store‐and‐load operation as a single step
∗ Data can only be shared within the same warp∗ In their example, an FFT algorithm saw 6% performance increase when
using this instruction.
18
Kepler Hardware Features
∗ Dynamic Parallelism∗ Any kernel can launch more kernels from within itself∗ Takes additional load off of the CPU
∗ Hyper‐Q∗ 32 hardware managed work queues∗ Fermi had 1 queue
∗ Grid Management Unit∗ Needed to manage the number of grids that are executed∗ Introduction of the GMU to handle all of the grids that can be active
at one time∗ NVIDIA GPUDirectTM
∗ Ability for CUDA enabled GPUs to interact without the need for CPU intervention
∗ The GPU can interact directly with the NIC19
Comparison of Kepler and Fermi
20
∗ Historically, GPUs were used for graphics to offload CPU work∗ Current trend – Combine CPU and GPU on a single core
∗ Due to the massively parallel computations of the work, GPUs are ideal for their number of processing cores.∗ However, these are only ideal when there are few data
dependencies.∗ Introduction of CUDA and the Tesla GPUs
Use for Computation
∗ Extensions to the C language∗ With some C++ support
∗ Programming Support∗ Windows – Visual Studio∗ Linux/Mac – Eclipse
∗ Programming paradigm where each computation take place on a separate thread
∗ Requires NVIDIA GPU for acceleration∗ Simulators are used for research purposes
CUDA Programming
Example – Vector Addition
C
for( int i = 0; i < SIZE; ++i ) {c[ i ] = a[ i ] + b[ i ];
}
CUDA
__global__ void addVectors( float* a, float* b, float* c ) {
int id = threadIdx.x;if( id < SIZE ) {
c[ id ] = a[ id ] + b[ id ];}
}
∗ Explicit Memory Operations to allocate and copy data from the CPU to GPU∗ Some exceptions do apply
∗ All kernels execute asynchronously of the CPU∗ Explicit synchronization barriers between the processors
Programming Requirements
∗ To meet data dependencies,∗ Synchronization Primitives
∗ __syncthreads() – Synchronizes all threads in a block∗ Atomic Operations – Depending on compute/CUDA version, these are possible on global and shared memory
∗ Performance is dictated by memory operations and synchronization cost∗ Memory Coalescence∗ Warp Divergence
Synchronization and Performance
Performance Comparison
∗ SMT∗ Many smaller cores, with less functionality, to compute results∗ Each core has a hardware context for a thread that can be
switched out∗ Vector Processors
∗ Computation of results in parallel that could be done sequentially by a CPU
∗ Ability to access large chunks of data from memory at a given time∗ Banks of shared memory ‐ could lead to bank conflicts
∗ Digital Signal Processors∗ As with DSP algorithms, many applications could also use the
MAC elements; these are built into the GPU by design
Relation to Other Architectures
∗ GPUs are massively parallel devices that can be used for general purpose computing, in addition to graphics processing
∗ As the cost continues to decrease, these devices become off‐the‐shelf components that can be used to build larger system.
∗ In addition to compute capabilities, Kepler offers the benefit of additional performance per watt, making a more power efficient design.
∗ When used with other technologies, like OpenCL, GPUs can be used in heterogeneous platforms.
Conclusions
∗ http://www.nvidia.com/page/corporate_timeline.html∗ http://www.pcmag.com/encyclopedia_term/0,2542,t=graphics+pipeline&i=43933,00.asp∗ S. L. Alarcon, “CUDA Memories,” unpublished.∗ NVIDIA. (2012 April 16). NVIDIA CUDA C Programming Guide. [Online]. Available:
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf.
∗ NVIDIA. (2009). NVIDIA’s Next Generation CUDATM Compute Architecture: Fermi. [Online]. Available: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
∗ NVIDIA. (2012). NVIDIA’s Next Generation CUDATM Compute Architecture: KeplerTM GK110. [Online]. Available: http://www.nvidia.com/content/PDF/kepler/NVIDIA‐Kepler‐GK110‐Architecture‐Whitepaper.pdf.
∗ NVIDIA. (2012). NVIDIA GeForce GTX 680. [Online]. Available: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce‐GTX‐680‐Whitepaper‐FINAL.pdf
References