29
Evolution of the NVIDIA GPU Architecture Jason Lowden Advanced Computer Architecture November 7, 2012

GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Embed Size (px)

Citation preview

Page 1: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Evolution of the NVIDIA GPU Architecture

Jason LowdenAdvanced Computer Architecture

November 7, 2012

Page 2: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ Introduction of the NVIDIA GPU∗ Graphics Pipeline

∗ GPU Terminology∗ Architecture of a GPU

∗ Computing Elements∗ Memory Types

∗ Fermi Architecture∗ Kepler Architecture∗ GPUs as a Computational Device

∗ CUDA Programming∗ Performance Comparison

∗ Relation to SMT, Vector Processors, and DSPs∗ Summary

Agenda

Page 3: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ First GPU is released in 1999∗ Used for the purpose of graphics processing∗ GeForce and Quadro

∗ CUDA Architecture released in 2006∗ Designed for use by industry and academia as a 

computing device∗ Move towards commodity parallel processing

∗ Tesla GPU series released in 2007∗ Fermi Architecture released in 2009∗ Kepler Architecture released in 2012

NVIDIA GPU History

Page 4: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Graphics Pipeline

Page 5: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ Thread – The smallest grain of the hierarchy of device computation

∗ Block – A group of threads∗ Grid – A group of blocks∗ Warp – A group of 32 threads that are executed simultaneously on the device

∗ Kernel ‐ The creator of a grid for GPU execution

Terminology

Page 6: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ Same components as a typical CPU∗ However,…∗ More computing elements∗ More types of memory

∗ Original GPUs had vertex and pixel shaders∗ Specifically for graphics

∗ Modern GPUs are slightly different∗ CUDA – Compute Unified Device Architecture

Architecture of a GPU

Page 7: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ Streaming Processor – Core of the design∗ Place where all of the computation takes place

∗ Streaming Multiprocessor∗ Groups of streaming multiprocessors∗ In addition to the SPs, these also contain the Special 

Function Units and Load/Store Units

∗ Instructional Schedulers∗ Complex Control Logic

Computational Elements of a GPU

Page 8: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Streaming Multiprocessor Architecture

Page 9: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ Global∗ DRAM∗ Slowest Performance

∗ Texture∗ Cached Global Memory∗ “Bound” at runtime

∗ Constant∗ Cached Global Memory

∗ Shared∗ Local to a block of threads

Types of GPU Memory

Page 10: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Architectural Memory Hierarchy

Page 11: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Fermi Architecture

Page 12: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ Increase the number of SPs per SM∗ Unified Request Path for load/store instructions∗ Implementation of a cache hierarchy∗ L1 cache per SM

∗ Configurable with Shared Memory∗ L2 cache is shared globally

∗ Register Spilling∗ Occurs when the register requirements of a thread 

exceed what is available on the device∗ Previous Generation: Spill to DRAM (global memory)∗ Fermi: Use of the L1 cache

Fermi Improvements

Page 13: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Summary

Page 14: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Kepler SM Overview

∗ Goal: Improve GPU performance and power efficiency∗ Improved to 3 times performance per watt over Fermi

∗ Increased to 192 SPs per SM∗ 32 Special Floating Point units∗ Improved Warp Scheduling

14

Page 15: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Kepler SM Design

15

Page 16: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Warp Scheduler

∗ 4 warp schedulers∗ Each scheduler can issue up to 2 independent 

instructions when it is ready to issue.

16

Page 17: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Kepler Memory Architecture

∗ Shared Memory and L1 are still physically shared∗ New configuration: 32K L1, 

32K Shared∗ Shared memory bandwidth is 

doubled compared with Fermi

∗ Increased the size of L2∗ Doubled the size Fermi, 

increasing it to 1536 KB

∗ Introduction of Read‐Only Cache∗ Previously, this was used in 

Fermi for Texture cache∗ 48 KB of storage 17

Page 18: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Warp Shuffle Instructions

∗ In Fermi, data could only be exchanged between threads using shared memory.∗ Resulted in additional synchronization time

∗ Kepler allows the shuffle functions, which∗ Exchange data between threads without using shared memory∗ Handles the store‐and‐load operation as a single step

∗ Data can only be shared within the same warp∗ In their example, an FFT algorithm saw 6% performance increase when 

using this instruction.

18

Page 19: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Kepler Hardware Features

∗ Dynamic Parallelism∗ Any kernel can launch more kernels from within itself∗ Takes additional load off of the CPU

∗ Hyper‐Q∗ 32 hardware managed work queues∗ Fermi had 1 queue

∗ Grid Management Unit∗ Needed to manage the number of grids that are executed∗ Introduction of the GMU to handle all of the grids that can be active 

at one time∗ NVIDIA GPUDirectTM

∗ Ability for CUDA enabled GPUs to interact without the need for CPU intervention

∗ The GPU can interact directly with the NIC19

Page 20: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Comparison of Kepler and Fermi

20

Page 21: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ Historically, GPUs were used for graphics to offload CPU work∗ Current trend – Combine CPU and GPU on a single core

∗ Due to the massively parallel computations of the work, GPUs are ideal for their number of processing cores.∗ However, these are only ideal when there are few data 

dependencies.∗ Introduction of CUDA and the Tesla GPUs

Use for Computation

Page 22: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ Extensions to the C language∗ With some C++ support

∗ Programming Support∗ Windows – Visual Studio∗ Linux/Mac – Eclipse

∗ Programming paradigm where each computation take place on a separate thread

∗ Requires NVIDIA GPU for acceleration∗ Simulators are used for research purposes

CUDA Programming

Page 23: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Example – Vector Addition

C

for( int i = 0; i < SIZE; ++i ) {c[ i ] = a[ i ] + b[ i ];

}

CUDA

__global__ void addVectors( float* a, float* b, float* c ) {

int id = threadIdx.x;if( id < SIZE ) {

c[ id ] = a[ id ] + b[ id ];}

}

Page 24: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ Explicit Memory Operations to allocate and copy data from the CPU to GPU∗ Some exceptions do apply

∗ All kernels execute asynchronously of the CPU∗ Explicit synchronization barriers between the processors

Programming Requirements

Page 25: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ To meet data dependencies,∗ Synchronization Primitives

∗ __syncthreads() – Synchronizes all threads in a block∗ Atomic Operations – Depending on compute/CUDA version, these are possible on global and shared memory

∗ Performance is dictated by memory operations and synchronization cost∗ Memory Coalescence∗ Warp Divergence

Synchronization and Performance

Page 26: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

Performance Comparison

Page 27: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ SMT∗ Many smaller cores, with less functionality, to compute results∗ Each core has a hardware context for a thread that can be 

switched out∗ Vector Processors

∗ Computation of results in parallel that could be done sequentially by a CPU

∗ Ability to access large chunks of data from memory at a given time∗ Banks of shared memory ‐ could lead to bank conflicts

∗ Digital Signal Processors∗ As with DSP algorithms, many applications could also use the 

MAC elements; these are built into the GPU by design

Relation to Other Architectures

Page 28: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ GPUs are massively parallel devices that can be used for general purpose computing, in addition to graphics processing

∗ As the cost continues to decrease, these devices become off‐the‐shelf components that can be used to build larger system.

∗ In addition to compute capabilities, Kepler offers the benefit of additional performance per watt, making a more power efficient design.

∗ When used with other technologies, like OpenCL, GPUs can be used in heterogeneous platforms.

Conclusions

Page 29: GPU Architecture Presentation(1)meseec.ce.rit.edu/722-projects/fall2012/1-4.pdf · Evolution of the NVIDIA GPU Architecture Jason Lowden ... NVIDIA GPU History. ... ∗Fermi had 1

∗ http://www.nvidia.com/page/corporate_timeline.html∗ http://www.pcmag.com/encyclopedia_term/0,2542,t=graphics+pipeline&i=43933,00.asp∗ S. L. Alarcon, “CUDA Memories,” unpublished.∗ NVIDIA. (2012 April 16). NVIDIA CUDA C Programming Guide. [Online]. Available: 

http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf.

∗ NVIDIA. (2009). NVIDIA’s Next Generation CUDATM Compute Architecture: Fermi. [Online]. Available: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.

∗ NVIDIA. (2012). NVIDIA’s Next Generation CUDATM Compute Architecture: KeplerTM GK110. [Online]. Available: http://www.nvidia.com/content/PDF/kepler/NVIDIA‐Kepler‐GK110‐Architecture‐Whitepaper.pdf.

∗ NVIDIA. (2012). NVIDIA GeForce GTX 680. [Online]. Available: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce‐GTX‐680‐Whitepaper‐FINAL.pdf

References