21
Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://WWW.NVIDIA.COM/CONTENT/PDF/FERMI_WHITE_PAPERS/NVIDIA _FERMI_COMPUTE_ARCHITECTURE_WHITEPAPER.PDF HTTP://WWW.REALWORLDTECH.COM/PAGE.CFM?ARTICLEID=RWT09300911 0932&P=1 HTTP://WWW.MODERNGPU.COM/INTRO/PERFORMANCE.HTML HTTP://HEATHER.CS.UCDAVIS.EDU/PARPROCBOOK

Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Embed Size (px)

Citation preview

Page 1: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Graphics Processing UnitsREFERENCES:

•COMPUTER ARCHITECTURE 5TH EDITION, HENNESSY AND PATTERSON, 2012

•HTTP://WWW.NVIDIA.COM/CONTENT/PDF/FERMI_WHITE_PAPERS/NVIDIA_FERMI_COMPUTE_ARCHITECTURE_WHITEPAPER.PDF

•HTTP://WWW.REALWORLDTECH.COM/PAGE.CFM?ARTICLEID=RWT093009110932&P=1

•HTTP://WWW.MODERNGPU.COM/INTRO/PERFORMANCE.HTML

•HTTP://HEATHER.CS.UCDAVIS.EDU/PARPROCBOOK

Page 2: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

CPU vs. GPU

http://chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html

• CPU: small fraction of chip used for arithmetic

Page 4: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

CPU vs GPU

Intel Haswell 170 GFlops on quad-core at 3.4GHz

AMD Radeon R9 290 4800 GFlops at 9.5GHz

Nvidia GTX 970 5000 Gflops at 1.05GHz

Page 5: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

GPGPU

General Purpose GPU programming Massively parallel

Scientific computing, brain simulations, etc

In supercomputers 53 of top500.org supercomputers used

NVIDIA/AMD GPUs (Nov 2014 ranking)

Including 2nd and 6th places

Page 6: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

OpenCL vs CUDA

Both for GPGPU

OpenCL Open standard

Supported on AMD, NVIDIA, Intel, Altera, …

CUDA Proprietary (Nvidia)

Losing ground to OpenCL?

Similar performance

Page 7: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

CUDA

Programming on Parallel Machines, Norm Matloff, Chapter 5

http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Uses a thread hierarchy Thread

Block

Grid

Page 8: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Thread

Executes an instance of a kernel (program)

ThreadID (within block), program counter, registers, private memory, input and output parameters

Private memory for register spills, function calls, array variables

Nvidia Fermi Whitepaper pg 6

Page 9: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Block

Set of concurrently executing threads

Cooperate via barrier synchronization and shared memory (fast but small)

BlockID (within grid)

Nvidia Fermi Whitepaper pg 6

Page 10: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Grid

Array of thread blocks running same kernel

Read and write global memory (slow – hundreds of cycles)

Synchronize between dependent kernel calls

Nvidia Fermi Whitepaper pg 6

Page 11: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Hardware Mapping

GPU executes 1+ kernel (program) grids

Streaming Multiprocessor (sm) executes 1+ thread blocks

CUDA core executes thread

Page 12: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Fermi Architecture

Debuted in 2010

512 CUDA cores executes 1 FP or integer instruction per cycle

32 CUDA cores per SM

16 SMs per GPU

6 64-bit memory ports

PCI-Express interface to CPU

GigaThread scheduler distributes blocks to SMs each SM has a thread scheduler (in hardware)

fast context switch

3 billion transistors

Page 13: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Nvid

ia F

erm

i W

hit

epaper

pg 7

Page 14: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

CUDA core

pipelined integer and FP units

IEEE 754-2008 FP fused multiply-add

integer unit boolean, shift, move, compare, ...

Nvidia Fermi Whitepaper pg 8

Page 15: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Streaming Multiprocessor (SM) 32 CUDA cores

16 ld/st units calculate source/destination

addresses

Special Function Units sin, cosine, reciprocal, sqrt

Nvidia Fermi Whitepaper pg 8

Page 16: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Warps

32 threads from a block are bundled into warps which execute the same instr/cycle

this becomes the minimum size of SIMD data

warps are implicitly synchronized if threads branch in different directions, they step

through both using predicated instructions

two warp schedulers select 1 instruction from a warp each to issue to 16 cores, 16 ld/st units or 4 SFUs

Page 17: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Maxwell Architecture

2014

16 streaming multiprocessors * 128 cores/SM = 2048 cores

Page 18: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Programming CUDA

C code

daxpy(n,2.0,x,y); // invoke

void daxpy(int n, double a, double *x double *y) { for(int i=0; i<n; i++) y[i] = a*x[i] + y[i];}

Page 19: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Programming CUDA

CUDA code

__host__int nblocks=(n+511)/512; // grid sizedaxpy<<<nblocks,512>>(n,2.0,x,y);// 512 threads/block

__global__void daxpy(int n, double a, double *x double *y) { int i=blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i] = a*x[i] + y[i];}

Page 20: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

n=8192, 512 threads/blockgrid block0 warp0 Y[0]=A*X[0]+Y[0]

...

Y[31]=A*X[31]+Y[31]

...

warp15 Y[480]=A*X[480]+Y[480]

...

Y[511]=A*X[511]+Y[511]

...

block15 warp0 Y[7680]=A*X[7680]+Y[7680]

...

Y[7711]=A*X[7711]+Y[7711]

...

warp15 Y[8160]=A*X[8160]+Y[8160]

...

Y[8191]=A*X[8191]+Y[8191]

Page 21: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP:

Moving data between host and GPU

int main() {double *x, *y, a, *dx, *dy;x = (double *)malloc(sizeof(double)*n);y = (double *)malloc(sizeof(double)*n);// initialize x and y…cudaMalloc(dx, n*sizeof(double));cudaMalloc(dy, n*sizeof(double));cudaMemcpy(dx, x, n*sizeof(double), cudaMemcpyHostToDevice); …daxpy<<<nblocks,512>>(n,2.0,x,y);cudaThreadSynchronize();cudaMemcpy(y, dy, n*sizeof(double), cudaMemcpyDeviceToHost);cudaMemFree(dx); cudaMemFree(dy);free(x); free(y);

}