Summary exam 2015

1

CIS 6930: Chip Multiprocessor: Parallel

Architecture and Programming

Fall 2009Jih-Kwon Peir

Computer Information Science Engineering

University of Florida

2

• Acknowledgement: Slides borrowed from o Accelerators for Science and Engineering Applications:

GPUs and Multicores, by David Kirk / NVIDIA and Wen-mei Hwu / University of Illinois, 2006-2008, (http://www.greatlakesconsortium.org/events/GPUMulticore/agenda.html)

o Course material posted from CUDA zone (http://www.nvidia.com/object/cuda_education.html)

o Intel Software Network (http://software.intel.com/en-us/academic/)

o The Art of Multiprocessor Programming (http://software.intel.com/en-us/academic/ )

o Presentation slides from various papers

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Course Goals• Learn how to program massively parallel

processors and achieve– high performance– functionality and maintainability– scalability across future generations

• Acquire technical knowledge required to achieve the above goals– principles and patterns of parallel programming– processor architecture features and constraints– programming API, tools and techniques

• Learn new many-core general-purpose and GPU processor architecture– Organization and memory systems

• Parallel programming basics: Locking, synchronization, mutual exclusion, transactional memory, etc.

Course Outline Week 1-2: Introduction, GPU architectures, CUDA programming Week 3-6: CUDA threads, code blocks, grids, CUDA memory,

synchronization, performance Week 7: Project selection and discussion Week 8-9: Intel many-core architectures Week 10-11: Parallel programming model, synchronization,

mutual exclusion, conditional synchronization, locks, barriers, concurrency and correctness, sequential program and consistency.

Add Fermi and Larrabee Week 12-13 - Discussion of advanced issues in multi-core

architecture and programming Week 14-16 In-depth discussion of project topics and project

presentation

4

5

CUDA – GPU Proggming

• Integrated host+device app C program– Serial or modestly parallel parts in host C code– Highly parallel parts in device SPMD kernel C code

Serial Code (host)

. . .

. . .

Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)KernelB<<< nBlk, nTid >>>(args);

6

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

CUDA Thread Blocks and Threads

• Each thread uses IDs to decide what data to work on– Block ID: 1D or 2D– Thread ID: 1D, 2D, or 3D

• Simplifies memoryaddressing when processingmultidimensional data– Image processing– Solving PDEs on volumes– …

Matrix MultiplicationA Simple Example

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

// Matrix multiplication on the (CPU) host in double precisionvoid MatrixMulOnHost(float* M, float* N, float* P, int Width){ for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; }}

i

k

k

j

8

G80 Example: Thread Scheduling (cont.)

• SM implements zero-overhead warp scheduling– At any time, only one of the warps is executed by SM– Warps whose next instruction has its operands ready for

consumption are eligible for execution– Eligible Warps are selected for execution on a prioritized

scheduling policy– All threads in a warp execute the same instruction when

selected

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

9

Thread Scheduling (cont.)

•Each code block assigned to one SM, each SM can take up to 8 blocks•Each block up to 512 threads, divided into 32-therad wrap, each wrap scheduled on 8 SP, 4 threads on one SP, wrap executed SIMT mode•SP is pipelined ~30 stages, fetch, decode, gather and write-back act on whole warps, so they have a throughput of 1 warp/slow clock•Execute acts on group of 8 threads or quarter-warps (there are only 8 SP/SM), so their throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks•The Fetch/decode/... stages have a higher throughput to feed both the MAD and the SFU/MUL units alternatively. Hence the peak rate of 8 MAD + 8 MUL per (fast) clock cycle•Need 6 warps (or 192 threads) per SM to hide the read-after-write latencies

10

G80 Implementation of CUDA Memories

Each thread can:– Read/write per-thread registers– Read/write per-thread local

memory– Read/write per-block shared

memory– Read/write per-grid global memory– Read/only per-grid constant

memory

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

11

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

How about performance on G80?• All threads access global memory

for their input matrix elements– Two memory accesses (8

bytes) per floating point multiply-add

– 4B/s of memory bandwidth/FLOPS

– 4*346.5 = 1386 GB/s required to achieve peak FLOP rating

– 86.4 GB/s limits the code at 21.6 GFLOPS

• The actual code runs at about 15 GFLOPS

• Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS

12

Tiled Matrix Multiplication Kernel__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;

7. float Pvalue = 0;// Loop over the Md and Nd tiles required to compute the Pd element8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {// Coolaborative loading of Md and Nd tiles into shared memory9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];11. __syncthreads();

11. for (int k = 0; k < TILE_WIDTH; ++k)12. Pvalue += Mds[ty][k] * Nds[k][tx];13. Synchthreads();14. }13. Pd[Row*Width+Col] = Pvalue;}

Today’s Intel PC Architecture:Single Core System

• FSB connection between processor and Northbridge (82925X)– Memory Control Hub

• Northbridge handles “primary” PCIe to video/GPU and DRAM.– PCIe x16 bandwidth

at 8 GB/s (4 GB each direction)

• Southbridge (ICH6RW) handles other peripherals

14

GeForce-8 Series HW Overview

TPC TPC TPC TPC TPC TPC

TEX

SM

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Texture Processor Cluster Streaming Multiprocessor

SM

Shared Memory

Streaming Processor Array

…

15

SM Warp Scheduling• SM hardware implements zero-

overhead Warp scheduling– Warps whose next instruction has its

operands ready for consumption are eligible for execution

– Eligible Warps are selected for execution on a prioritized scheduling policy

– All threads in a Warp execute the same instruction when selected

• 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80– If one global memory access is needed

for every 4 instructions– A minimal of 13 Warps are needed to

fully tolerate 200-cycle memory latency

warp 8 instruction 11

SM multithreadedWarp scheduler




...

time


16

CUDA Device Memory Space: Review• Each thread can:

– R/W per-thread registers– R/W per-thread local memory– R/W per-block shared memory– R/W per-grid global memory– Read only per-grid constant

memory– Read only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host• The host can R/W global, constant, and texture memories using Copy function

17

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

Memory Layout of a Matrix in C

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M

T1 T2 T3 T4

Time Period 1

T1 T2 T3 T4

Time Period 2

Access direction in Kernel code

…

18

Bank Addressing Examples

2-way Bank Conflicts– Linear addressing

stride == 2

8-way Bank Conflicts– Linear addressing

stride == 8

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0x8

x8

19

Control Flow Instructions• Main performance concern with branching is

divergence– Threads within a single warp take different paths– Different execution paths are serialized in G80

• The control paths taken by the threads in a warp are traversed one at a time until there is no more.

• A common case: avoid divergence when branch condition is a function of thread ID– Example with divergence:

• If (threadIdx.x > 2) { }• This creates two different control paths for threads in a block• Branch granularity < warp size; threads 0 and 1 follow different path

than the rest of the threads in the first warp– Example without divergence:

• If (threadIdx.x / WARP_SIZE > 2) { }• Also creates two different control paths for threads in a block• Branch granularity is a whole multiple of warp size; all threads in any

given warp follow the same path

20

Vector Reduction with Branch Divergence

0 1 2 3 4 5 76 1098 11

0+1 2+3 4+5 6+7 10+118+9

0...3 4..7 8..11

0..7 8..15

1

2

3

Array elements

iterations

Thread 0 Thread 8Thread 2 Thread 4 Thread 6 Thread 10

21

Thread 0

No Divergence until < 16 sub-sums

0 1 2 3 … 13 1514 181716 19

0+16 15+311

3

4

22

Fundamentals of Parallel Computing

• Parallel computing requires that– The problem can be decomposed into sub-problems

that can be safely solved at the same time– The programmer structures the code and data to solve

these sub-problems concurrently• The goals of parallel computing are

– To solve problems in less time, and/or– To solve bigger problems, and/or– To achieve better solutions

The problems must be large enough to justify parallel computing and to exhibit exploitable concurrency.

23

Challenges of Parallel Programming

• Finding and exploiting concurrency often requires looking at the problem from a non-obvious angle– Computational thinking (J. Wing)

• Dependences need to be identified and managed– The order of task execution may change the answers

• Obvious: One step feeds result to the next steps• Subtle: numeric accuracy may be affected by ordering steps that are

logically parallel with each other

• Performance can be drastically reduced by many factors– Overhead of parallel processing– Load imbalance among processor elements– Inefficient data sharing patterns– Saturation of critical resources such as memory bandwidth

24

Fermi Implements CUDA

• Definition of memory scope, grid, thread block, thread, are same as in Tesla

• Grid: Array of thread blocks• Thread Block: up to 1536

concurrent threads, comm. through shared memory

• GPU has an array of SMs, each executes one or more thread block, each block is grouped into warps with 32 thread per warp

• Other resource constraints are implementation based

25

Fermi – GT300 Key Feature

32 cores per SM, 512 coresFully pipelined integer and floating

point unit that implements new IEEE 754-2008 standard include fused multiply-add (FMA)

Two warps from different thread blocks (even different kernels) can be issued and executed concurrently

ECC protection from the registers to DRAM

Linear addressing model with caching at all levels

Large shared memory / L1 cacheDouble precision performance 8x

faster than GT200 and reach ~600 double-precision GFLOPs

26

Fermi supports simultaneous execution of multiple kernels from the same application, each kernel distributed to one or more SMs

GigaThread hardware thread scheduler, manages 1,536 simultaneously active threads for each SM across 16 kernels

Switching from one application to another is 20x faster on Fermi

Fermi supports OpenCL, Fortran, C++, Java, Matlab, and Python.

Each SM has 32cores and 16 LS/ST units, 4 SFUs

Fermi supports FMA for both singe and double precision

Fermi – GT300 Key Feature (cont.)

Instruction Schedule Example• A total of 32 instructions from one or

two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of load/store units. This figure shows how instructions are issued to the four execution blocks.

• It takes two cycles for the 32 instructions in each warp to execute on the cores or load/store units. A warp of 32 special-function instructions is issued in a single cycle but takes eight cycles to complete on the four SFUs• Another major improvement in Fermi and PTX 2.0 is a new unified addressing model. All addresses in the GPU are allocated from a continuous 40-bit (one terabyte) address space. Global, shared, and local addresses are defined as ranges within this address space and can be accessed by common load/store instructions. (The load/store instructions support 64-bit addresses to allow for future growth.)

Multi-Core Architecture:Intel Quad Core Technology of Today

Cache Structure

28

1066MHz/1333Mhz FSB

Core0

4MB Shared L2 Cache

Bus Interface

4MB Shared L2 Cache

Core1

Core2

Core3

The L2 cache of today’s quad-core processors is not one cache shared by all 4 cores. Instead there are two L2 cache shared by two cores each

29Programming with OpenMP*

What Is OpenMP*?

omp_set_lock(lck)

#pragma omp parallel for private(A, B)

#pragma omp critical

C$OMP parallel do shared(a, b, c)

C$OMP PARALLEL REDUCTION (+: A, B)

call OMP_INIT_LOCK (ilok)

call omp_test_lock(jlok)

setenv OMP_SCHEDULE “dynamic”

CALL OMP_SET_NUM_THREADS(10)

C$OMP DO lastprivate(XX)

C$OMP ORDERED

C$OMP SINGLE PRIVATE(X)

C$OMP SECTIONS

C$OMP MASTER

C$OMP ATOMIC

C$OMP FLUSH

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)

C$OMP THREADPRIVATE(/ABC/)

C$OMP PARALLEL COPYIN(/blk/)

Nthrds = OMP_GET_NUM_PROCS()

!$OMP BARRIER

http://www.openmp.orgCurrent spec is OpenMP 2.5

250 Pages

(combined C/C++ and Fortran)

More material

30

• Intel Larrabee Architecture• Herlihy’s Book

– Chapter 1: Introduction– Chapter 2: Mutual Exclusion

Documents

Summary exam 2015