30
1 CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming Fall 2009 Jih-Kwon Peir Computer Information Science Engineering University of Florida

Summary exam 2015

Embed Size (px)

DESCRIPTION

computer science Second Midterm practice exam for STA statistics. worked out Solutions are included. For professor Ying Yang University of florida 2015. Also Homework help. Chip parallel. Assuming that the submitted jobs are all compute-heavy workloads, possiblywith different memory bandwidth requirements, what are the pros and cons of round-robin versusconsolidated scheduling in terms of power and cooling costs, performance, and reliability?

Citation preview

Page 1: Summary exam 2015

1

CIS 6930: Chip Multiprocessor: Parallel

Architecture and Programming

Fall 2009Jih-Kwon Peir

Computer Information Science Engineering

University of Florida

Page 2: Summary exam 2015

2

• Acknowledgement: Slides borrowed from o Accelerators for Science and Engineering Applications:

GPUs and Multicores, by David Kirk / NVIDIA and Wen-mei Hwu / University of Illinois, 2006-2008, (http://www.greatlakesconsortium.org/events/GPUMulticore/agenda.html)

o Course material posted from CUDA zone (http://www.nvidia.com/object/cuda_education.html)

o Intel Software Network (http://software.intel.com/en-us/academic/)

o The Art of Multiprocessor Programming (http://software.intel.com/en-us/academic/ )

o Presentation slides from various papers

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Page 3: Summary exam 2015

Course Goals• Learn how to program massively parallel

processors and achieve– high performance– functionality and maintainability– scalability across future generations

• Acquire technical knowledge required to achieve the above goals– principles and patterns of parallel programming– processor architecture features and constraints– programming API, tools and techniques

• Learn new many-core general-purpose and GPU processor architecture– Organization and memory systems

• Parallel programming basics: Locking, synchronization, mutual exclusion, transactional memory, etc.

Page 4: Summary exam 2015

Course Outline Week 1-2: Introduction, GPU architectures, CUDA programming Week 3-6: CUDA threads, code blocks, grids, CUDA memory,

synchronization, performance Week 7: Project selection and discussion Week 8-9: Intel many-core architectures Week 10-11: Parallel programming model, synchronization,

mutual exclusion, conditional synchronization, locks, barriers, concurrency and correctness, sequential program and consistency.

Add Fermi and Larrabee Week 12-13 - Discussion of advanced issues in multi-core

architecture and programming Week 14-16 In-depth discussion of project topics and project

presentation

4

Page 5: Summary exam 2015

5

CUDA – GPU Proggming

• Integrated host+device app C program– Serial or modestly parallel parts in host C code– Highly parallel parts in device SPMD kernel C code

Serial Code (host)

. . .

. . .

Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)KernelB<<< nBlk, nTid >>>(args);

Page 6: Summary exam 2015

6

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

CUDA Thread Blocks and Threads

• Each thread uses IDs to decide what data to work on– Block ID: 1D or 2D– Thread ID: 1D, 2D, or 3D

• Simplifies memoryaddressing when processingmultidimensional data– Image processing– Solving PDEs on volumes– …

Page 7: Summary exam 2015

Matrix MultiplicationA Simple Example

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

// Matrix multiplication on the (CPU) host in double precisionvoid MatrixMulOnHost(float* M, float* N, float* P, int Width){ for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; }}

i

k

k

j

Page 8: Summary exam 2015

8

G80 Example: Thread Scheduling (cont.)

• SM implements zero-overhead warp scheduling– At any time, only one of the warps is executed by SM– Warps whose next instruction has its operands ready for

consumption are eligible for execution– Eligible Warps are selected for execution on a prioritized

scheduling policy– All threads in a warp execute the same instruction when

selected

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Page 9: Summary exam 2015

9

Thread Scheduling (cont.)

•Each code block assigned to one SM, each SM can take up to 8 blocks•Each block up to 512 threads, divided into 32-therad wrap, each wrap scheduled on 8 SP, 4 threads on one SP, wrap executed SIMT mode•SP is pipelined ~30 stages, fetch, decode, gather and write-back act on whole warps, so they have a throughput of 1 warp/slow clock•Execute acts on group of 8 threads or quarter-warps (there are only 8 SP/SM), so their throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks•The Fetch/decode/... stages have a higher throughput to feed both the MAD and the SFU/MUL units alternatively. Hence the peak rate of 8 MAD + 8 MUL per (fast) clock cycle•Need 6 warps (or 192 threads) per SM to hide the read-after-write latencies

Page 10: Summary exam 2015

10

G80 Implementation of CUDA Memories

Each thread can:– Read/write per-thread registers– Read/write per-thread local

memory– Read/write per-block shared

memory– Read/write per-grid global memory– Read/only per-grid constant

memory

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

Page 11: Summary exam 2015

11

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

How about performance on G80?• All threads access global memory

for their input matrix elements– Two memory accesses (8

bytes) per floating point multiply-add

– 4B/s of memory bandwidth/FLOPS

– 4*346.5 = 1386 GB/s required to achieve peak FLOP rating

– 86.4 GB/s limits the code at 21.6 GFLOPS

• The actual code runs at about 15 GFLOPS

• Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS

Page 12: Summary exam 2015

12

Tiled Matrix Multiplication Kernel__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;

7. float Pvalue = 0;// Loop over the Md and Nd tiles required to compute the Pd element8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {// Coolaborative loading of Md and Nd tiles into shared memory9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];11. __syncthreads();

11. for (int k = 0; k < TILE_WIDTH; ++k)12. Pvalue += Mds[ty][k] * Nds[k][tx];13. Synchthreads();14. }13. Pd[Row*Width+Col] = Pvalue;}

Page 13: Summary exam 2015

Today’s Intel PC Architecture:Single Core System

• FSB connection between processor and Northbridge (82925X)– Memory Control Hub

• Northbridge handles “primary” PCIe to video/GPU and DRAM.– PCIe x16 bandwidth

at 8 GB/s (4 GB each direction)

• Southbridge (ICH6RW) handles other peripherals

Page 14: Summary exam 2015

14

GeForce-8 Series HW Overview

TPC TPC TPC TPC TPC TPC

TEX

SM

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1Texture Processor Cluster Streaming Multiprocessor

SM

Shared Memory

Streaming Processor Array

Page 15: Summary exam 2015

15

SM Warp Scheduling• SM hardware implements zero-

overhead Warp scheduling– Warps whose next instruction has its

operands ready for consumption are eligible for execution

– Eligible Warps are selected for execution on a prioritized scheduling policy

– All threads in a Warp execute the same instruction when selected

• 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80– If one global memory access is needed

for every 4 instructions– A minimal of 13 Warps are needed to

fully tolerate 200-cycle memory latency

warp 8 instruction 11

SM multithreadedWarp scheduler

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

...

time

warp 3 instruction 96

Page 16: Summary exam 2015

16

CUDA Device Memory Space: Review• Each thread can:

– R/W per-thread registers– R/W per-thread local memory– R/W per-block shared memory– R/W per-grid global memory– Read only per-grid constant

memory– Read only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host• The host can R/W global, constant, and texture memories using Copy function

Page 17: Summary exam 2015

17

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

Memory Layout of a Matrix in C

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M

T1 T2 T3 T4

Time Period 1

T1 T2 T3 T4

Time Period 2

Access direction in Kernel code

Page 18: Summary exam 2015

18

Bank Addressing Examples

2-way Bank Conflicts– Linear addressing

stride == 2

8-way Bank Conflicts– Linear addressing

stride == 8

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0x8

x8

Page 19: Summary exam 2015

19

Control Flow Instructions• Main performance concern with branching is

divergence– Threads within a single warp take different paths– Different execution paths are serialized in G80

• The control paths taken by the threads in a warp are traversed one at a time until there is no more.

• A common case: avoid divergence when branch condition is a function of thread ID– Example with divergence:

• If (threadIdx.x > 2) { }• This creates two different control paths for threads in a block• Branch granularity < warp size; threads 0 and 1 follow different path

than the rest of the threads in the first warp– Example without divergence:

• If (threadIdx.x / WARP_SIZE > 2) { }• Also creates two different control paths for threads in a block• Branch granularity is a whole multiple of warp size; all threads in any

given warp follow the same path

Page 20: Summary exam 2015

20

Vector Reduction with Branch Divergence

0 1 2 3 4 5 76 1098 11

0+1 2+3 4+5 6+7 10+118+9

0...3 4..7 8..11

0..7 8..15

1

2

3

Array elements

iterations

Thread 0 Thread 8Thread 2 Thread 4 Thread 6 Thread 10

Page 21: Summary exam 2015

21

Thread 0

No Divergence until < 16 sub-sums

0 1 2 3 … 13 1514 181716 19

0+16 15+311

3

4

Page 22: Summary exam 2015

22

Fundamentals of Parallel Computing

• Parallel computing requires that– The problem can be decomposed into sub-problems

that can be safely solved at the same time– The programmer structures the code and data to solve

these sub-problems concurrently• The goals of parallel computing are

– To solve problems in less time, and/or– To solve bigger problems, and/or– To achieve better solutions

The problems must be large enough to justify parallel computing and to exhibit exploitable concurrency.

Page 23: Summary exam 2015

23

Challenges of Parallel Programming

• Finding and exploiting concurrency often requires looking at the problem from a non-obvious angle– Computational thinking (J. Wing)

• Dependences need to be identified and managed– The order of task execution may change the answers

• Obvious: One step feeds result to the next steps• Subtle: numeric accuracy may be affected by ordering steps that are

logically parallel with each other

• Performance can be drastically reduced by many factors– Overhead of parallel processing– Load imbalance among processor elements– Inefficient data sharing patterns– Saturation of critical resources such as memory bandwidth

Page 24: Summary exam 2015

24

Fermi Implements CUDA

• Definition of memory scope, grid, thread block, thread, are same as in Tesla

• Grid: Array of thread blocks• Thread Block: up to 1536

concurrent threads, comm. through shared memory

• GPU has an array of SMs, each executes one or more thread block, each block is grouped into warps with 32 thread per warp

• Other resource constraints are implementation based

Page 25: Summary exam 2015

25

Fermi – GT300 Key Feature

32 cores per SM, 512 coresFully pipelined integer and floating

point unit that implements new IEEE 754-2008 standard include fused multiply-add (FMA)

Two warps from different thread blocks (even different kernels) can be issued and executed concurrently

ECC protection from the registers to DRAM

Linear addressing model with caching at all levels

Large shared memory / L1 cacheDouble precision performance 8x

faster than GT200 and reach ~600 double-precision GFLOPs

Page 26: Summary exam 2015

26

Fermi supports simultaneous execution of multiple kernels from the same application, each kernel distributed to one or more SMs

GigaThread hardware thread scheduler, manages 1,536 simultaneously active threads for each SM across 16 kernels

Switching from one application to another is 20x faster on Fermi

Fermi supports OpenCL, Fortran, C++, Java, Matlab, and Python.

Each SM has 32cores and 16 LS/ST units, 4 SFUs

Fermi supports FMA for both singe and double precision

Fermi – GT300 Key Feature (cont.)

Page 27: Summary exam 2015

Instruction Schedule Example• A total of 32 instructions from one or

two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of load/store units. This figure shows how instructions are issued to the four execution blocks.

• It takes two cycles for the 32 instructions in each warp to execute on the cores or load/store units. A warp of 32 special-function instructions is issued in a single cycle but takes eight cycles to complete on the four SFUs• Another major improvement in Fermi and PTX 2.0 is a new unified addressing model. All addresses in the GPU are allocated from a continuous 40-bit (one terabyte) address space. Global, shared, and local addresses are defined as ranges within this address space and can be accessed by common load/store instructions. (The load/store instructions support 64-bit addresses to allow for future growth.)

Page 28: Summary exam 2015

Multi-Core Architecture:Intel Quad Core Technology of Today

Cache Structure

28

1066MHz/1333Mhz FSB

Core0

4MB Shared L2 Cache

Bus Interface

4MB Shared L2 Cache

Core1

Core2

Core3

The L2 cache of today’s quad-core processors is not one cache shared by all 4 cores. Instead there are two L2 cache shared by two cores each

Page 29: Summary exam 2015

29Programming with OpenMP*

What Is OpenMP*?

omp_set_lock(lck)

#pragma omp parallel for private(A, B)

#pragma omp critical

C$OMP parallel do shared(a, b, c)

C$OMP PARALLEL REDUCTION (+: A, B)

call OMP_INIT_LOCK (ilok)

call omp_test_lock(jlok)

setenv OMP_SCHEDULE “dynamic”

CALL OMP_SET_NUM_THREADS(10)

C$OMP DO lastprivate(XX)

C$OMP ORDERED

C$OMP SINGLE PRIVATE(X)

C$OMP SECTIONS

C$OMP MASTER

C$OMP ATOMIC

C$OMP FLUSH

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)

C$OMP THREADPRIVATE(/ABC/)

C$OMP PARALLEL COPYIN(/blk/)

Nthrds = OMP_GET_NUM_PROCS()

!$OMP BARRIER

http://www.openmp.orgCurrent spec is OpenMP 2.5

250 Pages

(combined C/C++ and Fortran)

Page 30: Summary exam 2015

More material

30

• Intel Larrabee Architecture• Herlihy’s Book

– Chapter 1: Introduction– Chapter 2: Mutual Exclusion