Upload
ayylmao-kek
View
17
Download
1
Embed Size (px)
DESCRIPTION
computer science Second Midterm practice exam for STA statistics. worked out Solutions are included. For professor Ying Yang University of florida 2015. Also Homework help. Chip parallel. Assuming that the submitted jobs are all compute-heavy workloads, possiblywith different memory bandwidth requirements, what are the pros and cons of round-robin versusconsolidated scheduling in terms of power and cooling costs, performance, and reliability?
Citation preview
1
CIS 6930: Chip Multiprocessor: Parallel
Architecture and Programming
Fall 2009Jih-Kwon Peir
Computer Information Science Engineering
University of Florida
2
• Acknowledgement: Slides borrowed from o Accelerators for Science and Engineering Applications:
GPUs and Multicores, by David Kirk / NVIDIA and Wen-mei Hwu / University of Illinois, 2006-2008, (http://www.greatlakesconsortium.org/events/GPUMulticore/agenda.html)
o Course material posted from CUDA zone (http://www.nvidia.com/object/cuda_education.html)
o Intel Software Network (http://software.intel.com/en-us/academic/)
o The Art of Multiprocessor Programming (http://software.intel.com/en-us/academic/ )
o Presentation slides from various papers
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Course Goals• Learn how to program massively parallel
processors and achieve– high performance– functionality and maintainability– scalability across future generations
• Acquire technical knowledge required to achieve the above goals– principles and patterns of parallel programming– processor architecture features and constraints– programming API, tools and techniques
• Learn new many-core general-purpose and GPU processor architecture– Organization and memory systems
• Parallel programming basics: Locking, synchronization, mutual exclusion, transactional memory, etc.
Course Outline Week 1-2: Introduction, GPU architectures, CUDA programming Week 3-6: CUDA threads, code blocks, grids, CUDA memory,
synchronization, performance Week 7: Project selection and discussion Week 8-9: Intel many-core architectures Week 10-11: Parallel programming model, synchronization,
mutual exclusion, conditional synchronization, locks, barriers, concurrency and correctness, sequential program and consistency.
Add Fermi and Larrabee Week 12-13 - Discussion of advanced issues in multi-core
architecture and programming Week 14-16 In-depth discussion of project topics and project
presentation
4
5
CUDA – GPU Proggming
• Integrated host+device app C program– Serial or modestly parallel parts in host C code– Highly parallel parts in device SPMD kernel C code
Serial Code (host)
. . .
. . .
Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)KernelB<<< nBlk, nTid >>>(args);
6
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
CUDA Thread Blocks and Threads
• Each thread uses IDs to decide what data to work on– Block ID: 1D or 2D– Thread ID: 1D, 2D, or 3D
• Simplifies memoryaddressing when processingmultidimensional data– Image processing– Solving PDEs on volumes– …
Matrix MultiplicationA Simple Example
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
// Matrix multiplication on the (CPU) host in double precisionvoid MatrixMulOnHost(float* M, float* N, float* P, int Width){ for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; }}
i
k
k
j
8
G80 Example: Thread Scheduling (cont.)
• SM implements zero-overhead warp scheduling– At any time, only one of the warps is executed by SM– Warps whose next instruction has its operands ready for
consumption are eligible for execution– Eligible Warps are selected for execution on a prioritized
scheduling policy– All threads in a warp execute the same instruction when
selected
TB1W1
TB = Thread Block, W = Warp
TB2W1
TB3W1
TB2W1
TB1W1
TB3W2
TB1W2
TB1W3
TB3W2
Time
TB1, W1 stallTB3, W2 stallTB2, W1 stall
Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4
9
Thread Scheduling (cont.)
•Each code block assigned to one SM, each SM can take up to 8 blocks•Each block up to 512 threads, divided into 32-therad wrap, each wrap scheduled on 8 SP, 4 threads on one SP, wrap executed SIMT mode•SP is pipelined ~30 stages, fetch, decode, gather and write-back act on whole warps, so they have a throughput of 1 warp/slow clock•Execute acts on group of 8 threads or quarter-warps (there are only 8 SP/SM), so their throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks•The Fetch/decode/... stages have a higher throughput to feed both the MAD and the SFU/MUL units alternatively. Hence the peak rate of 8 MAD + 8 MUL per (fast) clock cycle•Need 6 warps (or 192 threads) per SM to hide the read-after-write latencies
10
G80 Implementation of CUDA Memories
Each thread can:– Read/write per-thread registers– Read/write per-thread local
memory– Read/write per-block shared
memory– Read/write per-grid global memory– Read/only per-grid constant
memory
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
11
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
How about performance on G80?• All threads access global memory
for their input matrix elements– Two memory accesses (8
bytes) per floating point multiply-add
– 4B/s of memory bandwidth/FLOPS
– 4*346.5 = 1386 GB/s required to achieve peak FLOP rating
– 86.4 GB/s limits the code at 21.6 GFLOPS
• The actual code runs at about 15 GFLOPS
• Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS
12
Tiled Matrix Multiplication Kernel__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;// Loop over the Md and Nd tiles required to compute the Pd element8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {// Coolaborative loading of Md and Nd tiles into shared memory9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];11. __syncthreads();
11. for (int k = 0; k < TILE_WIDTH; ++k)12. Pvalue += Mds[ty][k] * Nds[k][tx];13. Synchthreads();14. }13. Pd[Row*Width+Col] = Pvalue;}
Today’s Intel PC Architecture:Single Core System
• FSB connection between processor and Northbridge (82925X)– Memory Control Hub
• Northbridge handles “primary” PCIe to video/GPU and DRAM.– PCIe x16 bandwidth
at 8 GB/s (4 GB each direction)
• Southbridge (ICH6RW) handles other peripherals
14
GeForce-8 Series HW Overview
TPC TPC TPC TPC TPC TPC
TEX
SM
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1Texture Processor Cluster Streaming Multiprocessor
SM
Shared Memory
Streaming Processor Array
…
15
SM Warp Scheduling• SM hardware implements zero-
overhead Warp scheduling– Warps whose next instruction has its
operands ready for consumption are eligible for execution
– Eligible Warps are selected for execution on a prioritized scheduling policy
– All threads in a Warp execute the same instruction when selected
• 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80– If one global memory access is needed
for every 4 instructions– A minimal of 13 Warps are needed to
fully tolerate 200-cycle memory latency
warp 8 instruction 11
SM multithreadedWarp scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96
16
CUDA Device Memory Space: Review• Each thread can:
– R/W per-thread registers– R/W per-thread local memory– R/W per-block shared memory– R/W per-grid global memory– Read only per-grid constant
memory– Read only per-grid texture memory
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host• The host can R/W global, constant, and texture memories using Copy function
17
M2,0
M1,1
M1,0M0,0
M0,1
M3,0
M2,1 M3,1
Memory Layout of a Matrix in C
M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2
M1,2M0,2 M2,2 M3,2
M1,3M0,3 M2,3 M3,3
M1,3M0,3 M2,3 M3,3
M
T1 T2 T3 T4
Time Period 1
T1 T2 T3 T4
Time Period 2
Access direction in Kernel code
…
18
Bank Addressing Examples
2-way Bank Conflicts– Linear addressing
stride == 2
8-way Bank Conflicts– Linear addressing
stride == 8
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 9Bank 8
Bank 15
Bank 7
Bank 2Bank 1Bank 0x8
x8
19
Control Flow Instructions• Main performance concern with branching is
divergence– Threads within a single warp take different paths– Different execution paths are serialized in G80
• The control paths taken by the threads in a warp are traversed one at a time until there is no more.
• A common case: avoid divergence when branch condition is a function of thread ID– Example with divergence:
• If (threadIdx.x > 2) { }• This creates two different control paths for threads in a block• Branch granularity < warp size; threads 0 and 1 follow different path
than the rest of the threads in the first warp– Example without divergence:
• If (threadIdx.x / WARP_SIZE > 2) { }• Also creates two different control paths for threads in a block• Branch granularity is a whole multiple of warp size; all threads in any
given warp follow the same path
20
Vector Reduction with Branch Divergence
0 1 2 3 4 5 76 1098 11
0+1 2+3 4+5 6+7 10+118+9
0...3 4..7 8..11
0..7 8..15
1
2
3
Array elements
iterations
Thread 0 Thread 8Thread 2 Thread 4 Thread 6 Thread 10
21
Thread 0
No Divergence until < 16 sub-sums
0 1 2 3 … 13 1514 181716 19
0+16 15+311
3
4
22
Fundamentals of Parallel Computing
• Parallel computing requires that– The problem can be decomposed into sub-problems
that can be safely solved at the same time– The programmer structures the code and data to solve
these sub-problems concurrently• The goals of parallel computing are
– To solve problems in less time, and/or– To solve bigger problems, and/or– To achieve better solutions
The problems must be large enough to justify parallel computing and to exhibit exploitable concurrency.
23
Challenges of Parallel Programming
• Finding and exploiting concurrency often requires looking at the problem from a non-obvious angle– Computational thinking (J. Wing)
• Dependences need to be identified and managed– The order of task execution may change the answers
• Obvious: One step feeds result to the next steps• Subtle: numeric accuracy may be affected by ordering steps that are
logically parallel with each other
• Performance can be drastically reduced by many factors– Overhead of parallel processing– Load imbalance among processor elements– Inefficient data sharing patterns– Saturation of critical resources such as memory bandwidth
24
Fermi Implements CUDA
• Definition of memory scope, grid, thread block, thread, are same as in Tesla
• Grid: Array of thread blocks• Thread Block: up to 1536
concurrent threads, comm. through shared memory
• GPU has an array of SMs, each executes one or more thread block, each block is grouped into warps with 32 thread per warp
• Other resource constraints are implementation based
25
Fermi – GT300 Key Feature
32 cores per SM, 512 coresFully pipelined integer and floating
point unit that implements new IEEE 754-2008 standard include fused multiply-add (FMA)
Two warps from different thread blocks (even different kernels) can be issued and executed concurrently
ECC protection from the registers to DRAM
Linear addressing model with caching at all levels
Large shared memory / L1 cacheDouble precision performance 8x
faster than GT200 and reach ~600 double-precision GFLOPs
26
Fermi supports simultaneous execution of multiple kernels from the same application, each kernel distributed to one or more SMs
GigaThread hardware thread scheduler, manages 1,536 simultaneously active threads for each SM across 16 kernels
Switching from one application to another is 20x faster on Fermi
Fermi supports OpenCL, Fortran, C++, Java, Matlab, and Python.
Each SM has 32cores and 16 LS/ST units, 4 SFUs
Fermi supports FMA for both singe and double precision
Fermi – GT300 Key Feature (cont.)
Instruction Schedule Example• A total of 32 instructions from one or
two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of load/store units. This figure shows how instructions are issued to the four execution blocks.
• It takes two cycles for the 32 instructions in each warp to execute on the cores or load/store units. A warp of 32 special-function instructions is issued in a single cycle but takes eight cycles to complete on the four SFUs• Another major improvement in Fermi and PTX 2.0 is a new unified addressing model. All addresses in the GPU are allocated from a continuous 40-bit (one terabyte) address space. Global, shared, and local addresses are defined as ranges within this address space and can be accessed by common load/store instructions. (The load/store instructions support 64-bit addresses to allow for future growth.)
Multi-Core Architecture:Intel Quad Core Technology of Today
Cache Structure
28
1066MHz/1333Mhz FSB
Core0
4MB Shared L2 Cache
Bus Interface
4MB Shared L2 Cache
Core1
Core2
Core3
The L2 cache of today’s quad-core processors is not one cache shared by all 4 cores. Instead there are two L2 cache shared by two cores each
29Programming with OpenMP*
What Is OpenMP*?
omp_set_lock(lck)
#pragma omp parallel for private(A, B)
#pragma omp critical
C$OMP parallel do shared(a, b, c)
C$OMP PARALLEL REDUCTION (+: A, B)
call OMP_INIT_LOCK (ilok)
call omp_test_lock(jlok)
setenv OMP_SCHEDULE “dynamic”
CALL OMP_SET_NUM_THREADS(10)
C$OMP DO lastprivate(XX)
C$OMP ORDERED
C$OMP SINGLE PRIVATE(X)
C$OMP SECTIONS
C$OMP MASTER
C$OMP ATOMIC
C$OMP FLUSH
C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)
C$OMP THREADPRIVATE(/ABC/)
C$OMP PARALLEL COPYIN(/blk/)
Nthrds = OMP_GET_NUM_PROCS()
!$OMP BARRIER
http://www.openmp.orgCurrent spec is OpenMP 2.5
250 Pages
(combined C/C++ and Fortran)
More material
30
• Intel Larrabee Architecture• Herlihy’s Book
– Chapter 1: Introduction– Chapter 2: Mutual Exclusion