46
Final Exam Review

Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Final Exam Review

Page 2: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Announcements •  Final examination

 Friday, March 22, 11.30a to 2.30p, PETER 104  You may bring a single sheet of notebook

sized paper “8x10 inches” with notes •  Office hours during examination week

 Monday and Wednesday usual time (2p to 3p)  Or by appointment

©2013 Scott B. Baden / CSE 160 / Winter 2013 2

Page 3: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Today’s lecture •  Technology trends •  Address space organization •  Interconnect •  Algorithms (applications) •  Parallel program design and implementation •  Programming models •  Performance

©2013 Scott B. Baden / CSE 160 / Winter 2013 3

Page 4: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

©2013 Scott B. Baden / CSE 160 / Winter 2013 4

35 years of processor trends

Page 5: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Trend: data motion costs are rising •  Increase amount of computation performed

per unit of communication  Conserve locality, tolerate or avoid

communication •  Many threads

5 ©2013 Scott B. Baden / CSE 160 / Winter 2013

Year

Processor Memory Bandwidth Latency

Tera

Exa???

Impr

ovem

ent

Giga

Peta

Page 6: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Today’s lecture •  Technology trends •  Address space organization •  Interconnect •  Algorithms (applications) •  Parallel program design and implementation •  Programming models •  Performance

©2013 Scott B. Baden / CSE 160 / Winter 2013 6

Page 7: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Address Organization •  Multiprocessors and multicomputers •  Shared memory, message passing

©2013 Scott B. Baden / CSE 160 / Winter 2013 7

Page 8: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Hybrid processing •  Two types of processors: general purpose + accelerator

  AMD fusion: 4 x86 cores + hundreds of Radeon GPU cores •  Accelerator can perform certain tasks more quickly than the

conventional cores •  Accelerator amplifies relative cost of communication

©2013 Scott B. Baden / CSE 160 / Winter 2013 8 Hothardware.com

Page 9: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

9

NUMA Architectures •  The address space is global to all processors, but

memory is physically distributed •  Point-to-point messages manage coherence •  A directory keeps track of sharers, one for each

block of memory •  Stanford Dash; NUMA nodes of the Cray XE-6,

SGI UV, Altix, Origin 2000

©2013 Scott B. Baden / CSE 160 / Winter 2013

en.wikipedia.org/wiki/ Non-Uniform_Memory_Access

Page 10: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

10

Heterogeneous processing with Graphical Processing Units

10 ©2013 Scott B. Baden / CSE 160 / Winter 2013

MEM

C0 C1 C2

P0 P1 P2

•  Specialized many-core processor •  Explicit data motion

 between host and device  inside the device

Host

Device

Page 11: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Today’s lecture •  Technology trends •  Address space organization •  Interconnect •  Algorithms (applications) •  Parallel program design and implementation •  Programming models •  Performance

©2013 Scott B. Baden / CSE 160 / Winter 2013 11

Page 12: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Interconnect •  Toroidal mesh (end around), ring •  K-ary n-cube (Kn nodes) •  Hypercube •  Diameter and bisection bandwidth

Sciencedirect.com

©2013 Scott B. Baden / CSE 160 / Winter 2013 12

0 1 2 3

0 1

23

Natalie EnrightJerger

ring

Mapping a ring to avoid long wiring

4d hypercube

3d toroidal mesh 3d hypercube

Page 13: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Today’s lecture •  Technology trends •  Address space organization •  Interconnect •  Algorithms (applications) •  Parallel program design and implementation •  Programming models •  Performance

©2013 Scott B. Baden / CSE 160 / Winter 2013 13

Page 14: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

©2013 Scott B. Baden / CSE 160 / Winter 2013

Algorithms •  Sorting

 Odd-even sort, bucket sort, sample sort •  Numerical integration (trapezoidal rule) •  Mandelbrot set •  Stencil methods

 Ghost cells, partitioning •  Matrix multiplication: Cannon’s Algorithm

14

Page 15: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

©2013 Scott B. Baden / CSE 160 / Winter 2013 15

Bucket sort •  Each process has 2 large buckets for input & output •  P small buckets for routing •  Worst and best cases?

B. Wilkinson

Page 16: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

©2013 Scott B. Baden / CSE 160 / Winter 2013

Why numerically intensive applications? •  Highly repetitive computations are prime

candidates for parallel implementation •  Improve quality of life, economically and

technologically important  Data Mining   Image processing  Simulations – financial modeling, weather, biomedical

•  We can classify applications according to Patterns of communication and computation that persist over time and across implementations Phillip Colella’s 7 Dwarfs

Courtesy of Randy Bank

16

Page 17: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

©2013 Scott B. Baden / CSE 160 / Winter 2013

Classifying the application domains  Patterns of communication and computation that

persist over time and across implementations  Structured grids

•  Panfilov method  Dense linear algebra

•  Matrix multiply, Vector-Mtx Mpy Gaussian elimination

 N-body methods  Sparse linear algebra

•  In a sparse matrix, we take advantage of knowledge about the locations of non-zeros, improving some aspect of performance

 Unstructured Grids  Spectral methods (FFT)  Monte Carlo

Courtesy of Randy Bank

17

+= * C[i,j] A[i,:]

B[:,j]

Page 18: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Application-specific knowledge is important •  There currently exists no tool that can convert a serial

program into an efficient parallel program … for all applications … all of the time… on all hardware

•  The more we know about the application… … specific problem … math/physics ... initial data … … context for analyzing the output… … the more we can improve productivity

•  Issues   Data motion and locality   Load balancing   Serial sections

©2013 Scott B. Baden / CSE 160 / Winter 2013 18

Page 19: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Today’s lecture •  Technology trends •  Address space organization •  Interconnect •  Algorithms (applications) •  Parallel program design and implementation •  Programming models •  Performance

©2013 Scott B. Baden / CSE 160 / Winter 2013 19

Page 20: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

©2013 Scott B. Baden / CSE 160 / Winter 2013

Parallel program design and implementation

•  Code organization and re-use •  Parallelizing serial code

(code reorganization and restructuring) •  Message passing, shared memory, hybrid •  How to design with parallelism in mind •  Performance

20

Page 21: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

SPMD Implementation techniques •  Pthreads and OpenMP •  MPI •  Processor geometry and data distribution

  BLOCK   Block Cyclic

•  Owner computes rule

©2013 Scott B. Baden / CSE 160 / Winter 2013 21

Page 22: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Owner computes rule

©2013 Scott B. Baden / CSE 160 / Winter 2013 22

#pragma omp parallel for schedule(static,CHUNK) for (int j=1; j<=m+1; j++) for (int i=1; i<=n+1; i++) E[j][i] = E_prev[j][i]+alpha*(E_prev[j][i+1]+E_prev[j][i-1]-

4*E_prev[j][i]+E_prev[j+1][i] +E_prev[j-1][i]

i, j+1

i, j-1

i+1, j i-1, j

#define E′ [i,j] E_prev[(j+1)*(m+3) + (i+1)]

I = blockIdx.y*blockDim.y + threadIdx.y; J = blockIdx.x*blockDim.x + threadIdx.x; if ((I <= n) && (J <= m) ) E[I] = E′ [I,J] + α*(E′ [I-1,J] + E′ [I+1,J] + E′ [I,J-1]+ E′ [I,J+1] – 4*E′ [I,J]);

Page 23: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Today’s lecture •  Technology trends •  Address space organization •  Interconnect •  Algorithms (applications) •  Parallel program design and implementation •  Programming models •  Performance

©2013 Scott B. Baden / CSE 160 / Winter 2013 23

Page 24: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

©2013 Scott B. Baden / CSE 160 / Winter 2013

Multithreading •  Primitives

 Thread launching and joining  Atomic variables, mutual exclusions, barriers

•  New storage class: shared •  Work scheduling

 Static vs dynamic

24

Page 25: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

©2013 Scott B. Baden / CSE 160 / Winter 2013

Multithreading in perspective •  Benefits

 Harness parallelism to improve performance  Ability to multitask to realize concurrency, e.g.

display •  Pitfalls

 Program complexity •  Partitioning, synchronization, parallel control flow •  Data dependencies •  Shared vs. local state (globals like errno) •  Thread-safety

 New aspects of debugging •  Race conditions •  Deadlock

25

Page 26: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

©2013 Scott B. Baden / CSE 160 / Winter 2013

Message passing •  No globally accessible storage •  MPI •  Blocking vs non-blocking communication •  Point to point vs collective •  Collectives distinguish short and long messages

  Gather/scatter, All-to-all, reduction, broadcast   Allreduce, Gatherv, Alltoallv

•  Message filtering •  Communicators

26

X0 X1 X2 X3

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

Y0

Y1

Y2

Y3

0 0 0 0

1 1 1 1

2 2 2 2

3 3 3 3

MPI_Comm rowComm;"MPI_Comm_split( MPI_COMM_WORLD,

"myRank / √P, myRank, &rowComm);"MPI_Comm_rank(rowComm,&myRow);"

Page 27: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Scatter/Gather

P0

Root

P1 Pp-1

Gather

Scatter

27 ©2013 Scott B. Baden / CSE 160 / Winter 2013

Short message algorithm

Page 28: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

©2013 Scott B. Baden / CSE 160 / Winter 2013

Message passing in perspective •  Benefits

  Processes communicate explicitly, no anonymous updates   Message arrival provides synchronization

•  Pitfalls   Must replace synchronization with explicit data motion (±)   Can’t rely on the cache to move data between processes living

on separate processing nodes   Ghost cells, collectives   Harder to incrementally parallelize code than with threads

28

Page 29: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

A Classic message passing implementation of a stencil method

•  Decompose domain into sub-regions, one per process

  Transmit halo regions between processes

  Compute inner region after communication completes

•  Loop carried dependences impose a strict ordering on communication and computation

©2013 Scott B. Baden / CSE 160 / Winter 2013 29

Page 30: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

©2013 Scott B. Baden / CSE 160 / Winter 2013

Programming with GPUs •  Benefits: simplified processor design enables

higher performance for targeted workloads, lower power consumption per flop

•  Pitfalls  Manage on-chip memory explicitly (shared memory)

though caches on some designs highly effective  Branches are costly (OKwithin a warp)  More complicated programming model

30

Page 31: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Nx

Ny

Cuda thread block

dx

dy

•  Create 1D thread block to process 2D data block •  Iterate over rows in y dim •  While first and last threads read ghost cells, others are idle

Using Shared Memory

Compared to a 2D thread blocking, 1D thread blocks provide a 12% improvement in double precision and 64% improvement in single precision

©2013 Scott B. Baden / CSE 160 / Winter 2013 31

Didem Unat

Page 32: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

32

CUDA Code for computational loop

__shared__ float block[DIM_Y + 2 ][DIM_X + 2 ]; int idx = threadIdx.x, idy = threadIdx.y ; //local indices //global indices int x = blockIdx.x * (DIM_X) + idx; int y = blockIdx.y * (DIM_Y) + idy; idy++; idx++; unsigned int index = y * N + x ; //interior points float center = E_prev[index] ; block[idy][idx] = center; __syncthreads();

©2013 Scott B. Baden / CSE 160 / Winter 2013

Page 33: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

33

Ghost cell copying - thread divergence

Didem Unat

if (idy == 1 && y > 0 ) block[0][idx]= E_prev[index - N]; else if(idy == DIM_Y && y < N-1) block[DIM_Y+1][idx] = E_prev[index + N]; if ( idx==1 && x > 0 ) block[idy][0] = E_prev[index - 1]; else if( idx== DIM_X && x < N-1 ) block[idy][DIM_X +1] = E_prev[index + 1]; __syncthreads();

©2013 Scott B. Baden / CSE 160 / Winter 2013

When loading ghost cells, only some of the threads are active, some are idle

Page 34: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Today’s lecture •  Technology trends •  Address space organization •  Interconnect •  Algorithms (applications) •  Parallel program design and implementation •  Programming models •  Performance

©2013 Scott B. Baden / CSE 160 / Winter 2013 34

Page 35: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Communication costs -stencil method •  1-D decomposition

(16N2 β/P) + 2(α+8βN) •  2-D decomposition

(16N2 β/P) + 4(α+8βN/√P )

35 ©2013 Scott B. Baden / CSE 160 / Winter 2013

Page 36: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Messaging system design and implementation

•  Short vs long messages, half power point •  Eager vs. rendezvous communication •  Hypercube algorithms for collectives •  Message buffering, completion

©2013 Scott B. Baden / CSE 160 / Winter 2013 36

Network interfaces Sender Recvr

αsend αrecv

Latency

β β

Page 37: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

α-β model of communication time

©2013 Scott B. Baden / CSE 160 / Winter 2013 37

•  Triton.sdsc.edu

β∞ =1.12 gB/sec��@N = 8MB

N1/2 ≈ 20 kb α = 3.2 μsec

Page 38: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Questions 1.  Performance 2.  Cache coherence 3.  Ghost Cells

©2013 Scott B. Baden / CSE 160 / Winter 2013 38

Page 39: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

1. Performance (Q1) •  This is too hard. The 2nd phase should consume a

fraction f on 2 processor •  We run on 16 processors •  An application with 2 phases

 Phase 1: perfectly parallelizable, E16 = 100%  Phase 2: E16 = 25% efficiency on 16 processors

Serial section: f of the overall running time on one processor (best serial algorithm): fT1

•  Express T16 in terms of f, as a fraction of the form x/y

©2013 Scott B. Baden / CSE 160 / Winter 2013 39

Page 40: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

40

2. Cache coherence const int BIG_NUMBER = … //shared int sharedVar = 0; while(sharedVar < BIG_NUMBER){ CRITICAL SECTION:

sharedVar++; } •  We run with 2 threads •  We run with 100 threads •  What can prevent threads from updating the shared variable?

©2013 Scott B. Baden / CSE 160 / Winter 2013

Page 41: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

3. Ghost cells •  Image smoother

repeat until done: for i = 1: N-2 for j= 1: N-2 Unew[i,j] = (U[i+1,j] + U[i-1,j]+U[i, j+1] +[i, j-1])/4; Swap Unew and U

•  What is the max number of ghost cells that a processor will need to transmit when P does not divides N evenly?

©2013 Scott B. Baden / CSE 160 / Winter 2013 41

P0 P1 P2 P3

P4

P8

P12

P5 P6 P7

P9 P11

P13 P14

P10

n

n np

np

P15

N/√(P)

Page 42: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Fin

Page 43: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

5. Tree Summation •  Input: an array x[], length N >> P •  Output: Sum of the elements of x[] •  Goal: Compute the sum in lg P time

sum = 0; for i=0 to N-1 sum += x[i]

•  Assume P is a power of 2, K = lg P •  Starter code

for m = 0 to K-1 { }

©2013 Scott B. Baden / CSE 160 / Winter 2013

4 7 5 9

11 14

25

3 1 7 0 4 1 6 3

43

Page 44: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Global Change, I[:,:], Inew[:,:] Local mymin = 1 + ($TID * n/$NT), mymax = mymin + n/$NT-1; Local done = FALSE; while (!done) do Local myChange= 0; BARRIER Only on thread 0: Change= 0;

BARRIER update Inew and myChange CRITICAL SEC: Change += myChange BARRIER

if (myChange< Tolerance) done = TRUE; Only on thread 0: Swap pointers: I ↔ Inew end while

Correctness

Does this code use minimal synchronization?

44 ©2013 Scott B. Baden / CSE 160 / Winter 2013

Page 45: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Visualizing the Summation

©2013 Scott B. Baden / CSE 160 / Winter 2013

0! 1! 2! 3! 4! 5! 7!6!

0+1! 2+3! 4+5! 6+7!

0...3! 4..7!

0..7!

Thread 0! Thread 2! Thread 4! Thread 6!

45

Page 46: Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Extra

©2013 Scott B. Baden / CSE 160 / Winter 2013 46

Ian Foster

http://www.mcs.anl.gov/~itf/dbpp/text/node33.html