Programming for Performance CS433 Spring 2001 Laxmikant Kale

Programming for Performance CS433

Spring 2001

Laxmikant Kale

2

Causes of performance loss• If each processor is rated at k MFLOPS, and there are p

processors, why don’t we see k.p MFLOPS performance?– Several causes,

– Each must be understood separately

– but they interact with each other in complex ways• Solution to one problem may create another

• One problem may mask another, which manifests itself under other conditions (e.g. increased p).

3

Causes• Sequential: cache performance

• Communication overhead

• Algorithmic overhead (“extra work”)

• Speculative work

• Load imbalance

• (Long) Critical paths

• Bottlenecks

4

Algorithmic overhead• Parallel algorithms may have a higher operation count

• Example: parallel prefix (also called “scan”)– How to parallelize this?

B[0] = A[0];

for (I=1; I<N; I++)

B[I] = B[I-1]+A[I];

5

Parallel Prefix: continued• How to this operation in parallel?

– Seems inherently sequential

– Recursive doubling algorithm

– Operation count: log(P) . N

• A better algorithm: – Take blocking of data into account

– Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements

– N + log(P) +N: doubling of op. Count

6

Bottleneck• Consider the “primes” program (or the “pi”)

– What happens when we run it on 1000 pes?

• How to eliminate bottlenecks:– Two structures are useful in most such cases:

• Spanning trees: organize processors in a tree

• Hypercube-based dimensional exchange

7

Communication overhead• Components:

– per message and per byte

– sending, receiving and network

– capacity constraints

• Grainsize analysis:– How much computation per message

– Computation-to-communication ratio

8

Communication overhead examples• Usually, must reorganize data or work to reduce communication

• Combining communication also helps

• Examples:

9

Communication delay: time interval between sending on one processor to receipt on another:

time = a + b. N

Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN

Typical values: a = 10 - 100 microseconds, b: 2-10 ns

Communication overhead

10

Grainsize control• A Simple definition of grainsize:

– Amount of computation per message

– Problem: short message/ long message

• More realistic:– Computation to communication ratio

– computation time / (a + bN) for one message

11

Example: matrix multiplication• How to parallelize this?

For (I=0; I<N; I++)

For (J=0; j<N; J++) // c[I][j] ==0

For(k=0; k<N; k++)

C[I][J] += A[I][K] * B[K][J];

12

A simple algorithm:• Distribute A by rows, B by columns

– So,any processor can request a row of A and get it (in two messages). Same for a col of B,

– Distribute the work of computing each element of C using some load balancing scheme

• So it works even on machines with varying processor capabilities (e.g. timeshared clusters)

– What is the computation-to-communication ratio?• For each object: 2.N ops, 2 messages with N bytes

• 2N / (2 a + 2N b) = 2N * 0.01 / (2*10 + 2*0.002N)

13

A better algorithm:• Store A as a collection row-bunches

– each bunch stores g rows

– Same of B’s columns

• Each object now computes a gxg section of C

• Comp to commn ratio:– 2*g*g*N ops

– 2 messages, gN bytes each

– alpha ratio: 2g*g*N/2, beta ratio: 2g

14

Alpha vs beta• The per message cost is significantly larger than per byte cost

– factor of several thousands

– So, several optimizations are possible that trade off : • get larger beta cost in return for smaller alpha

– I.e. send fewer messages

– Applications of this idea:• Examined in the last two lectures

15

Programming for performance:steps• Select/design Parallel algorithm

• Decide on Decomposition

• Select Load balancing strategy

• Plan Communication structure

• Examine synchronization needs– global synchronizations, critical paths

16

Design Philosophy:

• Parallel Algorithm design:– Ensure good performance (total op count)

– Generate sufficient parallelism

– Avoid/minimize “extra work”

• Decomposition:– Break into many small pieces:

• Smallest grain that sufficiently amortizes overhead

17

Design principles: contd.• Load balancing

– Select static, dynamic, or quasi-dynamic strategy• Measurement based vs prediction based load estimation

– Principle: let a processor idle but avoid overloading one • (think about this)

• Reduce communication overhead– Algorithmic reorganization (change mapping)

– Message combining

– Use efficient communication libraries

18

Design principles: Synchronization• Eliminate unnecessary global synchronization

– If T(i,j) is the time during i’th phase on j’th PE• With synch: sum ( max {T(i,j)})

• Without: max { sum(T (i,j) }

• Critical Paths:– Look for long chains of dependences

• Draw timeline pictures with dependences

19

Diagnosing performance problems• Tools:

– Back of the envelope (I.e. simple) analysis

– Post-mortem analysis, with performance logs• Visualization of performance data

• Automatic analysis

• Phase-by-phase analysis (prog. may have many phases)

– What to measure• load distribution, (commun.) overhead, idle time

• Their averages, max/min, and variances

• Profiling: time spent in individual modules/subroutines

20

Diagnostic technniques• Tell-tale signs:

– max load >> average, and # PEs > average is >>1

– max load >> average, and # PEs > average is ~ 1

– Profile shows increase in total time in routine f with increase in PEs:

– Communication overhead: obvious

Load imbalance

Possible bottleneck (if there is dependence)

Algorithmic overhead

21

Communication Optimization• Example problem from earlier lecture: Molecular Dynamics

– Each Processor, assumed to house just one cell, needs to send 26 short messages to “neighboring” processors

– Assume Send/Receive each: alpha = 10 us, beta: 2ns

– Time spent (notice: 26 sends and 26 receives):• 26*2(10 ) = 520 us

– If there are more than one cells on each PE, multiply this number!

– Can this be improved? How?

22

Message combining• If there are multiple cells per processor:

– Neighbors of a cell may be on the same neighboring processor.

– Neighbors of two different cells on the same processor

– Combine messages going to the same processor

23

Communication Optimization I • Take advantage of the structure of communication, and do

communication in stages:– If my coordinates are: (x,y,z):

• Send to (x+1, y,z), anything that goes to (x+1, *, *)

• Send to (x-1, y,z), anything that goes to (x-1, *, *)

• Wait for messages from x neighbors, then

• Send to y neighbors a combined message

– A total of 6 messages instead of 26

– Apparently longer critical path

24

Communication Optimization II• Send all migrating atoms to processor 0

– Let processor 0 sort them out and send 1 message to each processor

– Works ok if the number of processors is small• Otherwise, bottleneck at 0

25

Communication Optimization 3• Generalized problem:

– Each to all, individualized messages

– Apply all previously learned techniques

26

Intro to Load Balancing• Example: 500 processors, 50000 units of work

• What should the objective of load balancing be?

27

Causes of performance loss

• If each processor is rated at k MFLOPS, and there are p processors, why don’t we see k.p MFLOPS performance?– Several causes,

– Each must be understood separately

– but they interact with each other in complex ways• Solution to one problem may create another

• One problem may mask another, which manifests itself under other conditions (e.g. increased p).

28

Causes• Sequential: cache performance

• Communication overhead

• Algorithmic overhead (“extra work”)

• Speculative work

• Load imbalance

• (Long) Critical paths

• Bottlenecks

29

Algorithmic overhead• Parallel algorithms may have a higher operation count

• Example: parallel prefix (also called “scan”)– How to parallelize this?

B[0] = A[0];

for (I=1; I<N; I++)

B[I] = B[I-1]+A[I];

30

Parallel Prefix: continued

• How to this operation in parallel?– Seems inherently sequential

– Recursive doubling algorithm

– Operation count: log(P) . N

• A better algorithm: – Take blocking of data into account

– Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements

– N + log(P) +N: doubling of op. Count

31

Bottleneck• Consider the “primes” program (or the “pi”)

– What happens when we run it on 1000 pes?

• How to eliminate bottlenecks:– Two structures are useful in most such cases:

• Spanning trees: organize processors in a tree

• Hypercube-based dimensional exchange

32

Communication overhead• Components:

– per message and per byte

– sending, receiving and network

– capacity constraints

• Grainsize analysis:– How much computation per message

– Computation-to-communication ratio

33

Communication overhead examples• Usually, must reorganize data or work to reduce communication

• Combining communication also helps

• Examples:

34

Communication overhead

Communication delay: time interval between sending on one processor to receipt on another:

time = a + b. N

Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN

Typical values: a = 10 - 100 microseconds, b: 2-10 ns

35

Grainsize control• A Simple definition of grainsize:

– Amount of computation per message

– Problem: short message/ long message

• More realistic:– Computation to communication ratio

36

Example: matrix multiplication• How to parallelize this?

For (I=0; I<N; I++)

For (J=0; j<N; J++) // c[I][j] ==0

For(k=0; k<N; k++)

C[I][J] += A[I][K] * B[K][J];

37

A simple algorithm:

• Distribute A by rows, B by columns– So,any processor can request a row of A and get it (in two

messages). Same for a col of B,

– Distribute the work of computing each element of C using some load balancing scheme

• So it works even on machines with varying processor capabilities (e.g. timeshared clusters)

– What is the computation-toc-mmunication ratio?• For each object: 2.N ops, 2 messages with N bytes

38

A better algorithm:

• Store A as a collection row-bunches – each bunch stores g rows

– Same of B’s columns

• Each object now computes a gxg section of C

• Comp to commn ratio:– 2*g*g*N ops

– 2 messages, gN bytes each

– alpha ratio: 2g*g*N/2, beta ratio: g

39

Alpha vs beta

• The per message cost is significantly larger than per byte cost – factor of several thousands

– So, several optimizations are possible that trade off : get larger beta cost for smaller alpha

– I.e. send fewer messages

– Applications of this idea:• Message combining

• Complex communication patterns: each-to-all, ..

40

Example:• Each to all communication:

– each processor wants to send N bytes, distinct message to each other processor

– Simple implementation: alpha*P + N * beta *P• typical values?

41

Programming for performance:steps

• Select/design Parallel algorithm

• Decide on Decomposition

• Select Load balancing strategy

• Plan Communication structure

• Examine synchronization needs– global synchronizations, critical paths

42

Design Philosophy:

• Parallel Algorithm design:– Ensure good performance (total op count)

– Generate sufficient parallelism

– Avoid/minimize “extra work”

• Decomposition:– Break into many small pieces:

• Smallest grain that sufficiently amortizes overhead

43

Design principles: contd.

• Load balancing– Select static, dynamic, or quasi-dynamic strategy

• Measurement based vs prediction based load estimation

– Principle: let a processor idle but avoid overloading one (think about this)

• Reduce communication overhead– Algorithmic reorganization (change mapping)

– Message combining

– Use efficient communication libraries

44

Design principles: Synchronization

• Eliminate unnecessary global synchronization– If T(i,j) is the time during i’th phase on j’th PE

• With synch: sum ( max {T(i,j)})

• Without: max { sum(T (i,j) }

• Critical Paths:– Look for long chains of dependences

• Draw timeline pictures with dependences

45

Diagnosing performance problems

• Tools: – Back of the envelope (I.e. simple) analysis

– Post-mortem analysis, with performance logs• Visualization of performance data

• Automatic analysis

• Phase-by-phase analysis (prog. may have many phases)

– What to measure• load distribution, (commun.) overhead, idle time

• Their averages, max/min, and variances

• Profiling: time spent in individual modules/subroutines

46

Diagnostic technniques

• Tell-tale signs:– max load >> average, and # Pes > average is >>1

• Load imbalance

– max load >> average, and # Pes > average is ~ 1• Possible bottleneck (if there is dependence)

– profile shows increase in total time in routine f with increase in Pes: algorithmic overhead

– Communication overhead: obvious

Documents

Programming for Performance CS433 Spring 2001 Laxmikant Kale