Lecture 5 4 Multi Core Programming Concept II

8/7/2019 Lecture 5 4 Multi Core Programming Concept II

1/45

UCCD3213 MULTICORE

PROGRAMMINGMulticore Programming Concept II:

Performance Concept


2/45

Introduction to Performance Concepts

Performance Concepts:

Simple Speedup

Computing Speedup

Efficiency

Granularity Load Balance


3/45

Simple Speedup

Speedup measures the time required for a parallel program execute versus

the time the best serial code requires to accomplish the same task.

Speedup = Serial Time / Parallel Time

According to Amdahls law, speedup is a function of the fraction of a

program that is parallel and by how much that fraction is accelerated.

Speedup = 1 / [S+(1-S)/n +H(n)]


4/45

Computing Speedup Example

Painting a picket fence requires:

30 minutes of preparation (serial).

One minute to paint a single picket.

30 minutes to clean up (serial).

Example: Painting a Picket Fence


5/45

Computing Speedup

Number of Painters Time Speedup

1 30 + 300 + 30 = 360 1.0X

2 30 + 150 + 30 = 210 1.7X

10 30 + 30 + 30 = 90 4.0X

100 30 + 3 + 30 = 63 5.7X

Infinite 30 + 0 + 30 = 60 6.0X

Consider how speedup is computed for different numbers of

painters:


6/45

Parallel Efficiency

Parallel Efficiency:

Is a measure of how efficiently processor resources are used during

parallel computations.

Is equal to (Speedup / Number of Threads) * 100%.

Consider how efficiency is computed with different numbers of

painters:

Number of Painters Time Speedup Efficiency

1 360 1.0X 100%

2 30 + 150 + 30 = 210 1.7X 85%

10 30 + 30 + 30 = 90 4.0X 40%

100 30 + 3 + 30 = 63 5.7X 5.7%

Infinite 30 + 0 + 30 = 60 6.0X very low


7/45

7

Granularity

Definition:

An approximation of the ratio of

computation to synchronization.

The two types of granularity are:

Coarse-grained: Concurrent

calculations that have a large amount of

computation between synchronization

operations are known as coarse-grained.

Fine-grained: Cases where there is

very little computation between

synchronization events are known as

fine-grained.

Example: Field and Farmers


8/45

8

Load Balance

Load balancingrefers to the distribution of work across multiple threads so

that they all perform roughly the same amount of work.

Most effective distribution is such that:

Threads perform equal amounts of work.

Threads that finish first sit idle.

Threads finish work close to the same time.

Example: Cleaning Banquet Tables


9/45

COMPUTING SPEED UP

What speedup should I expect?

y

Amdahls Lawy Gustafsons Law

y Work and Span Law


10/45

AMDAHLS LAW

Amdahl started with the clear statement that

program speedup is a function of the fraction of a

program that is accelerated and by how much

that fraction is accelerated


11/45

AMDAHLS LAW

So, if you could speed up half the program by 15

percent, youd get:

Speedup=1/((1.50)+(.50/1.15))=1/(.50+.43)=1.08

This result is a speed increase of 8 percent, which

is what youd expect. If half of the program is

improved 15 percent, then the whole program isimproved by half that amount.


12/45

AMDAHLS LAW

In this equation, S is the time spent executing

the serial portion of the parallelized versionand n is the number of processor cores.


13/45


14/45

AMDAHL'S LAW

A parallel application can not run faster than the

sum of its sequential parts!


15/45

AMDAHL'S LAW

Assume a sequential program, with serial

execution time T


16/45

AMDAHL'S LAW


17/45

AMDAHL'S LAW


18/45

AMDAHL'S LAW


19/45

AMDAHL'S LAW


20/45

AMDAHL'S LAW


21/45

AMDAHL'S LAW


22/45

AMDAHL'S LAW


23/45

WORK& SPAN LAW: AMDAHLS LAW

Gene M. Amdahl

If50% of your application is

parallel and 50% is serial, you

cant get more than a factor of2

speedup, no matter how many

processors it runs on.*

*In general, if a fraction of an application

can be run in parallel and the rest must run

serially, the speedup is at most 1/(1).

But, whose application can be decomposed into

just a serial part and a parallel part? For my

application, what speedup should I expect?


24/45

WORK& SPAN LAW: MEASUREMENTS

MATTER

Q: What does the performance of a program on 1

and 2 cores tell you about its expected

performance on 16 or 64 cores?

A:Almost nothingy Many parallel programs cant exploit more than a few

cores.

y To predict the scalability of a program to many cores,

you need to know the amount ofparallelism exposed

by the code.y Parallelism is not a gut feel metric, but a

computable and measurable quantity.


25/45

int fib (int n) {if (n


26/45

WORK& SPAN LAW: COMPUTATION DAG

A parallel instruction stream is a dag G = (V, E ).

Each vertex v V is a strand: a sequence of instructions not

containing a call, spawn, sync, or return (or thrown exception).

An edge e E is a spawn, call, return, orcontinue edge.

Loop parallelism (cilk_for) is converted to spawns andsyncs using recursive divide-and-conquer.

spawn edgereturn edge

continue edge

initial strand final st

rand

strand

call edge


27/45

TP = execution time on P processors

WORK& SPAN LAW: PERFORMANCE

MEASURES


28/45


T1 = work


MEASURES


29/45


*Also called critical-path length

orcomputational depth.

T1 = work T = span*

WORK& N W: RFORMAN

MEASURES


30/45


T1 = work T = span*

*Also called critical-path length

orcomputational depth.

WORK LAW TP T1/P

SPAN LAW TP T


MEASURES


31/45

Work: T

1(A

) =

WORK& SPAN LAW: SERIES

COMPOSITION

A B

Work: T

1(A

B) =T

1(A) +T

1(B)Span: T(AB) = T(A) + T(B)Span: T(AB) =


32/45

WORK& SPAN LAW: PARALLEL

COMPOSITION

A

B

Span: T(AB) = max{T(A), T(B)}Span: T(AB) =

Work: T1(AB) =Work: T1(AB) = T1(A) + T1(B)


33/45

Def. T1/TP = speedup on P processors.

IfT1/TP = (P), we have linearspeedup,

= P, we haveperfect linearspeedup,

> P, we have superlinearspeedup,

which is not possible in this performance model,

because of the Work Law TP T1/P.

WORK& SPAN LAW: SPEEDUP


34/45

WORK& SPAN LAW: PARALLELISMBecause the Span Law dictates thatTP T, the maximum possiblespeedup given T1 and T is

T1/T = parallelism

= the averageamount of workper step alongthe span.


35/45

Parallelism: T1/T =Parallelism: T1/T = 2.125

Work: T1 = 17Work: T1 =

Span: T = 8Span: T =

WORK& SPAN LAW: EXAMPLE: FIB(4)

Assume for

simplicity that each

strand in fib(4)

takes unit time toexecute.

4

5

6

1

2 7

8

3

sing many more than 2 processors can

yield only marginal performance gains.


36/45

WORK& SPAN LAW: DEVELOPING

SOCRATES

For the competition, Socrates was to run on a 512-processor Connection Machine Model CM5supercomputer at the University ofIllinois.

The developers had easy access to a similar 32-processor CM5 at MIT.

One of the developers proposed a change to theprogram that produced a speedup of over 20% on theMIT machine.

After a back-of-the-envelope calculation, the proposedimprovement was rejected!


37/45

T32 = 2048/32 + 1

= 65 seconds = 40 seconds

T32 = 1024/32 + 8

WORK& SPAN LAW: SOCRATES PARADOX

TP } T1/P + T

Original program Proposed program

T32 = 65 seconds T32 = 40 seconds

T1 = 2048 secondsT = 1 second

T1 = 1024 secondsT = 8 seconds

T512 = 2048/512 + 1

= 5 seconds

T512 = 1024/512 + 8

= 10 seconds


38/45

WORK& SPAN LAW: MORAL OF THE

STORY

WorkWork andand spanspan

predict performancepredict performance

betterbetter thanthan runningrunning

timestimes alone can.alone can.


39/45

GUSTAFSONS LAW

Amdahl's Law indicates that the speedup from

parallelizing any computing problem is

inherently limited by the presence of serial (non-

parallelizable) portions.

Gustafson argues that, as processor power

increases, the size of the problem set also tends

to increase.


40/45

GUSTAFSONS LAW

To cite one obvious example: as mainstream

computational resources have increased,

computer games have become far more

sophisticated, both in terms of user-interface

characteristics and in terms of the underlying

physics and other logic.


41/45

GUSTAFSONS LAW

Because Amdahl's Law cannot address this

relationship, Gustafson modifies Amdahl's work

according to the precept (based on experimental

findings at Sandia) that the overall problem size

should increase proportionally to the number of

processor cores (N), while the size of the serial

portion of the problem should remain constant as

N increases.


42/45

GUSTAFSONS LAW


43/45

GUSTAFSONS LAW


44/45

GUSTAFSONS LAW

Clearly, these calculations show that the

performance result continues to scale upward as

more processor cores are applied to the

computational load.

It's also worth noting that the per-core efficiency

trends downward as additional cores are added,

although the data in Table 2 shows the decrease

in per-core efficiency between the two-core case

and the four-core case to be greater than theentire decrease between four cores and 1024

cores.


45/45

REFERENCES

Intel Software Network

http://software.intel.com/en-us/articles/amdahls-

law-gustafsons-trend-and-the-performance-

limits-of-parallel-applications/

Work and Span Laws

http://www.cprogramming.com/parallelism.html

Documents

Lecture 5 4 Multi Core Programming Concept II