Lecture 5 4 Multi Core Programming Concept II

Embed Size (px)

Citation preview

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    1/45

    UCCD3213 MULTICORE

    PROGRAMMINGMulticore Programming Concept II:

    Performance Concept

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    2/45

    Introduction to Performance Concepts

    Performance Concepts:

    Simple Speedup

    Computing Speedup

    Efficiency

    Granularity Load Balance

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    3/45

    Simple Speedup

    Speedup measures the time required for a parallel program execute versus

    the time the best serial code requires to accomplish the same task.

    Speedup = Serial Time / Parallel Time

    According to Amdahls law, speedup is a function of the fraction of a

    program that is parallel and by how much that fraction is accelerated.

    Speedup = 1 / [S+(1-S)/n +H(n)]

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    4/45

    Computing Speedup Example

    Painting a picket fence requires:

    30 minutes of preparation (serial).

    One minute to paint a single picket.

    30 minutes to clean up (serial).

    Example: Painting a Picket Fence

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    5/45

    Computing Speedup

    Number of Painters Time Speedup

    1 30 + 300 + 30 = 360 1.0X

    2 30 + 150 + 30 = 210 1.7X

    10 30 + 30 + 30 = 90 4.0X

    100 30 + 3 + 30 = 63 5.7X

    Infinite 30 + 0 + 30 = 60 6.0X

    Consider how speedup is computed for different numbers of

    painters:

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    6/45

    Parallel Efficiency

    Parallel Efficiency:

    Is a measure of how efficiently processor resources are used during

    parallel computations.

    Is equal to (Speedup / Number of Threads) * 100%.

    Consider how efficiency is computed with different numbers of

    painters:

    Number of Painters Time Speedup Efficiency

    1 360 1.0X 100%

    2 30 + 150 + 30 = 210 1.7X 85%

    10 30 + 30 + 30 = 90 4.0X 40%

    100 30 + 3 + 30 = 63 5.7X 5.7%

    Infinite 30 + 0 + 30 = 60 6.0X very low

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    7/45

    7

    Granularity

    Definition:

    An approximation of the ratio of

    computation to synchronization.

    The two types of granularity are:

    Coarse-grained: Concurrent

    calculations that have a large amount of

    computation between synchronization

    operations are known as coarse-grained.

    Fine-grained: Cases where there is

    very little computation between

    synchronization events are known as

    fine-grained.

    Example: Field and Farmers

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    8/45

    8

    Load Balance

    Load balancingrefers to the distribution of work across multiple threads so

    that they all perform roughly the same amount of work.

    Most effective distribution is such that:

    Threads perform equal amounts of work.

    Threads that finish first sit idle.

    Threads finish work close to the same time.

    Example: Cleaning Banquet Tables

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    9/45

    COMPUTING SPEED UP

    What speedup should I expect?

    y

    Amdahls Lawy Gustafsons Law

    y Work and Span Law

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    10/45

    AMDAHLS LAW

    Amdahl started with the clear statement that

    program speedup is a function of the fraction of a

    program that is accelerated and by how much

    that fraction is accelerated

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    11/45

    AMDAHLS LAW

    So, if you could speed up half the program by 15

    percent, youd get:

    Speedup=1/((1.50)+(.50/1.15))=1/(.50+.43)=1.08

    This result is a speed increase of 8 percent, which

    is what youd expect. If half of the program is

    improved 15 percent, then the whole program isimproved by half that amount.

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    12/45

    AMDAHLS LAW

    In this equation, S is the time spent executing

    the serial portion of the parallelized versionand n is the number of processor cores.

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    13/45

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    14/45

    AMDAHL'S LAW

    A parallel application can not run faster than the

    sum of its sequential parts!

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    15/45

    AMDAHL'S LAW

    Assume a sequential program, with serial

    execution time T

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    16/45

    AMDAHL'S LAW

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    17/45

    AMDAHL'S LAW

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    18/45

    AMDAHL'S LAW

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    19/45

    AMDAHL'S LAW

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    20/45

    AMDAHL'S LAW

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    21/45

    AMDAHL'S LAW

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    22/45

    AMDAHL'S LAW

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    23/45

    WORK& SPAN LAW: AMDAHLS LAW

    Gene M. Amdahl

    If50% of your application is

    parallel and 50% is serial, you

    cant get more than a factor of2

    speedup, no matter how many

    processors it runs on.*

    *In general, if a fraction of an application

    can be run in parallel and the rest must run

    serially, the speedup is at most 1/(1).

    But, whose application can be decomposed into

    just a serial part and a parallel part? For my

    application, what speedup should I expect?

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    24/45

    WORK& SPAN LAW: MEASUREMENTS

    MATTER

    Q: What does the performance of a program on 1

    and 2 cores tell you about its expected

    performance on 16 or 64 cores?

    A:Almost nothingy Many parallel programs cant exploit more than a few

    cores.

    y To predict the scalability of a program to many cores,

    you need to know the amount ofparallelism exposed

    by the code.y Parallelism is not a gut feel metric, but a

    computable and measurable quantity.

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    25/45

    int fib (int n) {if (n

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    26/45

    WORK& SPAN LAW: COMPUTATION DAG

    A parallel instruction stream is a dag G = (V, E ).

    Each vertex v V is a strand: a sequence of instructions not

    containing a call, spawn, sync, or return (or thrown exception).

    An edge e E is a spawn, call, return, orcontinue edge.

    Loop parallelism (cilk_for) is converted to spawns andsyncs using recursive divide-and-conquer.

    spawn edgereturn edge

    continue edge

    initial strand final st

    rand

    strand

    call edge

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    27/45

    TP = execution time on P processors

    WORK& SPAN LAW: PERFORMANCE

    MEASURES

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    28/45

    TP = execution time on P processors

    T1 = work

    WORK& SPAN LAW: PERFORMANCE

    MEASURES

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    29/45

    TP = execution time on P processors

    *Also called critical-path length

    orcomputational depth.

    T1 = work T = span*

    WORK& N W: RFORMAN

    MEASURES

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    30/45

    TP = execution time on P processors

    T1 = work T = span*

    *Also called critical-path length

    orcomputational depth.

    WORK LAW TP T1/P

    SPAN LAW TP T

    WORK& SPAN LAW: PERFORMANCE

    MEASURES

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    31/45

    Work: T

    1(A

    ) =

    WORK& SPAN LAW: SERIES

    COMPOSITION

    A B

    Work: T

    1(A

    B) =T

    1(A) +T

    1(B)Span: T(AB) = T(A) + T(B)Span: T(AB) =

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    32/45

    WORK& SPAN LAW: PARALLEL

    COMPOSITION

    A

    B

    Span: T(AB) = max{T(A), T(B)}Span: T(AB) =

    Work: T1(AB) =Work: T1(AB) = T1(A) + T1(B)

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    33/45

    Def. T1/TP = speedup on P processors.

    IfT1/TP = (P), we have linearspeedup,

    = P, we haveperfect linearspeedup,

    > P, we have superlinearspeedup,

    which is not possible in this performance model,

    because of the Work Law TP T1/P.

    WORK& SPAN LAW: SPEEDUP

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    34/45

    WORK& SPAN LAW: PARALLELISMBecause the Span Law dictates thatTP T, the maximum possiblespeedup given T1 and T is

    T1/T = parallelism

    = the averageamount of workper step alongthe span.

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    35/45

    Parallelism: T1/T =Parallelism: T1/T = 2.125

    Work: T1 = 17Work: T1 =

    Span: T = 8Span: T =

    WORK& SPAN LAW: EXAMPLE: FIB(4)

    Assume for

    simplicity that each

    strand in fib(4)

    takes unit time toexecute.

    4

    5

    6

    1

    2 7

    8

    3

    sing many more than 2 processors can

    yield only marginal performance gains.

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    36/45

    WORK& SPAN LAW: DEVELOPING

    SOCRATES

    For the competition, Socrates was to run on a 512-processor Connection Machine Model CM5supercomputer at the University ofIllinois.

    The developers had easy access to a similar 32-processor CM5 at MIT.

    One of the developers proposed a change to theprogram that produced a speedup of over 20% on theMIT machine.

    After a back-of-the-envelope calculation, the proposedimprovement was rejected!

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    37/45

    T32 = 2048/32 + 1

    = 65 seconds = 40 seconds

    T32 = 1024/32 + 8

    WORK& SPAN LAW: SOCRATES PARADOX

    TP } T1/P + T

    Original program Proposed program

    T32 = 65 seconds T32 = 40 seconds

    T1 = 2048 secondsT = 1 second

    T1 = 1024 secondsT = 8 seconds

    T512 = 2048/512 + 1

    = 5 seconds

    T512 = 1024/512 + 8

    = 10 seconds

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    38/45

    WORK& SPAN LAW: MORAL OF THE

    STORY

    WorkWork andand spanspan

    predict performancepredict performance

    betterbetter thanthan runningrunning

    timestimes alone can.alone can.

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    39/45

    GUSTAFSONS LAW

    Amdahl's Law indicates that the speedup from

    parallelizing any computing problem is

    inherently limited by the presence of serial (non-

    parallelizable) portions.

    Gustafson argues that, as processor power

    increases, the size of the problem set also tends

    to increase.

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    40/45

    GUSTAFSONS LAW

    To cite one obvious example: as mainstream

    computational resources have increased,

    computer games have become far more

    sophisticated, both in terms of user-interface

    characteristics and in terms of the underlying

    physics and other logic.

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    41/45

    GUSTAFSONS LAW

    Because Amdahl's Law cannot address this

    relationship, Gustafson modifies Amdahl's work

    according to the precept (based on experimental

    findings at Sandia) that the overall problem size

    should increase proportionally to the number of

    processor cores (N), while the size of the serial

    portion of the problem should remain constant as

    N increases.

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    42/45

    GUSTAFSONS LAW

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    43/45

    GUSTAFSONS LAW

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    44/45

    GUSTAFSONS LAW

    Clearly, these calculations show that the

    performance result continues to scale upward as

    more processor cores are applied to the

    computational load.

    It's also worth noting that the per-core efficiency

    trends downward as additional cores are added,

    although the data in Table 2 shows the decrease

    in per-core efficiency between the two-core case

    and the four-core case to be greater than theentire decrease between four cores and 1024

    cores.

  • 8/7/2019 Lecture 5 4 Multi Core Programming Concept II

    45/45

    REFERENCES

    Intel Software Network

    http://software.intel.com/en-us/articles/amdahls-

    law-gustafsons-trend-and-the-performance-

    limits-of-parallel-applications/

    Work and Span Laws

    http://www.cprogramming.com/parallelism.html