56
Parallel Programming Parallel Programming Patterns Patterns Moreno Marzolla Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/

Parallel Programming Patterns · Parallel Programming Patterns 13 Fine grained vs Coarse grained partitioning Fine-grained Partitioning – Better load balancing, especially if combined

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

  • Parallel Programming Parallel Programming PatternsPatterns

    Moreno MarzollaDip. di Informatica—Scienza e Ingegneria (DISI)Università di Bologna

    http://www.moreno.marzolla.name/

    http://www.moreno.marzolla.name/

  • Parallel Programming Patterns 2

  • Parallel Programming Patterns 3

    What is a pattern?● A design pattern is “a general solution to a recurring

    engineering problem”● A design pattern is not a ready-made solution to a

    given problem...● ...rather, it is a description of how a certain kind of

    problem can be solved

  • Parallel Programming Patterns 4

    Parallel Programming Patterns● Embarrassingly Parallel● Partition● Master-Worker● Stencil● Reduce● Scan

  • Parallel Programming Patterns 5

    Example● Building a bridge across a river● You do not “invent” a brand new type of bridge each

    time– Instead, you adapt an already existing type of bridge

  • Parallel Programming Patterns 6

    Example

  • Parallel Programming Patterns 7

    Example

  • Example

  • Parallel Programming Patterns 9

    Embarrassingly Parallel● Applies when the computation can be decomposed in

    independent tasks that require little or no communication● Examples:

    – Vector sum– Mandelbrot set– 3D rendering – Brute force password cracking– ...

    + + +

    ===

    a[]

    b[]

    c[]

    Processor 0 Processor 1 Processor 2

  • Parallel Programming Patterns 10

    Partition● The input data space (in short, domain) is split in

    disjoint regions called partitions● Each processor operates on one partition● This pattern is particularly useful when the application

    exhibits locality of reference– i.e., when processors can refer to their own partition only

    and need little or no communication with other processors

  • Parallel Programming Patterns 11

    Example

    Proc 0

    Proc 1

    Proc 2

    Proc 3

    x =

    ● Matrix-vector product Ax = b

    ● Matrix A[][] is partitioned into P horizontal blocks

    ● Each processor– operates on one block

    of A[][] and on a full copy of x[]

    – computes a portion of the result b[] A[][] x[] b[]

  • Parallel Programming Patterns 12

    Regular vs Irregular partitioning● Regular

    – the domain is split into partitions of roughly the same size and shape

    ● Irregular– partitions do not

    necessarily have the same size or shape

    P0 P2 P3P1

    Source: http://www.cdac.in/HTmL/events/beta-test/archives/promcore-2008/mpi-1x-promcore-2008/partial-diff-eqns-solvers-mpi.html

    http://www.cdac.in/HTmL/events/beta-test/archives/promcore-2008/mpi-1x-promcore-2008/partial-diff-eqns-solvers-mpi.html

  • Parallel Programming Patterns 13

    Fine grained vsCoarse grained partitioning

    ● Fine-grained Partitioning– Better load balancing, especially if combined

    with the master-worker pattern (see later)– If granularity is too fine, the computation /

    communication ratio might become too low (communication dominates on computation)

    ● Coarse-grained Partitioning– In general improves the computation /

    communication ratio– However, it might cause load imbalancing

    ● The "optimal" granularity is sometimes problem-dependent; in other cases the user must choose which granularity to use

    Computation

    Communication

    Tim

    eTi

    me

  • Parallel Programming Patterns 14

    Example: Mandelbrot set● The Mandelbrot set is the

    set of points c on the complex plane s.t. the sequence z

    n(c) defined as

    does not diverge whenn → +∞

    zn(c)={ 0 if n=0zn−12 (c) + c otherwise

  • Parallel Programming Patterns 15

    Mandelbrot set in color● If the modulus of z

    n(c) does

    not exceed 2 after nmax iterations, the pixel is black (the point is assumed to be part of the Mandelbrot set)

    ● Otherwise, the color depends on the number of iterations required for the modulus of z

    n(c) to become

    > 2

  • Parallel Programming Patterns 16

    Pseudocode

    maxit = 1000for each point (cx, cy) {

    x = 0;y = 0;it = 0;while ( it < maxit AND x*x + y*y ≤ 2*2 ) {

    xnew = x*x - y*y + cx;ynew = 2*x*y + cy;x = xnew;y = ynew;it = it + 1;

    }plot(cx, cy, it);

    }

    Embarassingly parallel structure: the color of each

    pixel can be computed independently from other pixels

    Source: http://en.wikipedia.org/wiki/Mandelbrot_set#For_programmers

    http://en.wikipedia.org/wiki/Mandelbrot_set#For_programmers

  • Parallel Programming Patterns 17

    Mandelbrot set● A regular partitioning

    can result in uneven load distribution– Black pixels require

    maxit iterations– Other pixels require

    fewer iterations

  • Parallel Programming Patterns 18

    Load balancing● Ideally, each processor should perform the same

    amount of work– If the tasks synchronize at the end of the computation, the

    execution time will be that of the slower task

    Task 1

    Task 2

    Task 3

    Task 0

    busy

    idle

  • Parallel Programming Patterns 19

    Load balancing howto● The workload is balanced if each processor performs

    more or less the same amount of work● How to achieve load balancing:

    – Use fine-grained partitioning● ...but beware of the possible communication overhead if the tasks

    need to communicate– Use dynamic task allocation (master-worker paradigm)

    ● ...but beware that dynamic task allocation might incur in higher overhead with respect to static task allocation

  • Parallel Programming Patterns 20

    Master-worker paradigm(process farm, work pool)

    ● Apply a fine-grained partitioning– number of task >> number of cores

    ● The master assigns a task to the first available worker

    Master

    Worker0

    Worker1

    WorkerP-1

    Bag of tasks of possibly different duration

  • Parallel Programming Patterns 21

    Choosing the partition size

    Too small = higher scheduling overhead Too large = unbalanced workload

  • Parallel Programming Patterns 22

    Stencils● Stencil computations involve a grid whose values are

    updated according to a fixed pattern called stencil– Example: the Gaussian smoothing of an image updates the

    color of each pixel with the weighted average of the previous colors of the 5 ´ 5 neighborhood

    41

    164

    45

    1628

    287

    164

    28

    1628

    41 47

    1

    4

    7

    4

    1

    41

  • Parallel Programming Patterns 23

    2D Stencils

    5-point 2-axis 2D stencil(von Neumann neighborhood) 9-point 2-axis 2D stencil

    9-point 1-plane 2D stencil(Moore neighborhood)

  • Parallel Programming Patterns 24

    2D Stencils● 2D stencil computations usually employ two grids to

    keep the current and next values– Values are read from the current grid– New values are written to the next grid– current and next grid are exchanged at the end of each

    phase

  • Parallel Programming Patterns 25

    Ghost Cells● How do we handle cells on

    the border of the domain?– We might assume that cells

    outside the border have some fixed, application-dependent value, or

    – We may assume periodic boundary conditions, where sides are “glued” together to form a torus

    ● In either case, we extend the domain with ghost cells, so that cells on the border do not require any special treatment

    Domain

    Ghost cells

    https://blender.stackexchange.com/questions/39735/how-could-i-animate-a-plane-into-a-pipe-and-then-a-pipe-into-a-torus

    https://blender.stackexchange.com/questions/39735/how-could-i-animate-a-plane-into-a-pipe-and-then-a-pipe-into-a-torushttps://blender.stackexchange.com/questions/39735/how-could-i-animate-a-plane-into-a-pipe-and-then-a-pipe-into-a-torushttps://blender.stackexchange.com/questions/39735/how-could-i-animate-a-plane-into-a-pipe-and-then-a-pipe-into-a-torus

  • Parallel Programming Patterns 26

    Parallelizing stencil computations● Computing the next grid from the current one has

    embarassingly parallel structure

    Initialize current gridwhile (!terminated) {

    Fill ghost cellsCompute next gridExchange current and next grids

    }

    EmbarassinglyParallel

  • Parallel Programming Patterns 27

    Reduce● A reduction is the application of an associative binary

    operator (e.g., sum, product, min, max...) to the elements of an array [x

    0, x

    1, … x

    n-1]

    – sum-reduce( [x0, x

    1, … x

    n-1] ) = x

    0+ x

    1+ … + x

    n-1

    – min-reduce( [x0, x

    1, … x

    n-1] ) = min { x

    0, x

    1, … x

    n-1}

    – …● A reduction can be realized in O(log

    2 n) parallel steps

  • Parallel Programming Patterns 28

    Example: sum-reduce

    12-52416-512-81174-231

  • Parallel Programming Patterns 29

    Example: sum-reduce

    12-52416-512-81174-231

    3-669814-22

  • Parallel Programming Patterns 30

    Example: sum-reduce

    12-52416-512-81174-231

    3-669814-22

    118411

  • Parallel Programming Patterns 31

    Example: sum-reduce

    12-52416-512-81174-231

    3-669814-22

    118411

    1519

  • Parallel Programming Patterns 32

    Example: sum-reduce

    12-52416-512-81174-231

    3-669814-22

    118411

    1519

    34

  • Parallel Programming Patterns 33

    Example: sum-reduce

    12-52416-512-81174-231

    3-669814-22

    118411

    1519

    34

    int d, i;/* compute largest power of two < n */for (d=1; 2*d < n; d *= 2) ;/* do reduction */for ( ; d>0; d /= 2 ) { for (i=0; i

  • Parallel Programming Patterns 34

    Scan (Prefix Sum)● A scan computes all prefixes of an array [x

    0, x

    1, … x

    n-1]

    using a given associative binary operator op (e.g., sum, product, min, max... )

    [y0, y

    1, … y

    n - 1] = inclusive-scan( op, [x

    0, x

    1, … x

    n - 1] )

    where

    y0 = x

    0

    y1 = x

    0 op x

    1

    y2

    = x0 op x

    1 op x

    2

    …y

    n - 1= x

    0 op x

    1 op … op x

    n - 1

  • Parallel Programming Patterns 35

    Scan (Prefix Sum)● A scan computes all prefixes of an array [x

    0, x

    1, … x

    n-1]

    using a given associative binary operator op (e.g., sum, product, min, max... )

    [y0, y

    1, … y

    n - 1] = exclusive-scan( op, [x

    0, x

    1, … x

    n - 1] )

    where

    y0 = 0

    y1 = x

    0

    y2

    = x0 op x

    1

    …y

    n - 1= x

    0 op x

    1 op … op x

    n - 2

    this is the neutral element of the binary operator (zero for

    sum, 1 for product, ...)

  • Parallel Programming Patterns 36

    Example

    1 -3 12 6 2 -3 7 -10x[] =

    1 -2 10 16 18 15 22 12inclusive-scan(+, x) =

    0 1 -2 10 16 18 15 22exclusive-scan(+, x) =

  • Parallel Programming Patterns 37

    Example

    1 -3 12 6 2 -3 7 -10x[] =

    1 -2 10 16 18 15 22 12inclusive-scan(+, x) =

    0 1 -2 10 16 18 15 22exclusive-scan(+, x) =

    +

  • Parallel Programming Patterns 38

    1 -2 10 16 18 15 22 12

    Example

    1 -3 12 6 2 -3 7 -10x[] =

    inclusive-scan(+, x) =

    0 1 -2 10 16 18 15 22exclusive-scan(+, x) =

    +

  • Parallel Programming Patterns 39

    Serial implementation

    void inclusive_scan(int *x, int *s, int n) // n must be > 0{

    int i;s[0] = x[0];for (i=1; i 0{

    int i;s[0] = 0;for (i=1; i

  • Parallel Programming Patterns 40

    Exclusive scan: Up-sweep

    x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7]

    x[0] ∑x[0..1] x[2] ∑x[2..3] x[4] ∑x[4..5] x[6] ∑x[6..7]

    x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] ∑x[4..7]

    x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] ∑x[0..7]

    for ( d=1; d

  • Parallel Programming Patterns 41

    Exclusive scan: Down-sweepx[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] ∑x[0..7]

    x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] 0

    x[0] ∑x[0..1] x[2] 0 x[4] ∑x[4..5] x[6] ∑x[0..3]

    zero

    x[0] 0 x[2] ∑x[0..1] x[4] ∑x[0..3] x[6] ∑x[0..5]

    0 x[0] ∑x[0..1] ∑x[0..2] ∑x[0..3] ∑x[0..4] ∑x[0..5] ∑x[0..6]

    x[n-1] = 0;for ( ; d > 0; d >>= 1 ) {

    for (k=0; k

  • Parallel Programming Patterns 42

    Example: Line of Sight● n peaks of heights h[0], … h[n - 1]; the distance

    between consecutive peaks is one● Which peaks are visible from peak 0?

    visiblenot

    visible

    h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7]

  • Parallel Programming Patterns 43

    Line of sight

    h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7]

  • Parallel Programming Patterns 44

    Line of sight

    h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7]

  • Parallel Programming Patterns 45

    Line of sight

    h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7]

  • Parallel Programming Patterns 46

    Line of sight

    h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7]

  • Parallel Programming Patterns 47

    Line of sight

    h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7]

  • Parallel Programming Patterns 48

    Line of sight

    h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7]

  • Parallel Programming Patterns 49

    Line of sight

    h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7]

  • Parallel Programming Patterns 50

    Line of sight

    h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7]

  • Parallel Programming Patterns 51

    Line of sight

    h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7]

  • Parallel Programming Patterns 52

    Serial algorithm● For each i = 0, … n – 1

    – Let a[i] be the slope of the line connecting the peak 0 to the peak i

    – a[0] ← -∞– a[i] ← arctan( ( h[i] – h[0] ) / i ), se i > 0

    ● For each i = 0, … n – 1– amax[0] ← -∞– amax[i] ← max {a[0], a[1], … a[i – 1]}, se i > 0

    ● For each i = 0, … n – 1– If a[i] ≥ amax[i] then the peak i is visible– otherwise the peak i is not visible

  • Parallel Programming Patterns 53

    Serial algorithm

    bool[0..n-1] Line-of-sight( double h[0..n-1] )bool v[0..n-1]double a[0..n-1], amax[0..n-1]a[0] ← -∞for i ← 1 to n-1 do

    a[i] ← arctan( ( h[i] – h[0] ) / i )endforamax[0] ← -∞for i ← 1 to n-1 do

    amax[i] ← max{ a[i-1], amax[i-1] }endforfor i ← 0 to n-1 do

    v[i] ← ( a[i] ≥ amax[i] )endforreturn v

  • Parallel Programming Patterns 54

    Serial algorithm

    bool[0..n-1] Line-of-sight( double h[0..n-1] )bool v[0..n-1]double a[0..n-1], amax[0..n-1]a[0] ← -∞for i ← 1 to n-1 do

    a[i] ← arctan( ( h[i] – h[0] ) / i )endforamax[0] ← -∞for i ← 1 to n-1 do

    amax[i] ← max{ a[i-1], amax[i-1] }endforfor i ← 0 to n-1 do

    v[i] ← ( a[i] ≥ amax[i] )endforreturn v

    Embarassinglyparallel

    Embarassinglyparallel

  • Parallel Programming Patterns 55

    Parallel algorithm

    bool[0..n-1] Parallel-line-of-sight( double h[0..n-1] )bool v[0..n-1]double a[0..n-1], amax[0..n-1]a[0] ← -∞for i ← 1 to n-1 do in parallel

    a[i] ← arctan( ( h[i] – h[0] ) / i )endforamax ← exclusive-scan( “max”, a )

    for i ← 0 to n-1 do in parallelv[i] ← ( a[i] ≥ amax[i] )

    endforreturn v

  • Parallel Programming Patterns 56

    Conclusions● A parallel programming patterns defines:

    – a partitioning of the input data– a communication structure among parallel tasks

    ● Parallel programming patterns can help to define efficient algorithms– Many problems can be solved by applying one or more

    known patterns