CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 6.3.2, 10.4.1 H. Casanova, A. Legrand, Z. Zaogordnov, and F. Berman, "Heuristics

CSE 160/Berman

Programming Paradigms and Algorithms

W+A 3.1, 3.2, p. 178, 6.3.2, 10.4.1

H. Casanova, A. Legrand, Z. Zaogordnov, and F. Berman, "Heuristics for Scheduling Parameter Sweep Applications in Grid Environments",

Proceedings of the 2000 Heterogeneous Computing Workshop

(http:apples.ucsd.edu)

CSE 160/Berman

Parallel programs

• A parallel program is a collection of tasks which can communicate and cooperate to solve large problems.

• Over the last 2 decades, some basic program structures have proven successful on a variety of parallel architectures

• The next few lectures will focus on parallel program structures and programming issues.

CSE 160/Berman

Common Parallel Programming Paradigms

• Embarrassingly parallel programs

• Workqueue

• Master/Slave programs

• Monte Carlo methods

• Regular, Iterative (Stencil) Computations

• Pipelined Computations

• Synchronous Computations

CSE 160/Berman

Embarrassingly Parallel Computations

• An embarrassingly parallel computation is one that can be divided into completely independent parts that can be executed simultaneously.– (Nearly) embarrassingly parallel computations are those

that require results to be distributed, collected and/or combined in some minimal way.

– In practice, nearly embarrassingly parallel and embarrassingly parallel computations both called embarrassingly parallel

• Embarrassingly parallel computations have potential to achieve maximal speedup on parallel platforms

CSE 160/Berman

Example: the Mandelbrot Computation

• Mandelbrot is an image computing and display computation.

• Pixels of an image (the “mandelbrot set”) are stored in a 2D array.

• Each pixel is computed by iterating the complex function

where c is the complex number (a+bi) giving the position of the pixel in the complex plane

czz kk 2

1

CSE 160/Berman

Mandelbrot• Computation of a single pixel:

• Subscript k denotes kth interation• Initial value of z is 0, value of c is free parameter• Iterations are continued until the magnitude of z is greater than 2 (which

indicates that eventually z will become infinite) or the number of iterations reaches a given threshold.

• The magnitude of z is given by

icbacba

iccibaz

czz

imagkkrealkk

imagrealkkk

kk

)2()(

)()(22

21

21

22 bazlength

CSE 160/Berman

Sample Mandelbrot Visualization• Black points do not go to infinity

• Colors represent “lemniscates” which are basically sets of points which converge at the same rate

• http://library.thinkquest.org/3288/myomand.html lets you color your own mandelbrot set

CSE 160/Berman

Mandelbrot Programming Issues

• Mandelbrot can be structured as a data parallel computation so the same computation is performed on all pixels, except with different complex numbers c.– The difference in input parameters result in different number of

iterations (execution times) for the computation of different pixels.

– Mandelbrot is embarrassingly parallel – computation of any two pixels is completely independent.

• Computation is generally visualized in terms of display where pixel color corresponds to the number of iterations required to compute the pixel– Coordinate system of Mandelbrot set is scaled to match the

coordinate system of the display area

CSE 160/Berman

Static Mapping to Achieve Performance

• Pixels generally organized into blocks and the blocks are computed on processors

• Mapping of blocks to processors can greatly affect application performance

• Want to load-balance the work of computing the values of the pixels across all processors.

CSE 160/Berman

Static Mapping to Achieve Performance

• Good load-balancing strategy for Mandelbrot is to randomize distribution of pixels

Block decomposition can unbalance load by clustering long-running pixel computations

Randomized decomposition can balance load by distributing long-running pixel computations

CSE 160/Berman

Dynamic Mapping: Using Workqueue to Achieve Performance

• Approach:– Initially assign some blocks to processors– When processors complete assigned blocks, join queue to wait for assignment of more blocks– When all blocks have been assigned, application concludes

Blocks

Processorsobtain block(s)from front of queue

Processorsperform workand get more block(s)

Processors

CSE 160/Berman

Workqueue Programming Issues• How much work should be assigned initially to

processors?• How many blocks should be assigned to a given

processor?– Should this always be the same for each processor? for all

processors?

• Should the blocks be ordered in the workqueue in some way?

• Performance of workqueue optimized if– Computation of each processor amortizes the work of

obtaining the blocks

CSE 160/Berman

Master/Slave Computations• Workqueue can be implemented as a master/slave

computation– Master directs the allocation of work to slaves– Slaves perform work

• Typical M/S Interaction– Slave

While there is more work to be doneRequest work from MasterPerform Work(Provide results to Master)

– MasterWhile there is more work to be done

(Receive results and process)Provide work to requesting slave

CSE 160/Berman

Flavors of M/S and Programming Issues

• “Flavors” of M/S– In some variations of M/S, master can also be a slave

– Typically slaves do not communicate

– Slave may return “results” to master or may just request more work

• Programming Issues– M/S most efficient if granularity of tasks assigned to slaves

amortizes communication between M and S

– Speed of slave or execution time of task may warrant non-uniform assignment of tasks to slaves

– Procedure for determining task assignment should be efficient

CSE 160/Berman

More Programming Issues

• Master/Slave and Workqueue may also be used with “work-stealing” approach where slaves/processes communicate with one another to redistribute the work during execution– Processors A and B perform computation

– If B finishes before A, B can ask A for work

BA

CSE 160/Berman

Monte Carlo Methods

• Monte Carlo methods based on the use of random selections in calculations which lead to the solution of numerical and physical problems.– Term refers to similarity of statistical

simulation to games of chance

• Monte Carlo simulation consists of multiple calculations, each of which utilizes a randomized parameter

CSE 160/Berman

Monte Carlo Example: Calculation of

• Consider a circle of unit radius inside a square box of side 2

• The ratio of the area of the circleto the area of thesquare is

1

422

11

CSE 160/Berman

Monte Carlo Calculation of

• Monte Carlo method to approximating :– Randomly choose a

sufficient number of points in the square

– For each point p, determine if p is inthe circle or the square

– The ratio of points inthe circle to points in the square will providean approximation of

4

CSE 160/Berman

M/S Implementation of Monte Carlo Approximation of

• Master code– While there are more points to calculate

• (Receive value from slave; update circlesum or boxsum)• Generate a (pseudo-)random value p=(x,y) in the bounding box • Send p to slave

• Slave code– While there are more points to calculate

• Receive p from master• Determine if p is in the circle or the square

[ check to see if ]• Send p’s status to master; ask for more work

122 yx

xy

p

Using Monte Carlo for a Large-Scale Simulation: MCell

• MCell = General simulator for cellular microphysiology

• Uses Monte Carlo diffusion and chemical reaction algorithm in 3D to simulate complex biochemical interactions of molecules

– Molecular environment represented as 3D space in which trajectories of ligands against cell membranes tracked

• Researchers need huge runs to model entire cells at molecular level.

– 100,000s of tasks– 10s of Gbytes of output data– Will ultimately perform execution-time

computational steering , data analysis and visualization

MCell Application Architecture

• Monte Carlo simulation performed on large parameter space

• In implementation, parameter sets stored in large shared data files

• Each task implements an “experiment” with a distinct data set

• Ultimately users will produce partial results during large-scale runs and use them to “steer” the simulation

CSE 160/Berman

MCell Programming Issues• Application is nearly embarrassingly parallel and can

target either MPP or clusters– Could even target both if implementation were developed in

this way

• Although application is nearly embarrassingly parallel, tasks share large input files– Cost of moving files can dominate computation time by a

large factor– Most efficient approach is to co-locate data and computation– Workqueue does not consider data location in allocation of

tasks to processors

Scheduling MCell

• We’ll show several ways that MCell can be scheduled on a set of clusters and compare execution performance

Cluster

User’s hostand storage

network links

storage

MPP

• Allocation developed by dynamically generating a Gantt chart for scheduling unassigned tasks between scheduling events

• Basic skeleton1. Compute the next scheduling event

2. Create a Gantt Chart G

3. For each computation and file transfer currently underway, compute an estimate of its completion time and fill in the corresponding slots in G

4. Select a subset T of the tasks that have not started execution

5. Until each host has been assigned enough work, heuristically assign tasks to hosts, filling in slots in G

6. Implement schedule

Contingency Scheduling Algorithm

1 2 1 2 1 2

Network links

Hosts(Cluster 1)

Hosts(Cluster 2)

Tim

e

Resources

Com

puta

tion

G

Schedulingevent

Schedulingevent

Com

puta

tion

MCell Scheduling Heuristics• Many heuristics can be used in the contingency scheduling algorithm

– Min-Min [task/resource that can complete the earliest is assigned first]

– Max-Min [longest of task/earliest resource times assigned first]

– Sufferage [task that would “suffer” most if given a poor schedule assigned first]

– Extended Sufferage [minimal completion times computed for task on

each cluster, sufferage heuristic applied to these]

– Workqueue [randomly chosen task assigned first]

)}},({{minmin jiji processortaskpredtime

)}},({{minma jiji processortaskpredtimex

)},({max)},({ma ,, jijijiji processortaskpredtimenextprocessortaskpredtimex

)},({max)},({ma ,, jijijiji clustertaskpredtimenextclustertaskpredtimex

CSE 160/Berman

Which heuristic is best?• How sensitive are the scheduling heuristics to the location

of shared input files and cost of data transmission?

• Used the contingency scheduling algorithm to compare– Min-min– Max-min– Sufferage– Extended Sufferage– Workqueue

• Ran the contingency scheduling algorithm on a simulator which reproduced file sizes and task run-times of real MCell runs.

MCell Simulation Results• Comparison of the performance of scheduling heuristics when it is up to

40 times more expensive to send a shared file across the network than it is to compute a task

• “Extended sufferage” scheduling heuristic takes advantage of file sharing to achieve good application performance

Max-min

Workqueue

XSufferage

Sufferage

Min-min

CSE 160/Berman

Additional Programming Issues

• We almost never know completely accurately what the runtime will be

• Resources may be shared

• Computation may be data dependent

• Task execution time may be hard to predict

• How sensitive are the scheduling heuristics to inaccurate performance information?

– i.e., what if our estimate of the execution time of a task on a resource is not 100% accurate?

MCell with a single scheduling event and task execution time predictions with between 0% error and 100% error

Same results with higher frequency of scheduling events

Documents

CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 6.3.2, 10.4.1 H. Casanova, A. Legrand, Z. Zaogordnov, and F. Berman, "Heuristics