37
1 Screen shots – Load imbalance Jacobi 2048 X 2048 Threshold 0.1 Chares 32 Processors 4

Screen shots – Load imbalance

  • Upload
    elle

  • View
    57

  • Download
    0

Embed Size (px)

DESCRIPTION

Screen shots – Load imbalance. Jacobi 2048 X 2048 Threshold 0.1 Chares 32 Processors 4. Timelines – load imbalance. Migration. Array objects can migrate from one PE to another To migrate, must implement pack/unpack or pup method pup combines 3 functions into one - PowerPoint PPT Presentation

Citation preview

Page 1: Screen shots – Load imbalance

1

Screen shots – Load imbalance

Jacobi 2048 X 2048

Threshold 0.1

Chares 32

Processors 4

Page 2: Screen shots – Load imbalance

2

Timelines – load imbalance

Page 3: Screen shots – Load imbalance

3

Migration Array objects can migrate from one PE to another To migrate, must implement pack/unpack or pup

method pup combines 3 functions into one

– Data structure traversal : compute message size, in bytes

– Pack : write object into message– Unpack : read object out of message

Basic Contract : here are my fields (types, sizes and a pointer)

Page 4: Screen shots – Load imbalance

4

Pup – How to write it?Class ShowPup {

double a; int x;

char y; unsigned long z;

float q[3]; int *r; // heap allocated memory

public:

… other methods …

void pup(PUP:er &p) {

p(a);

p(x); p(y); p(z);

p(q,3);

if(p.isUnpacking() )

r = new int[ARRAY_SIZE];

p(r,ARRAY_SIZE);

}

};

Page 5: Screen shots – Load imbalance

5

Load Balancing All you need is a working pup

link a LB module

– -module <strategy>

– CommLB, Comm1LB, GreedyLB, GreedyRefLB, MetisLB, NeighborLB, RandCentLB, RandRefLB, RecBisectLB, RefineLB

– EveryLB will include all load balancing strategies

runtime option

– +balancer CommLB

Page 6: Screen shots – Load imbalance

6

Centralized Load Balancing

Uses information about activity on all processors to make load balancing decisions

Advantage: since it has the entire object communication graph, it can make the best global decision

Disadvantage: Higher communication costs/latency, since this requires information from all running chares

Page 7: Screen shots – Load imbalance

7

Neighborhood Load Balancing

Load balances among a small set of processors (the neighborhood) to decrease communication costs

Advantage: Lower communication costs, since communication is between a smaller subset of processors

Disadvantage: Could leave a system which is globally poorly balanced

Page 8: Screen shots – Load imbalance

8

Centralized Load Balancing Strategies RandCentLB – randomly assign objects to

processors, with no reference to the object call graph GreedyLB – starting with no load on any processor,

places objects with highest load on processors with lowest load until all objects are allocated to a processor

RefineLB – move objects off overloaded processors to under-utilized processors to reach average load

RandRefLB – randomly assign objects to processors, then refine

GreedyRefLB – assign objects to processors using the greedy load balancer, then refine

Page 9: Screen shots – Load imbalance

9

Centralized Load Balancing Strategies, Part 2

RecBisectBfLB – recursively partition the object communication graph until there is one partition for each processor

MetisLB – use Metis to partition object communication graph

CommLB – similar to the greedy load balancer, but also takes communication graph into account

Comm1LB – variation of CommLB

Page 10: Screen shots – Load imbalance

10

Neighborhood Load Balancing Strategies

NeighborLB – neighborhood load balancer, currently uses a neighborhood of 4 processors

Page 11: Screen shots – Load imbalance

11

When to Re-balance Load?

AtSync method: enable load balancing at specific point– Object ready to migrate– Re-balance if needed– AtSync(),

ResumeFromSync()

Manual trigger: specify when to do load balancing

– All objects ready to migrate

– Re-balance now– TurnManualLBOn(),

StartLB()

Default: Load balancer will migrate when needed

Programmer Control

Page 12: Screen shots – Load imbalance

12

Processor Utilization: After Load Balance

Page 13: Screen shots – Load imbalance

13

Timelines: Before and After Load Balancing

Page 14: Screen shots – Load imbalance

14

Other tools: LiveViz

Page 15: Screen shots – Load imbalance

15

LiveViz – What is it?

Charm++ library Visualization tool Inspect your

program’s current state

Client runs on any machine (java)

You code the image generation

2D and 3D modes

Page 16: Screen shots – Load imbalance

16

LiveViz – Monitoring Your Application

LiveViz allows you to watch your application’s progress

Can use it from work or home

Doesn’t slow down computation when there is no client

Page 17: Screen shots – Load imbalance

17

LiveViz - Compilation

Compile the LiveViz library itself– Must have built charm++ first!

From the charm directory, run:– cd tmp/libs/ck-libs/liveViz– make

Page 18: Screen shots – Load imbalance

18

Running LiveViz Build and run the server

– cd pgms/charm++/ccs/liveViz/serverpush– Make– ./run_server

Or in detail…

Page 19: Screen shots – Load imbalance

19

Running LiveViz

Run the client– cd pgms/charm++/ccs/liveViz/client– ./run_client [<host> [<port>]]

Should get a result window:

Page 20: Screen shots – Load imbalance

20

LiveViz Request Model

LiveViz Server Code

Client

Parallel Application

Get Image

Poll for Request Poll Request Returns Work

Image Chunk Passed to Server Server Combines Image Chunks

Send Image to Client

Buffer Request

Page 21: Screen shots – Load imbalance

21

Main:

Setup worker array, pass data to them

Workers:

Start looping

Send messages to all neighbors with ghost rows

Wait for all neighbors to send ghost rows to me

Once they arrive, do the regular Jacobi relaxation

Calculate maximum error, do a reduction to compute

global maximum error

If timestep is a multiple of 64, load balance the

computation. Then restart the loop.

Jacobi 2D Example StructureMain:

Setup worker array, pass data to them

Workers:

Start looping

Send messages to all neighbors with ghost rows

Wait for all neighbors to send ghost rows to me

Once they arrive, do the regular Jacobi relaxation

Calculate maximum error, do a reduction to compute

global maximum error

If timestep is a multiple of 64, load balance the

computation. Then restart the loop.

Main:

Setup worker array, pass data to them

Workers:

Start looping

Send messages to all neighbors with ghost rows

Wait for all neighbors to send ghost rows to me

Once they arrive, do the regular Jacobi relaxation

Calculate maximum error, do a reduction to compute

global maximum error

If timestep is a multiple of 64, load balance the

computation. Then restart the loop.

Page 22: Screen shots – Load imbalance

22

#include <liveVizPoll.h>

void main::main(. . .) {

// Do misc initilization stuff

// Now create the (empty) jacobi 2D array

work = CProxy_matrix::ckNew(0);

// Distribute work to the array, filling it as you do

}

#include <liveVizPoll.h>

void main::main(. . .) {

// Do misc initilization stuff

// Create the workers and register with liveviz

CkArrayOptions opts(0); // By default allocate 0

// array elements.

liveVizConfig cfg(true, true); // color image = true and

// animate image = true

liveVizPollInit(cfg, opts); // Initialize the library

// Now create the jacobi 2D array

work = CProxy_matrix::ckNew(opts);

// Distribute work to the array, filling it as you do

}

LiveViz Setup

Page 23: Screen shots – Load imbalance

23

Adding LiveViz To Your Codevoid matrix::serviceLiveViz() {

liveVizPollRequestMsg *m;

while ( (m = liveVizPoll((ArrayElement *)this, timestep))

!= NULL ) {

requestNextFrame(m);

}

}

void matrix::startTimeSlice() {

// Send ghost row north, south, east, west, . . .

sendMsg(dims.x-2, NORTH, dims.x+1, 1, +0, -1);

}

void matrix::startTimeSlice() {

// Send ghost row north, south, east, west, . . .

sendMsg(dims.x-2, NORTH, dims.x+1, 1, +0, -1);

// Now having sent all our ghosts, service liveViz

// while waiting for neighbor’s ghosts to arrive.

serviceLiveViz();

}

Page 24: Screen shots – Load imbalance

24

Generate an Image For a Requestvoid matrix::requestNextFrame(liveVizPollRequestMsg *m) {

// Compute the dimensions of the image bit we’ll send

// Compute the image data of the chunk we’ll send –

// image data is just a linear array of bytes in row-major

// order. For greyscale it’s 1 byte, for color it’s 3

// bytes (rgb).

// The liveViz library routine colorScale(value, min, max,

// *array) will rainbow-color your data automatically.

// Finally, return the image data to the library

liveVizPollDeposit((ArrayElement *)this, timestep, m,

loc_x, loc_y, width, height, imageBits);

}

Page 25: Screen shots – Load imbalance

25

OPTS=-g

CHARMC=charmc $(OPTS)

LB=-module RefineLB

OBJS = jacobi2d.o

all: jacobi2d

jacobi2d: $(OBJS)

$(CHARMC) -language charm++ \

-o jacobi2d $(OBJS) $(LB) –lm

jacobi2d.o: jacobi2d.C jacobi2d.decl.h

$(CHARMC) -c jacobi2d.C

OPTS=-g

CHARMC=charmc $(OPTS)

LB=-module RefineLB

OBJS = jacobi2d.o

all: jacobi2d

jacobi2d: $(OBJS)

$(CHARMC) -language charm++ \

-o jacobi2d $(OBJS) $(LB) -lm \

-module liveViz

jacobi2d.o: jacobi2d.C jacobi2d.decl.h

$(CHARMC) -c jacobi2d.C

Link With The LiveViz Library

Page 26: Screen shots – Load imbalance

26

LiveViz Summary

Easy to use visualization library Simple code handles any number of

clients Doesn’t slow computation when there

are no clients connected Works in parallel, with load balancing,

etc.

Page 27: Screen shots – Load imbalance

27

Advanced Features: Groups

Groups are similar to arrays, except only one element is on each processor – the index to access the group is the processor ID

Advantage: Groups can be used to batch messages from chares running on a single processor, which cuts down on the message traffic

Disadvantage: Does not allow for effective load balancing, since groups are stationary (they are not virtualized)

Page 28: Screen shots – Load imbalance

28

Advanced Features: Node Groups Similar to groups, but only one per node,

instead of one per processor – the index is the node number

Similar advantages and disadvantages as well – node groups can be used to batch messages on a single node, but are not virtualized and do not participate in load balancing

Node groups can have exclusive entry methods – only one exclusive entry method may be running at once

Page 29: Screen shots – Load imbalance

29

Advanced Features: Priorities

Messages can be assigned different priorities The simplest priorities just specify that the message

should either go on the end of the queue (standard behavior) or the beginning of the queue

Specific priorities can also be assigned to messages Priorities can be specified with either numbers or bit

vectors For numeric priorities, lower numbers have a higher

priority

Page 30: Screen shots – Load imbalance

30

Advanced Features: Custom Array Indexes

Standard, system supplied indexes are available for 1D, 2D, and 3D arrays

You can create your own custom index for higher dimension arrays or for custom indexing information

Need to create a custom class that provides indexing functionality, supply this with the array definition

Page 31: Screen shots – Load imbalance

31

Advanced Features: Entry Method Attributes

entry [attribute1, ..., attributeN] void

EntryMethod(parameters);

Attributes: threaded

– entry methods which are run in their own non- preemptible threads

sync – methods return message as a result

Page 32: Screen shots – Load imbalance

32

Advanced features: Reductions Callbacks

– transfer control back to a client after a library has finished– Various pre-defined callbacks, eg: CkExit the program

Callbacks in reductions– Can be specified in the main chare on processor 0:

myProxy.ckSetReductionClient(new CkCallback(...));

– Can be specified in the call to contribute by specifying the callback method:

contribute(sizeof(x), &x, CkReduction::sum_int,processResult );

Page 33: Screen shots – Load imbalance

33

Reductions, Part 2

Predefined Reductions– Sum values or arrays with

CkReduction::sum_[int, float, double]– Calculate the product of values or arrays with

CkReduction:: product_[int,float,double]– Calculate the maximum contributed value with

CkReduction:: max_[int,float,double]– Calculate the minimum contributed value with

CkReduction:: min_[int,float,double]– Calculate the logical and of integer values with

CkReduction:: logical_and

Page 34: Screen shots – Load imbalance

34

Reductions, Part 3

Predefined Reductions, continued…– Calculate the logical or of contributed

integers with CkReduction::logical_or– Form a set of all contributed values with

CkReduction::set– Concatenate bytes of all contributed

values with CkReduction::concat

Page 35: Screen shots – Load imbalance

35

Reductions, Part 4

User defined reductions– performing a user-defined operation on user-

defined data– Defined as CkReductionMsg *reductionFn(int

nMsg, CkReductionMsg **msgs)– Registered using CkReducer::addReducer,

which returns a reduction type you can pass to Contribute

Page 36: Screen shots – Load imbalance

36

Benefits of Virtualization Better Software Engineering

– Logical Units decoupled from “Number of processors”

Message Driven Execution– Adaptive overlap between computation and

communication– Predictability of execution

Flexible and dynamic mapping to processors– Flexible mapping on clusters– Change the set of processors for a given job– Automatic Checkpointing

Principle of Persistence

Page 37: Screen shots – Load imbalance

37

More Information

http://charm.cs.uiuc.edu– Manuals– Papers– Download files– FAQs

[email protected]