Dynamic Load Balancing Tree and Structured Computations CS433 Laxmikant Kale Spring 2001

Dynamic Load Balancing Tree and

Structured ComputationsCS433

Laxmikant Kale

Spring 2001

When to send work away:

• Consider a processor with k units of work, with P other processors, – assume that a message takes 100 microsecs to reach:

• 20 microseconds send-processor overhead,

• 60: network latency

• 20 receive processor overhead

– If each task takes t units of time to complete, under what conditions should send them out to others (vs. doing it itself)?

– E.g. if t=100 microseconds? 50? 1000?

Key observation: the “master” spends 40 microseconds on coordination for a task, although the latency is 200 microsecs

Tree structured computations

• Examples: – Divide-and-conquer

– State-space search:

– Game-tree search

– Bidirectional search

– Branch-and-bound

• Issues:– Grainsize control

– Dynamic Load Balancing

– Prioritization

Divide and Conquer

• Simplest situation among the above– Given a problem, a recursive algorithm divides it into 1 (2) or

more subproblems, and solutions to the subproblems are composed to create a solution

– Example: adaptive quadrature

– Consider a simpler setting:

• Fib(n) = fib(n-1) + fib(n-2)

• Note: the fibonacci algorithm is not important here

– Issues:

• subtrees are unequal size, so can’t assign work a priori

• Fire tasks in parallel: but too fine-grained

Dynamic load balancing formulation:

• Each PE is creating work randomly• How to redistribute work?

• Initial allocation• Rebalancing

• Centralized vs distributed

Reading Assignment

• Adaptive grainsize control:– http://charm.cs.uiuc.edu go to publications, 95-05

• Prioritization and first-solution search:– http://charm.cs.uiuc.edu go to publications, 93-06

• Dynamic Load Balancing for tree structured computations:– Vipin Kumar’s papers (link to be added shotrly)

– http://charm.cs.uiuc.edu go to publications, 93-13

• A few more papers will be posted soon..

Adaptive grainsize control:

• Strategy 1: cut-off depth – (but must have an estimate of the size of the subtree)

• Strategy 2: stack splitting– Each PE maintains a stack of nodes of the tree

– If my stack is empty, “steal” half the stack of someone else

– Which part of the stack? Top? Bottom?


• Strategy 3:– Objects (tree nodes) decide whether to make children

available for other processors by calling a function in the runtime

– runtime monitors the size of its Queue (stack), and possibly size of other processor’s queues


• Strategy 3: Objects decide how big they want to grow– Monitor execution time (number of tree nodes evaluated)

– If the number is above a threshold:

• Fire some of my nodes as independent objects to be mapped somewhere else

– Problem: you sometimes get a “Mother” object that just keeps firing lots of smaller objects

• Solution: above the threshold, split the rest of the work into two objects and fire them off.

Dynamic load balancing

• Centralized:– maintain top levels of tree on one processor

– serve requests for work on demand

• Variation:– hierarchical:

Fully Distributed strategies

• Keep track of neighbors• Diffusion/Gradient model• Neighborhood averaging• What topology to use:

– Machine’s

– Hypercube

– Denser?

Gradient model

• Misnomer:too broad a name• Actual strategy:

– Processor arranged in a topology

• (may be virtual, but the original purpose was to use real)

– Each processor (tries to) maintain an estimate of how far it is from an idle processor

– Idle processors have a distance of 0

– Other processors: periodically send their numbers to nbrs

• My distance = 1 + min(neighbor’s distance)

– If my distance is more than a neighbor’s, send some work to it

• Work will “flow” towards idle processor

Neighborhood averaging

• Assume virtual topology• Periodically send my own load (queuesize) to neighbors• Each processor:

– Calculate avergae load of its neighborhood

– If I am above average, send pieces of work to underloaded neighbors so as to equalize them

• Estimate of work:– Assume the same for each unit

– Use better estimate if known

Randomized strategies

• Random initial assignment:– As work is created, assign it to a PE

– Problems: no way to correct errors

– Each message goes across processor: communication overhead

• Random demand:– If I am idle, ask a randomly selected processor for work

– If I get a demand, send half of my nodes to the requestor

– Good theoretical properties

– In practice: somewhat high overhead

Using Global Average

• Carry out a periodic global averaging to decide the average load on all processors

• If I am above average:– send work “away”

– Alternatively, get a vector of overload via global averaging, and figure out whom to send what work

Using Global Loads

• Idea:– For even a moderately large number of processors, collecting

a vector of load on each PE is not much more expensive than the collecting the total (per message cost dominates)

– How can we use this vector without creating serial bottleneck?

– Each processor know if it is overloaded compared with avg.

• Also knows which Pes are underloaded

• But need an algorithm that allows each processor to decide whom to send work to without global coordination, beyond getting the vector

– Insight: everyone has the same vector

– Also, assumption: there are sufficient fine-grained work pieces

Global vector scheme: contd

• Global algorithm: if we were able to make the decision centrally:

Receiver = nextUnderLoaded(0);

For (I=0, I<P; I++) {

if (load[I] > average) {

assign excess work to receiver, advancing receiver to the next as needed;

}

To make a distributed algorithm run the same algorithm on each processor! Except ignore any reassignment that doesn’t involve me.

Tree structured computations

• Examples: – Divide-and-conquer

– State-space search:

– Game-tree search

– Bidirectional search

– Branch-and-bound

• Issues:– Grainsize control

– Dynamic Load Balancing

– Prioritization

State Space Search

• Definition:– start state, operators, goal-state (implicit/explicit)

– Either search for goal state or for a path leading to one

• If we are looking for all solutions:– same as divide and conquer, except no backward

communication

• Search for any solution: – Use the same algorithm as above?

– Problems: inconsistent and not monotonically increasing speedups,

State Space Search

• Using priorities:– bitvector priorities

– Let root have 0 prio

– Prio of child:

– parent + my rank

p01 p02p03

p

Effect of Prioritization

• Let us consider shared memory machines for simplicity:– Search directed to left part of the tree

– Memory usage: let B be branching factor of tree, D its depth:

• O(D*B + P) nodes in the queue at a time

• With stack: O(D*P*B)

– Consistent and monotonic speedups

done

unexplored

active

Ideal Stack-stealing Prioritized

Need prioritized load balancing

• On non shared memory machines?• Centralized solution:

– Memory bottleneck too!

• Fully distributed solutions:• Hierarchical solution:

– Token idea

Bidirectional Search

• Goal state is explicitly known and operators can be inverted– Sequential:

– Parallel?

Game tree search

• Tricky problem:• alpha beta, negamax

Documents

Dynamic Load Balancing Tree and Structured Computations CS433 Laxmikant Kale Spring 2001