On-line adaptative parallel prefix computation

On-line adaptative

parallel prefix computation

Jean-Louis Roch, Daouda Traore, Julien Bernard

INRIA-CNRS Moais team - LIG Grenoble, France

Contents

I. Motivation II. Work-stealing scheduling of parallel algorithms III. Processor-oblivious parallel prefix computation

EUROPAR’2006 - Dresden, Germany - 2006, August 29th,

• Prefix problem : • input : a0, a1, …, an • output : 1, …, n with

Parallel prefix on fixed architecture

• Tight lower bound on p identical processors:Optimal time Tp = 2n / (p+1) but performs 2.n.p/(p+1) ops

[Nicolau&al. 1996]

Parallel requires twice more operations thansequential !!

performs only n operations

• Sequential algorithm : • for ([0] = a[0], i = 1 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ;

Critical time = 2. log n but performs 2.n ops

[Ladner-Fisher-81]

• Fine grain optimal parallel algorithm :

Dynamic architecture : non-fixed number of resources, variable speeds

eg: grid, … but not only: SMP server in multi-users mode

The problemTo design a single algorithm that computes efficiently prefix( a ) on

an arbitrary dynamic architecture

Sequentialalgorithm

parallelP=2

parallelP=100

parallelP=max

...

Multi-user SMP server GridHeterogeneous network

?Which algorithm to choose ?

… …

- Model of heterogeneous processors with changing speed [Bender&al 02]

=> i(t) = instantaneous speed of processor i at time t (in #operations * per second ) Assumption : max(t) < constant . min(t)

Def: ave = average speed per processor for a computation with duration T

- Theorem 2 : Lower bound for the time of prefix computation on p processors with changing speeds :

Sketch of the proof: - extension of the lower bound on p identical processors [Faith82]

- based on the analysis on the number of performed operations.

Lower bound for prefix on processors with changing speeds

Changing speeds and work-stealing• Workstealing schedule on-line adapts to processors availability

and speeds [Bender-02]

• Principle of work-stealing= “greedy” schedule but distributed and randomized

• Each processor manages locally the tasks it creates• When idle, a processor steals the oldest ready task on a remote -non idle-

victim processor (randomly chosen)

«Depth »

W = #ops on a critical path

(parallel time on resources)

« Work »

W1= #total

operations performed

[Bender-Rabin02]

Work-stealing and adaptation

«Depth »

W = #ops on a critical path

(parallel time on resources)

« Work »

W1= #total

operations performed

• Interest: if W1 fixed and W small, near-optimal adaptative schedulewith good probability on p processors with average speeds ave

• Moreover : #steals = #task migrations < p.W [Blumofe 98 Narlikar 01 Bender 02]

• But lower bounds for prefix : • Minimal work W1 = n W = n

• Minimal depth W < 2 log n W1 > 2n

• With work-stealing, how to reach the lower bound ?

• General approach: by coupling two algorithms :• a sequential algorithm with optimal number of operations Ws • and a fine grain parallel algorithm with minimal critical time W but

parallel work >> Ws

• Folk technique : parallel, than sequential • Parallel algorithm until a certain « grain »; then use the sequential one• Drawback with changing speeds :

• Either too much idle processors or too much operations

• Work-preserving speed-up technique [Bini-Pan94] sequential, then parallel Cascading [Jaja92] =Careful interplay of both algorithms to build one with

both W small and W1 = O( Wseq ) • Use the work-optimal sequential algorithm to reduce the size • Then use the time-optimal parallel algorithm to decrease the time

Drawback : sequential at coarse grain and parallel at fine grain

How to get both work W1 and depth W small?

Alternative : concurrently sequential and parallel

SeqCompute

Extract_parLastPartComputation

SeqCompute

Based on the work-stealing and the Work-first principle : Execute always a sequential algorithm to reduce parallelism overhead

use parallel algorithm only if a processor becomes idle (ie workstealing) by extracting parallelism from a sequential computation (ie adaptive granularity)

Hypothesis : two algorithms : • - 1 sequential : SeqCompute

- 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm

– Self-adaptive granularity based on work-stealing


SeqCompute

SeqCompute

preempt


SeqCompute

SeqCompute

merge/jump

complete

Seq

Parallel

Sequential

0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12

Work-stealer 1

MainSeq.

Work-stealer 2

Adaptive Prefix on 3 processors

1

Steal request

Parallel

Sequential


0 a1 a2 a3 a4

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

a9 a10 a11 a127

3

Steal request

2

6 i=a5*…*ai

Parallel

Sequential


0 a1 a2 a3 a4

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

7

3 42

6 i=a5*…*ai

a9 a10 a11 a12

8

4

Preempt

10 i=a9*…*ai

8

8

Parallel

Sequential


0 a1 a2 a3 a4 8

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

7

3 42

6 i=a5*…*ai

a9 a10 a11 a12

85

10 i=a9*…*ai9

6

11

8

Preempt 11

118

Parallel

Sequential


0 a1 a2 a3 a4 8 11 a12

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

7

3 42

6 i=a5*…*ai

a9 a10 a11 a12

85

10 i=a9*…*ai9

6

11

12

10

7

118

Parallel

Sequential


0 a1 a2 a3 a4 8 11 a12

Work-stealer 1

MainSeq. 1

Work-stealer 2

a5 a6 a7 a8

7

3 42

6 i=a5*…*ai

a9 a10 a11 a12

85

10 i=a9*…*ai9

6

11

12

10

7

118

Implicit critical path on the sequential process

• Theorem 3: Execution time

• Sketch of the proof : Analysis of the operations performed by :

– The sequential main performs S operations on one processor

– The (p-1) work-stealers perform X = 2(n-S) operations with depth log X– Each non constant time task can potentially be splitted (variable speeds)

The coupling ensures both algorithms complete simultaneously Ts = Tp - O(log X)=> enables to bound the whole number X of operations performedand the overhead of parallelism = (S+X) - #ops_optimal

Analysis of the algorithm

Lower bound

Adaptive prefix : experiments1

Single-user context : processor-adaptive prefix achieves near-optimal performance : - close to the lower bound both on 1 proc and on p processors

- Less sensitive to system overhead : even better than the theoretically “optimal” off-line parallel algorithm on p processors :

Optimal off-line on p procs

Adaptive

Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux)T

ime

(s)

#processors

Pure sequential

Single user context

Adaptive prefix : experiments 2

Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-adaptive prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule,

Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-adaptive prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule,

External charge (9-p external processes)

Off-line parallel algorithm for p processors

Adaptive

Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux)

Tim

e (s

)

#processors

Multi-user context :

Conclusion

The interplay of an on-line parallel algorithm directed by work-stealing schedule is useful for the design of processor-oblivious algorithms

Application to prefix computation : - theoretically reaches the lower bound on heterogeneous processors

with changing speeds - practically, achieves near-optimal performances on multi-user SMPs

Generic adaptive scheme to implement parallel algorithms with provable performance

- work in progress : parallel 3D reconstruction [oct-tree scheme with deadline constraint]

Thank you !

QuickTime™ et undécompresseur codec YUV420

sont requis pour visionner cette image.

Interactive Distributed Simulation[B Raffin &E Boyer]

- 5 cameras, - 6 PCs

3D-reconstruction+ simulation+ rendering

->Adaptive scheme to maximize 3D-reconstruction precision within fixed timestamp[L Suares, B Raffin, JL Roch]

The Prefix race: sequential/parallel fixed/ adaptive

Race between 9 algorithms (44 processes) on an octo-SMPSMP

0 5 10 15 20 25

1

2

3

4

5

6

7

8

9

Execution time (seconds)

Série1

Adaptative 8 proc.

Parallel 8 proc.

Parallel 7 proc.

Parallel 6 proc.Parallel 5 proc.

Parallel 4 proc.

Parallel 3 proc.

Parallel 2 proc.

Sequential

On each of the 10 executions, adaptive completes first

Adaptive prefix : some experiments

Single user contextAdaptive is equivalent to:

- sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors

Multi-user contextAdaptive is the fastest15% benefit over a static grain algorithm

Multi-user contextAdaptive is the fastest15% benefit over a static grain algorithm

External charge

Parallel

Adaptive

Parallel

Adaptive

Prefix of 10000 elements on a SMP 8 procs (IA64 / linux)

#processorsT

ime

(s)

Tim

e (s

)

#processors

With * = double sum ( r[i]=r[i-1] + x[i] )

Single user Processors with variable speeds

Remark for n=4.096.000 doubles :- “pure” sequential : 0,20 s- minimal ”grain” = 100 doubles : 0.26s on 1 proc

and 0.175 on 2 procs (close to lower bound)

Finest “grain” limited to 1 page = 16384 octets = 2048 double

Documents

On-line adaptative parallel prefix computation