Download ppt - Parallel Algorithms and Computing Selected topics

Parallel Algorithms and Computing

Selected topics

Parallel Architecture

2

References

An introduction to parallel algorithmsJoseph Jaja

Introduction to parallel computingVipin Kumar, Ananth Grama, Anshul Gupta, George KArypis

Parallel sorting algorithmsSelim G. Akl

3

Models

Three models:

Graphs (DAG : Directed Acyclic Graph)

Parallel Randon Access Machine

Network

4

Graphs

Not studied here


Parallel random access machine

6


Flynn classifies parallel machines based on:

– Data flow– Instruction flow

Each flow can be: – Single– Multiple

7


Flynn classification

SINGLE MULTIPLE

SINGLE SISD SIMD

MULTIPLE MISD MIMD

Data flow

Inst

ruct

ion

flo

w

8


Extend the traditional RAM (Random Access Memory) machine

Interconnection network between global memory and processors

Multiple processors

Mémoire Globale (Shared – Memory)

P1 P2 Pp

9


Characteristics

Processors Pi (i (0 i p-1 )– each with a local memory– i is a unique identity for processor Pi

A global shared memory – it can be accessed by all processors

10


Types of operations: Synchronous

– Processors work in locked step

at each step, a processor is active or idle

suited for SIMD and MIMD architectures

Asynchronous– processors have local clocks – needs to synchronize the processors

suited for MIMD architecture

11


Example of synchronous operation

Algorithm : Processor i (i=0 … 3)

Input : A, B i processor id

Output : (1) CBegin

If ( B==0) C = AElse C = A/B

End

12


Step 1

A : 7

B : 0

C : 7 (Actif, B=0)

A : 2

B : 1

C : 0 (Inactif, (B0)

A : 4

B : 2

C : 0 (Inactif, (B0)

A : 5

B : 0

C : 5 (Actif, (B=0)

Processeur 3Processeur 2Processeur 1Processeur 0

Initial

A : 7

B : 0

C : 0

A : 2

B : 1

C : 0

A : 4

B : 2

C : 0

A : 5

B : 0

C : 0


(idle B 0) (idle B 0) (active B = 0)(active B = 0)

13


Step 2

A : 7

B : 0

C : 7

A : 2

B : 1

C : 2

A : 4

B : 2

C : 2

A : 5

B : 0

C : 5


(active B 0) (active B 0) (idle B = 0)(idle B = 0)

14


Read / Write conflicts

EREW : Exclusive - Read, Exclusive -Write– no concurrent ( read or write) operation on a variable

CREW : Concurrent – Read, Exclusive – Write – concurrent reads allowed on same variable– exclusive write only

15


ERCW : Exclusive Read – Concurrent Write

CRCW : Concurrent – Read, Concurrent – Write

16


Concurrent write on a variable X Common CRCW : only if all processors write the

same value on X

SUM CRCW : write the sum all variables on X

Random CRCW : choose one processor at random and write its value on X

Priority CRCW : processor with hign priority writes on X

17


Example: Concurrent write on X by processors P1 (50 X) , P2 (60 X), P3 (70 X)

Common CRCW ou ERCW : Failure

SUM CRCW : X is the sum (180) of the written values Random CRCW : final value of

X { 50, 60, 70 }

18


Basic Input/Output operations

On global memory– global read (X, x)– global write (Y, y)

On local memory– read (X, x)– write (Y, y)

19

Example 1: Matrix-Vector product

Matrix-Vector produt

Y = AX

– A is a nXn matrix – X = [ x1, x2, …, xn] a vector of n elements– p processeurs ( pn ) and r = n/p

Each processor is assigned a bloc of r= n/p elements

20


Y1Y2….

Yn

A1,1 A1,2 … A1,nA2,1 A2,2 … A2,n ……..

A n,1 An,2 ... An,n

=

X1X2….

Xn

X

Global memory

P1 P2 Pp

Processors

21


Partition A in p blocks Ai

Compute p partial products in parallel

Processor Pi compute the partial product Yi = Ai * X

A =

A1A2….

Ap

A1,1 A1,2 … A1,n …….Ar,1 A2,2 … A2,n …….

A(p-1)r,1 A(p-1),2 … A(p-1),n ……..A pr,1 Apr,2 ….Apr,n

=

A1

Ap

r lignes

r lignes

22


Processeur Pi computes Yi = Ai * X

A1,1 A1,2 … A1,n …….Ar,1 A2,2 … A2,n

X1X2….Xn

X

Ar+1,1 Ar+1,2 … Ar+1,n …….A2r,1 A2r,2 … A2r,n

X1X2….Xn

X

A(p-1)r,1 A(p-1)r,2 …A(p-1)r,n …….Apr,1 Apr,2 … Apr,n

X1X2….Xn

X

Y1Y2….Yr

Y(p-1)r+1Y(p-1)r+2….Ypr

Yr+1Yr+2….Y2r

P1

P2

Pp

23


Solution requires : p concurrents reads of vector X

each processor Pi makes an exclusive read of block Ai = A [((i-1)r +1) : ir, 1:n]

Each processor Pi makes an exclusive write on

block Yi = Y[((i-1)r +1) : ir ]

Required architecture : PRAM CREW

24


Algorithm: processor Pi (i=1,2, …, n)Input

A : nxn matruix in global memory X : a vector in global memory Output

y = AX (y is a vector in global memory)Local variables

i : Pi processor id p: number of processors

n : dimension of A and X

Begin1. Global read ( x, z)2. global read (A((i-1)r + 1 : ir, 1:n), B)3. calculer W = Bz4. global write (w, y(i-1)r+1 : ir))

End

25


Analysis Computation cost

Ligne 3: O( n2/p) opérations arithmétiques by Pi

r lignes X n opérations ( avec r = n/p) Communication cost

Ligne 1 : O(n) numbers transferred from global to local memory by Pi

Ligne 2 : O(n2/p) numbers transferred from global to local memory by Pi

Ligne 4 : O(n/p) numbers transferred from global to local memory by Pi

Overall: Algorithm run in O(n2/p) time

26


Other way to partition the matrix is vertically Ai and X are split into blocks

– A1, A2, … Ap– X1, X2 … Xp

Solution in two phases :– Compute partial products

Z1 =A1X1, … Zp = ApXp– Synchronize the processors– Add partial results to get Y

Y= AX = Z1 + Z2 + … + Zp

27


Y1Y2

….

Yn

A1,1 … A1,r A2,1 … A2,r An,1 … An,r

X1 … Xr *

r columnsProcessor P1

A1,(p-1)r +1 ... A1,prA2,(p-1)r +1 ... A2,pr

An,(p-1)r +1 ... An,prX(p-1)r +1 ... Xpr*

r columnsProcessor Pp

……..

Synchronization

28


Algorithm: processor Pi (i=1,2, …, n)

Begin1. Global read ( x( (i-1)r +1 : ir) , z)2. global read (A(1:n, (i-1)r + 1 : ir), B)3. compute W = Bz4. Synchronize processors Pi (i=1, 2, …, n)5. global write (w, y(i-1)r+1 : ir))

End

Input A : nxn matruix in global memory

X : a vector in global memory Output

y = AX (* y: vector in global memory *)Local variables

i : Pi processor id p: number of processors

n : dimension of A and X

29


Analysis Work out the details Overall: Algorithm run in O(n2/p) time

30

Example 2: Sum on the PRAM model

An aray A of n = 2k numbers

A PRAM machine with n processor

Compute S = A(1) + A(2) + …. + A(n)

Construct a binary tree to compute the sum in log2n time

31


B(1)=A(1)

B(2)=A(2)

B(1)=A(1)

B(2)=A(2)

B(1)=A(1)

B(2)=A(2)

B(1)=A(1)

B(2)=A(2)

P1 P2 P3 P4 P5 P6 P7 P8

B(1) B(2) B(3) B(4)

P1 P2 P3 P4

B(1) B(2)

P1 P2

B(1)

S=B(1)

P1

P1

Level >1, Pi computeB(i) = B(2i-1) + B(2i)

Level 1, Pi B(i) = A(i)

32


Algorithm processor Pi ( i=0,1, …n-1)Input

A : array of n = 2k elements in mémoire globalOutput

S : où S= A(1) + A(2) + …. . A(n)Local variables Pi

n : i : processor Pi identity

Begin1. global read ( A(i), a)2. global write (a, B(i))3. for h = 1 to log n do

if ( i ≤ n / 2h ) then begin global read (B(2i-1), x)

global read (b(2i), y) z = x +y global write (z,B(i)) end

4. if i = 1 then global write(z,S)End


Network model

34

Network model

Characteristics Communication structure is important Network can be seen as a graph G=(N,E):

– Node i N is a processor– Edge (i,j)E represents a two way communication

between processors i and j

Basi communication operation– Send (X, Pi)

– Receive(X, Pi)

No global shared memory

35

Network model

P1 P2 P3 Pn…

n processors Linear array

n processor ring

P1 P2 P3 Pn…

36

Network model

P11 P12 P13 P1n…

n2 processors Grid

P21 P22 P23 P2n…

P31 P32 P33 P3n…

Pn1 Pn2 Pn3 Pnn…

n2 processors Torus: columns and rows are n rings

37

Network model

(P0)

(P2)

(P1)

(P7)

(P3)

(P4) (P5)

(P6)

n=2k hypercube

38

Network model

P11 P12 P13 P1n…

n2 processors Grid

P21 P22 P23 P2n…

P31 P32 P33 P3n…

Pn1 Pn2 Pn3 Pnn…

n2 processors Torus: columns and rows are n rings

39

Exemple 1: Matrix-Vector Product on linear array

A=[aij] an nxn matrix, i,j [1,n] X=[xi] i [1,n] Compute

n

j

xjaijyiyiy1

* where,

40


Systolic array algorithm for n=4

x4 x3 x2 x1

a14

a13

a12

a11

a24

a23

a22

a21

a34

a33

a32

a31

a44a43

a42

a41

. ..

.

.

.

P1 P2 P3 P4

41


•At step j, xj enters the processor P1. At step j, processor Pi receives (when possible) a value from its left and a value from the top. It updates its partial as follows:

Yi = Yi + aij*xj , j=1,2,3, ….

• Values xj and aij reach processor i at the same time at step (i+j-1)•(x1, a11) reach P1 at step 1 = (1+1-1)•(x3, a13) reach P1 at setep 3 = (1+3-1)

• In general, Yi is computed at step N+i-1

42


• The computation is completed when x4 and a44 reach processor P4 at Step N + N –1 = 2N-1• Conclusion: The algorithm requires (2N-1) steps. At each step, active processor Perform an addition and a multiplication• Complexity of the algorithm: O(N)

43


J=1

4 y2 = a2j*xj

J=1

4 y3 = a3j*xj

J=1

4 y4 = a4j*xj

J=1

4 y1 = a1j*xj

x2

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

Step1

2

3

4

5

6

7

x1

x3

x4

x1

x2 x1

x1

x1

x2

x2

x2 x1

x3

x3

x3

x3 x2 x1

x4

x4

x4

44


Systolic array algorithm: Time-Cost analysis

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

1 Add; 1 Mult; active: P1 idle: P2, P3, P4

2 Add; 2 Mult; active: P1, P2 idle: P3, P4

3 Add; 3 Mult; active: P1, P2,P3 idle: P4

4 Add; 4 Mult; active: P1, P2,P3 P4idle:

3 Add; 3 Mult; active: P2,P3,P4idle: P1

2 Add; 2 Mult; active: P3,P4idle: P1,P2

1 Add; 1 Mult; active: P4idle: P1,P2,P3

45


Systolic array algorithm: Time-Cost analysis

x2

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

Step1

2

3

4

5

6

7

x1

x3

x4

x1

x2 x1

x1 x2

x2

x3

x3

x3

x4

x4

x4

1 Add; 1 Mult; active: P1 idle: P2, P3, P4

2 Add; 2 Mult; active: P1, P2 idle: P3, P4

3 Add; 3 Mult; active: P1, P2,P3 idle: P4

4 Add; 4 Mult; active: P1, P2,P3 P4idle:

3 Add; 3 Mult; active: P2,P3,P4idle: P1

2 Add; 2 Mult; active: P3,P4idle: P1,P2

1 Add; 1 Mult; active: P4idle: P1,P2,P3

46

Exemple 2: Matrix multiplication on a 2-D nxn Mesh

Given two nxn matrices A = [aij] and B = [bij], i,j [1,n],

Compute the product C=AB , where C is given by :

n

j

bkjaikCijCijC1

* where,

47


•At step i, Row i of A (starting with ai1) is entered from the top into column i (into processor P1i)

•At step j, Column j of B (starting with b1j) is entered from the left into row j (to processor Pj1)

• The values aik and bkj reach processor (Pji) at step (i+j+k-2).

At the end of this step, aik is sent down and bkj is sent right.

48


Example: Systolic mesh algorithm for n=4

STEP 1

(1,1)

(2,1)

(3,1)

(4,1)

(1,2) (1,3) (1,4)

(2,4)(2,3)(2,2)

(3,4)(3,3)(3,2)

(4,4)(4,3)(4,2)

a14a13a12a11

a24a23a22a21

a34a33a32a31 . .

a44a43a42a41 . . ..

b41 b3 b21 b11

b42 b32 b22 b12 .

b43 b33 b23 b13 . .

b44 b34 b24 b14 ..

49


Example: Systolic mesh algorithm for n=4

STEP 5

a11

a24 a33 a42b41 b31 b21

a14

a13

a12

b42

b33

b24

b32 b22 b12

b23 b13

b14

a23 a32 a41

a22 a31

a21

b43

b34b44

b11

a34 a43a44 A

B

50

Exemple 2: Matrix-Vector multiplication on a ring

Analysis

To determine the number of steps for completing the multiplication of the matrice, we must find the step at which the terms ann and bnn reach rocessor Pnn.

– Values aik and bkj reach processor Pji at i+j+k-2– Substituing n for i,j,k yields :

n + n + n – 2 = 3n - 2

Complexity of the solution: O(N)

51


N=4

X4 X3 X2 X1

P1 P2 P3 P4

a13

a12

a11

a14

a22

a21

a24

a23

a31

a34

a33

a32

a44

a43

a42

a41

Xi

aij

This algorithm requires N steps for a matrix-vector multiplication

52


Goal

Pipeline data into the processors, so that n product terms are computed and added to partial sums at each step.

Distribution of X on the processors

Xj 1j N, Xj is assigned to processor N-j+1

This algorithm requires N steps for a matrix-vector multiplication

53


Another way to distribute the Xi over the processors and to input

Matrix A– Row i of the matrix A is shifted (rotated) down i (mod

n) times

and entered into processor Pi. – Xi is assigned to processor Pi, at each step the Xi are

shifted right

54


X1 X2 X3 X4

P1 P2 P3 P4

a12

a13

a14

a11

a23

a24

a21

a22

a34

a31

a32

a33

a41

a42

a43

a44

Xi

aij

N=4

Diagonal

55

Exemple 4: Sum of n=2P numbers on a d-hypercube

Assignment: xi is on processor Pi

0 1

32

4 5

6 7

(X0)

(X2)

(X1)

(X7)

(X3)

(X4) (X5)

(X6)

Computation of S = xi,

56


Step 1: Processors of the sub-cube 1XX send their data to corresponding processors in sub-cube 0XX

0 1

32

4 5

6 7

(X0+X4)

(X2+X6)

(X1+X5)

(X3+X7)

0XX sub-cube 1XX sub-cube

57


Step 2: Processors of the sub-cube 01X send their data to corresponding processors in sub-cube 00X

0 1

32

4 5

6 7

(X0+X4+X2+X6) (X1+X5+X3+X7)

P Processeurs actifs P Processeurs inactifs

58


Step 3: Processors of the sub-cube 01X send their data to corresponding processors in sub-cube 00X

P Processeurs actifs P Processeurs inactifs

0 1

32

4 5

6 7

S = (X0+X4+X2+X6+ X1+X5+X3+X7)

The sum of the n numbers is stored on node P0

59


Algorithm: Processor PiInput: 1) An array of X of n=2p of numbers, X[i] is assigned to processor Pi 2) processor identity idOutput: S= X[0]+…+X[n] stored on processor P0

Processor PiBegin

My_id = id ( My_id i)S=X[i]For j = 0 to (d-1) do begin Partner = M_id XOR 2j

if My_id AND 2j = 0 beginreceive(Si, Partner)S = S + Si end

if My_id AND 2j 0 beginsend(S, Partner)exit end

end end


Message broadcast on network model

(ring, torus, hypercube)

61

Basic communication

Message Broadcast One-to-all broadcast

– Ring– Mesh (Torus)– Hypercube

All-to-all broadcast– Ring– Mesh (Torus)– Hypercube

62

Communication cost

Message from

Pi Pj

l number of links traversed

Communication cost = ts + tw *m*l

ts :message preparation time

m: message length

tw: unit transfer time (byte)

l : number of links traversed by the message

63

Communication cost

Communication time bounds: – Ring ts + (tw)m p/2

– Mesh ts + (tw)m ((p)1/2)/2

– Hypercube ts + (tw)m log2p

Depends on the maximum number of links traversed by the message

64

One-to-All broadcast

Simple solution

P0 send message M0 to processors P1, P2, … Pn-1 successively.

P1P0M0

P2P0M0

P3P0M0

Pp-1P0M0

( P0 P1 P2 )

( P0 P1 P2 P3 )

( P0 P1 P2 … Pp-1 )

Communication cost = (ts + tw m0 ) i = (ts + tw m0 )*( p(p+1)/2)

65

One-to-all Broadcast

Processor send a message M to all processors

0 1 P-1 0 1 P-1

M M M M… …

One-to-all broadcast

Dual operation (Accumulation)

66

All-to-all broadcast

All-to-All Broadcast : several simultanous One-to-All broadcast where each processor Pi initiates the communication.

0 1 p-1…X0

0 1 p-1…

All-to-All broadcast

Accumulation vers plusieurs noeuds

X1 Xp-1

…

X0

X1

Xp-1 …

X0

X1

Xp-1 …

X0

X1

Xp-1


Examples of message broadcasts

68

Example 1: One-to-All broadcast on a ring

Each processor forwards message to the next processor. Initially, message sent in two directions

0 1 2 3

7 6 5 4

1 2 3

43

2 4

Communication cost :

T = (ts + tw * m)p/2 où p est le nombre de processeurs

Parallel Steps

69

Example 2: One-to-All broadcast on a Torus

Two phases : Phase 1 : One-to-All broacast on first row

0

4

8

12

1

5

9

13

2

6

10

14

3

7

11

15

1 2

2

70


Phase 2 : parallel one-to-all broadcasts in the columns.

0

4

8

12

1

5

9

13

2

6

10

14

3

7

11

15

1 2

2

3 3 3 3

4 4 4 4

4 4 4 4

71




T = 2 * (ts + tw * m) p(1/2)/ 2 2 (p) is the number of processors

Broadcast on line

Tcom = (ts + twm) p(1/2)/2

Broadcast on columns

Tcom = (ts + twm) p(1/2)/2

72

Example 3: One-to-All broadcast on a Hypercube

Requires d steps. Each step doubles the number of active processors

0 1

2 3

4 5

6 7

12

2

3

3

3

3

Coût de communication :

T = 2 * (ts + tw * m)*logpP is the number of processors

73


Broadcast an element X stored on one processor (say P0) to the other processors of the hypercube.

Broadcast can be performed in O(logn) as follows

0 1

32

4 5

6 7

X

Initial distribution of data

74


•Step 1: Processor Po sends X to processor P1•Step 2: Processors P0 and P1 send X to P2 and P3 respectively•Step 3: Processor P0, P1, P2 and P3 send X to P4, P5, P6 and P7

X

0 1

32

4 5

6 7

XStep 1 Step 2

0 1

32

4 5

6 7

X X

X X

X X

0 1

32

4 5

6 7

XX

X X

XX

Step 3

P Active processors P Idle processors

75


Algorithm for a broadcast of X on a p-hypercube

Input: 1) X assigned to processor P0 2) processor identity idOutput: All processor Pi contain X

Processor PiBegin

If i = 0 then B = XMy_id = id ( My_id i)For j = 0 to (d-1) do if My_id 2j begin Partner = My_id XOR 2j if My_id > Partner receive(B, Partner) if My_id < Partner send(S, Partner)

endend

76

All-to-all Broadcast on a ring

STEP 1

0 1 2 3

7 6 5 4

1(0) 1(1) 1(2)

1(3)1(7)

1(6) 1(5) 1(4)

(0) (1) (2) (3)

(4)(5)(6)(7)

77


Step 2

0 1 2 3

7 6 5 4

2(7) 2(0) 2(1)

2(2)2(6)

2(5) 2(4) 2(3)

(0,7) (1,0) (2,1) (3,2)

(4,3)(5,4)(6,5)(7,6)

78


Etape 3

0 1 2 3

7 6 5 4

3(6) 3(7) 3(0)

3(1)3(5)

3(4) 3(3) 3(2)

(0,7,6) (1,0,7) (2,1,0) (3,2,1)

(4,3,2)(5,4,3)(6,5,4)(7,6,5)

79


Etape 7

0 1 2 3

7 6 5 4

7(2) 7(3) 7(4)

7(5)7(1)

7(0) 7(7) 7(6)

(0,7,6,5,4,3,2) (1,0,7,6,5,4,3) (2,1,0,7,6,5,4) (3,2,1,0,7,6,5)

(4,3,2,1,0,7,6)(5,4,3,2,1,0,7)(6,5,4,3,2,1,0)(7,6,5,4,3,2,1)

80

All-to-all Broadcast on a 2-dimensional Torus

Two phases– Phase 1: All-to-all broadcast on each line. Each processor

Pi holds a message of size Mi = (p1/2)m– Phase 2: All-to-All broadcast in the columns

81

All-to-all Broadcast

0 1 2

3 4 5

6 7 8

(0) (1) (2)

(3) (4) (5)

(6) (7) (8)

All-to-All on the rows

Start of Phase 1

82


0 1 2

3 4 5

6 7 8

(0,1,2) (0,1,2) (0,1,2)

(3,4,5) (3,4,5) (3,4,5)

(6,7,8) (6,7,8) (6,7,8)

All-to-All on columns

Start of Phase 2

83


Communication cost = Cost phase 1 + cost phase 2

= (p1/2 -1)(ts + twm) + (p1/2-1) (ts + tw (p1/2)m)

Parallel Algorithms and Computing

Selected topics

Sorting in Parallel

85

Performance mesures

Speedup

Efficiency

Work-Time

Amdhal’s law

86

Speed up

Speed up : S(p) (p number of processors in the parallel solution)

T(p)

T(1) )( pS

S(p) < 1 : parallel solution is worst

1<S(p) ≤ 1 : Normal

p<S(p) : Hyper-speed up not very frequent

T(1) : sequential execution timeT(p) : parallel execution time with p processors

Poor performance

Normal speed up

Hyper- speed up

S(p) = pS(p)

1

p

(ideal)

87

Accélération

Is hyper speed up normal? Poor non optimal sequential algorithm Storage space a factor

88

Efficiency

Efficiency : E(p)

p

S(p) )( pE

0< E(p) ≤ 1 : Normal

1<E(p) : Hyper-accélération

Intérêt : Speed up : User point of viewEfficacité : Manager’s point of viewAccélération et Efficacité : designer’s point of view

89

Amdhal’s law

A program consists of two part :

+…

Sequential part Parallel prt

psseq TTTT )1(p

TTTpT psseq )(

90

Amdhal’s law

Bound on Speedup

p

TT

TpS

ps

seq

)(

T(p)

T(1) )( pS

)1(1

1 )(

seq

s

seq

s

TT

pTT

pS

91

Amdhal’s law

Bound on Speedup Sequential fraction (fs) et Parallel fraction (fp):

Speed up can be rewritten as

seq

pp

s

T T(1) ,1

1,0 T(1)

T , 1,0

T(1)

T

ps

s

ff

ff

)1(1

1 )(

ss fp

fpS

92

Amdhal’s law

Bound on Speedup

)1(1

1 )(

ss fp

fpS

sss

ffp

fpSp

1

)1(1

1 )( lim

sfpS

1)(

93

Amdhal’s law

Bound on Speedup

For example if fs is equal to 1%, S(p) is less than 100.

1/fs

S(p)

p11

94

Amdhal’s law

The above computation of speed up bound does not take into account communication and synchronization overheadsOverhead

overheadp

TTTpT psseq )(

S(p) )( pSreal

95

Parallel sorting

Types of sorting algorithms Properties

– Processor ordering determines order of the final result – Where input and output are stored– Basic compare-exchange operation

96

Issues in sorting algorithms

Internal/External sort :

CPU

Data fits in processor memory

(RAM)

Performances based :•Comparison•Basic operationsComplexity O(nlogn)

Internal

CPU

Data in memory and on disk

(RAM) (Disk)

Performances based :•Basic operations•Overlap of computing and I/O

External

97


Comparaison-Based

Non-Compared-Based– Ordering based on properties of the keys

Executions of :•Comparaison•Permutation

98


Internal sort (shared memory : PRAM)

P P •Share data•Minimize memory access conflicts

Each processor sort part of the data in memory

99


Internal sort (distributed memory)– Each processor is assigned a block of N/P elements– Processor locally sorts the assigned block

(using any sort algorithm internal ot )

Initial data

N/P elements per processor Pi

P1 P2 P3 < <

Input : Distributed among processorsOutput : Store on processorsOrdre final : processor order defines the

final ordering of list

100


Internal sort (distributed memory)

(0) (1)

(3)(2)

(4)

(7)

(5)

(6)

Example :Final order defined by the gray code labelling of processors

1 2 3 4 5

101


Building block: compare-exchange operation

CPU

(ai, aj)

RAMSequentiel

(ai < aj) ??

ai ↔ aj

(ai)

CPU

(aj)

CPU

ai

aj

ai = min(ai, aj)

aj

ai

aj = max(ai, aj)

Parallel

Exchange-Compare-Min(P(i+1))

P(i) P(i+1)

Exchange-Compare-Max(P(i-1))

102


Compare-exchange : N/p elements per processor

1 6 8 11 13 62 2 7 9 10 12 63

1 6 8 11 13 62

2 7 9 10 12 63

2 7 9 10 12 63

1 6 8 11 13 62

1 6 82 7 9 10 12 6311 13 62

min max

P(i) P(i+1)

N/p smallest elementsExchange-compare-min(P(i+1))

n/p largest elementsExchange-compare-max(P(i-1))

Example: Odd-Even Merge Sort

Unsorted list of n elements

A0 A1 A2 A3 AM-1 B0 B1 B2 B3 BM-1

Divide list in two lists (n/2 elements)

Sort each sub-list

A0 A2 … AM-2 B0 B2 … BM-2

Divide each in sub-lists ofOdd-even index

A1 A3 … AM-1 B1 B3 … BM-1

Merge sort the Odd – evensublists

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

Merge the two list andExchange out of positionelements

E0 O0 E1 O1 …. EM-1OM-1

Where is parallelism????

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

1xN

1xN

4xN/4

2xN/2

105


Key to the Merge Sort algorithm: method used to merge the sorted sub-list

Consider 2 sorted lists of m =2k elements:

A= a0, a1, ….am-1 et B= b0, b1, ….bm-1

Even(A)= a0, a2, ….am-2 , Odd(A)= a1, a3, ….am-1

Even(B)= b0, b2, ….bm-2 , Odd(B)= b1, b3, ….bm-1

106


Create 2 merged lists:

Merge Even(A) and Odd(B) to E = E0 E1 …Em-1

Merge Even(B) and Odd(A) to O = O0 O1 …Om-1

Merge E and O as follows to create a List L’

L’ = E0O0E1 O1 …Em-1 Om-1

Exchange out of order elements of L’ to obtain L

107


A=2,3,4,8 et B=1,5,6,7Even (A)= 2,4 and Odd(A)= 3,8Even (B)= 1, 6 et Odd(B)= 5,7

E = 2,4,5,7 et O=1,3,6,8

L’ = 2 ↔1, 4 ↔ 3, 5, 6, 7, 8

L = 1, 2, 3, 4, 5, 6, 7, 8

Parallel sorting

Quicksort

109

Review: Quicksort

)O(nlog steps log :caseBest

O(n2) scomparison 2

1)n(n :caseWorst

22 nn

Recursively:•choose a pivot•divide list in two using the pivot•sort left and right sub-list

Recall: Sequential Quicksort

Performance

110

Review: Quicksort

Sequential Quicksortvoid Quicksort (double *A, int q, int r){ int s, i; double pivot; if (q < r ) { /* divide A using the pivot */

pivot = A[q];s = q;for (i = q+1; i ≤ pivot {

s = s+1;exchange(A,s,i); }

}exchange(A,q,s);/* recursive calls to sort the new sublist*/Quicksort(A,q,s-1);Quicksort(A, s+1, r); }

}

111

Review: Quicksort

Create a binary tree of processor, one new processor for each recursive call of Quicksort

Easy to implement, but can be inefficient performance wise

112

Review: Quicksort

Implantation en mémoire partagée (avec primitives Fork())

double A[nmax]; qoid quicksort(int q, int r){ int s, i, n; double pivot; if q < r {/*partitions */ pivot=A[q];

s=q;for (i=q+1; i <= r; i++){if A[i] <= pivot){

s= s+1;exchange(A, s,i);}

}exchange(A, q, s);

/*Create a new processor */n=fork() if ( n== 0 )exec("quicksort", q, s-1);else quicksort(,s+1,r);}}

113

Quicksort on a d-hypercube

d étapes : all processors active in each step A processor is assigned N/p elements (p= 2d) Etapes de la solution:

– Initially (Step 0), 1 pivot is chosen and broadcast to all processors– Each processors its elements in two sub-lists: one less (inferior) than the

current pivot and the other greater or equal (superior)– Exchange the inferior and superior sub-lists based on dimension (d-0),

creating two sub-cubes along dimension d-0 (one for the inferior lists and the other for the superior lists)

– Each processor merges the (inferior and superior) lists – Repeat for each sub-cube

114


000010

110100

001

101 111

011

0XX 1XX

Step 0Pivot P0

Division along dimension 3. Two blocks of elements are created : •1 Block of elements less than pivot P0•1 block of elements greater than or equal to P0

Example on a 3-Hypercube

< P0 > P0

115


000010

110100

001

101 111

011

Etape 1

Pivot P10

•Division along dimension 2. •Divide each sub-cube in two smaller sub-cubes

00X

01X

10X

11X

Pivot P11

< P10

> P10

< P11

> P11


116


000010

110100

001

101 111

011

Etape 2

Pivot P20

Division along dimension 1. Final order defined by the label ordering of the processors

000

010

Pivot P22

001

011

100 101

110 111

Pivot P21 Pivot P23

<P20

<P21

>P20

>P21

<P22 >P22

>P23<P23


117


000010

110100

001

101 111

011

Final step

Each processor sorts its final list, using for example a sequential quicksort

000010

110100

001

101 111

011

local sort

{}

{}

{} : empty list


118


Data exchange at the initial step: sub-cubes P0XX and P1XX

P0XX P1XXBroadcastPivot P0

< P0 > P0 < P0 > P0

< P0 > P0

Exchange sub-lists

inferior / superior

< P0 > P0

Exchange sub-lists

inférior / superio

P1XX

P1XX

P1XX

P0XX

P0XX

P0XX

Sort the sub-lists at the end of each step?

119


Algorithm: Processor k (k=0, …, p-1)Hypercube-Quicksort(B, d) { /* B contains the elements assigned to processor i*/ /* d is the hypercube dimension*/int i; double x, B1[ ], B2[ ], T[ ]; my-id = k; /* Processor id*/ for ( i = d-1 to 0 ) {

x = pivot (my-id, i); partition(B, x, B1, B2): /* B1 inferior sub-list, B2 superior sub-list */

if ( my-id AND 2i == 0) { /* i th bit is 0 */ send( B2, My-neighbor in dimension i);receive(T, My-neighbor in dimension i);B = B1 T;}

else {send( B1, My-neighbor in dimension i);receive(T, My-neighbor in dimension i);B = B2 T;}

Sequential-Quicksort( B);End Hypercube-Quicksort

120


Choice of pivot More important for performance than in the

sequential case. It has a great impact on:– la répartition de charge entre processeurs – la performance de l’algorithme (dégradation rapide de la

performance)

121


Worst case: At step 0, largest element of list is selected as the pivot

Pivot0 = x=max{ Xi}

000010

110100

001

101 111

011

{} {}

{}{}

Foreground processors are overloaded Background processors are idle

122


Choice of pivot : ideal case In parallel do,

– Sort the initial list assigned to each processor– Choose the median element of one of the processors of

the cube – Assuming uniform distribution of elements of the list

List assigned processor Pi

Median element

List assigned processor Pi

Median element

Median element of whole list

123


Steps of the algorithm

Local sort of assigned list

Selection of pivot by rocessor

Broadcast of pivot in sub-d’hypercube d-i

Division based on pivot(Binary search)

Exchanges of sub-lists between neighbors

Merge sorted sub-lists

repeat

Time complexity

))p

Nlog(*

p

NO(

O(1*d)

)O(log2

)1(i 2

d

i

pdd

))p

Nlog(*O(d

)2p

N*O(d

)p

N*O(d d=logp

124

Parallel Quicksort on a PRAM

Parallel QUICKSORT algorithm/* The solution constructs a binary tree of processors chich is traversed in IN-ORDER to yield the sorted */

Variables shared by all processors

root : Root of the global binary treeA[n] : AN array of n elements ( 1, 2, ….., n)Leftchild [i] : The root of the left sub-tree of processor i (i=1,2, …)Rightchild[i] : The root of the right sub-tree of processor (i=1,2, …

125


Process /* Do in parallel for each processor i */begin Root := I; Parent := i; Leftchild[i] := Rightchild[i] := n+1;End

Repeat for each processor i root do

begin

if (A[i] < A[Parent] …..) then

begin

Leftchild[Parent] := i

if i = Leftchild[Parent] then exit

else Parent := Leftchild[Parent]

end

else

begin

Rightchild[Parent] := i

If i = Rightchild[Parent] then exit

else Parent := Leftchild[Parent]

end

end repeat

end process

126


Example

33 21 13 54 82 33 40 72

1 2 3 4 5 6 7 8 Root = processeur 4

[4] {54}

1 23 6 7

5 8

Binary tree

9

1 2 3 4 5 6 7 8

Leftchild

Rightchild

9 9 9 9 9 99

9 9 9 9 9 9 99

Step 0

127


Example

Racine = processeur 4

[4] {54}

1 23 6 7

5 8

Binary tree 1 2 3 4 5 6 7 8

Leftchild

Rightchild

1

5

Step 1

Processorr 1 wins the competition for the left subtree of 4 and 5 wins at right

2 3 6 7 8

[1] {33} [5] {82}

128


Example Racine = processeur 4

2 3 8 1 2 3 4 5 6 7 8

Leftchild

Rightchild

1

6 75

129


Example

[4] {54}

1 23 6 7

5 8

Binary tree

2 3 6 7 8

[1] {33} [5] {82}

3

[2] {21}

7

[6] {33} [8] {72}