Parallel Algorithms and Computing Selected topics

Parallel Algorithms and Computing

Selected topics

Parallel Architecture

References

An introduction to parallel algorithmsJoseph Jaja

Introduction to parallel computingVipin Kumar, Ananth Grama, Anshul Gupta, George KArypis

Parallel sorting algorithmsSelim G. Akl

Models

Three models:

Graphs (DAG : Directed Acyclic Graph)

Parallel Randon Access Machine

Network

Graphs

Not studied here

Parallel random access machine

Flynn classifies parallel machines based on:

– Data flow– Instruction flow

Each flow can be: – Single– Multiple

Flynn classification

SINGLE MULTIPLE

SINGLE SISD SIMD

MULTIPLE MISD MIMD

Data flow

Extend the traditional RAM (Random Access Memory) machine

Interconnection network between global memory and processors

Multiple processors

Mémoire Globale (Shared – Memory)

P1 P2 Pp

Characteristics

Processors Pi (i (0 i p-1 )– each with a local memory– i is a unique identity for processor Pi

A global shared memory – it can be accessed by all processors

Types of operations: Synchronous

– Processors work in locked step

at each step, a processor is active or idle

suited for SIMD and MIMD architectures

Asynchronous– processors have local clocks – needs to synchronize the processors

suited for MIMD architecture

Example of synchronous operation

Algorithm : Processor i (i=0 … 3)

Input : A, B i processor id

Output : (1) CBegin

If ( B==0) C = AElse C = A/B

Step 1

C : 7 (Actif, B=0)

C : 0 (Inactif, (B0)

C : 5 (Actif, (B=0)

Processeur 3Processeur 2Processeur 1Processeur 0

Initial

(idle B 0) (idle B 0) (active B = 0)(active B = 0)

Step 2

(active B 0) (active B 0) (idle B = 0)(idle B = 0)

Read / Write conflicts

EREW : Exclusive - Read, Exclusive -Write– no concurrent ( read or write) operation on a variable

CREW : Concurrent – Read, Exclusive – Write – concurrent reads allowed on same variable– exclusive write only

ERCW : Exclusive Read – Concurrent Write

CRCW : Concurrent – Read, Concurrent – Write

Concurrent write on a variable X Common CRCW : only if all processors write the

same value on X

SUM CRCW : write the sum all variables on X

Random CRCW : choose one processor at random and write its value on X

Priority CRCW : processor with hign priority writes on X

Example: Concurrent write on X by processors P1 (50 X) , P2 (60 X), P3 (70 X)

Common CRCW ou ERCW : Failure

SUM CRCW : X is the sum (180) of the written values Random CRCW : final value of

X { 50, 60, 70 }

Basic Input/Output operations

On global memory– global read (X, x)– global write (Y, y)

On local memory– read (X, x)– write (Y, y)

Example 1: Matrix-Vector product

Matrix-Vector produt

Y = AX

– A is a nXn matrix – X = [ x1, x2, …, xn] a vector of n elements– p processeurs ( pn ) and r = n/p

Each processor is assigned a bloc of r= n/p elements

Y1Y2….

A1,1 A1,2 … A1,nA2,1 A2,2 … A2,n ……..

A n,1 An,2 ... An,n

X1X2….

Global memory

P1 P2 Pp

Processors

Partition A in p blocks Ai

Compute p partial products in parallel

Processor Pi compute the partial product Yi = Ai * X

A1A2….

A1,1 A1,2 … A1,n …….Ar,1 A2,2 … A2,n …….

A(p-1)r,1 A(p-1),2 … A(p-1),n ……..A pr,1 Apr,2 ….Apr,n

r lignes

Processeur Pi computes Yi = Ai * X

A1,1 A1,2 … A1,n …….Ar,1 A2,2 … A2,n

X1X2….Xn

Ar+1,1 Ar+1,2 … Ar+1,n …….A2r,1 A2r,2 … A2r,n

X1X2….Xn

A(p-1)r,1 A(p-1)r,2 …A(p-1)r,n …….Apr,1 Apr,2 … Apr,n

X1X2….Xn

Y1Y2….Yr

Y(p-1)r+1Y(p-1)r+2….Ypr

Yr+1Yr+2….Y2r

Solution requires : p concurrents reads of vector X

each processor Pi makes an exclusive read of block Ai = A [((i-1)r +1) : ir, 1:n]

Each processor Pi makes an exclusive write on

block Yi = Y[((i-1)r +1) : ir ]

Required architecture : PRAM CREW

Algorithm: processor Pi (i=1,2, …, n)Input

A : nxn matruix in global memory X : a vector in global memory Output

y = AX (y is a vector in global memory)Local variables

i : Pi processor id p: number of processors

n : dimension of A and X

Begin1. Global read ( x, z)2. global read (A((i-1)r + 1 : ir, 1:n), B)3. calculer W = Bz4. global write (w, y(i-1)r+1 : ir))

Analysis Computation cost

Ligne 3: O( n2/p) opérations arithmétiques by Pi

r lignes X n opérations ( avec r = n/p) Communication cost

Ligne 1 : O(n) numbers transferred from global to local memory by Pi

Ligne 2 : O(n2/p) numbers transferred from global to local memory by Pi

Ligne 4 : O(n/p) numbers transferred from global to local memory by Pi

Overall: Algorithm run in O(n2/p) time

Other way to partition the matrix is vertically Ai and X are split into blocks

– A1, A2, … Ap– X1, X2 … Xp

Solution in two phases :– Compute partial products

Z1 =A1X1, … Zp = ApXp– Synchronize the processors– Add partial results to get Y

Y= AX = Z1 + Z2 + … + Zp

A1,1 … A1,r A2,1 … A2,r An,1 … An,r

X1 … Xr *

r columnsProcessor P1

A1,(p-1)r +1 ... A1,prA2,(p-1)r +1 ... A2,pr

An,(p-1)r +1 ... An,prX(p-1)r +1 ... Xpr*

r columnsProcessor Pp

……..

Synchronization

Algorithm: processor Pi (i=1,2, …, n)

Begin1. Global read ( x( (i-1)r +1 : ir) , z)2. global read (A(1:n, (i-1)r + 1 : ir), B)3. compute W = Bz4. Synchronize processors Pi (i=1, 2, …, n)5. global write (w, y(i-1)r+1 : ir))

Input A : nxn matruix in global memory

X : a vector in global memory Output

y = AX (* y: vector in global memory *)Local variables

i : Pi processor id p: number of processors

n : dimension of A and X

Analysis Work out the details Overall: Algorithm run in O(n2/p) time

Example 2: Sum on the PRAM model

An aray A of n = 2k numbers

A PRAM machine with n processor

Compute S = A(1) + A(2) + …. + A(n)

Construct a binary tree to compute the sum in log2n time

B(1)=A(1)

B(2)=A(2)

B(1)=A(1)

B(2)=A(2)

B(1)=A(1)

B(2)=A(2)

B(1)=A(1)

B(2)=A(2)

P1 P2 P3 P4 P5 P6 P7 P8

B(1) B(2) B(3) B(4)

P1 P2 P3 P4

B(1) B(2)

S=B(1)

Level >1, Pi computeB(i) = B(2i-1) + B(2i)

Level 1, Pi B(i) = A(i)

Algorithm processor Pi ( i=0,1, …n-1)Input

A : array of n = 2k elements in mémoire globalOutput

S : où S= A(1) + A(2) + …. . A(n)Local variables Pi

n : i : processor Pi identity

Begin1. global read ( A(i), a)2. global write (a, B(i))3. for h = 1 to log n do

if ( i ≤ n / 2h ) then begin global read (B(2i-1), x)

global read (b(2i), y) z = x +y global write (z,B(i)) end

4. if i = 1 then global write(z,S)End

Network model

Characteristics Communication structure is important Network can be seen as a graph G=(N,E):

– Node i N is a processor– Edge (i,j)E represents a two way communication

between processors i and j

Basi communication operation– Send (X, Pi)

– Receive(X, Pi)

No global shared memory

Network model

P1 P2 P3 Pn…

n processors Linear array

n processor ring

P1 P2 P3 Pn…

Network model

P11 P12 P13 P1n…

n2 processors Grid

P21 P22 P23 P2n…

P31 P32 P33 P3n…

Pn1 Pn2 Pn3 Pnn…

n2 processors Torus: columns and rows are n rings

Network model

(P4) (P5)

n=2k hypercube

Network model

P11 P12 P13 P1n…

n2 processors Grid

P21 P22 P23 P2n…

P31 P32 P33 P3n…

Pn1 Pn2 Pn3 Pnn…

n2 processors Torus: columns and rows are n rings

Exemple 1: Matrix-Vector Product on linear array

A=[aij] an nxn matrix, i,j [1,n] X=[xi] i [1,n] Compute

xjaijyiyiy1

* where,

Systolic array algorithm for n=4

x4 x3 x2 x1

a44a43

P1 P2 P3 P4

•At step j, xj enters the processor P1. At step j, processor Pi receives (when possible) a value from its left and a value from the top. It updates its partial as follows:

Yi = Yi + aij*xj , j=1,2,3, ….

• Values xj and aij reach processor i at the same time at step (i+j-1)•(x1, a11) reach P1 at step 1 = (1+1-1)•(x3, a13) reach P1 at setep 3 = (1+3-1)

• In general, Yi is computed at step N+i-1

• The computation is completed when x4 and a44 reach processor P4 at Step N + N –1 = 2N-1• Conclusion: The algorithm requires (2N-1) steps. At each step, active processor Perform an addition and a multiplication• Complexity of the algorithm: O(N)

4 y2 = a2j*xj

4 y3 = a3j*xj

4 y4 = a4j*xj

4 y1 = a1j*xj

P1 P2 P3 P4

x3 x2 x1

Systolic array algorithm: Time-Cost analysis

P1 P2 P3 P4

1 Add; 1 Mult; active: P1 idle: P2, P3, P4

2 Add; 2 Mult; active: P1, P2 idle: P3, P4

3 Add; 3 Mult; active: P1, P2,P3 idle: P4

4 Add; 4 Mult; active: P1, P2,P3 P4idle:

3 Add; 3 Mult; active: P2,P3,P4idle: P1

2 Add; 2 Mult; active: P3,P4idle: P1,P2

1 Add; 1 Mult; active: P4idle: P1,P2,P3

Systolic array algorithm: Time-Cost analysis

P1 P2 P3 P4

1 Add; 1 Mult; active: P1 idle: P2, P3, P4

2 Add; 2 Mult; active: P1, P2 idle: P3, P4

3 Add; 3 Mult; active: P1, P2,P3 idle: P4

4 Add; 4 Mult; active: P1, P2,P3 P4idle:

3 Add; 3 Mult; active: P2,P3,P4idle: P1

2 Add; 2 Mult; active: P3,P4idle: P1,P2

1 Add; 1 Mult; active: P4idle: P1,P2,P3

Exemple 2: Matrix multiplication on a 2-D nxn Mesh

Given two nxn matrices A = [aij] and B = [bij], i,j [1,n],

Compute the product C=AB , where C is given by :

bkjaikCijCijC1

* where,

•At step i, Row i of A (starting with ai1) is entered from the top into column i (into processor P1i)

•At step j, Column j of B (starting with b1j) is entered from the left into row j (to processor Pj1)

• The values aik and bkj reach processor (Pji) at step (i+j+k-2).

At the end of this step, aik is sent down and bkj is sent right.

Example: Systolic mesh algorithm for n=4

STEP 1

(1,2) (1,3) (1,4)

(2,4)(2,3)(2,2)

(3,4)(3,3)(3,2)

(4,4)(4,3)(4,2)

a14a13a12a11

a24a23a22a21

a34a33a32a31 . .

a44a43a42a41 . . ..

b41 b3 b21 b11

b42 b32 b22 b12 .

b43 b33 b23 b13 . .

b44 b34 b24 b14 ..

Example: Systolic mesh algorithm for n=4

STEP 5

a24 a33 a42b41 b31 b21

b32 b22 b12

b23 b13

a23 a32 a41

a22 a31

b34b44

a34 a43a44 A

Exemple 2: Matrix-Vector multiplication on a ring

Analysis

To determine the number of steps for completing the multiplication of the matrice, we must find the step at which the terms ann and bnn reach rocessor Pnn.

– Values aik and bkj reach processor Pji at i+j+k-2– Substituing n for i,j,k yields :

n + n + n – 2 = 3n - 2

Complexity of the solution: O(N)

X4 X3 X2 X1

P1 P2 P3 P4

This algorithm requires N steps for a matrix-vector multiplication

Pipeline data into the processors, so that n product terms are computed and added to partial sums at each step.

Distribution of X on the processors

Xj 1j N, Xj is assigned to processor N-j+1

This algorithm requires N steps for a matrix-vector multiplication

Another way to distribute the Xi over the processors and to input

Matrix A– Row i of the matrix A is shifted (rotated) down i (mod

n) times

and entered into processor Pi. – Xi is assigned to processor Pi, at each step the Xi are

shifted right

X1 X2 X3 X4

P1 P2 P3 P4

Diagonal

Exemple 4: Sum of n=2P numbers on a d-hypercube

Assignment: xi is on processor Pi

(X4) (X5)

Computation of S = xi,

Step 1: Processors of the sub-cube 1XX send their data to corresponding processors in sub-cube 0XX

(X0+X4)

(X2+X6)

(X1+X5)

(X3+X7)

0XX sub-cube 1XX sub-cube

Step 2: Processors of the sub-cube 01X send their data to corresponding processors in sub-cube 00X

(X0+X4+X2+X6) (X1+X5+X3+X7)

P Processeurs actifs P Processeurs inactifs

Step 3: Processors of the sub-cube 01X send their data to corresponding processors in sub-cube 00X

P Processeurs actifs P Processeurs inactifs

S = (X0+X4+X2+X6+ X1+X5+X3+X7)

The sum of the n numbers is stored on node P0

Algorithm: Processor PiInput: 1) An array of X of n=2p of numbers, X[i] is assigned to processor Pi 2) processor identity idOutput: S= X[0]+…+X[n] stored on processor P0

Processor PiBegin

My_id = id ( My_id i)S=X[i]For j = 0 to (d-1) do begin Partner = M_id XOR 2j

if My_id AND 2j = 0 beginreceive(Si, Partner)S = S + Si end

if My_id AND 2j 0 beginsend(S, Partner)exit end

end end

Message broadcast on network model

(ring, torus, hypercube)

Basic communication

Message Broadcast One-to-all broadcast

– Ring– Mesh (Torus)– Hypercube

All-to-all broadcast– Ring– Mesh (Torus)– Hypercube

Communication cost

Message from

l number of links traversed

Communication cost = ts + tw *m*l

ts :message preparation time

m: message length

tw: unit transfer time (byte)

l : number of links traversed by the message

Communication cost

Communication time bounds: – Ring ts + (tw)m p/2

– Mesh ts + (tw)m ((p)1/2)/2

– Hypercube ts + (tw)m log2p

Depends on the maximum number of links traversed by the message

One-to-All broadcast

Simple solution

P0 send message M0 to processors P1, P2, … Pn-1 successively.

P1P0M0

P2P0M0

P3P0M0

Pp-1P0M0

( P0 P1 P2 )

( P0 P1 P2 P3 )

( P0 P1 P2 … Pp-1 )

Communication cost = (ts + tw m0 ) i = (ts + tw m0 )*( p(p+1)/2)

One-to-all Broadcast

Processor send a message M to all processors

0 1 P-1 0 1 P-1

M M M M… …

One-to-all broadcast

Dual operation (Accumulation)

All-to-all broadcast

All-to-All Broadcast : several simultanous One-to-All broadcast where each processor Pi initiates the communication.

0 1 p-1…X0

0 1 p-1…

All-to-All broadcast

Accumulation vers plusieurs noeuds

X1 Xp-1

Xp-1 …

Examples of message broadcasts

Example 1: One-to-All broadcast on a ring

Each processor forwards message to the next processor. Initially, message sent in two directions

0 1 2 3

7 6 5 4

Communication cost :

T = (ts + tw * m)p/2 où p est le nombre de processeurs

Parallel Steps

Example 2: One-to-All broadcast on a Torus

Two phases : Phase 1 : One-to-All broacast on first row

Phase 2 : parallel one-to-all broadcasts in the columns.

3 3 3 3

4 4 4 4

Communication cost :

T = 2 * (ts + tw * m) p(1/2)/ 2 2 (p) is the number of processors

Broadcast on line

Tcom = (ts + twm) p(1/2)/2

Broadcast on columns

Tcom = (ts + twm) p(1/2)/2

Example 3: One-to-All broadcast on a Hypercube

Requires d steps. Each step doubles the number of active processors

Coût de communication :

T = 2 * (ts + tw * m)*logpP is the number of processors

Broadcast an element X stored on one processor (say P0) to the other processors of the hypercube.

Broadcast can be performed in O(logn) as follows

Initial distribution of data

•Step 1: Processor Po sends X to processor P1•Step 2: Processors P0 and P1 send X to P2 and P3 respectively•Step 3: Processor P0, P1, P2 and P3 send X to P4, P5, P6 and P7

XStep 1 Step 2

Step 3

P Active processors P Idle processors

Algorithm for a broadcast of X on a p-hypercube

Input: 1) X assigned to processor P0 2) processor identity idOutput: All processor Pi contain X

Processor PiBegin

If i = 0 then B = XMy_id = id ( My_id i)For j = 0 to (d-1) do if My_id 2j begin Partner = My_id XOR 2j if My_id > Partner receive(B, Partner) if My_id < Partner send(S, Partner)

endend

All-to-all Broadcast on a ring

STEP 1

0 1 2 3

7 6 5 4

1(0) 1(1) 1(2)

1(3)1(7)

1(6) 1(5) 1(4)

(0) (1) (2) (3)

(4)(5)(6)(7)

Step 2

0 1 2 3

7 6 5 4

2(7) 2(0) 2(1)

2(2)2(6)

2(5) 2(4) 2(3)

(0,7) (1,0) (2,1) (3,2)

(4,3)(5,4)(6,5)(7,6)

Etape 3

0 1 2 3

7 6 5 4

3(6) 3(7) 3(0)

3(1)3(5)

3(4) 3(3) 3(2)

(0,7,6) (1,0,7) (2,1,0) (3,2,1)

(4,3,2)(5,4,3)(6,5,4)(7,6,5)

Etape 7

0 1 2 3

7 6 5 4

7(2) 7(3) 7(4)

7(5)7(1)

7(0) 7(7) 7(6)

(0,7,6,5,4,3,2) (1,0,7,6,5,4,3) (2,1,0,7,6,5,4) (3,2,1,0,7,6,5)

(4,3,2,1,0,7,6)(5,4,3,2,1,0,7)(6,5,4,3,2,1,0)(7,6,5,4,3,2,1)

All-to-all Broadcast on a 2-dimensional Torus

Two phases– Phase 1: All-to-all broadcast on each line. Each processor

Pi holds a message of size Mi = (p1/2)m– Phase 2: All-to-All broadcast in the columns

All-to-all Broadcast

(0) (1) (2)

(3) (4) (5)

(6) (7) (8)

All-to-All on the rows

Start of Phase 1

(0,1,2) (0,1,2) (0,1,2)

(3,4,5) (3,4,5) (3,4,5)

(6,7,8) (6,7,8) (6,7,8)

All-to-All on columns

Start of Phase 2

Communication cost = Cost phase 1 + cost phase 2

= (p1/2 -1)(ts + twm) + (p1/2-1) (ts + tw (p1/2)m)

Parallel Algorithms and Computing

Selected topics

Sorting in Parallel

Performance mesures

Speedup

Efficiency

Work-Time

Amdhal’s law

Speed up

Speed up : S(p) (p number of processors in the parallel solution)

T(1) )( pS

S(p) < 1 : parallel solution is worst

1<S(p) ≤ 1 : Normal

p<S(p) : Hyper-speed up not very frequent

T(1) : sequential execution timeT(p) : parallel execution time with p processors

Poor performance

Normal speed up

Hyper- speed up

S(p) = pS(p)

(ideal)

Accélération

Is hyper speed up normal? Poor non optimal sequential algorithm Storage space a factor

Efficiency

Efficiency : E(p)

S(p) )( pE

0< E(p) ≤ 1 : Normal

1<E(p) : Hyper-accélération

Intérêt : Speed up : User point of viewEfficacité : Manager’s point of viewAccélération et Efficacité : designer’s point of view

Amdhal’s law

A program consists of two part :

Sequential part Parallel prt

psseq TTTT )1(p

TTTpT psseq )(

Amdhal’s law

Bound on Speedup

T(1) )( pS

Amdhal’s law

Bound on Speedup Sequential fraction (fs) et Parallel fraction (fp):

Speed up can be rewritten as

T T(1) ,1

1,0 T(1)

T , 1,0

Amdhal’s law

Bound on Speedup

1 )( lim

Amdhal’s law

Bound on Speedup

For example if fs is equal to 1%, S(p) is less than 100.

Amdhal’s law

The above computation of speed up bound does not take into account communication and synchronization overheadsOverhead

overheadp

TTTpT psseq )(

S(p) )( pSreal

Parallel sorting

Types of sorting algorithms Properties

– Processor ordering determines order of the final result – Where input and output are stored– Basic compare-exchange operation

Issues in sorting algorithms

Internal/External sort :

Data fits in processor memory

Performances based :•Comparison•Basic operationsComplexity O(nlogn)

Internal

Data in memory and on disk

(RAM) (Disk)

Performances based :•Basic operations•Overlap of computing and I/O

External

Comparaison-Based

Non-Compared-Based– Ordering based on properties of the keys

Executions of :•Comparaison•Permutation

Internal sort (shared memory : PRAM)

P P •Share data•Minimize memory access conflicts

Each processor sort part of the data in memory

Internal sort (distributed memory)– Each processor is assigned a block of N/P elements– Processor locally sorts the assigned block

(using any sort algorithm internal ot )

Initial data

N/P elements per processor Pi

P1 P2 P3 < <

Input : Distributed among processorsOutput : Store on processorsOrdre final : processor order defines the

final ordering of list

Internal sort (distributed memory)

(0) (1)

(3)(2)

Example :Final order defined by the gray code labelling of processors

1 2 3 4 5

Building block: compare-exchange operation

(ai, aj)

RAMSequentiel

(ai < aj) ??

ai ↔ aj

ai = min(ai, aj)

aj = max(ai, aj)

Parallel

Exchange-Compare-Min(P(i+1))

P(i) P(i+1)

Exchange-Compare-Max(P(i-1))

Compare-exchange : N/p elements per processor

1 6 8 11 13 62 2 7 9 10 12 63

1 6 8 11 13 62

2 7 9 10 12 63

1 6 8 11 13 62

1 6 82 7 9 10 12 6311 13 62

min max

P(i) P(i+1)

N/p smallest elementsExchange-compare-min(P(i+1))

n/p largest elementsExchange-compare-max(P(i-1))

Example: Odd-Even Merge Sort

Unsorted list of n elements

A0 A1 A2 A3 AM-1 B0 B1 B2 B3 BM-1

Divide list in two lists (n/2 elements)

Sort each sub-list

A0 A2 … AM-2 B0 B2 … BM-2

Divide each in sub-lists ofOdd-even index

A1 A3 … AM-1 B1 B3 … BM-1

Merge sort the Odd – evensublists

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

Merge the two list andExchange out of positionelements

E0 O0 E1 O1 …. EM-1OM-1

Where is parallelism????

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

Key to the Merge Sort algorithm: method used to merge the sorted sub-list

Consider 2 sorted lists of m =2k elements:

A= a0, a1, ….am-1 et B= b0, b1, ….bm-1

Even(A)= a0, a2, ….am-2 , Odd(A)= a1, a3, ….am-1

Even(B)= b0, b2, ….bm-2 , Odd(B)= b1, b3, ….bm-1

Create 2 merged lists:

Merge Even(A) and Odd(B) to E = E0 E1 …Em-1

Merge Even(B) and Odd(A) to O = O0 O1 …Om-1

Merge E and O as follows to create a List L’

L’ = E0O0E1 O1 …Em-1 Om-1

Exchange out of order elements of L’ to obtain L

A=2,3,4,8 et B=1,5,6,7Even (A)= 2,4 and Odd(A)= 3,8Even (B)= 1, 6 et Odd(B)= 5,7

E = 2,4,5,7 et O=1,3,6,8

L’ = 2 ↔1, 4 ↔ 3, 5, 6, 7, 8

L = 1, 2, 3, 4, 5, 6, 7, 8

Parallel sorting

Quicksort

Review: Quicksort

)O(nlog steps log :caseBest

O(n2) scomparison 2

1)n(n :caseWorst

Recursively:•choose a pivot•divide list in two using the pivot•sort left and right sub-list

Recall: Sequential Quicksort

Performance

Review: Quicksort

Sequential Quicksortvoid Quicksort (double *A, int q, int r){ int s, i; double pivot; if (q < r ) { /* divide A using the pivot */

pivot = A[q];s = q;for (i = q+1; i ≤ pivot {

s = s+1;exchange(A,s,i); }

}exchange(A,q,s);/* recursive calls to sort the new sublist*/Quicksort(A,q,s-1);Quicksort(A, s+1, r); }

Review: Quicksort

Create a binary tree of processor, one new processor for each recursive call of Quicksort

Easy to implement, but can be inefficient performance wise

Review: Quicksort

Implantation en mémoire partagée (avec primitives Fork())

double A[nmax]; qoid quicksort(int q, int r){ int s, i, n; double pivot; if q < r {/*partitions */ pivot=A[q];

s=q;for (i=q+1; i <= r; i++){if A[i] <= pivot){

s= s+1;exchange(A, s,i);}

}exchange(A, q, s);

/*Create a new processor */n=fork() if ( n== 0 )exec("quicksort", q, s-1);else quicksort(,s+1,r);}}

Quicksort on a d-hypercube

d étapes : all processors active in each step A processor is assigned N/p elements (p= 2d) Etapes de la solution:

– Initially (Step 0), 1 pivot is chosen and broadcast to all processors– Each processors its elements in two sub-lists: one less (inferior) than the

current pivot and the other greater or equal (superior)– Exchange the inferior and superior sub-lists based on dimension (d-0),

creating two sub-cubes along dimension d-0 (one for the inferior lists and the other for the superior lists)

– Each processor merges the (inferior and superior) lists – Repeat for each sub-cube

000010

110100

101 111

0XX 1XX

Step 0Pivot P0

Division along dimension 3. Two blocks of elements are created : •1 Block of elements less than pivot P0•1 block of elements greater than or equal to P0

Example on a 3-Hypercube

< P0 > P0

000010

110100

101 111

Etape 1

Pivot P10

•Division along dimension 2. •Divide each sub-cube in two smaller sub-cubes

Pivot P11

000010

110100

101 111

Etape 2

Pivot P20

Division along dimension 1. Final order defined by the label ordering of the processors

Pivot P22

100 101

110 111

Pivot P21 Pivot P23

<P22 >P22

>P23<P23

000010

110100

101 111

Final step

Each processor sorts its final list, using for example a sequential quicksort

000010

110100

101 111

local sort

{} : empty list

Data exchange at the initial step: sub-cubes P0XX and P1XX

P0XX P1XXBroadcastPivot P0

< P0 > P0 < P0 > P0

< P0 > P0

Exchange sub-lists

inferior / superior

< P0 > P0

Exchange sub-lists

inférior / superio

Sort the sub-lists at the end of each step?

Algorithm: Processor k (k=0, …, p-1)Hypercube-Quicksort(B, d) { /* B contains the elements assigned to processor i*/ /* d is the hypercube dimension*/int i; double x, B1[ ], B2[ ], T[ ]; my-id = k; /* Processor id*/ for ( i = d-1 to 0 ) {

x = pivot (my-id, i); partition(B, x, B1, B2): /* B1 inferior sub-list, B2 superior sub-list */

if ( my-id AND 2i == 0) { /* i th bit is 0 */ send( B2, My-neighbor in dimension i);receive(T, My-neighbor in dimension i);B = B1 T;}

else {send( B1, My-neighbor in dimension i);receive(T, My-neighbor in dimension i);B = B2 T;}

Sequential-Quicksort( B);End Hypercube-Quicksort

Choice of pivot More important for performance than in the

sequential case. It has a great impact on:– la répartition de charge entre processeurs – la performance de l’algorithme (dégradation rapide de la

performance)

Worst case: At step 0, largest element of list is selected as the pivot

Pivot0 = x=max{ Xi}

000010

110100

101 111

Foreground processors are overloaded Background processors are idle

Choice of pivot : ideal case In parallel do,

– Sort the initial list assigned to each processor– Choose the median element of one of the processors of

the cube – Assuming uniform distribution of elements of the list

List assigned processor Pi

Median element

List assigned processor Pi

Median element

Median element of whole list

Steps of the algorithm

Local sort of assigned list

Selection of pivot by rocessor

Broadcast of pivot in sub-d’hypercube d-i

Division based on pivot(Binary search)

Exchanges of sub-lists between neighbors

Merge sorted sub-lists

repeat

Time complexity

Nlog(*

O(1*d)

)O(log2

)1(i 2

Nlog(*O(d

N*O(d d=logp

Parallel Quicksort on a PRAM

Parallel QUICKSORT algorithm/* The solution constructs a binary tree of processors chich is traversed in IN-ORDER to yield the sorted */

Variables shared by all processors

root : Root of the global binary treeA[n] : AN array of n elements ( 1, 2, ….., n)Leftchild [i] : The root of the left sub-tree of processor i (i=1,2, …)Rightchild[i] : The root of the right sub-tree of processor (i=1,2, …

Process /* Do in parallel for each processor i */begin Root := I; Parent := i; Leftchild[i] := Rightchild[i] := n+1;End

Repeat for each processor i root do

if (A[i] < A[Parent] …..) then

Leftchild[Parent] := i

if i = Leftchild[Parent] then exit

else Parent := Leftchild[Parent]

Rightchild[Parent] := i

If i = Rightchild[Parent] then exit

else Parent := Leftchild[Parent]

end repeat

end process

Example

33 21 13 54 82 33 40 72

1 2 3 4 5 6 7 8 Root = processeur 4

[4] {54}

1 23 6 7

Binary tree

1 2 3 4 5 6 7 8

Leftchild

Rightchild

9 9 9 9 9 99

9 9 9 9 9 9 99

Step 0

Example

Racine = processeur 4

[4] {54}

1 23 6 7

Binary tree 1 2 3 4 5 6 7 8

Leftchild

Rightchild

Step 1

Processorr 1 wins the competition for the left subtree of 4 and 5 wins at right

2 3 6 7 8

[1] {33} [5] {82}

Example Racine = processeur 4

2 3 8 1 2 3 4 5 6 7 8

Leftchild

Rightchild

Example

[4] {54}

1 23 6 7

Binary tree

2 3 6 7 8

[1] {33} [5] {82}

[2] {21}

[6] {33} [8] {72}

Parallel Algorithms and Computing Selected topics

Documents

2.1 Parallel Algorithms

Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel

Topics in Parallel and Distributed Computing: Introducing …tcpp.cs.gsu.edu/curriculum/?q=system/files/ch03.pdf · Topics in Parallel and Distributed Computing: Introducing Algorithms,

Chapter ∞: Parallel Algorithms

Parallel Algorithms K means Clustering - University at Buffalo€¦ · Parallel Algorithms K –means Clustering Final Results By: Andreina Uzcategui CSE 633: Parallel Algorithms

Edelman - Applied Parallel Algorithms

Parallel Hashing Algorithms

Parallel Algorithms Continued

Efficient parallel algorithms for graph problemsEfficient Parallel Algorithms for Graph Problems 45 Subsequently, various parallel algorithms achieving linear speedup for solving the

Parallel Algorithms

1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations: Each group: 10 minutes Describe the problem,

Caching Parallel Computational Models Other Topics in Algorithms Wednesday, August 13 th

Parallel Algorithms for Trees

Fundamental Algorithms - Chapter 6: Parallel Algorithms ...kretinsk/teaching/fundamental algorithms/fa6...Technische Universit¨at Munc¨ hen Fundamental Algorithms Chapter 6: Parallel

CS 179: GPU Programming Lecture 10. Topics Non-numerical algorithms – Parallel breadth-first search (BFS) Texture memory

1 Parallel Algorithms III Topics: graph and sort algorithms

Parallel Computers 1 PARALLEL AND DISTRIBUTED COMPUTING OVERVIEW Fall 2003 TOPICS: Parallel computing requires an understanding of parallel algorithms,

Caching Parallel Computational Models Other Topics in Algorithms Wednesday, August 13 th 1

Parallel Algorithms - sorting

Massively Parallel Algorithms