Parallel Algorithms and Computing
Selected topics
Parallel Architecture
2
References
An introduction to parallel algorithmsJoseph Jaja
Introduction to parallel computingVipin Kumar, Ananth Grama, Anshul Gupta, George KArypis
Parallel sorting algorithmsSelim G. Akl
3
Models
Three models:
Graphs (DAG : Directed Acyclic Graph)
Parallel Randon Access Machine
Network
4
Graphs
Not studied here
Parallel Architecture
Parallel random access machine
6
Parallel Randon Access Machine
Flynn classifies parallel machines based on:
– Data flow– Instruction flow
Each flow can be: – Single– Multiple
7
Parallel Randon Access Machine
Flynn classification
SINGLE MULTIPLE
SINGLE SISD SIMD
MULTIPLE MISD MIMD
Data flow
Inst
ruct
ion
flo
w
8
Parallel Randon Access Machine
Extend the traditional RAM (Random Access Memory) machine
Interconnection network between global memory and processors
Multiple processors
Mémoire Globale (Shared – Memory)
P1 P2 Pp
9
Parallel Randon Access Machine
Characteristics
Processors Pi (i (0 i p-1 )– each with a local memory– i is a unique identity for processor Pi
A global shared memory – it can be accessed by all processors
10
Parallel Randon Access Machine
Types of operations: Synchronous
– Processors work in locked step
at each step, a processor is active or idle
suited for SIMD and MIMD architectures
Asynchronous– processors have local clocks – needs to synchronize the processors
suited for MIMD architecture
11
Parallel Randon Access Machine
Example of synchronous operation
Algorithm : Processor i (i=0 … 3)
Input : A, B i processor id
Output : (1) CBegin
If ( B==0) C = AElse C = A/B
End
12
Parallel Randon Access Machine
Step 1
A : 7
B : 0
C : 7 (Actif, B=0)
A : 2
B : 1
C : 0 (Inactif, (B0)
A : 4
B : 2
C : 0 (Inactif, (B0)
A : 5
B : 0
C : 5 (Actif, (B=0)
Processeur 3Processeur 2Processeur 1Processeur 0
Initial
A : 7
B : 0
C : 0
A : 2
B : 1
C : 0
A : 4
B : 2
C : 0
A : 5
B : 0
C : 0
Processeur 3Processeur 2Processeur 1Processeur 0
(idle B 0) (idle B 0) (active B = 0)(active B = 0)
13
Parallel Randon Access Machine
Step 2
A : 7
B : 0
C : 7
A : 2
B : 1
C : 2
A : 4
B : 2
C : 2
A : 5
B : 0
C : 5
Processeur 3Processeur 2Processeur 1Processeur 0
(active B 0) (active B 0) (idle B = 0)(idle B = 0)
14
Parallel Randon Access Machine
Read / Write conflicts
EREW : Exclusive - Read, Exclusive -Write– no concurrent ( read or write) operation on a variable
CREW : Concurrent – Read, Exclusive – Write – concurrent reads allowed on same variable– exclusive write only
15
Parallel Randon Access Machine
ERCW : Exclusive Read – Concurrent Write
CRCW : Concurrent – Read, Concurrent – Write
16
Parallel Randon Access Machine
Concurrent write on a variable X Common CRCW : only if all processors write the
same value on X
SUM CRCW : write the sum all variables on X
Random CRCW : choose one processor at random and write its value on X
Priority CRCW : processor with hign priority writes on X
17
Parallel Randon Access Machine
Example: Concurrent write on X by processors P1 (50 X) , P2 (60 X), P3 (70 X)
Common CRCW ou ERCW : Failure
SUM CRCW : X is the sum (180) of the written values Random CRCW : final value of
X { 50, 60, 70 }
18
Parallel Randon Access Machine
Basic Input/Output operations
On global memory– global read (X, x)– global write (Y, y)
On local memory– read (X, x)– write (Y, y)
19
Example 1: Matrix-Vector product
Matrix-Vector produt
Y = AX
– A is a nXn matrix – X = [ x1, x2, …, xn] a vector of n elements– p processeurs ( pn ) and r = n/p
Each processor is assigned a bloc of r= n/p elements
20
Example 1: Matrix-Vector product
Y1Y2….
Yn
A1,1 A1,2 … A1,nA2,1 A2,2 … A2,n ……..
A n,1 An,2 ... An,n
=
X1X2….
Xn
X
Global memory
P1 P2 Pp
Processors
21
Example 1: Matrix-Vector product
Partition A in p blocks Ai
Compute p partial products in parallel
Processor Pi compute the partial product Yi = Ai * X
A =
A1A2….
Ap
A1,1 A1,2 … A1,n …….Ar,1 A2,2 … A2,n …….
A(p-1)r,1 A(p-1),2 … A(p-1),n ……..A pr,1 Apr,2 ….Apr,n
=
A1
Ap
r lignes
r lignes
22
Example 1: Matrix-Vector product
Processeur Pi computes Yi = Ai * X
A1,1 A1,2 … A1,n …….Ar,1 A2,2 … A2,n
X1X2….Xn
X
Ar+1,1 Ar+1,2 … Ar+1,n …….A2r,1 A2r,2 … A2r,n
X1X2….Xn
X
A(p-1)r,1 A(p-1)r,2 …A(p-1)r,n …….Apr,1 Apr,2 … Apr,n
X1X2….Xn
X
Y1Y2….Yr
Y(p-1)r+1Y(p-1)r+2….Ypr
Yr+1Yr+2….Y2r
P1
P2
Pp
23
Example 1: Matrix-Vector product
Solution requires : p concurrents reads of vector X
each processor Pi makes an exclusive read of block Ai = A [((i-1)r +1) : ir, 1:n]
Each processor Pi makes an exclusive write on
block Yi = Y[((i-1)r +1) : ir ]
Required architecture : PRAM CREW
24
Example 1: Matrix-Vector product
Algorithm: processor Pi (i=1,2, …, n)Input
A : nxn matruix in global memory X : a vector in global memory Output
y = AX (y is a vector in global memory)Local variables
i : Pi processor id p: number of processors
n : dimension of A and X
Begin1. Global read ( x, z)2. global read (A((i-1)r + 1 : ir, 1:n), B)3. calculer W = Bz4. global write (w, y(i-1)r+1 : ir))
End
25
Example 1: Matrix-Vector product
Analysis Computation cost
Ligne 3: O( n2/p) opérations arithmétiques by Pi
r lignes X n opérations ( avec r = n/p) Communication cost
Ligne 1 : O(n) numbers transferred from global to local memory by Pi
Ligne 2 : O(n2/p) numbers transferred from global to local memory by Pi
Ligne 4 : O(n/p) numbers transferred from global to local memory by Pi
Overall: Algorithm run in O(n2/p) time
26
Example 1: Matrix-Vector product
Other way to partition the matrix is vertically Ai and X are split into blocks
– A1, A2, … Ap– X1, X2 … Xp
Solution in two phases :– Compute partial products
Z1 =A1X1, … Zp = ApXp– Synchronize the processors– Add partial results to get Y
Y= AX = Z1 + Z2 + … + Zp
27
Example 1: Matrix-Vector product
Y1Y2
….
Yn
A1,1 … A1,r A2,1 … A2,r An,1 … An,r
X1 … Xr *
r columnsProcessor P1
A1,(p-1)r +1 ... A1,prA2,(p-1)r +1 ... A2,pr
An,(p-1)r +1 ... An,prX(p-1)r +1 ... Xpr*
r columnsProcessor Pp
……..
Synchronization
28
Example 1: Matrix-Vector product
Algorithm: processor Pi (i=1,2, …, n)
Begin1. Global read ( x( (i-1)r +1 : ir) , z)2. global read (A(1:n, (i-1)r + 1 : ir), B)3. compute W = Bz4. Synchronize processors Pi (i=1, 2, …, n)5. global write (w, y(i-1)r+1 : ir))
End
Input A : nxn matruix in global memory
X : a vector in global memory Output
y = AX (* y: vector in global memory *)Local variables
i : Pi processor id p: number of processors
n : dimension of A and X
29
Example 1: Matrix-Vector product
Analysis Work out the details Overall: Algorithm run in O(n2/p) time
30
Example 2: Sum on the PRAM model
An aray A of n = 2k numbers
A PRAM machine with n processor
Compute S = A(1) + A(2) + …. + A(n)
Construct a binary tree to compute the sum in log2n time
31
Example 2: Sum on the PRAM model
B(1)=A(1)
B(2)=A(2)
B(1)=A(1)
B(2)=A(2)
B(1)=A(1)
B(2)=A(2)
B(1)=A(1)
B(2)=A(2)
P1 P2 P3 P4 P5 P6 P7 P8
B(1) B(2) B(3) B(4)
P1 P2 P3 P4
B(1) B(2)
P1 P2
B(1)
S=B(1)
P1
P1
Level >1, Pi computeB(i) = B(2i-1) + B(2i)
Level 1, Pi B(i) = A(i)
32
Example 2: Sum on the PRAM model
Algorithm processor Pi ( i=0,1, …n-1)Input
A : array of n = 2k elements in mémoire globalOutput
S : où S= A(1) + A(2) + …. . A(n)Local variables Pi
n : i : processor Pi identity
Begin1. global read ( A(i), a)2. global write (a, B(i))3. for h = 1 to log n do
if ( i ≤ n / 2h ) then begin global read (B(2i-1), x)
global read (b(2i), y) z = x +y global write (z,B(i)) end
4. if i = 1 then global write(z,S)End
Parallel Architecture
Network model
34
Network model
Characteristics Communication structure is important Network can be seen as a graph G=(N,E):
– Node i N is a processor– Edge (i,j)E represents a two way communication
between processors i and j
Basi communication operation– Send (X, Pi)
– Receive(X, Pi)
No global shared memory
35
Network model
P1 P2 P3 Pn…
n processors Linear array
n processor ring
P1 P2 P3 Pn…
36
Network model
P11 P12 P13 P1n…
n2 processors Grid
P21 P22 P23 P2n…
P31 P32 P33 P3n…
Pn1 Pn2 Pn3 Pnn…
n2 processors Torus: columns and rows are n rings
37
Network model
(P0)
(P2)
(P1)
(P7)
(P3)
(P4) (P5)
(P6)
n=2k hypercube
38
Network model
P11 P12 P13 P1n…
n2 processors Grid
P21 P22 P23 P2n…
P31 P32 P33 P3n…
Pn1 Pn2 Pn3 Pnn…
n2 processors Torus: columns and rows are n rings
39
Exemple 1: Matrix-Vector Product on linear array
A=[aij] an nxn matrix, i,j [1,n] X=[xi] i [1,n] Compute
n
j
xjaijyiyiy1
* where,
40
Exemple 1: Matrix-Vector Product on linear array
Systolic array algorithm for n=4
x4 x3 x2 x1
a14
a13
a12
a11
a24
a23
a22
a21
a34
a33
a32
a31
a44a43
a42
a41
. ..
.
.
.
P1 P2 P3 P4
41
Exemple 1: Matrix-Vector Product on linear array
•At step j, xj enters the processor P1. At step j, processor Pi receives (when possible) a value from its left and a value from the top. It updates its partial as follows:
Yi = Yi + aij*xj , j=1,2,3, ….
• Values xj and aij reach processor i at the same time at step (i+j-1)•(x1, a11) reach P1 at step 1 = (1+1-1)•(x3, a13) reach P1 at setep 3 = (1+3-1)
• In general, Yi is computed at step N+i-1
42
Exemple 1: Matrix-Vector Product on linear array
• The computation is completed when x4 and a44 reach processor P4 at Step N + N –1 = 2N-1• Conclusion: The algorithm requires (2N-1) steps. At each step, active processor Perform an addition and a multiplication• Complexity of the algorithm: O(N)
43
Exemple 1: Matrix-Vector Product on linear array
J=1
4 y2 = a2j*xj
J=1
4 y3 = a3j*xj
J=1
4 y4 = a4j*xj
J=1
4 y1 = a1j*xj
x2
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
Step1
2
3
4
5
6
7
x1
x3
x4
x1
x2 x1
x1
x1
x2
x2
x2 x1
x3
x3
x3
x3 x2 x1
x4
x4
x4
44
Exemple 1: Matrix-Vector Product on linear array
Systolic array algorithm: Time-Cost analysis
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
1 Add; 1 Mult; active: P1 idle: P2, P3, P4
2 Add; 2 Mult; active: P1, P2 idle: P3, P4
3 Add; 3 Mult; active: P1, P2,P3 idle: P4
4 Add; 4 Mult; active: P1, P2,P3 P4idle:
3 Add; 3 Mult; active: P2,P3,P4idle: P1
2 Add; 2 Mult; active: P3,P4idle: P1,P2
1 Add; 1 Mult; active: P4idle: P1,P2,P3
45
Exemple 1: Matrix-Vector Product on linear array
Systolic array algorithm: Time-Cost analysis
x2
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
P1 P2 P3 P4
Step1
2
3
4
5
6
7
x1
x3
x4
x1
x2 x1
x1 x2
x2
x3
x3
x3
x4
x4
x4
1 Add; 1 Mult; active: P1 idle: P2, P3, P4
2 Add; 2 Mult; active: P1, P2 idle: P3, P4
3 Add; 3 Mult; active: P1, P2,P3 idle: P4
4 Add; 4 Mult; active: P1, P2,P3 P4idle:
3 Add; 3 Mult; active: P2,P3,P4idle: P1
2 Add; 2 Mult; active: P3,P4idle: P1,P2
1 Add; 1 Mult; active: P4idle: P1,P2,P3
46
Exemple 2: Matrix multiplication on a 2-D nxn Mesh
Given two nxn matrices A = [aij] and B = [bij], i,j [1,n],
Compute the product C=AB , where C is given by :
n
j
bkjaikCijCijC1
* where,
47
Exemple 2: Matrix multiplication on a 2-D nxn Mesh
•At step i, Row i of A (starting with ai1) is entered from the top into column i (into processor P1i)
•At step j, Column j of B (starting with b1j) is entered from the left into row j (to processor Pj1)
• The values aik and bkj reach processor (Pji) at step (i+j+k-2).
At the end of this step, aik is sent down and bkj is sent right.
48
Exemple 2: Matrix multiplication on a 2-D nxn Mesh
Example: Systolic mesh algorithm for n=4
STEP 1
(1,1)
(2,1)
(3,1)
(4,1)
(1,2) (1,3) (1,4)
(2,4)(2,3)(2,2)
(3,4)(3,3)(3,2)
(4,4)(4,3)(4,2)
a14a13a12a11
a24a23a22a21
a34a33a32a31 . .
a44a43a42a41 . . ..
b41 b3 b21 b11
b42 b32 b22 b12 .
b43 b33 b23 b13 . .
b44 b34 b24 b14 ..
49
Exemple 2: Matrix multiplication on a 2-D nxn Mesh
Example: Systolic mesh algorithm for n=4
STEP 5
a11
a24 a33 a42b41 b31 b21
a14
a13
a12
b42
b33
b24
b32 b22 b12
b23 b13
b14
a23 a32 a41
a22 a31
a21
b43
b34b44
b11
a34 a43a44 A
B
50
Exemple 2: Matrix-Vector multiplication on a ring
Analysis
To determine the number of steps for completing the multiplication of the matrice, we must find the step at which the terms ann and bnn reach rocessor Pnn.
– Values aik and bkj reach processor Pji at i+j+k-2– Substituing n for i,j,k yields :
n + n + n – 2 = 3n - 2
Complexity of the solution: O(N)
51
Exemple 3: Matrix-Vector multiplication on a ring
N=4
X4 X3 X2 X1
P1 P2 P3 P4
a13
a12
a11
a14
a22
a21
a24
a23
a31
a34
a33
a32
a44
a43
a42
a41
Xi
aij
This algorithm requires N steps for a matrix-vector multiplication
52
Exemple 3: Matrix-Vector multiplication on a ring
Goal
Pipeline data into the processors, so that n product terms are computed and added to partial sums at each step.
Distribution of X on the processors
Xj 1j N, Xj is assigned to processor N-j+1
This algorithm requires N steps for a matrix-vector multiplication
53
Exemple 3: Matrix-Vector multiplication on a ring
Another way to distribute the Xi over the processors and to input
Matrix A– Row i of the matrix A is shifted (rotated) down i (mod
n) times
and entered into processor Pi. – Xi is assigned to processor Pi, at each step the Xi are
shifted right
54
Exemple 3: Matrix-Vector multiplication on a ring
X1 X2 X3 X4
P1 P2 P3 P4
a12
a13
a14
a11
a23
a24
a21
a22
a34
a31
a32
a33
a41
a42
a43
a44
Xi
aij
N=4
Diagonal
55
Exemple 4: Sum of n=2P numbers on a d-hypercube
Assignment: xi is on processor Pi
0 1
32
4 5
6 7
(X0)
(X2)
(X1)
(X7)
(X3)
(X4) (X5)
(X6)
Computation of S = xi,
56
Exemple 4: Sum of n=2P numbers on a d-hypercube
Step 1: Processors of the sub-cube 1XX send their data to corresponding processors in sub-cube 0XX
0 1
32
4 5
6 7
(X0+X4)
(X2+X6)
(X1+X5)
(X3+X7)
0XX sub-cube 1XX sub-cube
57
Exemple 4: Sum of n=2P numbers on a d-hypercube
Step 2: Processors of the sub-cube 01X send their data to corresponding processors in sub-cube 00X
0 1
32
4 5
6 7
(X0+X4+X2+X6) (X1+X5+X3+X7)
P Processeurs actifs P Processeurs inactifs
58
Exemple 4: Sum of n=2P numbers on a d-hypercube
Step 3: Processors of the sub-cube 01X send their data to corresponding processors in sub-cube 00X
P Processeurs actifs P Processeurs inactifs
0 1
32
4 5
6 7
S = (X0+X4+X2+X6+ X1+X5+X3+X7)
The sum of the n numbers is stored on node P0
59
Exemple 4: Sum of n=2P numbers on a d-hypercube
Algorithm: Processor PiInput: 1) An array of X of n=2p of numbers, X[i] is assigned to processor Pi 2) processor identity idOutput: S= X[0]+…+X[n] stored on processor P0
Processor PiBegin
My_id = id ( My_id i)S=X[i]For j = 0 to (d-1) do begin Partner = M_id XOR 2j
if My_id AND 2j = 0 beginreceive(Si, Partner)S = S + Si end
if My_id AND 2j 0 beginsend(S, Partner)exit end
end end
Parallel Architecture
Message broadcast on network model
(ring, torus, hypercube)
61
Basic communication
Message Broadcast One-to-all broadcast
– Ring– Mesh (Torus)– Hypercube
All-to-all broadcast– Ring– Mesh (Torus)– Hypercube
62
Communication cost
Message from
Pi Pj
l number of links traversed
Communication cost = ts + tw *m*l
ts :message preparation time
m: message length
tw: unit transfer time (byte)
l : number of links traversed by the message
63
Communication cost
Communication time bounds: – Ring ts + (tw)m p/2
– Mesh ts + (tw)m ((p)1/2)/2
– Hypercube ts + (tw)m log2p
Depends on the maximum number of links traversed by the message
64
One-to-All broadcast
Simple solution
P0 send message M0 to processors P1, P2, … Pn-1 successively.
P1P0M0
P2P0M0
P3P0M0
Pp-1P0M0
( P0 P1 P2 )
( P0 P1 P2 P3 )
( P0 P1 P2 … Pp-1 )
Communication cost = (ts + tw m0 ) i = (ts + tw m0 )*( p(p+1)/2)
65
One-to-all Broadcast
Processor send a message M to all processors
0 1 P-1 0 1 P-1
M M M M… …
One-to-all broadcast
Dual operation (Accumulation)
66
All-to-all broadcast
All-to-All Broadcast : several simultanous One-to-All broadcast where each processor Pi initiates the communication.
0 1 p-1…X0
0 1 p-1…
All-to-All broadcast
Accumulation vers plusieurs noeuds
X1 Xp-1
…
X0
X1
Xp-1 …
X0
X1
Xp-1 …
X0
X1
Xp-1
Parallel Architecture
Examples of message broadcasts
68
Example 1: One-to-All broadcast on a ring
Each processor forwards message to the next processor. Initially, message sent in two directions
0 1 2 3
7 6 5 4
1 2 3
43
2 4
Communication cost :
T = (ts + tw * m)p/2 où p est le nombre de processeurs
Parallel Steps
69
Example 2: One-to-All broadcast on a Torus
Two phases : Phase 1 : One-to-All broacast on first row
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
1 2
2
70
Example 2: One-to-All broadcast on a Torus
Phase 2 : parallel one-to-all broadcasts in the columns.
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
1 2
2
3 3 3 3
4 4 4 4
4 4 4 4
71
Example 2: One-to-All broadcast on a Torus
Communication cost :
Communication cost :
T = 2 * (ts + tw * m) p(1/2)/ 2 2 (p) is the number of processors
Broadcast on line
Tcom = (ts + twm) p(1/2)/2
Broadcast on columns
Tcom = (ts + twm) p(1/2)/2
72
Example 3: One-to-All broadcast on a Hypercube
Requires d steps. Each step doubles the number of active processors
0 1
2 3
4 5
6 7
12
2
3
3
3
3
Coût de communication :
T = 2 * (ts + tw * m)*logpP is the number of processors
73
Example 3: One-to-All broadcast on a Hypercube
Broadcast an element X stored on one processor (say P0) to the other processors of the hypercube.
Broadcast can be performed in O(logn) as follows
0 1
32
4 5
6 7
X
Initial distribution of data
74
Example 3: One-to-All broadcast on a Hypercube
•Step 1: Processor Po sends X to processor P1•Step 2: Processors P0 and P1 send X to P2 and P3 respectively•Step 3: Processor P0, P1, P2 and P3 send X to P4, P5, P6 and P7
X
0 1
32
4 5
6 7
XStep 1 Step 2
0 1
32
4 5
6 7
X X
X X
X X
0 1
32
4 5
6 7
XX
X X
XX
Step 3
P Active processors P Idle processors
75
Example 3: One-to-All broadcast on a Hypercube
Algorithm for a broadcast of X on a p-hypercube
Input: 1) X assigned to processor P0 2) processor identity idOutput: All processor Pi contain X
Processor PiBegin
If i = 0 then B = XMy_id = id ( My_id i)For j = 0 to (d-1) do if My_id 2j begin Partner = My_id XOR 2j if My_id > Partner receive(B, Partner) if My_id < Partner send(S, Partner)
endend
76
All-to-all Broadcast on a ring
STEP 1
0 1 2 3
7 6 5 4
1(0) 1(1) 1(2)
1(3)1(7)
1(6) 1(5) 1(4)
(0) (1) (2) (3)
(4)(5)(6)(7)
77
All-to-all Broadcast on a ring
Step 2
0 1 2 3
7 6 5 4
2(7) 2(0) 2(1)
2(2)2(6)
2(5) 2(4) 2(3)
(0,7) (1,0) (2,1) (3,2)
(4,3)(5,4)(6,5)(7,6)
78
All-to-all Broadcast on a ring
Etape 3
0 1 2 3
7 6 5 4
3(6) 3(7) 3(0)
3(1)3(5)
3(4) 3(3) 3(2)
(0,7,6) (1,0,7) (2,1,0) (3,2,1)
(4,3,2)(5,4,3)(6,5,4)(7,6,5)
79
All-to-all Broadcast on a ring
Etape 7
0 1 2 3
7 6 5 4
7(2) 7(3) 7(4)
7(5)7(1)
7(0) 7(7) 7(6)
(0,7,6,5,4,3,2) (1,0,7,6,5,4,3) (2,1,0,7,6,5,4) (3,2,1,0,7,6,5)
(4,3,2,1,0,7,6)(5,4,3,2,1,0,7)(6,5,4,3,2,1,0)(7,6,5,4,3,2,1)
80
All-to-all Broadcast on a 2-dimensional Torus
Two phases– Phase 1: All-to-all broadcast on each line. Each processor
Pi holds a message of size Mi = (p1/2)m– Phase 2: All-to-All broadcast in the columns
81
All-to-all Broadcast
0 1 2
3 4 5
6 7 8
(0) (1) (2)
(3) (4) (5)
(6) (7) (8)
All-to-All on the rows
Start of Phase 1
82
All-to-all Broadcast
0 1 2
3 4 5
6 7 8
(0,1,2) (0,1,2) (0,1,2)
(3,4,5) (3,4,5) (3,4,5)
(6,7,8) (6,7,8) (6,7,8)
All-to-All on columns
Start of Phase 2
83
All-to-all Broadcast
Communication cost = Cost phase 1 + cost phase 2
= (p1/2 -1)(ts + twm) + (p1/2-1) (ts + tw (p1/2)m)
Parallel Algorithms and Computing
Selected topics
Sorting in Parallel
85
Performance mesures
Speedup
Efficiency
Work-Time
Amdhal’s law
86
Speed up
Speed up : S(p) (p number of processors in the parallel solution)
T(p)
T(1) )( pS
S(p) < 1 : parallel solution is worst
1<S(p) ≤ 1 : Normal
p<S(p) : Hyper-speed up not very frequent
T(1) : sequential execution timeT(p) : parallel execution time with p processors
Poor performance
Normal speed up
Hyper- speed up
S(p) = pS(p)
1
p
(ideal)
87
Accélération
Is hyper speed up normal? Poor non optimal sequential algorithm Storage space a factor
88
Efficiency
Efficiency : E(p)
p
S(p) )( pE
0< E(p) ≤ 1 : Normal
1<E(p) : Hyper-accélération
Intérêt : Speed up : User point of viewEfficacité : Manager’s point of viewAccélération et Efficacité : designer’s point of view
89
Amdhal’s law
A program consists of two part :
+…
Sequential part Parallel prt
psseq TTTT )1(p
TTTpT psseq )(
90
Amdhal’s law
Bound on Speedup
p
TT
TpS
ps
seq
)(
T(p)
T(1) )( pS
)1(1
1 )(
seq
s
seq
s
TT
pTT
pS
91
Amdhal’s law
Bound on Speedup Sequential fraction (fs) et Parallel fraction (fp):
Speed up can be rewritten as
seq
pp
s
T T(1) ,1
1,0 T(1)
T , 1,0
T(1)
T
ps
s
ff
ff
)1(1
1 )(
ss fp
fpS
92
Amdhal’s law
Bound on Speedup
)1(1
1 )(
ss fp
fpS
sss
ffp
fpSp
1
)1(1
1 )( lim
sfpS
1)(
93
Amdhal’s law
Bound on Speedup
For example if fs is equal to 1%, S(p) is less than 100.
1/fs
S(p)
p11
94
Amdhal’s law
The above computation of speed up bound does not take into account communication and synchronization overheadsOverhead
overheadp
TTTpT psseq )(
S(p) )( pSreal
95
Parallel sorting
Types of sorting algorithms Properties
– Processor ordering determines order of the final result – Where input and output are stored– Basic compare-exchange operation
96
Issues in sorting algorithms
Internal/External sort :
CPU
Data fits in processor memory
(RAM)
Performances based :•Comparison•Basic operationsComplexity O(nlogn)
Internal
CPU
Data in memory and on disk
(RAM) (Disk)
Performances based :•Basic operations•Overlap of computing and I/O
External
97
Issues in sorting algorithms
Comparaison-Based
Non-Compared-Based– Ordering based on properties of the keys
Executions of :•Comparaison•Permutation
98
Issues in sorting algorithms
Internal sort (shared memory : PRAM)
P P •Share data•Minimize memory access conflicts
Each processor sort part of the data in memory
99
Issues in sorting algorithms
Internal sort (distributed memory)– Each processor is assigned a block of N/P elements– Processor locally sorts the assigned block
(using any sort algorithm internal ot )
Initial data
N/P elements per processor Pi
P1 P2 P3 < <
Input : Distributed among processorsOutput : Store on processorsOrdre final : processor order defines the
final ordering of list
100
Issues in sorting algorithms
Internal sort (distributed memory)
(0) (1)
(3)(2)
(4)
(7)
(5)
(6)
Example :Final order defined by the gray code labelling of processors
1 2 3 4 5
101
Issues in sorting algorithms
Building block: compare-exchange operation
CPU
(ai, aj)
RAMSequentiel
(ai < aj) ??
ai ↔ aj
(ai)
CPU
(aj)
CPU
ai
aj
ai = min(ai, aj)
aj
ai
aj = max(ai, aj)
Parallel
Exchange-Compare-Min(P(i+1))
P(i) P(i+1)
Exchange-Compare-Max(P(i-1))
102
Issues in sorting algorithms
Compare-exchange : N/p elements per processor
1 6 8 11 13 62 2 7 9 10 12 63
1 6 8 11 13 62
2 7 9 10 12 63
2 7 9 10 12 63
1 6 8 11 13 62
1 6 82 7 9 10 12 6311 13 62
min max
P(i) P(i+1)
N/p smallest elementsExchange-compare-min(P(i+1))
n/p largest elementsExchange-compare-max(P(i-1))
Example: Odd-Even Merge Sort
Unsorted list of n elements
A0 A1 A2 A3 AM-1 B0 B1 B2 B3 BM-1
Divide list in two lists (n/2 elements)
Sort each sub-list
A0 A2 … AM-2 B0 B2 … BM-2
Divide each in sub-lists ofOdd-even index
A1 A3 … AM-1 B1 B3 … BM-1
Merge sort the Odd – evensublists
E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1
Merge the two list andExchange out of positionelements
E0 O0 E1 O1 …. EM-1OM-1
Where is parallelism????
E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1
1xN
1xN
4xN/4
2xN/2
105
Example: Odd-Even Merge Sort
Key to the Merge Sort algorithm: method used to merge the sorted sub-list
Consider 2 sorted lists of m =2k elements:
A= a0, a1, ….am-1 et B= b0, b1, ….bm-1
Even(A)= a0, a2, ….am-2 , Odd(A)= a1, a3, ….am-1
Even(B)= b0, b2, ….bm-2 , Odd(B)= b1, b3, ….bm-1
106
Example: Odd-Even Merge Sort
Create 2 merged lists:
Merge Even(A) and Odd(B) to E = E0 E1 …Em-1
Merge Even(B) and Odd(A) to O = O0 O1 …Om-1
Merge E and O as follows to create a List L’
L’ = E0O0E1 O1 …Em-1 Om-1
Exchange out of order elements of L’ to obtain L
107
Example: Odd-Even Merge Sort
A=2,3,4,8 et B=1,5,6,7Even (A)= 2,4 and Odd(A)= 3,8Even (B)= 1, 6 et Odd(B)= 5,7
E = 2,4,5,7 et O=1,3,6,8
L’ = 2 ↔1, 4 ↔ 3, 5, 6, 7, 8
L = 1, 2, 3, 4, 5, 6, 7, 8
Parallel sorting
Quicksort
109
Review: Quicksort
)O(nlog steps log :caseBest
O(n2) scomparison 2
1)n(n :caseWorst
22 nn
Recursively:•choose a pivot•divide list in two using the pivot•sort left and right sub-list
Recall: Sequential Quicksort
Performance
110
Review: Quicksort
Sequential Quicksortvoid Quicksort (double *A, int q, int r){ int s, i; double pivot; if (q < r ) { /* divide A using the pivot */
pivot = A[q];s = q;for (i = q+1; i ≤ pivot {
s = s+1;exchange(A,s,i); }
}exchange(A,q,s);/* recursive calls to sort the new sublist*/Quicksort(A,q,s-1);Quicksort(A, s+1, r); }
}
111
Review: Quicksort
Create a binary tree of processor, one new processor for each recursive call of Quicksort
Easy to implement, but can be inefficient performance wise
112
Review: Quicksort
Implantation en mémoire partagée (avec primitives Fork())
double A[nmax]; qoid quicksort(int q, int r){ int s, i, n; double pivot; if q < r {/*partitions */ pivot=A[q];
s=q;for (i=q+1; i <= r; i++){if A[i] <= pivot){
s= s+1;exchange(A, s,i);}
}exchange(A, q, s);
/*Create a new processor */n=fork() if ( n== 0 )exec("quicksort", q, s-1);else quicksort(,s+1,r);}}
113
Quicksort on a d-hypercube
d étapes : all processors active in each step A processor is assigned N/p elements (p= 2d) Etapes de la solution:
– Initially (Step 0), 1 pivot is chosen and broadcast to all processors– Each processors its elements in two sub-lists: one less (inferior) than the
current pivot and the other greater or equal (superior)– Exchange the inferior and superior sub-lists based on dimension (d-0),
creating two sub-cubes along dimension d-0 (one for the inferior lists and the other for the superior lists)
– Each processor merges the (inferior and superior) lists – Repeat for each sub-cube
114
Quicksort on a d-hypercube
000010
110100
001
101 111
011
0XX 1XX
Step 0Pivot P0
Division along dimension 3. Two blocks of elements are created : •1 Block of elements less than pivot P0•1 block of elements greater than or equal to P0
Example on a 3-Hypercube
< P0 > P0
115
Quicksort on a d-hypercube
000010
110100
001
101 111
011
Etape 1
Pivot P10
•Division along dimension 2. •Divide each sub-cube in two smaller sub-cubes
00X
01X
10X
11X
Pivot P11
< P10
> P10
< P11
> P11
Example on a 3-Hypercube
116
Quicksort on a d-hypercube
000010
110100
001
101 111
011
Etape 2
Pivot P20
Division along dimension 1. Final order defined by the label ordering of the processors
000
010
Pivot P22
001
011
100 101
110 111
Pivot P21 Pivot P23
<P20
<P21
>P20
>P21
<P22 >P22
>P23<P23
Example on a 3-Hypercube
117
Quicksort on a d-hypercube
000010
110100
001
101 111
011
Final step
Each processor sorts its final list, using for example a sequential quicksort
000010
110100
001
101 111
011
local sort
{}
{}
{} : empty list
Example on a 3-Hypercube
118
Quicksort on a d-hypercube
Data exchange at the initial step: sub-cubes P0XX and P1XX
P0XX P1XXBroadcastPivot P0
< P0 > P0 < P0 > P0
< P0 > P0
Exchange sub-lists
inferior / superior
< P0 > P0
Exchange sub-lists
inférior / superio
P1XX
P1XX
P1XX
P0XX
P0XX
P0XX
Sort the sub-lists at the end of each step?
119
Quicksort on a d-hypercube
Algorithm: Processor k (k=0, …, p-1)Hypercube-Quicksort(B, d) { /* B contains the elements assigned to processor i*/ /* d is the hypercube dimension*/int i; double x, B1[ ], B2[ ], T[ ]; my-id = k; /* Processor id*/ for ( i = d-1 to 0 ) {
x = pivot (my-id, i); partition(B, x, B1, B2): /* B1 inferior sub-list, B2 superior sub-list */
if ( my-id AND 2i == 0) { /* i th bit is 0 */ send( B2, My-neighbor in dimension i);receive(T, My-neighbor in dimension i);B = B1 T;}
else {send( B1, My-neighbor in dimension i);receive(T, My-neighbor in dimension i);B = B2 T;}
Sequential-Quicksort( B);End Hypercube-Quicksort
120
Quicksort on a d-hypercube
Choice of pivot More important for performance than in the
sequential case. It has a great impact on:– la répartition de charge entre processeurs – la performance de l’algorithme (dégradation rapide de la
performance)
121
Quicksort on a d-hypercube
Worst case: At step 0, largest element of list is selected as the pivot
Pivot0 = x=max{ Xi}
000010
110100
001
101 111
011
{} {}
{}{}
Foreground processors are overloaded Background processors are idle
122
Quicksort on a d-hypercube
Choice of pivot : ideal case In parallel do,
– Sort the initial list assigned to each processor– Choose the median element of one of the processors of
the cube – Assuming uniform distribution of elements of the list
List assigned processor Pi
Median element
List assigned processor Pi
Median element
Median element of whole list
123
Quicksort on a d-hypercube
Steps of the algorithm
Local sort of assigned list
Selection of pivot by rocessor
Broadcast of pivot in sub-d’hypercube d-i
Division based on pivot(Binary search)
Exchanges of sub-lists between neighbors
Merge sorted sub-lists
repeat
Time complexity
))p
Nlog(*
p
NO(
O(1*d)
)O(log2
)1(i 2
d
i
pdd
))p
Nlog(*O(d
)2p
N*O(d
)p
N*O(d d=logp
124
Parallel Quicksort on a PRAM
Parallel QUICKSORT algorithm/* The solution constructs a binary tree of processors chich is traversed in IN-ORDER to yield the sorted */
Variables shared by all processors
root : Root of the global binary treeA[n] : AN array of n elements ( 1, 2, ….., n)Leftchild [i] : The root of the left sub-tree of processor i (i=1,2, …)Rightchild[i] : The root of the right sub-tree of processor (i=1,2, …
125
Parallel Quicksort on a PRAM
Process /* Do in parallel for each processor i */begin Root := I; Parent := i; Leftchild[i] := Rightchild[i] := n+1;End
Repeat for each processor i root do
begin
if (A[i] < A[Parent] …..) then
begin
Leftchild[Parent] := i
if i = Leftchild[Parent] then exit
else Parent := Leftchild[Parent]
end
else
begin
Rightchild[Parent] := i
If i = Rightchild[Parent] then exit
else Parent := Leftchild[Parent]
end
end repeat
end process
126
Parallel Quicksort on a PRAM
Example
33 21 13 54 82 33 40 72
1 2 3 4 5 6 7 8 Root = processeur 4
[4] {54}
1 23 6 7
5 8
Binary tree
9
1 2 3 4 5 6 7 8
Leftchild
Rightchild
9 9 9 9 9 99
9 9 9 9 9 9 99
Step 0
127
Parallel Quicksort on a PRAM
Example
Racine = processeur 4
[4] {54}
1 23 6 7
5 8
Binary tree 1 2 3 4 5 6 7 8
Leftchild
Rightchild
1
5
Step 1
Processorr 1 wins the competition for the left subtree of 4 and 5 wins at right
2 3 6 7 8
[1] {33} [5] {82}
128
Parallel Quicksort on a PRAM
Example Racine = processeur 4
2 3 8 1 2 3 4 5 6 7 8
Leftchild
Rightchild
1
6 75
129
Parallel Quicksort on a PRAM
Example
[4] {54}
1 23 6 7
5 8
Binary tree
2 3 6 7 8
[1] {33} [5] {82}
3
[2] {21}
7
[6] {33} [8] {72}