View
228
Download
0
Embed Size (px)
Citation preview
Parallel Prefix and Data Parallel Operations
Motivation: basic parallel operations which occurs repeatedly.Let ) be an associative operation.
(a1 ) a2) ) a3 = a1 ) (a2 ) a3 )
How to compute
(a1 ) a2 ) …. ) an ) in parallel in O(logn) time?
Approach 1
a0 a1 a2 a3 a4 a5 a6 a7
[0:1][0:0] [1:2] [2:3] [3:4] [4:5] [5:6] [6:7]
[0:1][0:0] [0:2] [0:3] [1:4] [2:5] [3:6] [4:7]
[0:1][0:0] [0:2] [0:3] [0:4] [0:5] [0:6] [0:7]
d=1
d=2
d=4
Assume that n = 2k
for i = 0 to k-1 for j = 0 to n-1-2i do in parallel
x[j+ 2i ] = x[j] + x[j+ 2i ]
How to do on Tree Architecture?
for each nodeif there is a signal from left and right
St <- Sl + Sr
if there is a signal R, send R to both its children
if the node is a leaf and there is a signal R, X <- X + R
SlSr
StR
How to do on a Hypercube
A complete binary tree can be embedded into a hypercubeSimpler solution: each node computes prefix and total sum for i = 0 to k-1 for j = 0 to n-1 do in parallel
x[j] = x[j] + sum[ji] if i-th bit of j = 1
sum[j ] = sum[j] + sum[ji],
where ji and j have the same binary number representation
except their i-th bit, where the i-th bit of ji is the
complement of the i-bit of j.
Prefix on Hypercube
a0 a1 a2 a3 a4 a5 a6 a7
for i = 0 to k-1 for j = 0 to n-1 do in parallel
x[j] = x[j] + sum[ji] if i-th bit of j = 1
sum[j ] = sum[j] + sum[ji],
[0:1]
[0:1]
[0:0]
[0:1]
[2:2]
[2:3]
[2:3]
[2:3]
[4:4]
[4:5]
[4:5]
[4:5]
[6:6]
[6:7]
[6:7]
[6:7]d=1X
SUM
[0:1]
[0:3]
[0:0]
[0:3]
[2:2]
[0:3]
[2:3]
[0:3]
[4:4]
[4:7]
[4:5]
[4:7]
[4:6]
[4:7]
[4:7]
[4:7]d=2X
SUM
[0:1]
[0:7]
[0:0]
[0:7]
[2:2]
[0:7]
[2:3]
[0:7]
[0:4]
[0:7]
[0:5]
[0:7]
[0:6]
[0:7]
[0:7]
[0:7]d=4X
SUM
Applications of Data Parallel Operations
Any associative operations:
Examples:– min, max, add– adding two binary numbers– finite state automata– radix sort– segmented prefix sum– routing
• packing• unpacking• broadcast (copy-scan)
– solving recurrence equations– straight line computation (parallel arithmetic evaluation)
Adding two n bit numbers as parallel prefix
• a = an-1 …. a0
• b = bn-1 …. b0
• s = a + b
• note that si = ai bi ci-1
• to compute ci define g and p as:
gi = ai bi , pi = ai bi
• define as : (g,p) (g’,p’) = (g (p g’), p p’)
Then carry bit ci can be computed by:
(g,p) (g’,p’) = (g (p g’), p p’)
(Gi, Pi) = (gi,pi) (gi-1, pi-1) … (g0,p0)
and Gi = ci
Hardware circuit of recursive look-ahead adder
a0
b0
a10
b10
a12
b12
a6
b6
a9
b9
a3
b3
a14
b14
a13
b13
a1
b1
a5
b5
a7
b7
a4
b4
a2
b2
a8
b8
a15
b15
a11
b11
Parsing a regular language
b b
c cq1q2q0
(q0,b) = q2, (q0,c) = q1, (q1,b) = q0, (q1,c) = qr,(q2,b) = qr, (q2,c) = q0qr: reject state
q0->q2q1->q0q2->qr
q2q0qr
q1qrq0
q1qrq0
q2q0qr
q1’q2’q3’
q1’q2’q3’
q0q1qr
q1qrq0
b
q1’q2’q3’
q0q1qr
q0qrq2
q0q1qr
q0qrq2
q0qrqr
bccb c
Segmented Prefix operation
Segment boundary
1 3 3 7 12 18 7 15after
1 2 3 4 5 6 7 8
before
Segmented Prefix computation
Let be any associative operation.For segmented operation of , define ’ as follows:
’ b | b
a a b | b | a | (a b) | b
Then ’ is associativeand we can compute segmented operation in O(logn) time.
Enumerating
Data = [5 6 3 1 8 3 7 5 9 2]
active procs = [1 0 1 1 0 0 1 0 1 0]
enumerated = [0 x 1 2 x x 3 x 4 0]
packing
data = [5 6 3 1 8 3 7 5 9 2]
active procs = [1 0 1 1 0 0 1 0 1 0]
enumerated = [0 x 1 2 x x 3 x 4 x]
packed data =[5 3 1 7 9 x x x x x]
Packing and Unpacking on Hypercube
Packing• adjust bit 0• adjust bit 1• adjust bit 2 • ...• adjust bit k-1
Unpacking• adjust bit k-1• adjust bit k-2• ...• adjust bit 1• adjust bit 0
How about in the order of adjust bit 0, 1, ..., k-1 for packing?
Unpacking
Address 0 1 2 3 4 5 6 7 8 9
data = [6 2 3 5 9 x x x x x]
active procs = [1 0 1 1 0 0 1 0 1 0]
enumerated = [0 x 1 2 x x 3 x 4 x]
destination = [0 2 3 6 8 x x x x x]
unpacked data = [6 x 2 3 x x 5 x 9 x]
Copy Scan (broadcast)
address 0 1 2 3 4 5 6 7 8 9
data = [ 6 2 3 5 9 4 1 7 8 10]
segmented bit = [ 1 0 1 1 0 0 1 0 1 0]
result = [ 6 6 3 5 5 5 1 1 8 8]
Radix Sort
for j = k-1 to 0 // x has k bits for all i in [0 .. n-1] do parallel { if j-th bit of x[i] is 0 { y[i] = enumerate c = count } if j-th bit of x[i] is 1 y [i] <- enumerate + c
x [y[i]] = x [i] }
Radix sort another code
for j = k-1 to 0 // x has k bits for all i in [0 .. n-1] do parallel { pack left x[i] if j-th bit of x[i] pack right x[i] if j-th bit of x[i] }
Quick Sort
1. Pick a pivot p
2. Broadcast p
3. For all PE i, compare A[i] with p
{ if A[i] <p, pack left A[i] in the segment
if A[i] >= p, pack right A[i] in the segment
}
4. Mark the segment boundary
5. Each segment, quick sort recursively
Solving Linear Recurrence Equations
fn=an-1fn-1 + an-2fn-2
fn
fn-1
Pointer Jumping and Tree Computation
How to compute a prefix on a linked list?
1 2 3 4 5 6 7
If NEXT[i] != NILL then X[i] <- X[i] + X[NEXT[i]] NEXT[i] <- NEXT[NEXT[i]]
10 14 18 22 18 13 7
3 5 7 9 11 13 7
28 27 25 22 18 13 7
How to make 1 3 6 10 15 21 28 order?
Application: Tree computationPre-order numbering
Each node
Leaf node
1
1
Can be applied to in order, post ordernumber of children, depth etc.Bi-component, etc also
Recurrence Equation
Example: LU decomposition on a triangular matrix