Upload
vungoc
View
225
Download
1
Embed Size (px)
Citation preview
Algorithms PART II: Partitioning and Divide & Conquer
HPC Fall 2012 Prof. Robert van Engelen
HPC Fall 2012 2 11/14/12
Overview
Partitioning strategies Divide and conquer strategies Further reading
HPC Fall 2012 3 11/14/12
Partitioning Strategies Data partitioning
Perform domain decomposition to run parallel tasks on subdomains
“Scatter-compute-gather” where local computation may require communication and scatter/gather may involve computations
Task partitioning Decompose functions into
independent subfunctions and execute the subfunctions in parallel
Block partitioning of a 2D domain
function f(x,y) u=g(x) v=h(y) return u+v end
u=g(x) v=h(y)
return u+v
Thread 1 Thread 2
HPC Fall 2012 4 11/14/12
Partitioning Strategies
Partitioning strategy (data partitioning): 1. Break up a given problem into P subproblems 2. Solve the P subproblems concurrently 3. Collect and combine the P solutions
Embarrassingly parallel Is a simple form of data partitioning into independent
subproblems without initial work and no communication between tasks (workers)
HPC Fall 2012 5 11/14/12
Partitioning Example 1: Summation
Summation of n values X = [x1,…,xn]
1. Divide X into P equally-sized sublists Xp, p = 0,…,P-1 and distribute the Xp sublists to the P processors
2. The processors sum the local parts sp = ∑ Xp 3. Combine the local sums s = ∑ sp
Algorithms: 1. Scatter list X using a scatter-tree 2. Serial summation of parts 3. Reduce local sums
HPC Fall 2012 6 11/14/12
Partitioning Example 1: Summation
time
Log2(P) steps
Total amount of data transferred: n/2 log2(P)
n/2
n/4 n/4
n/8 n/8 n/8 n/8
Local summations: n/P steps
Log2(P) steps
Total amount of data transferred: P-1
1 1 1 1
1 1
1
scat
ter
(div
ide)
re
duce
(c
ombi
ne)
HPC Fall 2012 7 11/14/12
Partitioning Example 1: Summation
Communication time Scatter: tcomm1 = ∑k=1..log2(P) (tstartup + 2-kn tdata)
= log2(P)tstart + n(P-1)/P tdata Reduce: tcomm2 = log2(P) (tstart + tdata) Total: tcomm = 2 log2(P)tstart + ( n(P-1)/P + log2(P) ) tdata
Computation time Local sum: tcomp1 = n/P Global sum: tcomp2 = log2(P) Total: tcomp = n/P + log2(P)
Speedup, assuming tstartup = 0 Sequential time: ts = n-1 Parallel time: tP = ( n(P-1)/P + log2(P) ) tdata + n/P + log2(P) Speedup: SP = ts/tP = O(n / (n + log(P))) Best speedup w/o communication: SP = O(P/log(P))
HPC Fall 2012 8 11/14/12
General M-Ary Partitioning
time
First division Second division
Third division
Example: partitioning an image, e.g. to compute histogram by parallel reductions
(summations to count color pixels)
divide
combine
compute
3-level 4-ary partitioning for 43 = 64 processors
HPC Fall 2012 9 11/14/12
Partitioning Example 2: Parallel Bucket Sort
Bucket sort of values [x1,…, xn] bounded within a range xi∈[lo…hi] 1. Partition the n values in n/P segments 2a. Sort each segment into P small buckets (local computation) 2b. Send content of small buckets to P large buckets 3. Sort P large buckets and merge lists
Unsorted values
Sorted values
Sort content of buckets and merge lists
Empty small buckets into large buckets
P processors
Small buckets
HPC Fall 2012 10 11/14/12
Partitioning Example 2: Parallel Bucket Sort
Input: list X of length n with minimum value L and maximum U Output: sorted list X
def function bucket(x) = P*(x-L)/(U-L);
scatter list X to local Xp lists each of size n/P forall processors p = 0,…,P-1 for i = 0,…,n/P-1 x = Xp[i] put x into small bucket bp[bucket(x)] all-to-all of small buckets bp into large buckets Bp sort values in Bp[0,…,P-1] using a sequential sort algorithm gather X from Bp into a merged sorted list
HPC Fall 2012 11 11/14/12
Partitioning Example 2: Parallel Bucket Sort
Communication time (assuming uniform distribution in X) Scatter: tcomm1 = log2(P)tstartup + n(P-1)/P tdata All-to-all: tcomm2 = (P-1)(tstartup + n/P2 tdata) Gather: tcomm3 = log2(P)tstartup + n(P-1)/P tdata
Computation time (assuming uniform distribution in X) Small bucket sort: tcomp1 = n/P Large bucket sort: tcomp2 = n/P log2(n/P)
Speedup Sequential time: ts = n log2(n/P) (with P buckets) Parallel time: tP = 2 log2(P)tstartup + 2 n(P-1)/P tdata
+ (P-1)(tstartup + n/P2 tdata) + n/P (1 + log2(n/P))
Speedup w/o communication: SP = O(P)
HPC Fall 2012 12 11/14/12
Partitioning Example 3: Barnes Hut Algorithm
Direction of the force between two bodies at points p and q
HPC Fall 2012 13 11/14/12
Partitioning Example 3: Barnes Hut Algorithm
.
. . . .
. . . . .
.
.
Quadtree
Particles in 2D space Mass of parent is sum of
masses of children
. . . . . . . . .
. . . .
Center of mass
Particle at (x,y) and mass m
Parent computes M and C
. A square w/o particle is deleted
HPC Fall 2012 14 11/14/12
Partitioning Example 3: Barnes Hut Algorithm
for (t = 0; t < tmax; t++) { Build_tree(); Compute_Total_Mass_Center(); Compute_Force(); Update_Positions(); }
Sequential time is O(n log n)
Assuming P = n then tP = O(log P)
Compute_Force() { for (i = 0; i < n; i++) Compute_Tree_Force(i,root) } Compute_Tree_Force(i,node) { if (box at node contains one particle) F = force using eq (**) else { r = distance from i to C (*) of box D = size of box at node if (D/r < theta) F = force using eq (**) with total M else for (all children c of box) F = F + Compute_Tree_Force(i,c); } return F; } (**)
(*)
HPC Fall 2012 15 11/14/12
Divide and Conquer
Divide and conquer strategy (definition by JáJá 1992) 1. Break up a given problem into independent subproblems 2. Solve the subproblems recursively and concurrently 3. Collect and combine the solutions into the overall solution
In contrast to the partitioning strategy, divide and conquer uses recursive partitioning with concurrent execution to divide the problem down into independent subproblems
In deeper levels of recursion the number of active processors may increase or decrease
HPC Fall 2012 16 11/14/12
Divide & Conquer Example 1: Parallel Recursive Matmul
Block matrix multiplication in recursion by decomposing matrix in 2×2 submatrices and computing the submatrices recursively
Mat matmul(Mat A, Mat B, int s) { if (s == 1) C = A * B; else { s = s/2; P0 = matmul(Ap,p, Bp,p, s); P1 = matmul(Ap,q, Bq,p, s); P2 = matmul(Ap,p, Bp,q, s); P3 = matmul(Ap,q, Bq,q, s); P4 = matmul(Aq,p, Bp,p, s); P5 = matmul(Aq,q, Bq,p, s); P6 = matmul(Aq,p, Bp,q, s); P7 = matmul(Aq,q, Bq,q, s); Cp,p = P0 + P1; Cp,q = P2 + P3; Cq,p = P4 + P5; Cq,q = P6 + P7; } return C; }
Level of parallelism increases with deepening recursion Suitable for shared memory systems
P0…P7 computed in parallel
Can be computed in parallel
HPC Fall 2012 17 11/14/12
Divide and Conquer Example 2: Parallel Convex Hull Algorithm
The planar convex hull of a set of points S={p1,p2,…,pn} of pi=(x,y) coordinates is the smallest convex polygon that encompasses all points S on the x-y plane
x
y
HPC Fall 2012 18 11/14/12
Divide and Conquer Example 2: Parallel Convex Hull Algorithm
The upper convex hull spans points {q1,…,qs} ⊆ S from point q1 with minimum x to qs with maximum x
The convex hull = upper convex hull + lower convex hull Problem:
Given points S = {p1,…,pn} such that x(p1) < x(p2) < … < x(pn), construct the upper convex hull in parallel
Upper convex hull
x
y q1 qs
HPC Fall 2012 19 11/14/12
Divide and Conquer Example 2: Parallel Convex Hull Algorithm
Parallel convex hull: 1. Divide the x-sorted points S into sets S1 and S2 of equal size 2. Compute upper convex hull recursively on S1 and S2 3. Combine UCH(S1) and UCH(S2) by computing the upper
common tangent a to b to form UCH(S)
Upper common tangent
S2
S1
HPC Fall 2012 20 11/14/12
Divide and Conquer Example 2: Parallel Convex Hull Algorithm
Base case of recursion: two points, which are returned as UCH(S) The line segment (a,b) can be computed sequentially in O(log n)
time with n = |UCH(S1) + UCH(S2)| using a binary search method Line segments can be implemented as linked list of points, thus
UCH(S1) and UCH(S2) can be connected using one pointer change of a to point to b in O(1) time
Parallel convex hull time complexity recurrence relation: T(n) < T(n/2) + a log n
with solution: T(n) = O(log2 n)
Parallel convex hull operations recurrence relation: W(n) < 2W(n/2) + b n
with solution: W(n) = O(n log n)
which is cost optimal, since sequential algorithm is O(n log n)
HPC Fall 2012 21 11/14/12
Divide and Conquer Example 3: First-Order Linear Recurrences
First-order linear recurrence y1 = b1 yi = ai yi-1 + bi 2 < i < n
Example applications: Prefix sum yi = ∑j=1..i bj is a special case of a first-order linear
recurrence with ai = 1 (the multiplicative unit element) n-th order polynomial evaluation using Horner’s rule
p(x) = (((b1 x + b2) x + b3) x + … + bn-1) x + bn is a special case of a first-order linear recurrence with ai = x
Solving a bi-diagonal system By = c, let ai = - li/di bi = ci/di
then solve linear recurrence to obtain solution y
d1 l2 d2 l3 d3 … … ln dn
y1 y2 y3 … yn
c1 c2 c3 … cn
=
HPC Fall 2012 22 11/14/12
Divide and Conquer Example 3: First-Order Linear Recurrences
Rewrite yi = ai yi-1 + bi into yi = ai (ai-1 yi-2 + bi-1) + bi
This equation defines a linear recurrence of size n/2 for even index i
z1 = b1’ zi = ai’ zi-1 + bi’ 2 < i < n/2
1. Let ai’ = a2i a2i-1 bi’ = a2i b2i-1 + b2i
2. Solve zi recursively 3. For 1 < i < n set
yi = zi/2 if i is even yi = ai z(i-1)/2+bi if i is odd > 1 yi = b1 if i = 1
HPC Fall 2012 23 11/14/12
Divide and Conquer Example 3: First-Order Linear Recurrences
Parallel algorithm: linrecsolve(a[], b[], y[], n) { if (n==1) { y[1] = b[1]; return; } forall (i = 1 to n/2) { anew[i] = a[2*i]*a[2*i-1]; bnew[i] = a[2*i]*b[2*i-1]+b[2*i]; } linrecsolve(anew, bnew, z, n/2); forall (i = 1 to n) { if (i == 1) y[1] = b[1]; else if (even(i)) y[i] = z[i/2]; else y[i] = a[i]*z[(i-1)/2]+b[i]; } }
Recu
rsio
n le
vel
b1 b1’ = a2 b1 + b2 b1’’ = a2’ b1’ + b2’ = ((a2 b1 + b2) a3 + b3) a4 + b4 b1’’’ = a2’’ b1’’ + b2’’ = ((a2’ b1’ + b2’) a3’ + b3’) a4’ + b4’ = ((((a2 b1 + b2) a3 + b3) a4 + b4) a5 + b5) a6 + b6) a7 + b7) a8 + b8
log2 n recursive steps
HPC Fall 2012 24 11/14/12
Divide and Conquer Example 4: Triangular Matrix Inversion
Consider Ax = b with n×n triangular matrix A
Partition A into (n/2) × (n/2) blocks
Then A-1 is given by
a11 a21 a22 a31 a32 a33 … … … … an1 an2 … … ann
A1 A2 A3
A1-1 0
-A3-1A2A1
-1 A3-1
HPC Fall 2012 25 11/14/12
Divide and Conquer Example 4: Triangular Matrix Inversion
Parallel algorithm: 1. Divide A into A1, A2, A3 2. Recursively compute inverses of A1 and A3 in parallel 3. Multiply -A3
-1A2A1-1 and combine with A1
-1 and A3-1 to get A-1
Time complexity is given by the recurrence relation T(n) = T(n/2) + c n
with P=n2 processors to compute -A3-1A2A1
-1 in O(n) operations in parallel, thus T(n) = O(n) time
HPC Fall 2012 26 11/14/12
Divide and Conquer Example 5: Banded Triangular Systems
Consider Ax = b with banded matrix A with m=3
Define block diagonal D and inverse D-1
a11 a21 a22 a31 a32 a33 a42 a43 a44 a53 a54 a55 a64 a65 a66 a75 a76 a77 a86 a87 a88 a97 a98 a99
a11 a21 a22 a31 a32 a33 a42 a43 a44 a53 a54 a55 a64 a65 a66 a75 a76 a77 a86 a87 a88 a97 a98 a99
A11 A22 … … An/m,n/m
D = D-1 =
A11-1
A22-1
… … An/m,n/m
-1
HPC Fall 2012 27 11/14/12
Divide and Conquer Example 5: Banded Triangular Systems
Compute d = D-1b and B = D-1A where Bi,i-1 = Aii-1Ai,i-1
Solve first-order linear recurrence on m×m matrices Bi,i-1 x1 = d1 xi = -Bi,i-1 xi-1 + di 2 < i < n/m
Parallel time O(m + m log(n/m)) with P=nm processors Compute all Aii
-1 (each requiring O(m) operations) in parallel with parallel matrix inversion algorithm
Compute all Bi,i-1 = Aii-1Ai,i-1 in O(m) operations in parallel
Recurrence depth is log2(n/m), each step has O(m) operations
d1 d2 … … dn/m
d = D-1b =
Im B21 Im B32 Im … … Bn/m,n/m-1 Im
B = D-1A =
HPC Fall 2012 28 11/14/12
Divide and Conquer Example 6: LU of Tridiagonal Matrix
Consider tridiagonal matrix LU decomposition
The LU decomposition A = L U satisfies a1 = d1 ci = ui ai = di + liui-1 bi = lidi-1
thus d1 = a1 di = ai - liui-1 = ai - ui-1bi/di-1 = [ ai di-1 - bici-1 ] / di-1
a1 c1 b2 a2 c2 b3 a3 c3 … … … bn an
1 l2 1 l3 1 … … ln 1
d1 u1 d2 u2 d3 u3 … … dn
=
HPC Fall 2012 29 11/14/12
Divide and Conquer Example 6: LU of Tridiagonal Matrix
Let
From the Möbius transformation we have
Algorithm: Set up matrices R Solve first-order linear recurrence (prefix sum) of T Compute di From the solution of di compute li = bi/di-1
a1 0 1 0
R1 = ai -bici-1 1 0
Ri = Ti = Ri Ri-1 … R1
0 1
1 0
1 1
1 1
T
T
Ti
Ti
di =
HPC Fall 2012 30 11/14/12
Further Reading
[PP2] pages 106-131 [PSC] pages 321-337