Algorithms PART II: Partitioning and Divide & Conquer

Algorithms PART II: Partitioning and Divide & Conquer

HPC Fall 2012 Prof. Robert van Engelen

HPC Fall 2012 2 11/14/12

Overview

  Partitioning strategies   Divide and conquer strategies   Further reading

HPC Fall 2012 3 11/14/12

Partitioning Strategies   Data partitioning

  Perform domain decomposition to run parallel tasks on subdomains

  “Scatter-compute-gather” where local computation may require communication and scatter/gather may involve computations

  Task partitioning   Decompose functions into

independent subfunctions and execute the subfunctions in parallel

Block partitioning of a 2D domain

function f(x,y) u=g(x) v=h(y) return u+v end

u=g(x) v=h(y)

return u+v

Thread 1 Thread 2

HPC Fall 2012 4 11/14/12

Partitioning Strategies

  Partitioning strategy (data partitioning): 1.  Break up a given problem into P subproblems 2.  Solve the P subproblems concurrently 3.  Collect and combine the P solutions

  Embarrassingly parallel   Is a simple form of data partitioning into independent

subproblems without initial work and no communication between tasks (workers)

HPC Fall 2012 5 11/14/12

Partitioning Example 1: Summation

  Summation of n values X = [x1,…,xn]

1.  Divide X into P equally-sized sublists Xp, p = 0,…,P-1 and distribute the Xp sublists to the P processors

2.  The processors sum the local parts sp = ∑ Xp 3.  Combine the local sums s = ∑ sp

  Algorithms: 1.  Scatter list X using a scatter-tree 2.  Serial summation of parts 3.  Reduce local sums

HPC Fall 2012 6 11/14/12


time

Log2(P) steps

Total amount of data transferred: n/2 log2(P)

n/2

n/4 n/4

n/8 n/8 n/8 n/8

Local summations: n/P steps

Log2(P) steps

Total amount of data transferred: P-1

1 1 1 1

1 1

1

scat

ter

(div

ide)

re

duce

(c

ombi

ne)

HPC Fall 2012 7 11/14/12


  Communication time   Scatter: tcomm1 = ∑k=1..log2(P) (tstartup + 2-kn tdata)

= log2(P)tstart + n(P-1)/P tdata   Reduce: tcomm2 = log2(P) (tstart + tdata)   Total: tcomm = 2 log2(P)tstart + ( n(P-1)/P + log2(P) ) tdata

  Computation time   Local sum: tcomp1 = n/P   Global sum: tcomp2 = log2(P)   Total: tcomp = n/P + log2(P)

  Speedup, assuming tstartup = 0   Sequential time: ts = n-1   Parallel time: tP = ( n(P-1)/P + log2(P) ) tdata + n/P + log2(P)   Speedup: SP = ts/tP = O(n / (n + log(P)))   Best speedup w/o communication: SP = O(P/log(P))

HPC Fall 2012 8 11/14/12

General M-Ary Partitioning

time

First division Second division

Third division

Example: partitioning an image, e.g. to compute histogram by parallel reductions

(summations to count color pixels)

divide

combine

compute

3-level 4-ary partitioning for 43 = 64 processors

HPC Fall 2012 9 11/14/12

Partitioning Example 2: Parallel Bucket Sort

  Bucket sort of values [x1,…, xn] bounded within a range xi∈[lo…hi] 1. Partition the n values in n/P segments 2a. Sort each segment into P small buckets (local computation) 2b. Send content of small buckets to P large buckets 3. Sort P large buckets and merge lists

Unsorted values

Sorted values

Sort content of buckets and merge lists

Empty small buckets into large buckets

P processors

Small buckets

HPC Fall 2012 10 11/14/12


Input: list X of length n with minimum value L and maximum U Output: sorted list X

def function bucket(x) = P*(x-L)/(U-L);

scatter list X to local Xp lists each of size n/P forall processors p = 0,…,P-1 for i = 0,…,n/P-1 x = Xp[i] put x into small bucket bp[bucket(x)] all-to-all of small buckets bp into large buckets Bp sort values in Bp[0,…,P-1] using a sequential sort algorithm gather X from Bp into a merged sorted list

HPC Fall 2012 11 11/14/12


  Communication time (assuming uniform distribution in X)   Scatter: tcomm1 = log2(P)tstartup + n(P-1)/P tdata   All-to-all: tcomm2 = (P-1)(tstartup + n/P2 tdata)   Gather: tcomm3 = log2(P)tstartup + n(P-1)/P tdata

  Computation time (assuming uniform distribution in X)   Small bucket sort: tcomp1 = n/P   Large bucket sort: tcomp2 = n/P log2(n/P)

  Speedup   Sequential time: ts = n log2(n/P) (with P buckets)   Parallel time: tP = 2 log2(P)tstartup + 2 n(P-1)/P tdata

+ (P-1)(tstartup + n/P2 tdata) + n/P (1 + log2(n/P))

  Speedup w/o communication: SP = O(P)

HPC Fall 2012 12 11/14/12

Partitioning Example 3: Barnes Hut Algorithm

Direction of the force between two bodies at points p and q

HPC Fall 2012 13 11/14/12


.

. . . .

. . . . .

.

.

Quadtree

Particles in 2D space Mass of parent is sum of

masses of children

. . . . . . . . .

. . . .

Center of mass

Particle at (x,y) and mass m

Parent computes M and C

. A square w/o particle is deleted

HPC Fall 2012 14 11/14/12


for (t = 0; t < tmax; t++) { Build_tree(); Compute_Total_Mass_Center(); Compute_Force(); Update_Positions(); }

Sequential time is O(n log n)

Assuming P = n then tP = O(log P)

Compute_Force() { for (i = 0; i < n; i++) Compute_Tree_Force(i,root) } Compute_Tree_Force(i,node) { if (box at node contains one particle) F = force using eq (**) else { r = distance from i to C (*) of box D = size of box at node if (D/r < theta) F = force using eq (**) with total M else for (all children c of box) F = F + Compute_Tree_Force(i,c); } return F; } (**)

(*)

HPC Fall 2012 15 11/14/12

Divide and Conquer

  Divide and conquer strategy (definition by JáJá 1992) 1.  Break up a given problem into independent subproblems 2.  Solve the subproblems recursively and concurrently 3.  Collect and combine the solutions into the overall solution

  In contrast to the partitioning strategy, divide and conquer uses recursive partitioning with concurrent execution to divide the problem down into independent subproblems

  In deeper levels of recursion the number of active processors may increase or decrease

HPC Fall 2012 16 11/14/12

Divide & Conquer Example 1: Parallel Recursive Matmul

  Block matrix multiplication in recursion by decomposing matrix in 2×2 submatrices and computing the submatrices recursively

Mat matmul(Mat A, Mat B, int s) { if (s == 1) C = A * B; else { s = s/2; P0 = matmul(Ap,p, Bp,p, s); P1 = matmul(Ap,q, Bq,p, s); P2 = matmul(Ap,p, Bp,q, s); P3 = matmul(Ap,q, Bq,q, s); P4 = matmul(Aq,p, Bp,p, s); P5 = matmul(Aq,q, Bq,p, s); P6 = matmul(Aq,p, Bp,q, s); P7 = matmul(Aq,q, Bq,q, s); Cp,p = P0 + P1; Cp,q = P2 + P3; Cq,p = P4 + P5; Cq,q = P6 + P7; } return C; }

  Level of parallelism increases with deepening recursion   Suitable for shared memory systems

P0…P7 computed in parallel

Can be computed in parallel

HPC Fall 2012 17 11/14/12

Divide and Conquer Example 2: Parallel Convex Hull Algorithm

  The planar convex hull of a set of points S={p1,p2,…,pn} of pi=(x,y) coordinates is the smallest convex polygon that encompasses all points S on the x-y plane

x

y

HPC Fall 2012 18 11/14/12


  The upper convex hull spans points {q1,…,qs} ⊆ S from point q1 with minimum x to qs with maximum x

  The convex hull = upper convex hull + lower convex hull   Problem:

  Given points S = {p1,…,pn} such that x(p1) < x(p2) < … < x(pn), construct the upper convex hull in parallel

Upper convex hull

x

y q1 qs

HPC Fall 2012 19 11/14/12


  Parallel convex hull: 1.  Divide the x-sorted points S into sets S1 and S2 of equal size 2.  Compute upper convex hull recursively on S1 and S2 3.  Combine UCH(S1) and UCH(S2) by computing the upper

common tangent a to b to form UCH(S)

Upper common tangent

S2

S1

HPC Fall 2012 20 11/14/12


  Base case of recursion: two points, which are returned as UCH(S)   The line segment (a,b) can be computed sequentially in O(log n)

time with n = |UCH(S1) + UCH(S2)| using a binary search method   Line segments can be implemented as linked list of points, thus

UCH(S1) and UCH(S2) can be connected using one pointer change of a to point to b in O(1) time

  Parallel convex hull time complexity recurrence relation: T(n) < T(n/2) + a log n

with solution: T(n) = O(log2 n)

  Parallel convex hull operations recurrence relation: W(n) < 2W(n/2) + b n

with solution: W(n) = O(n log n)

which is cost optimal, since sequential algorithm is O(n log n)

HPC Fall 2012 21 11/14/12

Divide and Conquer Example 3: First-Order Linear Recurrences

  First-order linear recurrence y1 = b1 yi = ai yi-1 + bi 2 < i < n

  Example applications:   Prefix sum yi = ∑j=1..i bj is a special case of a first-order linear

recurrence with ai = 1 (the multiplicative unit element)   n-th order polynomial evaluation using Horner’s rule

p(x) = (((b1 x + b2) x + b3) x + … + bn-1) x + bn is a special case of a first-order linear recurrence with ai = x

  Solving a bi-diagonal system By = c, let ai = - li/di bi = ci/di

then solve linear recurrence to obtain solution y

d1 l2 d2 l3 d3 … … ln dn

y1 y2 y3 … yn

c1 c2 c3 … cn

=

HPC Fall 2012 22 11/14/12


  Rewrite yi = ai yi-1 + bi into yi = ai (ai-1 yi-2 + bi-1) + bi

  This equation defines a linear recurrence of size n/2 for even index i

z1 = b1’ zi = ai’ zi-1 + bi’ 2 < i < n/2

1.  Let ai’ = a2i a2i-1 bi’ = a2i b2i-1 + b2i

2.  Solve zi recursively 3.  For 1 < i < n set

yi = zi/2 if i is even yi = ai z(i-1)/2+bi if i is odd > 1 yi = b1 if i = 1

HPC Fall 2012 23 11/14/12


  Parallel algorithm: linrecsolve(a[], b[], y[], n) { if (n==1) { y[1] = b[1]; return; } forall (i = 1 to n/2) { anew[i] = a[2*i]*a[2*i-1]; bnew[i] = a[2*i]*b[2*i-1]+b[2*i]; } linrecsolve(anew, bnew, z, n/2); forall (i = 1 to n) { if (i == 1) y[1] = b[1]; else if (even(i)) y[i] = z[i/2]; else y[i] = a[i]*z[(i-1)/2]+b[i]; } }

Recu

rsio

n le

vel

b1 b1’ = a2 b1 + b2 b1’’ = a2’ b1’ + b2’ = ((a2 b1 + b2) a3 + b3) a4 + b4 b1’’’ = a2’’ b1’’ + b2’’ = ((a2’ b1’ + b2’) a3’ + b3’) a4’ + b4’ = ((((a2 b1 + b2) a3 + b3) a4 + b4) a5 + b5) a6 + b6) a7 + b7) a8 + b8

log2 n recursive steps

HPC Fall 2012 24 11/14/12

Divide and Conquer Example 4: Triangular Matrix Inversion

  Consider Ax = b with n×n triangular matrix A

  Partition A into (n/2) × (n/2) blocks

  Then A-1 is given by

a11 a21 a22 a31 a32 a33 … … … … an1 an2 … … ann

A1 A2 A3

A1-1 0

-A3-1A2A1

-1 A3-1

HPC Fall 2012 25 11/14/12

Divide and Conquer Example 4: Triangular Matrix Inversion

  Parallel algorithm: 1.  Divide A into A1, A2, A3 2.  Recursively compute inverses of A1 and A3 in parallel 3.  Multiply -A3

-1A2A1-1 and combine with A1

-1 and A3-1 to get A-1

  Time complexity is given by the recurrence relation T(n) = T(n/2) + c n

with P=n2 processors to compute -A3-1A2A1

-1 in O(n) operations in parallel, thus T(n) = O(n) time

HPC Fall 2012 26 11/14/12

Divide and Conquer Example 5: Banded Triangular Systems

  Consider Ax = b with banded matrix A with m=3

  Define block diagonal D and inverse D-1

a11 a21 a22 a31 a32 a33 a42 a43 a44 a53 a54 a55 a64 a65 a66 a75 a76 a77 a86 a87 a88 a97 a98 a99

a11 a21 a22 a31 a32 a33 a42 a43 a44 a53 a54 a55 a64 a65 a66 a75 a76 a77 a86 a87 a88 a97 a98 a99

A11 A22 … … An/m,n/m

D = D-1 =

A11-1

A22-1

… … An/m,n/m

-1

HPC Fall 2012 27 11/14/12

Divide and Conquer Example 5: Banded Triangular Systems

  Compute d = D-1b and B = D-1A where Bi,i-1 = Aii-1Ai,i-1

  Solve first-order linear recurrence on m×m matrices Bi,i-1 x1 = d1 xi = -Bi,i-1 xi-1 + di 2 < i < n/m

  Parallel time O(m + m log(n/m)) with P=nm processors   Compute all Aii

-1 (each requiring O(m) operations) in parallel with parallel matrix inversion algorithm

  Compute all Bi,i-1 = Aii-1Ai,i-1 in O(m) operations in parallel

  Recurrence depth is log2(n/m), each step has O(m) operations

d1 d2 … … dn/m

d = D-1b =

Im B21 Im B32 Im … … Bn/m,n/m-1 Im

B = D-1A =

HPC Fall 2012 28 11/14/12

Divide and Conquer Example 6: LU of Tridiagonal Matrix

  Consider tridiagonal matrix LU decomposition

  The LU decomposition A = L U satisfies a1 = d1 ci = ui ai = di + liui-1 bi = lidi-1

thus d1 = a1 di = ai - liui-1 = ai - ui-1bi/di-1 = [ ai di-1 - bici-1 ] / di-1

a1 c1 b2 a2 c2 b3 a3 c3 … … … bn an

1 l2 1 l3 1 … … ln 1

d1 u1 d2 u2 d3 u3 … … dn

=

HPC Fall 2012 29 11/14/12

Divide and Conquer Example 6: LU of Tridiagonal Matrix

  Let

  From the Möbius transformation we have

  Algorithm:   Set up matrices R   Solve first-order linear recurrence (prefix sum) of T   Compute di   From the solution of di compute li = bi/di-1

a1 0 1 0

R1 = ai -bici-1 1 0

Ri = Ti = Ri Ri-1 … R1

0 1

1 0

1 1

1 1

T

T

Ti

Ti

di =

HPC Fall 2012 30 11/14/12

Further Reading

  [PP2] pages 106-131   [PSC] pages 321-337

Documents

Algorithms PART II: Partitioning and Divide & Conquer