19
Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering of Software Systems Lecture 13 Divide-&-Conquer Recurrences Divide & Conquer Recurrences Analysis of Cilk Loops Multithreaded Matrix Multiplication Algorithms Merge Sort Charles E. Leiserson Tableau Construction Tableau Construction October 29, 2009 © 2009 Charles E. Leiserson 1 OUTLINE Divide-& &-Conquer Recurrences ences Divide Conquer Recurr Cilk Loops Matrix Multiplication Merge Sort Tableau Construction Tableau Construction © 2009 Charles E. Leiserson 2 The Master Method The Master Method for solving divide-and- conquer recurrences applies to recurrences of conquer recurrences applies to recurrences of the form* T(n) = aT(n/b) + f(n) , where a 1, b > 1, and f is asymptotically positive. *The unstated base case is T(n) = Θ(1) for sufficiently small n. © 2009 Charles E. Leiserson 3 © 2009 Charles E. Leiserson 4 MIT 6.172 Lecture 13 1

Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

OUTLINE

6.172 Performance

Engineering of Software Systems

Lecture 13 •Divide-&-Conquer Recurrences Divide & Conquer Recurrences

Analysis of •Cilk Loops Multithreaded •Matrix Multiplication

Algorithms •Merge Sort

Charles E. Leiserson •Tableau Construction Tableau Construction October 29, 2009

© 2009 Charles E. Leiserson 1

OUTLINE

•Divide-&&-Conquer Recurrences encesDivide Conquer Recurr•Cilk Loops •Matrix Multiplication •Merge Sort •Tableau Construction Tableau Construction

© 2009 Charles E. Leiserson 2

The Master Method

The Master Method for solving divide-and­conquer recurrences applies to recurrences of conquer recurrences applies to recurrences of the form*

T(n) = aT(n/b) + f(n) , where a ≥ 1, b > 1, and f is asymptoticallypositive.

*The unstated base case is T(n) = Θ(1) for sufficiently small n.

© 2009 Charles E. Leiserson 3 © 2009 Charles E. Leiserson 4

MIT 6.172 Lecture 13 1

Page 2: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

T(n)

T(n/b)T(n/b) T(n/b)

T(n/b2)T(n/b2) T(n/b2)

Analysis of Multithreaded Algorithms 2009-10-29

Recursion Tree: T(n) = aT(n/b) + f(n) Recursion Tree: T(n) = aT(n/b) + f(n)

T(n)

T(n/b)T(n/b) T(n/b)… af(n)

© 2009 Charles E. Leiserson 5 © 2009 Charles E. Leiserson 6

… a

f(n/b) f(n/b)

f(n)

f(n/b)

Recursion Tree: T(n) = aT(n/b) + f(n)

T(n/b2)T(n/b2) T(n/b2)… a

f(n/b)… a

f(n/b)f(n/b)

f(n)

Recursion Tree: T(n) = aT(n/b) + f(n)

… a

T(1)

f(n/b2) f(n/b2)f(n/b2)

T(1)

© 2009 Charles E. Leiserson 7 © 2009 Charles E. Leiserson 8

MIT 6.172 Lecture 13 2

Page 3: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

f(n/b)…a f(n)

a f(n/b)f(n/b)f(n/b)

f(n)

Recursion Tree: T(n) = a T(n/b) + f(n)

f(n/b2) f(n/b2)…

a2 f(n/b2)

alogbn T(1)T(1)

h = logbn

f(n/b2)a

© 2009 Charles E. Leiserson 9

= Θ(nlogba)alogbn T(1)T(1)

IDEA: Compare nlogba with f(n) .

Master Method — CASE I

f(n/b)…a

f(n)

a f(n/b)h l

f(n/b)f(n/b)

f(n)

nlogba ≫ f(n)

alogbn T(1)

f(n/b2) f(n/b2)… a2

f(n/b2)

T(1)

h = logbn

f(n/b2)anlogba ≫ f(n)

GEOMETRICALLY INCREASING

Specifically, f(n) = O(nlogba – ε)for some constant ε > 0 .

© 2009 Charles E. Leiserson 10

alogbn T(1)= Θ(nlogba)

T(1)

T(n) = Θ(nlogba)

Master Method — CASE 2

f(n/b)…a

f(n)

a f(n/b)h l

f(n/b)f(n/b)

f(n)

nlogba ≈ f(n)

alogbn T(1)

f(n/b2) f(n/b2)… a2

f(n/b2)

T(1)

h = logbn

f(n/b2)a

ARITHMETICALLY INCREASING

Specifically, f(n) = Θ(nlogbalgkn)for some constant k ≥ 0.

© 2009 Charles E. Leiserson 11

alogbn T(1)= Θ(nlogba)

T(1)

T(n) = Θ(nlogbalgk+1n))

Master Method — CASE 3

f(n/b)…a

f(n)

a f(n/b)h l

f(n/b)f(n/b)

f(n)nlogba ≪ f(n)GEOMETRICALLY

alogbn T(1)

f(n/b2) f(n/b2)… a2

f(n/b2)

T(1)

h = logbn

f(n/b2)aDECREASING

Specifically, f(n) = Ω(nlogba + ε) for some

constant ε > 0 .*

© 2009 Charles E. Leiserson 12

alogbn T(1)= Θ(nlogba)

T(1)

T(n) = Θ(f(n))*and f(n) satisfies the regularity condition that

a f(n/b) ≤ c f(n) for some constant c < 1.

MIT 6.172 Lecture 13 3

Page 4: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

Master-Method Cheat Sheet

T(n) = a T(n/b) + f(n)

CASE 1: f (n) = O(nlogba – ε), constant ε > 0⇒ T(n) = Θ(nlogba) .

CASE 2: f (n) = Θ(nlogba lgkn), constant k ≥ 0 ⇒ T(n) = Θ(nlogba lgk+1n) .

© 2009 Charles E. Leiserson 13

( ) ( g )

CASE 3: f (n) = Ω(nlogba + ε), constant ε > 0 (and regularity condition)

⇒ T(n) = Θ(f(n)) .

Master Method Quiz

∙T(n) = 4 T(n/2) + nnlogba = n2 ≫ n ⇒ CASE 1: T(n) = Θ(n2).

∙T(n) = 4 T(n/2) + n2

nlogba = n2 = n2 lg0n CASE 2: T(n) = Θ(n2 lgn).

∙T(n) = 4 T(n/2) + n3

nlogba = n2 ≪ n3 CASE 3: T(n) = Θ(n3).

© 2009 Charles E. Leiserson 14

∙T(n) = 4 T(n/2) + n2/lgnMaster method does not apply!

OUTLINE

•Divide-&-Conquer RecurrencesDivide & Conquer Recurrences•Cilk Loops•Matrix Multiplication•Merge Sort•Tableau Construction

© 2009 Charles E. Leiserson 15

•Tableau Construction

Loop Parallelism in Cilk++

Example: In-place matrix

a11 a12 ⋯ a1n

a21 a22 ⋯ a2n

⋮ ⋮ ⋱ ⋮

a11 a21 ⋯ an1

a12 a22 ⋯ an2

⋮ ⋮ ⋱ ⋮

The iterations of a cilk for

transpose ⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann

⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann

A T

// indices run from 0, not 1cilk_for (int i=1; i<n; ++i) {

© 2009 Charles E. Leiserson 16

of a cilk_forloop execute in parallel.

for (int j=0; j<i; ++j) {double temp = A[i][j];A[i][j] = A[j][i];A[j][i] = temp;

}}

MIT 6.172 Lecture 13 4

A

Page 5: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 5

Divide-and-conquer implementation

Implemention of Parallel Loops

cilk_for (int i=1; i<n; ++i) {for (int j=0; j<i; ++j) {

double temp = A[i][j];A[i][j] = A[j][i];A[j][i] = temp; p

In practice, the recursion

A[j][i] = temp;}

} void recur(int lo, int hi) {if (hi > lo) { // coarsen

int mid = lo + (hi - lo)/2;cilk_spawn recur(lo, mid);recur(mid, hi);cilk_sync;

} elsefor (int j=0; j<i; ++j) {

d bl A[i][j]

void recur(int lo, int hi) {if (hi > lo) { // coarsen

int mid = lo + (hi - lo)/2;cilk_spawn recur(lo, mid);recur(mid, hi);cilk_sync;

} elsefor (int j=0; j<i; ++j) {

d bl A[i][j]

© 2009 Charles E. Leiserson 17

is coarsenedto minimize overheads.

double temp = A[i][j];A[i][j] = A[j][i];A[j][i] = temp;

}}

}recur(1, n–1);

double temp = A[i][j];A[i][j] = A[j][i];A[j][i] = temp;

}}

}recur(1, n–1);

Analysis of Parallel Loopscilk_for (int i=1; i<n; ++i) {

for (int j=0; j<i; ++j) {double temp = A[i][j];A[i][j] = A[j][i];A[j][i] = temp;

Span of loop control = Θ(lgn) .Max span of body Θ(n)A[j][i] = temp;

}}

= Θ(n) .

© 2009 Charles E. Leiserson 18

Work: T1(n) = Θ(n2)

Parallelism: T1(n)/T∞(n) = Θ(n)

Work: T1(n) =Span: T∞(n) = Θ(n + lgn) = Θ(n) Parallelism: T1(n)/T∞(n) =

1 2 3 4 5 6 7 8

Span: T∞(n) =

Analysis of Nested Loops

cilk_for (int i=1; i<n; ++i) {cilk_for (int j=0; j<i; ++j) {

double temp = A[i][j];

Span of outer loop control = Θ(lgn) .

double temp A[i][j];A[i][j] = A[j][i];A[j][i] = temp;

}} Span of body = Θ(1) .

Max span of inner loop control = Θ(lgn) .

© 2009 Charles E. Leiserson 19

Work: T1(n) = Θ(n2) Span: T∞(n) =Parallelism: T1(n)/T∞(n) = Θ(n2/lgn)

Work: T1(n) =Span: T∞(n) = Θ(lgn)Parallelism: T1(n)/T∞(n) =

cilk_for (int i=0; i<n; ++i) {A[i] += B[i];

}

cilk_for (int i=0; i<n; ++i) {A[i] += B[i];

}

Vector addition

A Closer Look at Cilk Loops

Includes substantial overhead!

© 2009 Charles E. Leiserson 20

Work: T1 = Θ(n) Span: T∞ =Parallelism: T1/T∞ = Θ(n/lgn)

Work: T1 =Span: T∞ = Θ(lgn)Parallelism: T1/T∞ =

Page 6: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 6

#pragma cilk:grain_size=Gcilk_for (int i=0; i<n; ++i) {

A[i] += B[i];}

#pragma cilk:grain_size=Gcilk_for (int i=0; i<n; ++i) {

A[i] += B[i];}

Loop Grain Size

Vector addition

© 2009 Charles E. Leiserson 21

G

Work: T1 = n·titer + (n/G)·tspawnSpan: T∞ = G·titer + lg(n/G)·tspawn

Want G ≫tspawn/titer and G small.

void vadd (double *A, double *B, int n){for (int i=0; i<n; i++) A[i] += B[i];

}

for (int j=0; j<n; j+=G) {

void vadd (double *A, double *B, int n){for (int i=0; i<n; i++) A[i] += B[i];

}

for (int j=0; j<n; j+=G) {

Another Implementation

cilk_spawn vadd(A+j, B+j, min(G,n-j));}cilk_sync;

cilk_spawn vadd(A+j, B+j, min(G,n-j));}cilk_sync;

© 2009 Charles E. Leiserson 22

G

Work: T1 = Θ(n) Span: T∞ =Parallelism: T1/T∞ = Θ(1)

Work: T1 =Span: T∞ = Θ(n)Parallelism: T1/T∞ =

Assume that G = 1.

void vadd (double *A, double *B, int n){for (int i=0; i<n; i++) A[i]+=B[i];

}

for (int j 0; j n; j G) {

void vadd (double *A, double *B, int n){for (int i=0; i<n; i++) A[i]+=B[i];

}

for (int j 0; j n; j G) {

Another Implementation

for (int j=0; j<n; j+=G) {cilk_spawn vadd(A+j, B+j, min(G,n-j));

}cilk_sync;

for (int j=0; j<n; j+=G) {cilk_spawn vadd(A+j, B+j, min(G,n-j));

}cilk_sync;

Choose G = √n to optimize

© 2009 Charles E. Leiserson 23

G

Work: T1 = Θ(n) Span: T∞ =Parallelism: T1/T∞ = Θ(√n)

Work: T1 =Span: T∞ = Θ(G + n/G)Parallelism: T1/T∞ =

Analyze in terms of G: Span: T∞ = Θ(G + n/G) = Θ(√n)

optimize.

Three Performance Tips

1. Minimize the span to maximize parallelism. Try to generate 10 times more parallelism than processors for near-perfect linear speedup.p p p p

2. If you have plenty of parallelism, try to trade some of it off to reduce work overhead.

3. Use divide-and-conquer recursion or parallel loops rather than spawning one small thing after another.

cilk for (int i 0; i<n; ++i) {cilk for (int i 0; i<n; ++i) {Do this:

© 2009 Charles E. Leiserson 24

for (int i=0; i<n; ++i) {cilk_spawn foo(i);

}cilk_sync;

for (int i=0; i<n; ++i) {cilk_spawn foo(i);

}cilk_sync;

cilk_for (int i=0; i<n; ++i) {foo(i);

}

cilk_for (int i=0; i<n; ++i) {foo(i);

}

Do this:

Not this:

Page 7: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 7

And Three More

4. Ensure that work/#spawns is sufficiently large.• Coarsen by using function calls and inlining near the leaves

of recursion, rather than spawning.5 P ll li l d i l5. Parallelize outer loops, as opposed to inner loops,

if you’re forced to make a choice.6. Watch out for scheduling overheads.

cilk_for (int i=0; i<2; ++i) {for (int j=0; j<n; ++j) {

f(i,j);}

cilk_for (int i=0; i<2; ++i) {for (int j=0; j<n; ++j) {

f(i,j);}

Do this:

© 2009 Charles E. Leiserson 25

for (int j=0; j<n; ++j) {cilk_for (int i=0; i<2; ++i) {

f(i,j);}

for (int j=0; j<n; ++j) {cilk_for (int i=0; i<2; ++i) {

f(i,j);}

Not this:

(Cilkview will show a high burdened parallelism.)

OUTLINE

•Divide-&-Conquer RecurrencesDivide & Conquer Recurrences•Cilk Loops•Matrix Multiplication•Merge Sort•Tableau Construction

© 2009 Charles E. Leiserson 26

•Tableau Construction

Square-Matrix Multiplication

c11 c12 ⋯ c1nc21 c22 ⋯ c2n

a11 a12 ⋯ a1na21 a22 ⋯ a2n

b11 b12 ⋯ b1nb21 b22 ⋯ b2n= ·⋮ ⋮ ⋱ ⋮

cn1 cn2 ⋯ cnn

⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann

⋮ ⋮ ⋱ ⋮bn1 bn2 ⋯ bnn

=

C A B

∑n

© 2009 Charles E. Leiserson 27

cij = ∑k = 1

aik bkj

Assume for simplicity that n = 2k.

Parallelizing Matrix Multiply

cilk_for (int i=1; i<n; ++i) {cilk_for (int j=0; j<n; ++j) {

for (int k=0; k<n; ++k {for (int k=0; k<n; ++k {C[i][j] += A[i][k] * B[k][j];

}}

Θ(n3) Span: T =Work: T1 =

Θ(n)

© 2009 Charles E. Leiserson 28

Span: T∞ =Θ(n2)

Θ(n)Parallelism: T1/T∞ =

For 1000 × 1000 matrices, parallelism ≈ (103)2 = 106.

Page 8: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 8

Recursive Matrix Multiplication

Divide and conquer —

C11 C12 A11 A12 B11 B12

C21 C22

= ·A21 A22 B21 B22

= +A11B11 A11B12 A12B21 A12B22

© 2009 Charles E. Leiserson 29

8 multiplications of n/2 × n/2 matrices.1 addition of n × n matrices.

A21B11 A21B12 A22B21 A22B22

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T *D = new T[n*n];

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T *D = new T[n*n];

D&C Matrix Multiplication

T D new T[n n];//base case & partition matricescilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);cilk_spawn MMult(C22, A21, B12, n/2, size);cilk_spawn MMult(C21, A21, B11, n/2, size);cilk_spawn MMult(D11, A12, B21, n/2, size);cilk_spawn MMult(D12, A12, B22, n/2, size);cilk spawn MMult(D22 A22 B22 n/2 size);

T D new T[n n];//base case & partition matricescilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);cilk_spawn MMult(C22, A21, B12, n/2, size);cilk_spawn MMult(C21, A21, B11, n/2, size);cilk_spawn MMult(D11, A12, B21, n/2, size);cilk_spawn MMult(D12, A12, B22, n/2, size);cilk spawn MMult(D22 A22 B22 n/2 size);

© 2009 Charles E. Leiserson 30

cilk_spawn MMult(D22, A22, B22, n/2, size);MMult(D21, A22, B21, n/2, size);

cilk_sync;MAdd(C, D, n, size); // C += D;delete[] D;

}

cilk_spawn MMult(D22, A22, B22, n/2, size);MMult(D21, A22, B21, n/2, size);

cilk_sync;MAdd(C, D, n, size); // C += D;delete[] D;

}

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T *D = new T[n*n];

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T *D = new T[n*n];

D&C Matrix Multiplication

T D new T[n n];//base case & partition matricescilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);cilk_spawn MMult(C22, A21, B12, n/2, size);cilk_spawn MMult(C21, A21, B11, n/2, size);cilk_spawn MMult(D11, A12, B21, n/2, size);cilk_spawn MMult(D12, A12, B22, n/2, size);cilk spawn MMult(D22 A22 B22 n/2 size);

T D new T[n n];//base case & partition matricescilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);cilk_spawn MMult(C22, A21, B12, n/2, size);cilk_spawn MMult(C21, A21, B11, n/2, size);cilk_spawn MMult(D11, A12, B21, n/2, size);cilk_spawn MMult(D12, A12, B22, n/2, size);cilk spawn MMult(D22 A22 B22 n/2 size);

Row length of matricesCoarsen for

efficiency

© 2009 Charles E. Leiserson 31

cilk_spawn MMult(D22, A22, B22, n/2, size);MMult(D21, A22, B21, n/2, size);

cilk_sync;MAdd(C, D, n, size); // C += D;delete[] D;

}

cilk_spawn MMult(D22, A22, B22, n/2, size);MMult(D21, A22, B21, n/2, size);

cilk_sync;MAdd(C, D, n, size); // C += D;delete[] D;

}

Determine submatrices

by index calculation

Matrix Addition

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T *D = new T[n*n]);

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T *D = new T[n*n]);T D new T[n n]);//base case & partition matricescilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);cilk_spawn MMult(C22, A21, B12, n/2, size);cilk_spawn MMult(C21, A21, B11, n/2, size);cilk_spawn MMult(D11, A12, B21, n/2, size);cilk_spawn MMult(D12, A12, B22, n/2, size);cilk spawn MMult(D22 A22 B22 n/2 size);

T D new T[n n]);//base case & partition matricescilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);cilk_spawn MMult(C22, A21, B12, n/2, size);cilk_spawn MMult(C21, A21, B11, n/2, size);cilk_spawn MMult(D11, A12, B21, n/2, size);cilk_spawn MMult(D12, A12, B22, n/2, size);cilk spawn MMult(D22 A22 B22 n/2 size);

template <typename T>void MAdd(T *C, T *D, int n, int size) {template <typename T>void MAdd(T *C, T *D, int n, int size) {

© 2009 Charles E. Leiserson 32

cilk_spawn MMult(D22, A22, B22, n/2, size);MMult(D21, A22, B21, n/2, size);

cilk_sync;MAdd(C, D, n, size); // C += D;delete[] D;

}

cilk_spawn MMult(D22, A22, B22, n/2, size);MMult(D21, A22, B21, n/2, size);

cilk_sync;MAdd(C, D, n, size); // C += D;delete[] D;

}

void MAdd(T C, T D, int n, int size) {cilk_for (int i=0; i<n; ++i) {cilk_for (int j=0; j<n; ++j) {

C[size*i+j] += D[size*i+j];}

}}

void MAdd(T C, T D, int n, int size) {cilk_for (int i=0; i<n; ++i) {cilk_for (int j=0; j<n; ++j) {

C[size*i+j] += D[size*i+j];}

}}

Page 9: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 9

Analysis of Matrix Addition

template <typename T>void MAdd(T *C, T *D, int n, int size) {cilk_for (int i=0; i<n; ++i) {

template <typename T>void MAdd(T *C, T *D, int n, int size) {cilk_for (int i=0; i<n; ++i) {cilk_for (int j=0; j<n; ++j) {

C[size*i+j] += D[size*i+j];}

}}

cilk_for (int j=0; j<n; ++j) {C[size*i+j] += D[size*i+j];

}}

}

Θ(n2) Work: A1(n) =

© 2009 Charles E. Leiserson 33

Span: A∞(n) =Θ(lgn)

Work of Matrix Multiplication

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T *D = new T[n*n];//base case & partition matrices

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T *D = new T[n*n];//base case & partition matricescilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);⋮cilk_spawn MMult(D22, A22, B22, n/2, size);

MMult(D21, A22, B21, n/2, size);cilk_sync;MAdd(C, D, n, size); // C += D;delete[] D;

cilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);⋮cilk_spawn MMult(D22, A22, B22, n/2, size);

MMult(D21, A22, B21, n/2, size);cilk_sync;MAdd(C, D, n, size); // C += D;delete[] D;

© 2009 Charles E. Leiserson 34

delete[] D;}delete[] D;

}

8M1(n/2) + A1(n) + Θ(1)= 8M1(n/2) + Θ(n2)= Θ(n3)

Work: M1(n) =

Work of Matrix Multiplication

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T *D = new T[n*n];//base case & partition matrices

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T *D = new T[n*n];//base case & partition matricescilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);⋮cilk_spawn MMult(D22, A22, B22, n/2, size);

MMult(D21, A22, B21, n/2, size);cilk_sync;MAdd(C, D, n, size); // C += D;delete[] D;

cilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);⋮cilk_spawn MMult(D22, A22, B22, n/2, size);

MMult(D21, A22, B21, n/2, size);cilk_sync;MAdd(C, D, n, size); // C += D;delete[] D;

CASE 1:nlogba = nlog28 = n3

f(n) = Θ(n2)

© 2009 Charles E. Leiserson 35

delete[] D;}delete[] D;

}

8M1(n/2) + A1(n) + Θ(1)= 8M1(n/2) + Θ(n2)= Θ(n3)

Work: M1(n) =

Span of Matrix Multiplication

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T *D = new T[n*n];//base case & partition matrices

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T *D = new T[n*n];//base case & partition matricesm cilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);⋮cilk_spawn MMult(D22, A22, B22, n/2, size);

MMult(D21, A22, B21, n/2, size);cilk_sync;MAdd(C, D, n, size); // C += D;delete[] D;

cilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);⋮cilk_spawn MMult(D22, A22, B22, n/2, size);

MMult(D21, A22, B21, n/2, size);cilk_sync;MAdd(C, D, n, size); // C += D;delete[] D;

CASE 2:nlogba = nlog21 = 1f(n) = Θ(nlogba lg1n)m

axim

um

© 2009 Charles E. Leiserson 36

delete[] D;}delete[] D;

}

M∞(n/2) + A∞(n) + Θ(1)= M∞(n/2) + Θ(lgn)= Θ(lg2n)

Span: M∞(n) =

Page 10: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 10

M1(n) = Θ(n3)Work:

Parallelism of Matrix Multiply

M∞(n) = Θ(lg2n)Span:

Parallelism: M1(n)M (n)

= Θ(n3/lg2n)

© 2009 Charles E. Leiserson 37

M∞(n)

For 1000 × 1000 matrices, parallelism ≈ (103)3/102 = 107.

Temporaries

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T D[n*n];//base case & partition matricescilk spawn MMult(C11, A11, B11, n/2, size);

template <typename T>void MMult(T *C, T *A, T *B, int n, int size) {T D[n*n];//base case & partition matricescilk spawn MMult(C11, A11, B11, n/2, size);

T *D = new T[n*n];

cilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);cilk_spawn MMult(C22, A21, B12, n/2, size);cilk_spawn MMult(C21, A21, B11, n/2, size);cilk_spawn MMult(D11, A12, B21, n/2, size);cilk_spawn MMult(D12, A12, B22, n/2, size);cilk_spawn MMult(D22, A22, B22, n/2, size);

MMult(D21, A22, B21, n/2, size);cilk_sync;MAdd(C, D, n, size); // C += D;

cilk_spawn MMult(C11, A11, B11, n/2, size);cilk_spawn MMult(C12, A11, B12, n/2, size);cilk_spawn MMult(C22, A21, B12, n/2, size);cilk_spawn MMult(C21, A21, B11, n/2, size);cilk_spawn MMult(D11, A12, B21, n/2, size);cilk_spawn MMult(D12, A12, B22, n/2, size);cilk_spawn MMult(D22, A22, B22, n/2, size);

MMult(D21, A22, B21, n/2, size);cilk_sync;MAdd(C, D, n, size); // C += D;

© 2009 Charles E. Leiserson 38

IDEA: Since minimizing storage tends to yield higher performance, trade off parallelism for less storage.

delete[] D;}delete[] D;

}

No-Temp Matrix Multiplication

// C += A*B;template <typename T>void MMult2(T *C, T *A, T *B, int n, int size) {

// C += A*B;template <typename T>void MMult2(T *C, T *A, T *B, int n, int size) {{//base case & partition matricescilk_spawn MMult2(C11, A11, B11, n/2, size);cilk_spawn MMult2(C12, A11, B12, n/2, size);cilk_spawn MMult2(C22, A21, B12, n/2, size);

MMult2(C21, A21, B11, n/2, size);cilk_sync;cilk_spawn MMult2(C11, A12, B21, n/2, size);cilk spawn MMult2(C12, A12, B22, n/2, size);

{//base case & partition matricescilk_spawn MMult2(C11, A11, B11, n/2, size);cilk_spawn MMult2(C12, A11, B12, n/2, size);cilk_spawn MMult2(C22, A21, B12, n/2, size);

MMult2(C21, A21, B11, n/2, size);cilk_sync;cilk_spawn MMult2(C11, A12, B21, n/2, size);cilk spawn MMult2(C12, A12, B22, n/2, size);

© 2009 Charles E. Leiserson 39

Saves space, but at what expense?

cilk_spawn MMult2(C12, A12, B22, n/2, size);cilk_spawn MMult2(C22, A22, B22, n/2, size);

MMult2(C21, A22, B21, n/2, size);cilk_sync;

}

cilk_spawn MMult2(C12, A12, B22, n/2, size);cilk_spawn MMult2(C22, A22, B22, n/2, size);

MMult2(C21, A22, B21, n/2, size);cilk_sync;

}

// C += A*B;template <typename T>void MMult2(T *C, T *A, T *B, int n, int size) {//base case & partition matricescilk_spawn MMult2(C11, A11, B11, n/2, size);

// C += A*B;template <typename T>void MMult2(T *C, T *A, T *B, int n, int size) {//base case & partition matricescilk_spawn MMult2(C11, A11, B11, n/2, size);

Work of No-Temp Multiply

cilk_spawn MMult2(C12, A11, B12, n/2, size);cilk_spawn MMult2(C22, A21, B12, n/2, size);

MMult2(C21, A21, B11, n/2, size);cilk_sync;cilk_spawn MMult2(C11, A12, B21, n/2, size);cilk_spawn MMult2(C12, A12, B22, n/2, size);cilk_spawn MMult2(C22, A22, B22, n/2, size);

MMult2(C21, A22, B21, n/2, size);cilk_sync;

}

cilk_spawn MMult2(C12, A11, B12, n/2, size);cilk_spawn MMult2(C22, A21, B12, n/2, size);

MMult2(C21, A21, B11, n/2, size);cilk_sync;cilk_spawn MMult2(C11, A12, B21, n/2, size);cilk_spawn MMult2(C12, A12, B22, n/2, size);cilk_spawn MMult2(C22, A22, B22, n/2, size);

MMult2(C21, A22, B21, n/2, size);cilk_sync;

}

CASE 1:nlogba = nlog28 = n3

f(n) = Θ(1)

© 2009 Charles E. Leiserson 40

}}

8M1(n/2) + Θ(1)= Θ(n3)

Work: M1(n) =

Page 11: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 11

// C += A*B;template <typename T>void MMult2(T *C, T *A, T *B, int n, int size) {//base case & partition matricescilk_spawn MMult2(C11, A11, B11, n/2, size);

// C += A*B;template <typename T>void MMult2(T *C, T *A, T *B, int n, int size) {//base case & partition matricescilk_spawn MMult2(C11, A11, B11, n/2, size);

Span of No-Temp Multiply

cilk_spawn MMult2(C12, A11, B12, n/2, size);cilk_spawn MMult2(C22, A21, B12, n/2, size);

MMult2(C21, A21, B11, n/2, size);cilk_sync;cilk_spawn MMult2(C11, A12, B21, n/2, size);cilk_spawn MMult2(C12, A12, B22, n/2, size);cilk_spawn MMult2(C22, A22, B22, n/2, size);

MMult2(C21, A22, B21, n/2, size);cilk_sync;

}

cilk_spawn MMult2(C12, A11, B12, n/2, size);cilk_spawn MMult2(C22, A21, B12, n/2, size);

MMult2(C21, A21, B11, n/2, size);cilk_sync;cilk_spawn MMult2(C11, A12, B21, n/2, size);cilk_spawn MMult2(C12, A12, B22, n/2, size);cilk_spawn MMult2(C22, A22, B22, n/2, size);

MMult2(C21, A22, B21, n/2, size);cilk_sync;

}

CASE 1:nlogba = nlog22 = nf(n) = Θ(1)

max

max

© 2009 Charles E. Leiserson 41

}}

2M∞(n/2) + Θ(1)= Θ(n)

Span: M∞(n) =

Parallelism of No-Temp Multiply

M1(n) = Θ(n3)Work:

M∞(n) = Θ(n)Span:

Parallelism: M1(n)M (n)

= Θ(n2)

© 2009 Charles E. Leiserson 42

M∞(n)

For 1000 × 1000 matrices, parallelism ≈ (103)2 = 106.

Faster in practice!

OUTLINE

•Divide-&-Conquer RecurrencesDivide & Conquer Recurrences•Cilk Loops•Matrix Multiplication•Merge Sort•Tableau Construction

© 2009 Charles E. Leiserson 43

•Tableau Construction

template <typename T>void Merge(T *C, T *A, T *B, int na, int nb) {while (na>0 && nb>0) {if (*A <= *B) {*C *A

template <typename T>void Merge(T *C, T *A, T *B, int na, int nb) {while (na>0 && nb>0) {if (*A <= *B) {*C *A

Merging Two Sorted Arrays

*C++ = *A++; na--;} else {*C++ = *B++; nb--;

}}while (na>0) {*C++ = *A++; na--;

}while (nb>0) {*C++ = *B++; nb--;

*C++ = *A++; na--;} else {*C++ = *B++; nb--;

}}while (na>0) {*C++ = *A++; na--;

}while (nb>0) {*C++ = *B++; nb--;

Time to merge n elements = Θ(n).

© 2009 Charles E. Leiserson 44

C++ = B++; nb ;}

}

C++ = B++; nb ;}

}3 12 19 46

4 14 21 23

193

4

12

14 21 23

46

Page 12: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 12

template <typename T>void MergeSort(T *B, T *A, int n) {if (n==1) {B[0] = A[0];

} else {T* C[ ]

template <typename T>void MergeSort(T *B, T *A, int n) {if (n==1) {B[0] = A[0];

} else {T* C[ ]

Merge Sort

T* C[n];cilk_spawn MergeSort(C, A, n/2);

MergeSort(C+n/2, A+n/2, n-n/2);cilk_sync;Merge(B, C, C+n/2, n/2, n-n/2);

} }

T* C[n];cilk_spawn MergeSort(C, A, n/2);

MergeSort(C+n/2, A+n/2, n-n/2);cilk_sync;Merge(B, C, C+n/2, n/2, n-n/2);

} } 46143 4 12 19 21 33merge

© 2009 Charles E. Leiserson 45

4 3319 46 143 12 21

46 333 12 19 4 14 21

merge

merge

144619 3 12 33 4 21

template <typename T>void MergeSort(T *B, T *A, int n) {if (n==1) {B[0] = A[0];

} else {T* C[ ]

template <typename T>void MergeSort(T *B, T *A, int n) {if (n==1) {B[0] = A[0];

} else {T* C[ ]

Work of Merge Sort

CASE 2:nlogba = nlog22 = n

lT* C[n];cilk_spawn MergeSort(C, A, n/2);

MergeSort(C+n/2, A+n/2, n-n/2);cilk_sync;Merge(B, C, C+n/2, n/2, n-n/2);

} }

T* C[n];cilk_spawn MergeSort(C, A, n/2);

MergeSort(C+n/2, A+n/2, n-n/2);cilk_sync;Merge(B, C, C+n/2, n/2, n-n/2);

} }

2T ( /2) + Θ( )W k T ( )

f(n) = Θ(nlogbalg0n)

© 2009 Charles E. Leiserson 46

2T1(n/2) + Θ(n)= Θ(n lgn)

Work: T1(n) =

template <typename T>void MergeSort(T *B, T *A, int n) {if (n==1) {B[0] = A[0];

} else {T* C[ ]

template <typename T>void MergeSort(T *B, T *A, int n) {if (n==1) {B[0] = A[0];

} else {T* C[ ]

Span of Merge Sort

CASE 3:nlogba = nlog21 = 1

T* C[n];cilk_spawn MergeSort(C, A, n/2);

MergeSort(C+n/2, A+n/2, n-n/2);cilk_sync;Merge(B, C, C+n/2, n/2, n-n/2);

} }

T* C[n];cilk_spawn MergeSort(C, A, n/2);

MergeSort(C+n/2, A+n/2, n-n/2);cilk_sync;Merge(B, C, C+n/2, n/2, n-n/2);

} }

T ( /2) + Θ( )S T ( )

f(n) = Θ(n)

© 2009 Charles E. Leiserson 47

T∞(n/2) + Θ(n)= Θ(n)

Span: T∞(n) =

Parallelism of Merge Sort

T1(n) = Θ(n lg n)Work:

T∞(n) = Θ(n)Span:

Parallelism: T1(n)T (n)

= Θ(lgn)

© 2009 Charles E. Leiserson 48

T∞(n)

We need to parallelize the merge!

Page 13: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 13

A0 na

Parallel Merge

ma = na/2

≤ A[ma] ≥ A[ma]

B0 nb

na ≥ nb≤ A[ma] ≥ A[ma]

Binary Search

mb-1 mb

RecursiveP_Merge

RecursiveP_Merge

© 2009 Charles E. Leiserson 49

KEY IDEA: If the total number of elements to be merged in the two arrays is n = na + nb, the total number of elements in the larger of the two recursive merges is at most (3/4) n .

Parallel Merge

template <typename T>void P_Merge(T *C, T *A, T *B, int na, int nb) {if (na < nb) {

template <typename T>void P_Merge(T *C, T *A, T *B, int na, int nb) {if (na < nb) {P_Merge(C, B, A, nb, na);

} else if (na==0) {return;

} else {int ma = na/2;int mb = BinarySearch(A[ma], B, nb);C[ma+mb] = A[ma];cilk_spawn P_Merge(C, A, B, ma, mb);P Merge(C+ma+mb+1 A+ma+1 B+mb na ma 1 nb mb);

P_Merge(C, B, A, nb, na);} else if (na==0) {

return; } else {int ma = na/2;int mb = BinarySearch(A[ma], B, nb);C[ma+mb] = A[ma];cilk_spawn P_Merge(C, A, B, ma, mb);P Merge(C+ma+mb+1 A+ma+1 B+mb na ma 1 nb mb);

© 2009 Charles E. Leiserson 50

P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb);cilk_sync;

}}

P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb);cilk_sync;

}}

Coarsen base cases for efficiency.

Span of Parallel Merge

template <typename T>void P_Merge(T *C, T *A, T *B, int na, int nb) {if (na < nb) {

template <typename T>void P_Merge(T *C, T *A, T *B, int na, int nb) {if (na < nb) {⋮int mb = BinarySearch(A[ma], B, nb);C[ma+mb] = A[ma];cilk_spawn P_Merge(C, A, B, ma, mb);P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb);cilk_sync;

}}

⋮int mb = BinarySearch(A[ma], B, nb);C[ma+mb] = A[ma];cilk_spawn P_Merge(C, A, B, ma, mb);P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb);cilk_sync;

}}

CASE 2:nlogba = nlog4/31 = 1f(n) = Θ(nlogba lg1n)

© 2009 Charles E. Leiserson 51

T∞(3n/4) + Θ(lgn)= Θ(lg2n )

Span: T∞(n) =

Work of Parallel Merge

template <typename T>void P_Merge(T *C, T *A, T *B, int na, int nb) {if (na < nb) {

template <typename T>void P_Merge(T *C, T *A, T *B, int na, int nb) {if (na < nb) {⋮int mb = BinarySearch(A[ma], B, nb);C[ma+mb] = A[ma];cilk_spawn P_Merge(C, A, B, ma, mb);P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb);cilk_sync;

}}

⋮int mb = BinarySearch(A[ma], B, nb);C[ma+mb] = A[ma];cilk_spawn P_Merge(C, A, B, ma, mb);P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb);cilk_sync;

}}

© 2009 Charles E. Leiserson 52

T1(αn) + T1((1-α)n) + Θ(lgn),where 1/4 ≤ α ≤ 3/4.Work: T1(n) =

Claim: T1(n) = Θ(n).

Page 14: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

14

Analysis of Work Recurrence

S b tit ti th d I d ti h th i i

Work: T1(n) = T1(αn) + T1((1-α)n) + Θ(lg n), where 1/4 ≤ α ≤ 3/4.

Substitution method: Inductive hypothesis is T1(k) ≤c 1k – c2lg k, where c1,c2 > 0. Prove that the relation holds, and solve for c1 and c2.

T1(n)=T1(αn) + T1((1–α)n) + Θ(lgn) ≤ c1(αn) – c2lg(αn)

© 2009 Charles E. Leiserson 53

≤ c1(αn) c2lg(αn) + c1(1–α)n – c2lg((1–α)n) + Θ(lgn)

Analysis of Work Recurrence

Work: T1(n) = T1(αn) + T1((1-α)n) + Θ(lgn), where 1/4 ≤ α ≤ 3/4.

T1(n)=T1(αn) + T1((1–α)n) + Θ(lg n) ≤ c1(αn) – c2lg(αn)

© 2009 Charles E. Leiserson 54

≤ c1(αn) c2lg(αn) + c1(1–α)n – c2lg((1–α)n) + Θ(lgn)

T1(n)=T1(αn) + T1((1–α)n) + Θ(lg n)

Work: T1(n) = T1(αn) + T1((1-α)n) + Θ(lgn), where 1/4 ≤ α ≤ 3/4. T1(n) T1(αn) + T1((1 α)n) + Θ(lg n)

≤ c1(αn) – c2lg(αn) + c1(1–α)n – c2lg((1–α)n) + Θ(lgn)

≤c 1n – c2lg(αn) – c2lg((1–α)n) + Θ(lgn) ≤c 1n – c2 ( lg(α(1–α)) + 2 lg n ) + Θ(lgn) ≤ c1n – c2 lg n

© 2009 Charles E. Leiserson 55

≤ c1n c2 lg n – (c2(lg n + lg(α(1–α))) – Θ(lgn))

≤c 1n – c2 lgn by choosing c2 large enough. Choose c1 largeenough to handle the base case.

MIT 6.172 Lecture 13

Analysis of Work Recurrence Parallelism of P_Merge

Work: T1(n) = Θ(n)

Span: T∞(n) = Θ(lg2n)

Parallelism: T1(n) = Θ(n/lg2n)T (n)T∞(n)

© 2009 Charles E. Leiserson 56

Page 15: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 15

template <typename T>void P_MergeSort(T *B, T *A, int n) {if (n==1) {B[0] A[0]

Parallel Merge Sort

template <typename T>void P_MergeSort(T *B, T *A, int n) {if (n==1) {B[0] A[0] CASE 2:B[0] = A[0];

} else {T C[n];cilk_spawn P_MergeSort(C, A, n/2);P_MergeSort(C+n/2, A+n/2, n-n/2);cilk_sync;P_Merge(B, C, C+n/2, n/2, n-n/2);

} }

B[0] = A[0];} else {T C[n];cilk_spawn P_MergeSort(C, A, n/2);P_MergeSort(C+n/2, A+n/2, n-n/2);cilk_sync;P_Merge(B, C, C+n/2, n/2, n-n/2);

} }

Cn

ASE 2:logba = nlog22 = n

f(n) = Θ(nlogba lg0n)

© 2009 Charles E. Leiserson 57

}}

Work: T1(n) = 2T1(n/2) + Θ(n)= Θ(n lgn)

template <typename T>void P_MergeSort(T *B, T *A, int n) {if (n==1) {B[0] A[0]

Parallel Merge Sort

template <typename T>void P_MergeSort(T *B, T *A, int n) {if (n==1) {B[0] A[0] CASE 2:B[0] = A[0];

} else {T C[n];cilk_spawn P_MergeSort(C, A, n/2);P_MergeSort(C+n/2, A+n/2, n-n/2);cilk_sync;P_Merge(B, C, C+n/2, n/2, n-n/2);

} }

B[0] = A[0];} else {T C[n];cilk_spawn P_MergeSort(C, A, n/2);P_MergeSort(C+n/2, A+n/2, n-n/2);cilk_sync;P_Merge(B, C, C+n/2, n/2, n-n/2);

} }

Cn

ASE 2:logba = nlog21 = 1

f(n) = Θ(nlogba lg2n)

© 2009 Charles E. Leiserson 58

}}

Span: T T∞(n/2) + Θ(lg2∞(n) = n)

= Θ(lg3n)

Parallelism of P_MergeSort

Work: T1(n) = Θ(n lgn)

Span: T∞(n) = Θ(lg3n)

TParallelism: 1(n)T (n)

= Θ(n/lg2n)T∞(n)

© 2009 Charles E. Leiserson 59

OUTLINE

•Divide-&-Conquer Recurrences•Divide & Conquer RecurrencesCilk Loops

•Matrix Multiplication•Merge Sort•Tableau Construction•Tableau Construction

© 2009 Charles E. Leiserson 60

Page 16: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 16

Constructing a Tableau

A[i][ j] = f( A[i][ j–1], A[i–1][j], A[i–1][j–1] ).Problem: Fill in an n × n tableau A, where

Dynamic00 01 02 03 04 05 06 07 Dynamic programming∙ Longest

common subsequence

∙ Edit distance∙ Time warping

00 01 02 03 04 05 06 0710 11 12 13 14 15 16 1720 21 22 23 24 25 26 2730 31 32 33 34 35 36 3740 41 42 43 44 45 46 47

© 2009 Charles E. Leiserson 61

Time warping40 41 42 43 44 45 46 4750 51 52 53 54 55 56 5760 61 62 63 64 65 66 6770 71 72 73 74 75 76 77

Work: Θ(n2).

n

I;

Parallel code

Recursive Construction

n

;cilk_spawn II;III;cilk_sync;IV;

I II

III IV

© 2009 Charles E. Leiserson 62

III IV

n

I;

Parallel code

Recursive Construction

n

;cilk_spawn II;III;cilk_sync;IV;

I II

III IVCASE 1:

© 2009 Charles E. Leiserson 63

III IV

4T1(n/2) + Θ(1)= Θ(n2)

Work: T1(n) =

nlogba = nlog24 = n2

f(n) = Θ(1)

n

I;

Parallel code

Recursive Construction

n

;cilk_spawn II;III;cilk_sync;IV;

I II

III IVCASE 1:

© 2009 Charles E. Leiserson 64

III IV

3T∞(n/2) + Θ(1)= Θ(nlg3)

Span: T∞(n) =

nlogba = nlog23

f(n) = Θ(1)

Page 17: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 17

T1(n) = Θ(n2)Work:

Analysis of Tableau Constr.

T∞(n) = Θ(nlg3) = O(n1.59)Span:

Parallelism: T1(n)T (n)

= Θ(n2-lg 3)

© 2009 Charles E. Leiserson 65

T∞(n)= Ω(n0.41)

I;cilk_spawn II;III

n

A More-Parallel Construction

III;cilk_sync;cilk_spawn IV;cilk_spawn V;VI;cilk_sync;cilk_spawn VII;VIII;ilk

I II

III

IV

V VIIn

CASE 1:

© 2009 Charles E. Leiserson 66

cilk_sync;IX;

VI VIII IX

9T1(n/3) + Θ(1)= Θ(n2)

Work: T1(n) =

CASE 1:nlogba = nlog39 = n2

f(n) = Θ(1)

I;cilk_spawn II;III

n

A More-Parallel Method

III;cilk_sync;cilk_spawn IV;cilk_spawn V;VI;cilk_sync;cilk_spawn VII;VIII;ilk

I II

III

IV

V VIIn

CASE 1:

© 2009 Charles E. Leiserson 67

cilk_sync;IX;

VI VIII IX

5T∞(n/3) + Θ(1)= Θ(nlog35)

Span: T∞(n) =

CASE 1:nlogba = nlog35

f(n) = Θ(1)

T1(n) = Θ(n2)Work:

T∞(n) = Θ(nlog35) = O(n1.47)Span:

Analysis of Revised Method

Parallelism: T1(n)T∞(n)

= Θ(n2-log35)

= Ω(n0.53)

© 2009 Charles E. Leiserson 68

= Ω(n )Nine-way divide-and-conquer has about Θ(n0.12) more parallelism than four-way divide-and-conquer, but it exhibits less cache locality.

Page 18: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

Analysis of Multithreaded Algorithms 2009-10-29

MIT 6.172 Lecture 13 18

Puzzle

What is the largest parallelism that

∙ You may only use basic parallel control constructs (cilk_spawn, cilk_sync,

can be obtained for the tableau-construction problem using Cilk++?

© 2009 Charles E. Leiserson 69

p , y ,cilk_for) for synchronization.

∙ No using locks, atomic instructions, synchronizing through memory, etc.

Page 19: Analysis of Multithreaded Algorithmsdspace.mit.edu/bitstream/handle/1721.1/74613/6-172-fall...Analysis of Multithreaded Algorithms 2009-10-29 OUTLINE 6.172 Performance Engineering

MIT OpenCourseWarehttp://ocw.mit.edu

6.172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.