49
PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Embed Size (px)

Citation preview

Page 1: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

PL/B: A Programming Language for the Blues

George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Page 2: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Objective of this Project

• To develop a programming system for the Blue Gene series (and distributed-memory machines in general) that is:– Convenient to use for program development

and maintenance. Resulting parallel programs should be very readable – easy to modify.

– Not too difficult to implement so that the system could be completed in a reasonable time.

• Main focus is numerical computing.

Page 3: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Most Popular Parallel Programming Models Today

• SPMD– Library-based

• MPI• BSP

– Compiler-based• Co-Array Fortran• UPC

• Single thread of execution - loop/array-based– Implicit code partitioning and communication

Sophisticated compiler transformations• HPF• ZPL

– Explicit code partitioning and communication.Simple compiler transformations.

• OpenMP

Page 4: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Design Principles (1 of 2)

• Wanted to avoid the SPMD programming model. – In its more general form can lead to 4D spaghetti

code. More structure is needed.– The most popular of the SPMD implementations, MPI,

is cumbersome to use, in part due to the lack of compiler support. It has been called the “assembly language of parallel computing”.

– It is difficult to reason about SPMD programs. When we reason about them, we rely on an (implicit) global view of communication and computation.

• Wanted to avoid the compiler complexity and limitations that led to the failure of HPF.

Page 5: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Design Principles (2 of 2)

• OpenMP is a good model, but Blue Gene is not a shared-memory machine. OpenMP could be implemented on top of TreadMarks, but efficient implementations would require untested, sophisticated compilers.

• We propose a few language extensions and a programming model based on a single thread of execution. We would like the extensions to require only simple transformations similar to those used by OpenMP compilers.

• We would like the extensions to be of a general nature so that they can be applied to any conventional language.

Page 6: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

MATLAB Implementation

• We will present next our solution in the context of MATLAB.

• There are at least two reasons why developing a parallel extension to MATLAB is a good idea. 1. MATLAB is an excellent language for prototyping conventional

algorithms. There is nothing equivalent for parallel algorithms.• Programmers of parallel machines should appreciate the MATLAB

environment as much as conventional programmers.• In fact, during a site visit to TJ Watson, several government

representatives requested that a parallel MATLAB be included in the PERCS project.

2. It is relatively simple to develop a prototype of the proposed extensions in MATLAB.

Page 7: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

PL/B Data Types and Statements

• The constructs of PL/B are– A new type of objects: Hierarchically tiled arrays.– Array (and hierarchical array) operations and

assignment statements similar to those found in MATLAB, Fortran 90, and APL.

• Hierarchically tiled arrays (HTAs) are used – to specify data distribution in parallel programs and– to facilitate the writing of blocked sequential

algorithms (for locality).

Page 8: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 1: Matrix Multiplication (1 of 4)

for I=1:q:n for J=1:q:n for K=1:q:n for i=I:I+q-1 for j=J:J+q-1 for k=K:K+q-1 c(i,j)=c(i,j)+a(i,k)*b(k,j); end end end end endend

1. Tiled Matrix Multiplication in a Conventional Language

Page 9: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

• Here c{i,j}, a{i,k}, b{k,j}, and T represent submatrices.

• The * operator represents matrix multiplication in MATLAB.

2. Tiled Matrix Multiplication Using HTAs

Example 1: Matrix Multiplication (2 of 4)

for i=1:m for j=1:m T=0; for k=1:m T=T+a{i,k}*b{k,j}; end c{i,j}=T; endend

Page 10: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

% block {i,j} is in processor (i,j)for j=1:m for k=1:m % copy kth column of a r{:,j}=a{:,k}; % broadcast b{k,j} s{:,j}=b{k,j}; % compute w/o communication c{:,j}=c{:,j}+r{:,j}*s{:,j}; endend

for j=1:m for k=1:m c{:,j}=c{:,j}+a{:,k}*b{k,j}; endend

3. Parallel Matrix Multiplication Using HTAs

Example 1: Matrix Multiplication (3 of 4)

• Use the middle product method to do parallel matrix multiply,

• Assume a two dimensional processor mesh.

• Assume {i,j} submatrices are stored in processor (i,j)

• All communications are explicit in separate array assignment statements.

• Parallel computations are also explicit in separate array assignment statements and involve no communication.

Page 11: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

for i=1:m for j=1:m c{i,j}=dotproduct(a,b,i,j); endend

function x = dotproduct(a,b,i,j) t{i,:}=b{:,j}; %communication t{i,:}=t{i,:}*a{i,:}; %computation s{i,:}=t{i,:}; for i=1:ceil(log2(size(a))) s{i,:}=eoshift(t{i,:},2^i,eye); t{i,:}=s{i,:}*t{i,:}; end x=t{i,1};end

4. A Second Version of Parallel Matrix Multiplication

Example 1: Matrix Multiplication (4 of 4)

• Use now the inner product method,

• Collective operations that involve both computations and communications can be represented using intrinsic functions or user functions.

Page 12: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

HTAs

• HTAs are n-dimensional arrays that have been tiled (tiles are d-dimensional, d ≤ n).

• The tiles could be recursive tiled with the tiles at each level having the same shape except for “boundary cases”.

• The topmost levels could be distributed across modules of a parallel system.

Page 13: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Examples of HTA

A(1:4,1:18) A{1:2}{1:4,1:3}(1:3)

A(1:4,1:17) A{1}{1:4,1:3}{1:3}, A{2}{1:4,1:3}(1:2)

Page 14: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Flattening

• Elements of a HTA are referenced using a tile index for each level in the hierarchy followed by an array index. Each tile index tuple is enclosed within {}s and the array index is enclosed within parentheses. – In the matrix multiplication code, c{i,j}(3,4) would represent

element 3,4 of submatrix i,j.– In the first example of the previous page A{2},{4,3}(3) represents

the element on the bottom right corner.

• Alternatively, the tiled array could be accessed as a flat array. – In the matrix multiplication example, assuming that each tile is

p × p, c(r,s) would represent element 3,4 of submatrix i,j if r=(i-1)*p+3 and s=(j-1)*p+4.

Page 15: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Two Ways of Referencing the Elements of an 8 x 8 Array.

C{1,1}(1,1)C(1,1)

C{1,1}(1,2)C(1,2)

C{1,1}(1,3)C(1,3)

C{1,1}(1,4)C(1,4)

C{1,2}(1,1)C(1,5)

C{1,2}(1,2)C(1,6)

C{1,2}(1,3)C(1,7)

C{1,2}(1,4)C(1,8)

C{1,1}(2,1)C(2,1)

C{1,1}(2,2)C(2,2)

C{1,1}(2,3)C(2,3)

C{1,1}(2,4)C(2,4)

C{1,2}(2,1)C(2,5)

C{1,2}(2,2)C(2,6)

C{1,2}(2,3)C(2,7)

C{1,2}(2,4)C(2,8)

C{1,1}(3,1)C(3,1)

C{1,1}(3,2)C(3,2)

C{1,1}(3,3)C(3,3)

C{1,1}(3,4)C(3,4)

C{1,2}(3,1)C(3,5)

C{1,2}(3,2)C(3,6)

C{1,2}(3,3)C(3,7)

C{1,2}(3,4)C(3,8)

C{1,1}(4,1)C(4,1)

C{1,1}(4,2)C(4,2)

C{1,1}(4,3)C(4,3)

C{1,1}(4,4)C(4,4)

C{1,2}(4,1)C(4,5)

C{1,2}(4,2)C(4,6)

C{1,2}(4,3)C(4,7)

C{1,2}(4,4)C(4,8)

C{2,1}(1,1)C(5,1)

C{2,1}(1,2)C(5,2)

C{2,1}(1,3)C(5,3)

C{2,1}(1,4)C(5,4)

C{2,2}(1,1)C(5,5)

C{2,2}(1,2)C(5,6)

C{2,2}(1,3)C(5,7)

C{2,2}(1,4)C(5,8)

C{2,1}(2,1)C(6,1)

C{2,1}(2,2)C(6,2)

C{2,1}(2,3)C(6,3)

C{2,1}(2,4)C(6,4)

C{2,2}(2,1)C(6,5)

C{2,2}(2,2)C(6,6)

C{2,2}(2,3)C(6,7)

C{2,2}(2,4)C(6,8)

C{2,1}(3,1)C(7,1)

C{2,1}(3,2)C(7,2)

C{2,1}(3,3)C(7,3)

C{2,1}(3,4)C(7,4)

C{2,2}(3,1)C(7,5)

C{2,2}(3,2)C(7,6)

C{2,2}(3,3)C(7,7)

C{2,2}(3,4)C(7,8)

C{2,1}(4,1)C(8,1)

C{2,1}(4,2)C(8,2)

C{2,1}(4,3)C(8,3)

C{2,1}(4,4)C(8,4)

C{2,2}(4,1)C(8,5)

C{2,2}(4,2)C(8,6)

C{2,2}(4,3)C(8,7)

C{2,2}(4,4)C(8,8)

Page 16: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Parallel Communication and Computation in PL/B (1 of 2)

• PL/B programs are single-threaded and contain array operations on HTAs. The sequential part of the program

• Like in HPF and MPI processors can be arranged in different forms. In MPF and MPI, processors are arranged into meshes, in PL/B we also consider (virtual) nodes so that processors can be arranged into hierarchical meshes.

• The top levels of HTAs can be distributed onto a subset of nodes.

Page 17: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Parallel Communication and Computation in PL/B (1 of 2)

• Array operations on HTAs can represent communication or computation. – Assignment statements where all HTA indices

are identical are computations executed in the home of each of the HTA elements involved.

– Assignment statements where this is not the case represent communication operations.

Page 18: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

HTA Distribution Examples (1 of 4)

• Consider a 3x6 processor arrangement.• A HTA with shape {1:3,1:6}(1:10) will be distributed on the

processor mesh by assigning each 10 element vector o a different processor.

• A HTA with shape {1:6,1:18}(1:10) will be distributed cyclically on both dimensions:

A{1,1} A{1,7} A{1,13}A{4,1} A{4,7} A{4,13}

A{1,6} A{1,12} A{1,18}

A{4,6} A{4,12} A{4,18}

A{2,1} A{2,7} A{2,13}A{5,1} A{5,7} A{5,13}

A{1,6} A{1,12} A{1,18}

A{4,6} A{4,12} A{4,18}

A{3,1} A{3,7} A{3,13}A{6,1} A{6,7} A{6,13}

A{1,6} A{1,12} A{1,18}

A{4,6} A{4,12} A{4,18}

A{1,2} A{1,8} A{1,14}A{4,2} A{4,8} A{4,14}

A{2,2} A{2,8} A{2,14}A{5,2} A{5,8} A{5,14}

A{3,2} A{3,8} A{3,14}A{6,2} A{6,8} A{6,14}

Page 19: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

HTA Distribution Examples (2 of 4)• Consider the two level processor

arrangement shown on the left. It is a 3x2 arrangement of 2x2 meshes of processors.

• A HTA with shape {1:3,1:2}{1:2,1:2}(1:10,1:10) could be distributed by assigning each 10x10 array to a different processor.

• Also, the processor arrangement shown on left could be flattened so that an HTA with shape {1:6,1:4}(1:10,1:10) could be distributed by assigning each 10x10 array to a different processor

Page 20: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

HTA Distribution Example (3 of 4)• A HTA, b, with shape {1:3,1:2}

{1:2,1:20}(1:5) would be distributed cyclically on the arrangement on the left. The pair of processors {i,j}{k,1:2} would host the following vectors:b{i,j}{k,1}(1:5)b{i,j}{k,3}(1:5)…b{i,j}{k,19}(1:5)

b{i,j}{k,2}(1:5)b{i,j}{k,4}(1:5)…b{i,j}{k,20}(1:5)

• A HTA, b, with shape {1:6,1:40}(1:5) would be distributed cyclically on the flattened arrangement so that processors {i,1:4} would host the following vectors:b{i,1}

(1:5)b{i,5}(1:5)…b{i,17}(1:5)

b{i,2}(1:5)b{i,6}(1:5)…b{i,18}(1:5)

b{i,3}(1:5)b{i,7}(1:5)…b{i,19}(1:5)

b{i,4}(1:5)b{i,8}(1:5)…b{i,20}(1:5)

Page 21: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Advantages of the PL/B Programming Model

• We believe that by relying on a single thread our programming model facilitates the transition from sequential to parallel form.

• In fact, a conventional program can be incrementally modified by converting one data structure at a time into distributed cell array form (cf. OpenMP).

Page 22: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Additional Examples• We now present several additional examples of

PL/B• Except where indicated, we assume that:

– Each HTA has a single level of tiling, – All HTAs have identical sizes and shapes,– The HTAs are identically distributed distributed on a

processor mesh that matches in size and shape all HTAs.

• Our objective in these examples is to show the expressiveness of the language for both dense and sparse computations.

Page 23: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 2: Cannon’s Algorithm (1 of 3)

Page 24: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 2: Cannon’s Algorithm (2 of 3)

c{1:n,1:n} = zeros(p,p); %communicationforall i=1:n

a{:,i} = cshift(a(i,:},i-1); %communicationendforall i=1:n

b{:,i} = cshift(b{:,i},i-1); %communicationendfor k=1:n

forall i=1:n, j=1:nc{i,j} = c{i,j}+a{i,j}*b{i,j}; %computation

endforall i=1:n

a{i,:} = cshift(a{i,:},1); %communication

endforall i=1:n

b{:,i} = cshift(b{:,i},1); %communication

endend

forall version

Page 25: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 2: Cannon’s Algorithm (3 of 3)

c{1:n,1:n} = zeros(p,p); %communication

for i=2:n a{i:n,:} = cshift(a(i:n,:},dim=2,shift=1); %communication b{:,i:n} = cshift(b{:,i:n},dim=1,shift=1) %communicationendfor k=1:n c{:,:} = c{:,:}+a{:,:}*b{:,:}; %computation a{:,:} = cshift(a{:,:},dim=2, shift=1); %communication b{:,:} = cshift(b{:,:},dim=1,shift=1); %communicationend

Triplet version

Page 26: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 2: The SUMMA Algorithm (1 of 6)

Use now the outer-product method (n2-parallelism)Interchanging the loop headers of the loop in Example 1 produce:

for k=1:n for i=1:n for j=1:n C{i,j}=C{i,j}+A{i,k}*B{k,j} end end

endTo obtain n2 parallelism, the inner two loops should take the form of a block operations:

for k=1:n C{:,:}=C{:,:}+A{:,k} B{k,:};

endWhere the operator represents the outer product operations

Page 27: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 2: The SUMMA Algorithm (2 of 6)

• The SUMMA AlgorithmA BC

b11 b12

a11

a21

a11b11

a21b11

a11b12

a21b12

Switch Orientation -- By using a column of A and a row of B broadcast to all, compute the “next” terms of the dot product

Switch Orientation -- By using a column of A and a row of B broadcast to all, compute the “next” terms of the dot product

Page 28: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 2: The SUMMA Algorithm (3 of 6)

c{1:n,1:n} = zeros(p,p); % communicationfor i=1:n t1{:,:}=spread(a(:,i),dim=2,ncopies=N); % communication t2{:,:}=spread(b(i,:),dim=1,ncopies=N); % communication c{:,:}=c{:,:}+t1{:,:}*t2{:,:}; % computationend

Page 29: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 2: The SUMMA Algorithm (4 of 6)

Page 30: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 2: The SUMMA Algorithm (5 of 6)

Page 31: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 2: The SUMMA Algorithm (6 of 6)

Page 32: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 4: Jacobi Relaxation

while dif > epsilon v{2:,:}(0,:) = v{:n-1,:}(p,:); % communication v{:n-1,:} (p+1,:) = v{2:,:} (1,:); % communication v{:,2:} (:,0) = v{:,1:n-1}(:,p); % communication v{:, :n-1}(:,p+1) = v{:,2:}(:,1); % communication

u{:,:}(1:p,1:p) = a * (v{:,:} (1:p,0:p-1) + v{:,:} (0:p-1,1:p)+ v{:,:} (1:p,2:p+1) + v{:,:} (2:p+1,1:p)) ; %computation

dif=max(max( abs ( v – u ) )); v = u ; end

Page 33: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 5: Sparse MVM/copy vector (1 of 2)

A b×

P1

P3

P2

P4

Page 34: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

% Distribute ac{1:n} = a(DIST(1:n):DIST(1:n+1)-1,:);

% Broadcast vector bv{1:n} = b;

% Local multiply t{:} = c{:} * v{:};

% Get resultforall i=1:N

v{i}= t(:); % flattened tend

Example 5: Sparse MVM/copy vector (1 of 2)

Important observation: In MATLAB sparse computations can berepresented as dense computations. The interpreter only performs the necessary operations.

Page 35: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 6: Sparse MVM/distribute vector (1 of 7)

A b×

P1

P3

P2

P4

Page 36: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 6: Sparse MVM/distribute vector (2 of 7)

P1

P3

P2

P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

Step 1: Compute set of indices needed in processor i from chunk of vector in processor j

Page 37: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 6: Sparse MVM/distribute vector (3 of 7)

P1

P3

P2

P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

Step 2: Perform collective communication bytransposing matrix of set of indices

P1 P1 P1 P1

P2 P2 P2 P2

P3 P3 P3 P3

P4 P4 P4 P4

Page 38: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 6: Sparse MVM/distribute vector (4 of 7)

P1

P3

P2

P4

Step 3: In each processor, collect the vectorelements needed by the other processors

P1 P1 P1 P1

P2 P2 P2 P2

P3 P3 P3 P3

P4 P4 P4 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

Page 39: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 6: Sparse MVM/distribute vector (5 of 7)

P1

P3

P2

P4

Step 4: Transpose again to communicate vector elements to the processor where they are needed

P1 P1 P1 P1

P2 P2 P2 P2

P3 P3 P3 P3

P4 P4 P4 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

P1 P2 P3 P4

Page 40: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

c{1:n} = a(DIST(1:n):DIST(2:n+1)-1,:);v{1:n}(DIST(1:n):DIST(2:n+1)-1,1) = b(DIST(1:n):DIST(2:n+1)-1);

% Step 1: P{I}{J} contains index of elements of vector v from% processor J needed by processor I. First component of P is distributed% and the second is a regular cell array.

forall I=1:NP{I} =fun(c{I},DIST);

end

Example 6: Sparse MVM/distribute vector (6 of 7)

Page 41: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

% Step 2: R{I}{J} contains the index of elements of vector v from% processor I needed by processor J

R{:}{:}= transpose(P{:}{:});% Step 3: Q{I}{J} contains all elements of vector v from% processor I needed by processor J

forall I=1:N, k=1:N, (k != I)Q{I}{k} = v{I}(R{I}{k}(:));

end% Step 4: R{I}{J} contains all elements of vector v from% processor J needed by processor I

R{:}{:} = transpose(Q{:}{:} );forall I=1:N

v{I}(P{I}(:))=R{I}(:);endv{:} = c{:} * v{:};

Example 6: Sparse MVM/distribute vector (7 of 7)

Page 42: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 7: SuperLU_DISTStep 1: Apply LU decomposition to corner block C{K,K}

Page 43: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Step 2: Factorize block column L{…,K} = C{…,K}/U{K,K}

Example 7: SuperLU_DIST

Page 44: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Step 3: Factorize block row U{K,…}

Example 7: SuperLU_DIST

Page 45: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Step 4: Update trailing Submatrix C{I,J}=C{I,J}-L{I,K}*U{K,J}

Example 7: SuperLU_DIST

Page 46: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 7: SuperLU_DIST

for K=1:N[L{K,K},U{K,K}]=lu(C{K,K}); %Step 1

L{K+1:N,K}=C{K+1:N,K}/U{K,K}; %Step 2U{K,K+1:N}=L{K,K}\C{K,K+1:N}; %Step 3

forall I=K+1:N;L{I,K+1:}=L{I,K};U{K+1:,I}=U{K+1:,I};

end

C{K+1:,K+1:}=C{K+1:,K+1:}-L{K+1:,K+1:}*U{K+1:,K+1:}; %Step 4end

Page 47: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 7: SuperLU_DIST

0

0

0

0

0

0

Avoiding unnecessary communication

Page 48: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

T

F

T

F

T

T

F

T

F

T

T

F

T

F

T

T

F

T

F

T

T

F

T

F

T

Example 7: SuperLU_DIST

Spread

TF

TF

T

0

0

0T

F

T

F

T

TF

TF

T

TF

TF

T

TF

TF

T

TF

TF

T

Tranpose & Send

Page 49: PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua

Example 7: SuperLU_DISTfor K=1:N

[L{K,K},U{K,K}]=lu(C{K,K});%Step 1

T{K:N,K} = C{K:N,K} != 0;S{K:N,K}(K:N)=spread(T{K:N,K},…);R {K,K:N}=transpose(SL{K+1:N,K}{K:N});

L{K+1:N,K}=C{K+1:N,K}/U{K,K}; %Step 2

U{K,K+1:N}=L{K,K}\C{K,K+1:N};%Step 3

forall I=K+1:N;L{I,K+1:}=L{I,K};

end

forall I=K=1:NU{R{K,I},I}=U{K,I};

end

C{K+1:,K+1:}=C{K+1:,K+1:}-L{K+1:,K+1:}*U{K+1:,K+1:};%Step 4end