Upload
nguyenhanh
View
216
Download
0
Embed Size (px)
Citation preview
HTA’sNOV30
PROGRAMMING FOR PARALLELISM AND LOCALITY WITH
PRESENTATION BY ROMAN FRIGG
Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson Research Center3 byGanesh Bikshandi1, Jia Guo, Daniel Hoeflinger1, Gheorghe Almasi3, Basilio B. Fraguela2, María J. Garzarán1, David Padua1 and Christoph von Praun3
PAPER PUBLISHED AT PPOPP MARCH 2006
PROGRAMMING TODAY’S SYSTEMS
2|
SCALABILITY PORTABILITY
PRODUCTIVITY
PROGRAMMING TODAY’S SYSTEMS
2|
SCALABILITY PORTABILITY
PRODUCTIVITY
Parallelism
PROGRAMMING TODAY’S SYSTEMS
2|
SCALABILITY PORTABILITY
PRODUCTIVITY
Parallelism Locality
PROGRAMMING TODAY’S SYSTEMS
2|
SCALABILITY PORTABILITY
PRODUCTIVITY
Parallelism Locality
Abstractions
PROGRAMMING TODAY’S SYSTEMS
2|
SCALABILITY PORTABILITY
PRODUCTIVITY
Parallelism Locality
Abstractions
HTA’s
CLASSIFICATION
3|
LIBRARIES LANGUAGES
HTA
MPI/PVM
GAS
POET
POOMA X10 CAFZPL
TITANIUM
UPC HPFCLASSIFICATION
3|
LIBRARIES LANGUAGES
HTA
MPI/PVM
GAS
POET
POOMA X10 CAFZPL
TITANIUM
UPC HPFCLASSIFICATION
3|
‣ Library‣ Matlab & C++
‣ Single threaded, global view
TALK OVERVIEW
INTRO1
4|
TALK OVERVIEW
INTRO
HOW HTA’s WORK
1
2
4|
TALK OVERVIEW
INTRO
HOW HTA’s WORK
HTA OPERATIONS & APPLICATIONS
1
2
3
4|
TALK OVERVIEW
INTRO
HOW HTA’s WORK
HTA OPERATIONS & APPLICATIONS
EVALUATION
1
2
3
4
4|
TALK OVERVIEW
INTRO
HOW HTA’s WORK
HTA OPERATIONS & APPLICATIONS
EVALUATION
CONCLUSIONS
1
2
3
4
5
4|
RECURSIVE TILING
‣ distributed
image source: paper
5| HOW HTA’s
WORK 2
9 H TA3/20/06
Hierarchically Tiled Array
9 H TA3/20/06
Hierarchically Tiled Array
‣ local
‣ local
CONSTRUCT HTA FROM 6x6 MATRIX
6| HOW HTA’s
WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
T1 = hta( ,{ , } )
HOW HTA’s WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
T1 = hta( ,{ , } )M
HOW HTA’s WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
T1 = hta( ,{ , } )M [1 3 5]
HOW HTA’s WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
T1 = hta( ,{ , } )M [1 3 5]
HOW HTA’s WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1 [1 2]
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1 [1 2]
1
2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1 [1 2] [1 3]
1
2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1 [1 2] [1 3]
1 3
1
2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1 [1 2] [1 3]
1 3
1
2
[2 2]
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
P1 P2
P3 P4
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1 [1 2] [1 3]
1 3
1
2
[2 2]
HTAACCESS
7|
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
C(6,4)=
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
C(6,4)= C{2,1}(2,4)=
ASSIGNMENTS & BINARY OPERATORS
8|
VALID OPERATIONS
Scalar
⊕Array
HTA
ASSIGNMENTS & BINARY OPERATORS
9|
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
*4x4 HTA 2x3 Array
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
u*4x4 HTA 2x3 Array
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
*4x4 HTA 3x2 Array
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
v*4x4 HTA 3x2 Array
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
*4x4 HTA Scalar
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
u*4x4 HTA Scalar
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
*4x4 HTA 4x4 HTA
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
u*4x4 HTA 4x4 HTA
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
=4x4 HTA 4x4 HTA
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
v=
4x4 HTA 4x4 HTA
VALID OPERATION ?
TALK OVERVIEW
INTRO
HOW HTA’s WORK
HTA OPERATIONS & APPLICATIONS
CONCLUSIONS
1
2
3
5
10|
EVALUATION
4
TWOKINDS OF
OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
f(x)TWO
KINDS OF OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
f(x)TWO
KINDS OF OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
f(x)TWO
KINDS OF OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
Assignments, repmat, circshift, permute
f(x)TWO
KINDS OF OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
g(x)g(x)g(x)g(x)g(x)
Assignments, repmat, circshift, permute
f(x)TWO
KINDS OF OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
g(x) g(x)
g(x)g(x)
Assignments, repmat, circshift, permute
f(x)TWO
KINDS OF OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
g(x) g(x)
g(x)g(x)
Assignments, repmat, circshift, permute
parHTA(@g(x), H)
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
A,B,C
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
circshift( )circshift( )
circshift( )circshift( )
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
Initialization
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
Initialization
Iteration
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
B23
A32
13| HTA OPERATIONS & APPLICATIONS
3
A11 A12 A13
A22 A23
A33
A21
A31
B11
B22B21
B31 B32 B33
Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
end
B13B12
B23
A32
13| HTA OPERATIONS & APPLICATIONS
3
A11 A12 A13
A22 A23
A33
A21
A31
B11
B22B21
B31 B32 B33
Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
end
B13B12
i=2
B23
A32
13| HTA OPERATIONS & APPLICATIONS
3
A11 A12 A13
A22 A23
A33
A21
A31
B11
B22B21
B31 B32 B33
Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
end
B13B12
i=2
B23
A32
13| HTA OPERATIONS & APPLICATIONS
3
A11 A12 A13
A22 A23
A33
A21
A31
B11
B22B21
B31 B32 B33
Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
end
B13B12
i=2
B23
A32
13| HTA OPERATIONS & APPLICATIONS
3
A11 A12 A13
A22 A23
A33
A21
A31
B11 B22
B21
B31
B32
B33
Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
end
B13
B12
i=2
B23
A32
13| HTA OPERATIONS & APPLICATIONS
3
A11 A12 A13
A22 A23
A33
A21
A31
B11 B22
B21
B31
B32
B33
Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
end
B13
B12
i=3
B23
A32
13| HTA OPERATIONS & APPLICATIONS
3
A11 A12 A13
A22 A23
A33
A21
A31
B11 B22
B21
B31
B32
B33
Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
end
B13
B12
i=3
B23
A32
13| HTA OPERATIONS & APPLICATIONS
3
A11 A12 A13
A22 A23
A33
A21
A31
B11 B22
B21
B31
B32
B33
Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
end
B13
B12
i=3
B23
A32
13| HTA OPERATIONS & APPLICATIONS
3
A11 A12 A13
A22 A23
A33
A21
A31
B11 B22
B21
B31
B32
B33
Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
end
B13
B12
i=3
B23
A32
13| HTA OPERATIONS & APPLICATIONS
3
A11 A12 A13
A22 A23
A33
A21
A31
B11 B22
B21
B31
B32 B33
Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
end
B13B12
i=3
B23A32
13| HTA OPERATIONS & APPLICATIONS
3
A11 A12 A13
A22 A23
A33
A21
A31
B11 B22
B21
B31
B32
B33
Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
end
B13
B12
i=3
C32
C11 C12 C13
C22 C23
C33
C21
C31
14| HTA OPERATIONS & APPLICATIONS
3
A32
A12 A13
A23 A21
A31
Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
A11
A22
A33 B23B12
B13B21
B31
B32
B11 B22 B33
C32
C11 C12 C13
C22 C23
C33
C21
C31
14| HTA OPERATIONS & APPLICATIONS
3
A32
A12 A13
A23 A21
A31
Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
A11
A22
A33 B23B12
B13B21
B31
B32
B11 B22 B33
k=1
C32
C11 C12 C13
C22 C23
C33
C21
C31
14| HTA OPERATIONS & APPLICATIONS
3
A32
A12 A13
A23 A21
A31
Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
A11
A22
A33 B23B12
B13B21
B31
B32
B11 B22 B33
k=1
C32
C11 C12 C13
C22 C23
C33
C21
C31
14| HTA OPERATIONS & APPLICATIONS
3
A32
A12 A13
A23 A21
A31
Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
A11
A22
A33 B23B12
B13B21
B31
B32
B11 B22 B33
k=1
C32
C11 C12 C13
C22 C23
C33
C21
C31
14| HTA OPERATIONS & APPLICATIONS
3
A32
A12 A13
A23 A21
A31
Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
A11
A22
A33 B23B12
B13B21
B31
B32
B11 B22 B33
k=1
C32
C11 C12 C13
C22 C23
C33
C21
C31
14| HTA OPERATIONS & APPLICATIONS
3
A32
A12 A13
A23 A21
A31
Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
A11
A22
A33
B23B12
B13B21
B31
B32
B11 B22 B33
k=1
C32
C11 C12 C13
C22 C23
C33
C21
C31
14| HTA OPERATIONS & APPLICATIONS
3
A32
A12 A13
A23 A21
A31
Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
A11
A22
A33
B23B12
B13B21
B31
B32
B11 B22 B33
k=2
C32
C11 C12 C13
C22 C23
C33
C21
C31
14| HTA OPERATIONS & APPLICATIONS
3
A32
A12 A13
A23 A21
A31
Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
A11
A22
A33
B23B12
B13B21
B31
B32
B11 B22 B33
k=2
C32
C11 C12 C13
C22 C23
C33
C21
C31
14| HTA OPERATIONS & APPLICATIONS
3
A32
A12 A13
A23 A21
A31
Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
A11
A22
A33
B23B12
B13B21
B31
B32
B11 B22 B33
k=2
TALK OVERVIEW
INTRO
HOW HTA’s WORK
HTA OPERATIONS & APPLICATIONS
CONCLUSIONS
1
2
3
5
15|
EVALUATION
4
NASA ADVANCED SUPERCOMPUTING BENCHMARK
image source: paper
16| EVALUATION
4
Nprocs EP (CLASS C) FT (CLASS B) CG (CLASS C) MG (CLASS B) LU (CLASS B)Fortran+ Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab +MPI HTA MPI HTA MPI HTA MPI HTA MPI HTA
1 901.6 3556.9 136.8 657.4 3606.9 3812.0 26.9 828.0 15.7 245.14 273.1 888.8 109.1 274.0 362.0 1750.9 17.0 273.8 6.3 60.58 136.3 447.0 65.5 159.3 123.4 823.6 9.6 151.3 2.9 29.916 68.6 224.8 37.2 87.2 89.5 375.2 4.8 87.0 1.2 16.032 34.7 112.0 20.7 42.9 48.4 250.3 3.3 54.9 1.1 9.864 17.1 56.7 10.4 24.0 44.5 148.0 1.6 50.4 1.3 7.1128 8.5 29.1 5.9 15.6 30.8 123.0 1.4 38.5 1.6 N/A
Table 1. Execution times in seconds for some of the applications in the NAS benchmarks for Fortran+MPI versus MATLAB +HTA. Theexecution time for 1 processor corresponds to the serial application in Fortran or MATLAB , without MPI or HTAs.
tation not involving distributed HTA operations. Since all data arereplicated, the behavior in each processor is exactly the same aswhat would be the behavior of the client except that no communi-cation is necessary to use data from the main thread in operationson distributed HTAs. On invocation of a method on a distributedHTA, each processor applies the corresponding operation to thetiles of the HTA it owns.The incorporation of HTAs in MATLAB produced an explicitly
parallel programming extension of MATLAB that integrates seam-lessly with the language. Most other parallel MATLAB extensionseither make use of extraneous primitives (MultiMATLAB [24]) ordo not allow explicit parallel programming (Matlab*P [17]). Also,the incorporation of HTA gives MATLAB a mechanism to accessand operate on tiles much more powerful than that provided by theirnative . The main disadvantage of the implementa-tion is that the immense overhead of the interpreted MATLAB lim-its the efficiency of many applications. The three main sources ofthis overhead are:
Excessive creation of temporary variables.MATLAB creates tem-poraries to hold the partial results of expression, which signifi-cantly slows down the programs.Frequent replication of data. MATLAB passes parameters byvalue and assignment statements replicate the data, andInterpretation of instructions. The overhead resulting from the in-terpretation of instructions is more pronounced when the compu-tation relies mainly on scalar operations.
Table 1 presents the execution time for Fortran+MPI and ourMATLAB +HTA implementations of most of the NAS bench-marks. The table shows the execution times in seconds when theapplications execute on a cluster of up-to 128 processors. Each pro-cessor is a 3.2 GHz Intel Xeon connected through a Gigabit Ether-net. For the NAS benchmarks we used the version 3.1, and com-piled them with the INTEL ifort compiler, version 8.1, and flag-03. For MATLAB we used the version 7.0.1 (R14). Finally, forMPI we used MPI-LAM [6].The execution time for 1 processor corresponds to the serial
execution of the pure Fortran or MATLAB code without MPI orHTAs. Results in Table 1 correspond to the class C input for EPand CG, and class B for MG, FT and LU.As can be seen in the table, in the case of EP and FT the parallel
MATLAB code takes advantage of parallelism leading to executiontimes that are of the same magnitude as those of the Fortran+MPIcode. In the case of CG our parallel MATLAB does reasonablywell, although not as well as the Fortran+MPI version that ob-tains super-linear speedups when the number of processors is 64 orsmaller. However, for MG and LU the performance of the sequen-tial MATLAB implementation was slow and, in the case of MG,
the parallel MATLAB does not improve upon the serial Fortranversion. Similarly, for BT (not shown) the serial MATLAB versionruns so slow that, even the parallel version is not comparable withits sequential Fotran counterpart. Overall, for EP, FT and CG wherethe sequential MATLAB version runs 1 to 5 times slower than theFortran version, the parallel MATLAB implementation does rea-sonably well improving upon the serial Fortran version. In thesecases, it could be said that parallelism at least compensates for theinterpretation overhead. For 128 processors the parallel MATLABobtains speedups of 30.9, 8.8 and 29.3 over the sequential Fortrancounterpart for EP, FT and CG, respectively.
4.2 C++In the C++ implementation, HTAs are represented as compos-ite objects with methods to operate on both distributed and non-distributed HTAs. As in the case of MATLAB , MPI is usedfor communication and, while the programming model is singlethreaded, HTA C++ programs execute in SPMD form. To facil-itate programming, our C++ implementation enforces an alloca-tion/deallocation policy through reference counting as follows: (1)HTAs are allocated through factory methods on the heap. Themethods return a handle which is assigned to a (stack allocated)variable. (2) All accesses to the HTA occur through this handle,which itself is small in size and typically passed by value acrossprocedure boundaries. (3) Once all handles to an HTA disappearfrom the stack, the HTA and its related structures are automaticallydeleted from memory. This design permits sharing of sub-treesamong HTAs and also precludes deallocation errors. Moreover, thetemporary arrays that are for instance created during the partialevaluation of expressions, are handled through this mechanism anddeleted automatically as early as possible.Performance is one of the main goals of our C++ implementa-
tion. Methods were optimized and whenever possible specializedfor specific cases. Also, the user is given control over the memorylayout of non-distributed HTAs. In MATLAB the layout was in thehands of the system and the user had no way of influencing it. Fi-nally, to enable efficient access to scalar components of HTAs, theimplementation was organized to guarantee that hot methods wereinlined. This last strategy enabled the codes written using the li-brary to have performance similar to that of traditional (non-HTAs)implementations. For example, the code in Figure 13 represents themultiplication of two two-dimensional arrays recursively tiled. Thecode is similar to the MATLAB code shown Figure 8.The code in Figure 13 shows the declaration of the HTAs , ,
and . The function is the factory method that creates theHTAs. It takes as input the complete tiling information for eachHTA, number of tiles in each dimension , tilesize , and memory layout (
, or ). The function is recursive. When the input
53
NASA ADVANCED SUPERCOMPUTING BENCHMARK
image source: paper
16| EVALUATION
4
Nprocs EP (CLASS C) FT (CLASS B) CG (CLASS C) MG (CLASS B) LU (CLASS B)Fortran+ Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab +MPI HTA MPI HTA MPI HTA MPI HTA MPI HTA
1 901.6 3556.9 136.8 657.4 3606.9 3812.0 26.9 828.0 15.7 245.14 273.1 888.8 109.1 274.0 362.0 1750.9 17.0 273.8 6.3 60.58 136.3 447.0 65.5 159.3 123.4 823.6 9.6 151.3 2.9 29.916 68.6 224.8 37.2 87.2 89.5 375.2 4.8 87.0 1.2 16.032 34.7 112.0 20.7 42.9 48.4 250.3 3.3 54.9 1.1 9.864 17.1 56.7 10.4 24.0 44.5 148.0 1.6 50.4 1.3 7.1128 8.5 29.1 5.9 15.6 30.8 123.0 1.4 38.5 1.6 N/A
Table 1. Execution times in seconds for some of the applications in the NAS benchmarks for Fortran+MPI versus MATLAB +HTA. Theexecution time for 1 processor corresponds to the serial application in Fortran or MATLAB , without MPI or HTAs.
tation not involving distributed HTA operations. Since all data arereplicated, the behavior in each processor is exactly the same aswhat would be the behavior of the client except that no communi-cation is necessary to use data from the main thread in operationson distributed HTAs. On invocation of a method on a distributedHTA, each processor applies the corresponding operation to thetiles of the HTA it owns.The incorporation of HTAs in MATLAB produced an explicitly
parallel programming extension of MATLAB that integrates seam-lessly with the language. Most other parallel MATLAB extensionseither make use of extraneous primitives (MultiMATLAB [24]) ordo not allow explicit parallel programming (Matlab*P [17]). Also,the incorporation of HTA gives MATLAB a mechanism to accessand operate on tiles much more powerful than that provided by theirnative . The main disadvantage of the implementa-tion is that the immense overhead of the interpreted MATLAB lim-its the efficiency of many applications. The three main sources ofthis overhead are:
Excessive creation of temporary variables.MATLAB creates tem-poraries to hold the partial results of expression, which signifi-cantly slows down the programs.Frequent replication of data. MATLAB passes parameters byvalue and assignment statements replicate the data, andInterpretation of instructions. The overhead resulting from the in-terpretation of instructions is more pronounced when the compu-tation relies mainly on scalar operations.
Table 1 presents the execution time for Fortran+MPI and ourMATLAB +HTA implementations of most of the NAS bench-marks. The table shows the execution times in seconds when theapplications execute on a cluster of up-to 128 processors. Each pro-cessor is a 3.2 GHz Intel Xeon connected through a Gigabit Ether-net. For the NAS benchmarks we used the version 3.1, and com-piled them with the INTEL ifort compiler, version 8.1, and flag-03. For MATLAB we used the version 7.0.1 (R14). Finally, forMPI we used MPI-LAM [6].The execution time for 1 processor corresponds to the serial
execution of the pure Fortran or MATLAB code without MPI orHTAs. Results in Table 1 correspond to the class C input for EPand CG, and class B for MG, FT and LU.As can be seen in the table, in the case of EP and FT the parallel
MATLAB code takes advantage of parallelism leading to executiontimes that are of the same magnitude as those of the Fortran+MPIcode. In the case of CG our parallel MATLAB does reasonablywell, although not as well as the Fortran+MPI version that ob-tains super-linear speedups when the number of processors is 64 orsmaller. However, for MG and LU the performance of the sequen-tial MATLAB implementation was slow and, in the case of MG,
the parallel MATLAB does not improve upon the serial Fortranversion. Similarly, for BT (not shown) the serial MATLAB versionruns so slow that, even the parallel version is not comparable withits sequential Fotran counterpart. Overall, for EP, FT and CG wherethe sequential MATLAB version runs 1 to 5 times slower than theFortran version, the parallel MATLAB implementation does rea-sonably well improving upon the serial Fortran version. In thesecases, it could be said that parallelism at least compensates for theinterpretation overhead. For 128 processors the parallel MATLABobtains speedups of 30.9, 8.8 and 29.3 over the sequential Fortrancounterpart for EP, FT and CG, respectively.
4.2 C++In the C++ implementation, HTAs are represented as compos-ite objects with methods to operate on both distributed and non-distributed HTAs. As in the case of MATLAB , MPI is usedfor communication and, while the programming model is singlethreaded, HTA C++ programs execute in SPMD form. To facil-itate programming, our C++ implementation enforces an alloca-tion/deallocation policy through reference counting as follows: (1)HTAs are allocated through factory methods on the heap. Themethods return a handle which is assigned to a (stack allocated)variable. (2) All accesses to the HTA occur through this handle,which itself is small in size and typically passed by value acrossprocedure boundaries. (3) Once all handles to an HTA disappearfrom the stack, the HTA and its related structures are automaticallydeleted from memory. This design permits sharing of sub-treesamong HTAs and also precludes deallocation errors. Moreover, thetemporary arrays that are for instance created during the partialevaluation of expressions, are handled through this mechanism anddeleted automatically as early as possible.Performance is one of the main goals of our C++ implementa-
tion. Methods were optimized and whenever possible specializedfor specific cases. Also, the user is given control over the memorylayout of non-distributed HTAs. In MATLAB the layout was in thehands of the system and the user had no way of influencing it. Fi-nally, to enable efficient access to scalar components of HTAs, theimplementation was organized to guarantee that hot methods wereinlined. This last strategy enabled the codes written using the li-brary to have performance similar to that of traditional (non-HTAs)implementations. For example, the code in Figure 13 represents themultiplication of two two-dimensional arrays recursively tiled. Thecode is similar to the MATLAB code shown Figure 8.The code in Figure 13 shows the declaration of the HTAs , ,
and . The function is the factory method that creates theHTAs. It takes as input the complete tiling information for eachHTA, number of tiles in each dimension , tilesize , and memory layout (
, or ). The function is recursive. When the input
53
Too many numbers !
0 32 64 96 1280
32
64
96
128
EP
25 %
100 %
Matlab+HTA Fortran+MPI
ebarassinglyparallel
# processors
speedup factor
sequential speed
17|
128 3.2 GHz Intel Xeons, Gigabit Ethernet
EVALUATION
4
Matlab+HTA Fortran+MPI
0 32 64 96 1280
32
64
96
128
EP LINEAR SPEEDUP
25 %
100 %
Matlab+HTA Fortran+MPI
ebarassinglyparallel
# processors
speedup factor
sequential speed
17|
128 3.2 GHz Intel Xeons, Gigabit Ethernet
EVALUATION
4
Matlab+HTA Fortran+MPI
0 32 64 96 1280
32
64
96
128
FFT
21 %
100 %
Matlab+HTA Fortran+MPI
fast fouriertransform
18| EVALUATION
4
128 3.2 GHz Intel Xeons, Gigabit Ethernet
Matlab+HTA Fortran+MPI
# processors
speedup factor
sequential speed
0 32 64 96 1280
32
64
96
128
FFT
HTA’s SCALE BETTER
21 %
100 %
Matlab+HTA Fortran+MPI
fast fouriertransform
18| EVALUATION
4
128 3.2 GHz Intel Xeons, Gigabit Ethernet
Matlab+HTA Fortran+MPI
# processors
speedup factor
sequential speed
0 32 64 96 1280
32
64
96
128
CG95 % 100 %
Matlab+HTA Fortran+MPI
conjugategradient
19| EVALUATION
4
128 3.2 GHz Intel Xeons, Gigabit Ethernet
Matlab+HTA Fortran+MPI
# processors
speedup factor
sequential speed
0 32 64 96 1280
32
64
96
128
CG MPI SUPER L INEAR SPEEDUP
95 % 100 %
Matlab+HTA Fortran+MPI
conjugategradient
19| EVALUATION
4
128 3.2 GHz Intel Xeons, Gigabit Ethernet
Matlab+HTA Fortran+MPI
# processors
speedup factor
sequential speed
0
32
64
96
128
0 32 64 96 128
MG
3 %
100 %
Matlab+HTA Fortran+MPI
multigrid
20| EVALUATION
4
128 3.2 GHz Intel Xeons, Gigabit Ethernet
Matlab+HTA Fortran+MPI
# processors
speedup factor
sequential speed
HTA’s
SLOW
0 32 64 96 1280
32
64
96
128
MG
3 %
100 %
Matlab+HTA Fortran+MPI
multigrid
20| EVALUATION
4
128 3.2 GHz Intel Xeons, Gigabit Ethernet
Matlab+HTA Fortran+MPI
# processors
speedup factor
sequential speed
0
32
64
96
128
0 32 64 96 128
LU
6 %
100 %
Matlab+HTA Fortran+MPI
lufactorization
21| EVALUATION
4
128 3.2 GHz Intel Xeons, Gigabit Ethernet
Matlab+HTA Fortran+MPI
# processors
speedup factor
sequential speed
0
32
64
96
128
0 32 64 96 128
LU
SLOW AGAIN
6 %
100 %
Matlab+HTA Fortran+MPI
lufactorization
21| EVALUATION
4
128 3.2 GHz Intel Xeons, Gigabit Ethernet
Matlab+HTA Fortran+MPI
# processors
speedup factor
sequential speed
0 32 64 96 1280
32
64
96
128
LU
SLOW AGAIN
6 %
100 %
Matlab+HTA Fortran+MPI
lufactorization
21| EVALUATION
4
no data for 128 proce
ssors
128 3.2 GHz Intel Xeons, Gigabit Ethernet
Matlab+HTA Fortran+MPI
# processors
speedup factor
sequential speed
PERFORMANCE OF C++ HTA’s
22| EVALUATION
4
MMM
504 1008 2016 3024 40320
1000
2000
3000
4000
matrix size
MFLOPS Naive 3 loops HTA naive Tiled 6 loopsHTA+ATLAS ATLAS Intel MKL
Intel Pentium 4, 3.0 GHz, 8KB L1 cache
PERFORMANCE OF C++ HTA’s
22| EVALUATION
4
MMM
504 1008 2016 3024 40320
1000
2000
3000
4000
matrix size
MFLOPS Naive 3 loops HTA naive Tiled 6 loopsHTA+ATLAS ATLAS Intel MKL
Intel Pentium 4, 3.0 GHz, 8KB L1 cache
8-13.5%
LINES OF CODE
COMPARISON
ep cg mg ft lu0
200
400
600
800
1000
1200
lines
of c
ode
HTA MPIHTA
MPI
HTA
MPI
HTA
MPIHTA
MPI
ComputationCommunicationData Decomposition
image source: paper
23| EVALUATION
4
TALK OVERVIEW
INTRO
HOW HTA’s WORK
HTA OPERATIONS & APPLICATIONS
CONCLUSIONS
1
2
3
5
24|
EVALUATION
4
CONCLUSIONS
25| CONCLUSIONS
5
SCALABILITY PORTABILITY
PRODUCTIVITY
HTA’s
FURTHER INFORMATION
26| CONCLUSIONS
5
http://polaris.cs.uiuc.edu/hta/
THANKS.FOR YOUR ATTENTION
&AQPUT YOUR QUESTIONS