Upload
yoko
View
33
Download
0
Embed Size (px)
DESCRIPTION
PL/B: Programming for locality and large scale parallelism. George Alm á si Luiz A. DeRose José E. Moreira David A. Padua. Overview. Concepts Examples Thoughts about implementation Conclusions. A programming system for distributed-memory machines: Focus: numerical computing . - PowerPoint PPT Presentation
Citation preview
IBM Research
September 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
PL/B: Programming for locality and large scale parallelism
George Almási Luiz A. DeRose José E. MoreiraDavid A. Padua
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Overview
Concepts Examples Thoughts about implementation Conclusions
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
PL/B at a glance
A programming system for distributed-memory machines:
Focus: numerical computing.Convenient to use
Flat learning curveShort development cycleEasy debugging, maintenance.
Not too difficult to implementNo “heroic programming” for the compiler
Language extension:General nature
1st implementation using MATLAB™
Programming model:Single thread of executionExplicit data layout, distribution
Recursive tilingData distribution primitives
Implementation:Master-slave model
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Things we didn’t want to do
Not another programming language!
Avoid SPMDDifficult to reason about:
Global view of communication and computation is not explicit in SPMD model“4D spaghetti code”
MPI is cumbersome:No compiler support“Assembly language of parallel computing”.
Avoid complex compilers HPF
Avoid OpenMPWrong abstraction for distributed memory machinesCould be implemented on top of Treadmarks™-like systems, but
Hard to get efficiencyRequires compiler supportUntested, experimental
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
The Convergence of PL/B and MATLAB™
Technical simplicity:No compiler work needed for prototype
Popularity:Programmers of parallel machines are familiar with the MATLAB environment
Government interest:Parallel MATLAB is part of PERCS project
Evaluation strategy: IBM’s BlueGene/L is an ideal testbed for scalability
Novelty: MATLAB™ is an excellent language for prototyping conventional algorithms. There is nothing equivalent for parallel algorithms.
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Hierarchically Tiled Arrays
Constructing HTAs:Bottom-up:
Imposing HTA shape onto a flat arrayAlways homogeneous
Top-down: Structure firstContents laterMaybe non-homogeneous
Matlab™ notation:Similar to cell arrays { }
n-dimensional tiled arrays-dimensional tiles, ≤ nTiling is recursiveHomogeneity of HTAs:
Adjacent tiles are “compatible” along dimensions of adjacencyNot all tiles have to have the same shape
Tiles can be distributed Across modules of a parallel system.Distribution is always block cyclic
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
x4 = A(2,9:11)
x1 = A{2}{4,3}(3)
x2 = A{:}{2:4,3}(1:2)
x3 = A{1}(1:4,1:6)
Creating and Accessing HTAs
A = hta {1:2}{1:4,1:3}(1:3)
“flattened” access
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Distributing HTAs across processors 3x3 mesh of processors, 15x12 array
•Blocked:•HTA shape: {1:3,1:3}(1:5,1:4)
•Block-cyclic in 2nd dimension:•HTA shape: {1:3,1:6}(1:5,1:2)
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Summary: Parallel Communication and Computation in PL/B
PL/B programs are single-threaded and contain array operations on HTAs.
The host running PL/B is a front for a distributed machine
Processors are arranged in hierarchical meshes.
Top levels of HTAs distribute onto a subset of existing nodes.
Computation statements: all HTA indices refer to the same (local) physical processor
In particular, when all HTA indices are identical, computations are guaranteed to be local
Communication: all other statements
Some functions and operators encode both communicatoin and computation
Typically, MPI-like collective operations
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Overview
Concepts Examples Thoughts about 1st implementation Conclusions
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Tiled Matrix Multiplication
for I=1:q:n for J=1:q:n for K=1:q:n for i=I:I+q-1 for j=J:J+q-1 for k=K:K+q-1 c(i,j)=c(i,j)+a(i,k)*b(k,j); end end end end endend
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Tiled Matrix Multiplication (PL/B)
c{i,j}, a{i,k}, b{k,j}, and T represent HTA tiles (submatrices).
The * operator represents matrix multiplication on HTA tiles.
for i=1:m for j=1:m T=0; for k=1:m T=T+a{i,k}*b{k,j}; end c{i,j}=T; endend
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Cannon’s Algorithm (parallel, tiled matrix multiplication)
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Cannon’s Algorithm written down in PL/Bfunction [c] = cannon(a,b) % a, b are assumed to be distributed on an n*n grid.
% create an n*n distributed hta for matrix c. c{1:n,1:n} = zeros(p,p); % communication
% “parallelogram shift” rows of a, columns of b for i=2:n a{i:n,:} = cshift(a(i:n,:},dim=2,shift=1); % communication b{:,i:n} = cshift(b{:,i:n},dim=1,shift=1); % communication end
% main loop: parallel multiplications, column shift a, row shift b for k=1:n c{:,:} = c{:,:}+a{:,:}*b{:,:}; % computation a{:,:} = cshift(a{:,:},dim=2, shift=1); % communication b{:,:} = cshift(b{:,:},dim=1,shift=1); % communication endend
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Sparse Parallel Matrix-Vector Multiply with vector copy
A b×
P1
P3
P2
P4
A: distributedb: copied
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
% Distribute aforall i=1:n, c{i} = a(DIST(i):DIST(i+1)-1,:);end
% Broadcast vector bv{1:n} = b;
% Local multiply (sparse)t{:} = c{:} * v{:};
% Everybody gets copy of resultforall i=1:N v{i}= t(:); % flattened tend
Sparse Parallel MVM with vector copy
Important observation: In MATLAB sparse computations can berepresented as dense computations. The interpreter only performs the necessary operations.
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Overview
Concepts Examples Thoughts about implementation Conclusions
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Implementation
A “Distributed Array Virtual Machine” implemented on backed nodes Multiple types of memory (local, shared, co-arrays etc) Similar to UPC, OpenMP runtimes DAVM instruction set (bytecode?)
A MATLAB™ based frontend The MATLAB interpreter runs the show HTA code can be compiled into AVM code and distributed to backend A MATLAB “toolbox” contains the new data types Possible changes to MATLAB syntax: as few as we can get away
forall
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Implementation: MATLAB™-based frontend
Matlab
@hta directory
hta subsref
subsasgn
sum
spreadtilecshift
*
/
\
constructors indexing collectives operators
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Anticipating questions:
Q: is PL/B a toy language? A: it is as expressive as SPMD
Subsumes a large part of MPI:a{1} = b{2} is a message sent from rank 2 to 1.x = sum(a{:}) MPI_Reducex{:} = a MPI_Bcast
Many important algorithms can be formulated “better”
Q: Is PL/B still Matlab™ or a new beast?
PL/B defines a new data type and operators
MATLAB is a polymorphic language:New data type is compatible (drop-in replacement) with existing data typesNew data types bring new functionalityThink “toolbox” – Matlab users are familiar with the concept
Porting code to PL/B: Changes are going to be fairly localizedThe code will keep working during transition
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
More questions
Q: Debugging and profiling PL/B A: Debugging PL/B should not be
different from debugging a regular MATLAB program.
Q: Performance? A: PL/B has a better chance of
scaling than a regular MPI program Most communication primitives are high-level and are going to be optimized. Writing low-level communication code in PL/B is possible, but not a natural thing to do Implementation easy for most primitives (MPI)
IBM Research
October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation
Conclusion
New and exciting paradigm: HTA arrays and operators express communication and computation. Single-threaded code Master-slave execution model
Anticipate scalability Minimal to no compiler work needed
About to embark on 1st implementation Runtime (Distributed Array Virtual Machine) Interpreted front-end using (unchanged) Matlab™