PL/B: Programming for locality and large scale parallelism

IBM Research

September 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation

PL/B: Programming for locality and large scale parallelism

George Almási Luiz A. DeRose José E. MoreiraDavid A. Padua

IBM Research

October 2003 | Languages and Compilers for Parallel Computing © 2003 IBM Corporation

Overview

Concepts Examples Thoughts about implementation Conclusions

IBM Research


PL/B at a glance

A programming system for distributed-memory machines:

Focus: numerical computing.Convenient to use

Flat learning curveShort development cycleEasy debugging, maintenance.

Not too difficult to implementNo “heroic programming” for the compiler

Language extension:General nature

1st implementation using MATLAB™

Programming model:Single thread of executionExplicit data layout, distribution

Recursive tilingData distribution primitives

Implementation:Master-slave model

IBM Research


Things we didn’t want to do

Not another programming language!

Avoid SPMDDifficult to reason about:

Global view of communication and computation is not explicit in SPMD model“4D spaghetti code”

MPI is cumbersome:No compiler support“Assembly language of parallel computing”.

Avoid complex compilers HPF

Avoid OpenMPWrong abstraction for distributed memory machinesCould be implemented on top of Treadmarks™-like systems, but

Hard to get efficiencyRequires compiler supportUntested, experimental

IBM Research


The Convergence of PL/B and MATLAB™

Technical simplicity:No compiler work needed for prototype

Popularity:Programmers of parallel machines are familiar with the MATLAB environment

Government interest:Parallel MATLAB is part of PERCS project

Evaluation strategy: IBM’s BlueGene/L is an ideal testbed for scalability

Novelty: MATLAB™ is an excellent language for prototyping conventional algorithms. There is nothing equivalent for parallel algorithms.

IBM Research


Hierarchically Tiled Arrays

Constructing HTAs:Bottom-up:

Imposing HTA shape onto a flat arrayAlways homogeneous

Top-down: Structure firstContents laterMaybe non-homogeneous

Matlab™ notation:Similar to cell arrays { }

n-dimensional tiled arrays-dimensional tiles, ≤ nTiling is recursiveHomogeneity of HTAs:

Adjacent tiles are “compatible” along dimensions of adjacencyNot all tiles have to have the same shape

Tiles can be distributed Across modules of a parallel system.Distribution is always block cyclic

IBM Research


x4 = A(2,9:11)

x1 = A{2}{4,3}(3)

x2 = A{:}{2:4,3}(1:2)

x3 = A{1}(1:4,1:6)

Creating and Accessing HTAs

A = hta {1:2}{1:4,1:3}(1:3)

“flattened” access

IBM Research


Distributing HTAs across processors 3x3 mesh of processors, 15x12 array

•Blocked:•HTA shape: {1:3,1:3}(1:5,1:4)

•Block-cyclic in 2nd dimension:•HTA shape: {1:3,1:6}(1:5,1:2)

IBM Research


Summary: Parallel Communication and Computation in PL/B

PL/B programs are single-threaded and contain array operations on HTAs.

The host running PL/B is a front for a distributed machine

Processors are arranged in hierarchical meshes.

Top levels of HTAs distribute onto a subset of existing nodes.

Computation statements: all HTA indices refer to the same (local) physical processor

In particular, when all HTA indices are identical, computations are guaranteed to be local

Communication: all other statements

Some functions and operators encode both communicatoin and computation

Typically, MPI-like collective operations

IBM Research


Overview

Concepts Examples Thoughts about 1st implementation Conclusions

IBM Research


Tiled Matrix Multiplication

for I=1:q:n for J=1:q:n for K=1:q:n for i=I:I+q-1 for j=J:J+q-1 for k=K:K+q-1 c(i,j)=c(i,j)+a(i,k)*b(k,j); end end end end endend

IBM Research


Tiled Matrix Multiplication (PL/B)

c{i,j}, a{i,k}, b{k,j}, and T represent HTA tiles (submatrices).

The * operator represents matrix multiplication on HTA tiles.

for i=1:m for j=1:m T=0; for k=1:m T=T+a{i,k}*b{k,j}; end c{i,j}=T; endend

IBM Research


Cannon’s Algorithm (parallel, tiled matrix multiplication)

IBM Research


Cannon’s Algorithm written down in PL/Bfunction [c] = cannon(a,b) % a, b are assumed to be distributed on an n*n grid.

% create an n*n distributed hta for matrix c. c{1:n,1:n} = zeros(p,p); % communication

% “parallelogram shift” rows of a, columns of b for i=2:n a{i:n,:} = cshift(a(i:n,:},dim=2,shift=1); % communication b{:,i:n} = cshift(b{:,i:n},dim=1,shift=1); % communication end

% main loop: parallel multiplications, column shift a, row shift b for k=1:n c{:,:} = c{:,:}+a{:,:}*b{:,:}; % computation a{:,:} = cshift(a{:,:},dim=2, shift=1); % communication b{:,:} = cshift(b{:,:},dim=1,shift=1); % communication endend

IBM Research


Sparse Parallel Matrix-Vector Multiply with vector copy

A b×

P1

P3

P2

P4

A: distributedb: copied

IBM Research


% Distribute aforall i=1:n, c{i} = a(DIST(i):DIST(i+1)-1,:);end

% Broadcast vector bv{1:n} = b;

% Local multiply (sparse)t{:} = c{:} * v{:};

% Everybody gets copy of resultforall i=1:N v{i}= t(:); % flattened tend

Sparse Parallel MVM with vector copy

Important observation: In MATLAB sparse computations can berepresented as dense computations. The interpreter only performs the necessary operations.

IBM Research


Overview

Concepts Examples Thoughts about implementation Conclusions

IBM Research


Implementation

A “Distributed Array Virtual Machine” implemented on backed nodes Multiple types of memory (local, shared, co-arrays etc) Similar to UPC, OpenMP runtimes DAVM instruction set (bytecode?)

A MATLAB™ based frontend The MATLAB interpreter runs the show HTA code can be compiled into AVM code and distributed to backend A MATLAB “toolbox” contains the new data types Possible changes to MATLAB syntax: as few as we can get away

forall

IBM Research


Implementation: MATLAB™-based frontend

Matlab

@hta directory

hta subsref

subsasgn

sum

spreadtilecshift

*

/

\

constructors indexing collectives operators

IBM Research


Anticipating questions:

Q: is PL/B a toy language? A: it is as expressive as SPMD

Subsumes a large part of MPI:a{1} = b{2} is a message sent from rank 2 to 1.x = sum(a{:}) MPI_Reducex{:} = a MPI_Bcast

Many important algorithms can be formulated “better”

Q: Is PL/B still Matlab™ or a new beast?

PL/B defines a new data type and operators

MATLAB is a polymorphic language:New data type is compatible (drop-in replacement) with existing data typesNew data types bring new functionalityThink “toolbox” – Matlab users are familiar with the concept

Porting code to PL/B: Changes are going to be fairly localizedThe code will keep working during transition

IBM Research


More questions

Q: Debugging and profiling PL/B A: Debugging PL/B should not be

different from debugging a regular MATLAB program.

Q: Performance? A: PL/B has a better chance of

scaling than a regular MPI program Most communication primitives are high-level and are going to be optimized. Writing low-level communication code in PL/B is possible, but not a natural thing to do Implementation easy for most primitives (MPI)

IBM Research


Conclusion

New and exciting paradigm: HTA arrays and operators express communication and computation. Single-threaded code Master-slave execution model

Anticipate scalability Minimal to no compiler work needed

About to embark on 1st implementation Runtime (Distributed Array Virtual Machine) Interpreted front-end using (unchanged) Matlab™

Documents

PL/B: Programming for locality and large scale parallelism