Optimization of Java-Like Languages for Parallel and Distributed Environments

Optimization of Java-Like Languages for Parallel and Distributed Environments

U.C. BerkeleyComputer Science Division

Kathy Yelick

http://www.cs.berkeley.edu/~yelick/talks.html

Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley

What this tutorial is about• Language and compiler support for:

• Performance• Programmability• Scalability• Portability

• Some of this is specific to the Java language (not the JVM), but much of it applies to other parallel languages


TitaniumTitanium will be used as an examples• Based on Java

• Has Java’s syntax, safety, memory management, etc.• Replaces Java’s thread model with static threads (SPMD)• Other extensions for performance and parallelism

• Optimizing compiler• Compiles to C (and from there to executable)• Synchronization analysis • Various optimizations

• Portable• Runs on uniprocessors, shared memory, and clusters


Organization• Can we use Java for high performance on

• 1 processor machines? • Java commercial compilers on some Scientific

applications• Java the language, compiled to native code (via C)• Extensions of Java to improve performance

• 10-100 processor machines?• 1K-10K processor machines?• 100K-1M processor machines?


SciMark Benchmark

• Numerical benchmark for Java, C/C++• Five kernels:

• FFT (complex, 1D)• Successive Over-Relaxation (SOR)• Monte Carlo integration (MC)• Sparse matrix multiply • dense LU factorization

• Results are reported in Mflops• Download and run on your machine from:

• http://math.nist.gov/scimark2• C and Java sources also provided

Roldan Pozo, NIST, http://math.nist.gov/~Rpozo

http://math.nist.gov/scimark2






SciMark: Java vs. C(Sun UltraSPARC 60)

0102030405060708090

MFl

ops

FFT SOR MC Sparse LU

CJava

* Sun JDK 1.3 (HotSpot) , javac -0; Sun cc -0; SunOS 5.7Roldan Pozo, NIST, http://math.nist.gov/~Rpozo


SciMark: Java vs. C(Intel PIII 500MHz, Win98)

0

20

40

60

80

100

120

FFT SOR MC Sparse LU

CJava

* Sun JDK 1.2, javac -0; Microsoft VC++ 5.0, cl -0; Win98Roldan Pozo, NIST, http://math.nist.gov/~Rpozo


Can we do better without the JVM?• Pure Java with a JVM (and JIT)

• Within 2x of C and sometimes better• OK for many users, even those using high end

machines• Depends on quality of both compilers

• We can try to do better using a traditional compilation model • E.g., Titanium compiler at Berkeley

• Compiles Java extension to C• Does not optimize Java arrays or for loops (prototype)


Java Compiled by Titanium Compiler

Performance on a Pentium IV (1.5GHz)

050

100150200250300350400450

Overall FFT SOR MC Sparse LU

MFl

ops

java C (gcc -O6) Ti Ti -nobc


Java Compiled by Titanium CompilerPerformance on a Sun Ultra 4

0

10

20

3040

50

60

70

Overall FFT SOR MC Sparse LU

MFl

ops

Java C Ti Ti -nobc


Language Support for Performance• Multidimensional arrays

• Contiguous storage• Support for sub-array operations without copying

• Support for small objects• E.g., complex numbers• Called “immutables” in Titanium• Sometimes called “value” classes

• Unordered loop construct• Programmer specifies iteration independent• Eliminates need for dependence analysis – short term

solution? Used by vectorizing compilers.


HPJ Compiler from IBM• HPJ Compiler from IBM Research

• Moreira et. al • Program using Array classes which use

contiguous storage• e.g. A[i][j] becomes A.get(i,j)• No new syntax (worse for programming, but better

portability – any Java compiler can be used)• Compiler for IBM machines, exploits hardware

• e.g., Fused Multiply-Add• Result: 85+% of Fortran on RS/6000


Java vs. Fortran Performance

0

50

100

150

200

250

Mflops

*IBM RS/6000 67MHz POWER2 (266 Mflops peak) AIX Fortran, HPJC


Organization• Can we use Java for high performance on

• 1 processor machines?• 10-100 processor machines?

• A correctness model• Cycle detection for reordering analysis• Synchronization analysis

• 1K-10K processor machines?• 100K-1M processor machines?


Parallel ProgrammingParallel programming models and language are

distinguished primary by:1. How parallel processes/threads are created

• Statically at program startup time• The SPMD model, 1 thread per processor

• Dynamically during program execution• Through fork statements or other features

2. How the parallel threads communicate• Through message passing (send/receive)• By reading and writing to shared memory

Implicit parallelism not included here

Java


Two Problems• Compiler writers would like to move code around• The hardware folks also want to build hardware

that dynamically moves operations around

• When is reordering correct? • Because the programs are parallel, there are more

restrictions, not fewer• The reason is that we have to preserve semantics of what

may be viewed by other processors


Sequential Consistency• Given a set of executions from n processors, each defines a

total order Pi.• The program order is the partial order given by the union of

these Pi ’s. • The overall execution is sequentially consistent if there exists

a correct total order that is consistent with the program order.

write x =1 read y 0

write y =3 read z 2

read x 1 read y 3

When this is serialized, the read and write

semantics must be

preserved


Sequential Consistency Intuition• Sequential consistency says that:

• The compiler may only reorder operations if another processor cannot observe it.

• Writes (to variables that are later read) cannot result in garbage values being written.

• The program behaves as if processors take turns executing instructions

• Comments:• In a legal execution, there will typically be many possible

total orders – limited only the reads and writes to shared variables

• This is what you get if all reads and writes go to a single shared memory, and accesses serialized at memory cell


How Can Sequential Consistency Fail?• The compiler saves a value in a register across multiple read

accesses• This “moves” the later reads to the point of the first one

• The compiler saves a value in a register across writes• This “moves” the write until the register is written back from the standpoint

of other processors.• The compiler performance common subexpression

elimination• As if the later expression reads are all moved to the first• Once contiguous in the instruction stream, they are merged

• The compiler performs other code motion• The hardware has a write buffer

• Reads may by-pass writes in the buffer (to/from different variables) • Some write buffers are not FIFO

• The hardware may have out-of-order execution


Weaker Correctness Models• Many systems use weaker memory models:

• Sun has TSO, PSO, and RMO• Alpha has its own model

• Some languages do as well• Java also has its own, currently undergoing redesign• C spec is mostly silent on threads – very weak on

memory mapped I/O• These are variants on the following, sequential

consistency under proper synchronization:• All accesses to shared data must be protected by a lock,

which must be a primitive known to the system• Otherwise, all bets are off (extreme)


Why Don’t Programmers Care?• If these popular languages have used these weak models

successfully, then what is wrong?• They don’t worry about what they don’t understand• Many people use compilers that are not very aggressive about

reordering• The hardware reordering is non-deterministic, and may happen very

infrequently in practice

• Architecture community is way ahead of us in worrying about these problems.

• Open problem: A hardware simulator and/or Java (or C) compiler that reorders things in the “worst possible way”


Using Software to Mask Hardware• Recall our two problems:

1. Compiler writers would like to move code around2. The hardware folks also want to build hardware that

dynamically moves operations around

• The second can be viewed as compiler problem• Weak memory models come extra primitives, usually

called fences or memory barriers• Write fence: wait for all outstanding writes from this processor

to complete• Read fence: do not issue any read pre-fetches before this point


Use of Memory Fences• Memory fences can turn a particular memory

model into sequential consistency under proper synchronization:• Add a read-fence to acquire lock operation• Add a write fence to release lock operation

• In general, a language can have a stronger model than the machine it runs if the compiler is clever

• The language may also have a weaker model, if the compiler does any optimizations


Aside: Volatile• Because Java and C have weak memory models

at the language level, they give programmers a tool: volatile variables• These variables should not be kept in registers• Operations should not be reordered• Should have mem fences around accesses

• General problem• This is a big hammer which may be unnecessary• No fine-grained control over particular accesses or

program phases (static notion)• To get SC using volatile, many variables must be volatile


How Can Compilers Help?• To implement a stronger model on a weaker one:

• Figure out what can legal be reordered• Do optimizations under these constraints• Generate necessary fences in resulting code

• Open problem: Can this be used to give Java a sequentially consistent semantics?

• What about C?


Compiler Analysis Overview• When compiling sequential programs, compute

dependencies:

Valid if y not in expr1 and x not in expr2 (roughly)• When compiling parallel code, we need to consider

accesses by other processors.

y = expr2;

x = expr1;

x = expr1;

y = expr2;

Initially flag = data = 0

Proc A Proc B

data = 1; while (flag == 0);

flag = 1; ... = ...data...;


Cycle Detection• Processors define a “program order” on accesses from

the same thread P is the union of these total orders

• Memory system define an “access order” on accesses to the same variable

A is access order (read/write & write/write pairs)

• A violation of sequential consistency is cycle in P U A [Shash&Snir]

write data read flag

write flag read data


Cycle Analysis Intuition• Definition is based on execution model, which

allows you to answer the question: Was this execution sequentially consistent?

• Intuition:• Time cannot flow backwards• Need to be able to construct total order

• Examples (all variables initially 0)write data 1 read flag 1

write flag 1 read data 0

write data 1 read data 1

write flag 1 read flag 0


Cycle Detection Generalization• Generalizes to arbitrary numbers of variables and

processors

• Cycles may be arbitrarily long, but it is sufficient to consider only minimal cycles with 1 or 2 consecutive stops per processor

• Can simplify the analysis by assuming all processors run a copy of the same code

write x write y read y

read y write x


Static Analysis for Cycle Detection• Approximate P by the control flow graph• Approximate A by undirected “conflict” edges

• Bi-directional edge between accesses to the same variable in which at least one is a write

• It is still correct if the conflict edge set is a superset of the reality

• Let the “delay set” D be all edges from P that are part of a minimal cycle• The execution order of D edge must be preserved; other P edges

may be reordered (modulo usual rules about serial code)

write ywrite z

read y write z

read x


Cycle Detection in Practice• Cycle detection was implemented in a prototype

version of the Split-C and Titanium compilers. • Split-C version used many simplifying assumptions.• Titanium version had too many conflict edges.

• What is needed to make it practical?• Finding possibly-concurrent program blocks

• Use SPMD model rather than threads to simplify• Or apply data race detection work for Java threads

• Compute conflict edges• Need good alias analysis• Reduce size by separating shared/private variables

• Synchronization analysis


Synchronization Analysis• Enrich language with synchronization primitives

• Lock/Unlock or “synchronized” blocks• Post/Wait or Wait/Notify on condition variables• Global barriers: all processors wait at barrier

• Compiler can exploit understanding of synchronization primitives to reduce cycles• Note: use of language primitives for synchronization may

aid in optimization, but “rolling your own” is still correct


Edge Ordering• Post-Wait operations on the a variable can be ordered

• Although correct to treat these as shared memory accesses, we can get leverage by ordering them

• Then turn edges • ? post c into delay edges• wait c ? into delay edges

• And oriented corresponding conflict edges

post c

wait c…


Edge Deletion• In SPMD programs, the most common form of

synchronization is global barrier

• If we add to the delay set edges of the form• ? barrier• barrier ? Then we can remove corresponding conflict edges

…barrier

barrier…


Synchronization in Cycle Detection• Iterative algorithm

• Compute delay set restrictions in which at least one operation is a synchronization operation

• Perform edge orientation and deletion• Compute delay set on remaining conflict edges

• Two important details• For locks (and synchronized) we need good alias

information about the lock variables. (Conservative would probably work…)

• For barriers, need to line up corresponding barriers


Static Analysis for Barriers• Lining up barriers is needed for cycle detection.• Mis-aligned barriers are also a source of bugs

inside branches or loops.• Includes other global communication primitives

barrier, broadcast, reductions• Titanium uses barrier analysis, based on the

idea of single variables and methods:• A “single” method is one called by all procs public single static void allStep(...)• A “single” variable has same value on all procs

int single timestep = 0;


Single Analysis• The underlying requirement is that barriers only

match the same textual instance• Complication from conditionals:

if (this processor owns some data) { compute on it barrier}

• Hence the use of “single” variables in Titanium• If a conditional or loop block contains a barrier,

all processors must execute it• expression in such loops headers, if statements, etc. must

contain only single variables


Single Variable Example in Titanium• Barriers and single in N-body Simulation class ParticleSim { public static void main (String [] argv) { int single allTimestep = 0; int single allEndTime = 100; for (; allTimestep < allEndTime; allTimestep++){ // read all particles and compute forces on mine computeForces(…); Ti.barrier(); // write to my particles using new forces spreadParticles(…); Ti.barrier(); } } }

• Single methods are automatically inferred, variables not


Some Open Problems• What is the right semantic model for shared

memory parallel languages?

• Is cycle detection practical on real languages? • How well can synchronization be analyzed?• Aliases between non-synchronizing variables?• Can we distinguish between shared and private data?• What is the complexity on real applications?

• Analysis in programs with dynamic thread creation


Organization• Can we use Java for high performance on a

• 1 processor machine?• 10-100 processor machine?• 1K-10K processor machine?

• Programming model landscape• Global address space language support• Optimizing local pointers• Optimizing remote pointers

• 100K-1M processor machine?


Programming Models at Scale• Large scale machines are mostly

• Clusters of uniprocessors or SMPs• Some have hardware support for remote memory access

• Shmem in Cray T3E• GM layer in Myrinet• DSM on SGI Origin 2K

• Yet most programs are written in:• SPMD model • Message passing

• Can we use a simpler, shared memory model?• On Origin, yes, but what about large machines?


Global Address Space• To run shared memory programs on distributed memory

hardware, we replace references (pointers) by global ones:• May point to remote data• Useful in building large, complex data structures• Easy to port shared-memory programs (functionality is correct)• Uniform programming model across machines• Especially true for cluster of SMPs

• Usual implementation• Each reference contains:

• Processor id (or process id on cluster of SMPs)• And a memory address on that processor


Use of Global / Local• Global pointers are more expensive than local

• When data is remote, it turns into a remote read or write) which is a message call of some kind

• When the data is not remote, there is still an overhead• space (processor number + memory address)• dereference time (check to see if local)

• Conclusion: not all references should be global -- use normal references when possible.


Explicit Approach to Global/Local• A common approach in parallel languages is to

distinguish between local and global (“possibly remote”) pointers in the language.

• Two variations are:• Make global the default – nice for porting shared memory

programs• Make local the default – nice for calling libraries on a

single processor that were built for uniprocessor

• Titanium uses global deafault, with local declarations in important sections


Global Address Space• Processes allocate locally• References can be passed

to other processesclass C { int val;... }C gv; // global pointerC local lv; // local pointer

if (thisProc() == 0) {lv = new C();

}gv = broadcast lv from 0; gv.val = ...; ... = gv.val;

Process 0Other

processes

lvgv

lvgv

lvgv

lvgv

lvgv

lvgv

LOCAL HEAP

LOCAL HEAP


Local Pointer Analysis• Compiler can infer locals using Local

Qualification Inference

• Data structures must be well partitioned

Effect of LQI

0

50

100

150

200

250

cannon lu sample gsrb poison

applica tions

runn

ing

time

(sec

)

Original

After LQI


Remote Accesses• What about remote accesses? In this case, the

cost of the storage and extra check is small relative to the message cost.

• Strategies for reducing remote accesses:• Use non-blocking writes – do not wait for them to

performed• Use prefetching for reads – ask before data is needed• Aggregate several accesses to the same processor

together• All of these involve reordering or the potential

for reordering


Communication Optimizations• Data on an old machine, UCB NOW, using a simple

subset of C

Tim

e (n

orm

aliz

ed)


Example communication costslatency () and bandwidth () measured in units of flops° measured per 8-byte word

Machine Year Mflop rate per proc

CM-5 1992 1900 20 20 IBM SP-1 1993 5000 32 100 Intel Paragon 1994 1500 2.3 50 IBM SP-2 1994 7000 40 200 Cray T3D (PVM) 1994 1974 28 94 UCB NOW 1996 2880 38 180 UCB Millennium 2000 50000 300 500

SGI Power Challenge 1995 3080 39 308 SUN E6000 1996 1980 9 180 SGI Origin 2K 2000 5000 25 500


Organization• Can we use Java for high performance on a

• 1 processor machine?• 10-100 processor machine?• 1K-10K processor machine?• 100K-1M processor machine?

• Kinds of machines• Open problems


Future Machines

• IBM is building a 1M processor Blue Gene machine• Expect a processor failure per day• Would like to run 1 job for a year

• “The grid” is made from harnessing unused cycles across the internet• Need to kill job if owner wants to use the machine• Frequent failures

• All of our high performance programming models assume the machine works


Possible Software Model• System hides some faults at each layer

• Lower levels send “hints” upward• Lower level has control, but upper level can optimize

Byzantine faultsFail-stop faults

Performance faults (process pairs, checkpoints)

Uniform machine (dynamic load balancing)

Over-partitioned applications (Java,Titanium,…)


References• Serial Java performance:

• Roldan Pozo, Jose Moreira et al, Titanium group• Java memory models

• Bill Pugh, Jaejlin Lee• Cycle analysis

• Dennis Shasha and Marc Snir, Arvind Krishnamurthy and Kathy Yelick, Jaejin Lee and Sam Midkiff and David Padua

• Synchronization analysis• Data race detection: many people• Barriers: Alex Aiken and David Gay

• Global pointers• See UPC, Split-C, AC, CC++, Titanium, and others• Local Qualification Inference: Ben Liblit and Alex Aiken

• Non-blocking communication• Active messages, Global Arrays (PNL), and others


Summary• Opportunities to improve programmability

• Simplify programmers model (e.g., Java with sequential consistency)

• Solve harder compiler problem (use it on “the grid”)• Basic requirements understood, but not

• Usability in practice on real applications• Interaction with other analyses• Complexity

• Current and future machines are harder• More processors, more levels of hierarchy• Less reliable overall, because of scale

Backup Slides


Outline• Java-like Languages

• Language support for performance• Optimizations• Compilation models

• Parallel• Machine models• Language models• Memory models• Analysis

• Distributed• Remote access• Failures


Data from Dan• Origin 2000 (128 CPU configuration):• local memory latency: 300 ns• remote memory latency: 900 ns avg.• bandwidth: 160 MB/sec per CPU• CPU: MIPS R10000 195Mhz(390 MFLOPS) or 250MHz(500 MFLOPS)• note the hardware supports up to 4 outstanding non-blocking references to • remote cache lines (SGI obviously agrees with you)

• Millennium cluster:• CPU: 4-way Intel P3-700• AMUDP performance (100Mbit half-duplex switched ethernet, kernel UDP • driver):• 100 microsec latency, 12 MB/sec bandwidth (both actual measured)• Myrinet PCI64C performance through GM:• (SAN-2000 network) 240 MB/sec in each direction, 7-9 microsec latency

• Cray T3E-900 (the version NERSC has):• CPU: 450 MHz, 900 Mflops• local memory: 600 MB/sec actual, 280 ns latency• remote memory:• SHMEM: 2 microsec, 350 MB/sec• PVM: 11 microsec, 154 MB/sec• MPI: 14 microsec, 260 MB/sec


Data from Millennium Home page• Poweredge 2-way SMPs (500 MHz Pentium IIIs)

running Linux 2.2.5• Each SMP has a Lanai 7.2 card:

• Round trip time: 32-33 microseconds for small messages• BW: 59.5 MB/s for 16 KB msgs• Gap (time between msg sends in steady state): 18-19

microseconds

• Page: Dec 1999


Value of optimizations


Also for I/O (Dan’s stuff)


Parallel Language Landscape• Two axes (2-d grid)

• Parallelism (control) model• Static (SPMD)• Dynamic (threads)

• Communication/Sharing model• Shared memory• Global address space• Message passing

• In the 2-100 processor range, one can buy shared memory machines


Parallel Language Landscape• Implicitly parallel (serial semantics)

• Sequential – compiler too hard• Data parallel – compiler too hard

• Explicitly parallel (parallel semantics)• OpenMP – compiler too hard (for large machines)• Threads – the sweet spot

• People use it (java, vector supers)• Message passing (e.g., MPI) – programming too hard


The Economics of High Performance• The failure (or delay of) compilers for data

parallel languages in the 90s -> most programs for large scale machines written in MPI

• Programming community is elite• Many applications with parallelism don’t use it,

because it’s two hard

Backup Slides II


Titanium Group• Susan Graham• Katherine Yelick• Paul Hilfinger• Phillip Colella (LBNL)• Alex Aiken

• Greg Balls (SDSC)• Peter McQuorquodale

(LBNL)

• Andrew Begel• Dan Bonachea• Tyson Condie• David Gay• Ben Liblit• Chang Sun Lin• Geoff Pike• Siu Man Yau


Target Problems• Many modeling problems in astrophysics,

biology, material science, and other areas require • Enormous range of spatial and temporal scales

• To solve interesting problems, one needs:• Adaptive methods• Large scale parallel machines

• Titanium is designed for methods with• Stuctured grids• Locally-structured grids (AMR)


Common Requirements• Algorithms for numerical PDE computations

are • communication intensive• memory intensive

• AMR makes these harder• more small messages • more complex data structures• most of the programming effort is

debugging the boundary cases• locality and load balance trade-off is hard


A Little History• Most parallel programs are written using explicit parallelism, either:

• Message passing with a SPMD model• Usually for scientific applications with C++/Fortran• Scales easily

• Shared memory with a thread C or Java • Usually for non-scientific applications• Easier to program

• Take the best features of both for Titanium


Why Java for Scientific Computing?• Computational scientists use increasingly

complex models• Popularized C++ features: classes, overloading, pointer-

based data structures• But C++ is very complicated

• easy to lose performance and readability• Java is a better C++

• Safe: strongly typed, garbage collected• Much simpler to implement (research vehicle)• Industrial interest as well: IBM HP Java


Summary of Features Added to Java• Multidimensional arrays with iterators • Immutable (“value”) classes• Templates• Operator overloading• Scalable SPMD parallelism• Global address space• Checked Synchronization • Zone-based memory management• Scientific Libraries


Lecture Outline• Language and compiler support for

uniprocessor performance• Immutable classes• Multidimensional Arrays• Foreach

• Language support for ease of programming• Language support for parallel computation• Applications and application-level libraries• Summary and future directions


Java: A Cleaner C++• Java is an object-oriented language

• classes (no standalone functions) with methods• inheritance between classes

• Documentation on web at java.sun.com• Syntax similar to C++

class Hello { public static void main (String [] argv) { System.out.println(“Hello, world!”); }}

• Safe: strongly typed, auto memory management• Titanium is (almost) strict superset


Sequential PerformanceC/C++/FORTRAN

JavaArrays

TitaniumArrays Overhead

DAXPY3D multigrid2D multigridEM3D

1.4s12s

5.4s0.7s 1.8s 1.0s 42%

15%83%

7%

6.2s22s

1.5s6.8s

Ultrasparc:

C/C++/FORTRAN

JavaArrays

TitaniumArrays Overhead

DAXPY3D multigrid2D multigridEM3D

1.8s23.0s

7.3s1.0s 1.6s 60%

-25%-13%27%

5.5s20.0s

2.3sPentium II:

Performance results from 98; new IR and optimization framework almost complete.



uniprocessor performance• Language support for ease of programming

• Templates• Operator overloading

• Language support for parallel computation• Applications and application-level libraries• Summary and future directions

Example later



uniprocessor performance• Language support for parallel computation

• SPMD execution• Barriers and single• Explicit Communication• Implicit Communication (global and local references)• More on Single• Synchronized methods and blocks (as in Java)

• Applications and application-level libraries• Summary and future directions


SPMD Execution Model• Java programs can be run as Titanium, but the

result will be that all processors do all the work• E.g., parallel hello world class HelloWorld { public static void main (String [] argv) { System.out.println(“Hello from proc “ + Ti.thisProc()); } }

• Any non-trivial program will have communication and synchronization


SPMD Model• All processors start together and execute same code,

but not in lock-step• Basic control done using

• Ti.numProcs() total number of processors• Ti.thisProc() number of executing processor

• Bulk-synchronous style read all particles and compute forces on mine Ti.barrier(); write to my particles using new forces Ti.barrier();

• This is neither message passing nor data-parallel


Explicit Communication: Broadcast

• Broadcast is a one-to-all communication broadcast <value> from <processor>• For example: int count = 0; int allCount = 0; if (Ti.thisProc() == 0) count = computeCount(); allCount = broadcast count from 0;

• The processor number in the broadcast must be single; all constants are single.

• The allCount variable could be declared single.


Example of Data Input• Same example, but reading from keyboard• Shows use of Java exceptions int single count = 0; int allCount = 0; if (Ti.thisProc() == 0) try { DataInputStream kb = new

DataInputStream(System.in); myCount =

Integer.valueOf(kb.readLine()).intValue(); } catch (Exception e) { System.err.println(``Illegal Input’’); allCount = myCount from 0;


Example: A Distributed Data Structure

Proc 0 Proc 1

local_grids

• Data can be accessed across processor boundaries

all_grids


Example: Setting Boundary Conditionsforeach (l in local_grids.domain()) {foreach (a in all_grids.domain()) {

local_grids[l].copy(all_grids[a]);}

}


Explicit Communication: Exchange• To create shared data structures

• each processor builds its own piece• pieces are exchanged (for object, just exchange pointers)

• Exchange primitive in Titanium int [1d] single allData; allData = new int [0:Ti.numProcs()-1]; allData.exchange(Ti.thisProc()*2);

• E.g., on 4 procs, each will have copy of allData:

0 2 4 6


Building Distributed Structures• Distributed structures are built with exchange: class Boxed { public Boxed (int j) { val = j;} public int val; }

Object [1d] single allData;allData = new Object [0:Ti.numProcs()-1];allData.exchange(new Boxed(Ti.thisProc());


Distributed Data Structures• Building distributed arrays:

RectDomain <1> single allProcs = [0:Ti.numProcs-1]; RectDomain <1> myParticleDomain = [0:myPartCount-1]; Particle [1d] single [1d] allParticle = new Particle [allProcs][1d]; Particle [1d] myParticle = new Particle [myParticleDomain]; allParticle.exchange(myParticle);

• Now each processor has array of pointers, one to each processor’s chunk of particles


Lecture Outline• Language and compiler support for uniprocessor

performance• Language support for ease of programming• Language support for parallel computation• Applications and application-level libraries

• Gene sequencing application• Heart simulation • AMR elliptic and hyperbolic solvers• Scalable Poisson for infinite domains• Several smaller benchmarks: EM3D, MatMul, LU, FFT, Join

• Summary and future directions


Unstructured Mesh Kernel

• EM3D: Relaxation on a 3D unstructured mesh

• Speedup on Ultrasparc SMP

• Simple kernel: mesh not partitioned.

0

1

2

3

4

5

6

7

8

1 2 4 8

em3d


AMR Poisson• Poisson Solver [Semenzato, Pike, Colella]

• 3D AMR • finite domain• variable

coefficients• multigrid

across levels • Performance of Titanium implementation

• Sequential multigrid performance +/- 20% of Fortran• On fixed, well-balanced problem of 8 patches, each 723 • parallel speedups of 5.5 on 8 processors

Level 0

Level 2

Level 1


Scalable Poisson Solver• MLC for Finite-Differences by Balls and Colella• Poisson equation with infinite boundaries

• arise in astrophysics, some biological systems, etc.• Method is scalable

• Low communication • Performance on

• SP2 (shown) and t3e• scaled speedups• nearly ideal (flat)

• Currently 2D and non-adaptive

0

0.2

0.4

0.6

0.8

1

1.2

1 4 16processors

Tim

e/fin

e-pa

tch-

iter/p

roc

129x129/65x65

129x129/33x33257x257/129x129

257x257/65x65


AMR Gas Dynamics• Developed by McCorquodale and Colella• Merge with Poisson underway for self-gravity• 2D Example (3D supported)

• Mach-10 shock on solid surface at oblique angle

• Future: Self-gravitating gas dynamics package


Distributed Array Libraries• There are some “standard” distributed array

libraries associated with Titanium• Hides the details of exchange, indirection within

the data structure, etc.• Libraries benefit from support for templates


Distributed Array Library Fragmenttemplate <class T, int single arity> public class

DistArray { RectDomain <arity> single rd; T [arity d][arity d] subMatrices; RectDomain <arity> [arity d] single subDomains; ... /* Sets the element at p to value */ public void set (Point <arity> p, T value) { getHomingSubMatrix (p) [p] = value; }}template DistArray <double, 2> single A = new template

DistArray <double, 2> ([ [0, 0] : [aHeight, aWidth]);


Immersed Boundary Method (future)• Immersed boundary method [Peskin,MacQueen]

• Used in heart model, platelets, and others• Currently uses FFT for Navier-Stokes solver• Begun effort to move solver and full method into

Titanium


Implementation• Strategy

• Titanium into C• Solaris or Posix threads for SMPs• Lightweight communication for MPPs/Clusters

• Status: Titanium runs on • Solaris or Linux SMPs and uniprocessors• Berkeley NOW• SDSC Tera, SP2, T3E (also NERSC)• SP3 port underway


Using Titanium on NPACI Machines• Send mail to us if you are interested [email protected]• Has been installed in individual accounts

• t3e and BH: upgrade needed• On uniprocessors and SMPs

• available from the Titanium home page• http://www.cs.berkeley.edu/projects/titanium• other documentation available as well


Calling Other Languages• We have built interfaces to

• PETSc : scientific library for finite element applications• Metis: graph partitioning library• KeLP: starting work on this

• Two issues with cross-language calls• accessing Titanium data structures (arrays) from C

• possible because Titanium arrays have same format on inside• having a common message layer

• Titanium is built on lightweight communication


Future Plans• Improved compiler optimizations for scalar code

• large loops are currently +/- 20% of Fortran• working on small loop performance

• Packaged solvers written in Titanium• Elliptic and hyperbolic solvers, both regular and adaptive

• New application collaboration• Peskin and McQueen (NYU) with Colella (LBNL)• Immersed boundary method, currently use for heart

simulation, platelet coagulation, and others

Backup Slides


Example: Domain

Point<2> lb = [0, 0];Point<2> ub = [6, 4];RectDomain<2> r = [lb : ub : [2, 2]];...Domain<2> red = r + (r + [1, 1]);foreach (p in red) {

...}

(0, 0)

(6, 4)r

(1, 1)

(7, 5)r + [1, 1]

(0, 0)

(7, 5)red

• Domains in general are not rectangular• Built using set operations

• union, +• intersection, *• difference, -

• Example is red-black algorithm


Example using Domains and foreach• Gauss-Seidel red-black computation in multigridvoid gsrb() {

boundary (phi);

for (domain<2> d = res; d != null;

d = (d = = red ? black : null)) {

foreach (q in d)

res[q] = ((phi[n(q)] + phi[s(q)] + phi[e(q)] + phi[w(q)])*4

+ (phi[ne(q) + phi[nw(q)] + phi[se(q)] + phi[sw(q)])

- 20.0*phi[q] - k*rhs[q]) * 0.05;

foreach (q in d) phi[q] += res[q];

}

}

unordered iteration


Recent Progress in Titanium

• Distributed data structures built with global refs• communication may be implicit, e.g.: a[j] = a[i].dx; • use extensively in AMR algorithms

• Runtime layer optimizes• bulk communication• bulk I/O

• Runs on• t3e, SP2, and Tera

• Compiler analysis optimizes • global references converted to local ones when possible


Consistency Model• Titanium adopts the Java memory consistency

model• Roughly: Access to shared variables that are not

synchronized have undefined behavior.• Use synchronization to control access to shared

variables.• barriers• synchronized methods and blocks


Compiler Techniques Outline• Analysis and Optimization of parallel code

• Tolerate network latency: Split-C experience• Hardware trends and reordering• Semantics: sequential consistency• Cycle detection: parallel dependence analysis• Synchronization analysis: parallel flow analysis

• Summary and future directions


Parallel Optimizations• Titanium compiler performs parallel

optimizations• communication overlap and aggregation

• Two new analyses• synchronization analysis: the parallel analog to control

flow analysis for serial code [Gay & Aiken]

• shared variable analysis: the parallel analog to dependence analysis [Krishnamurthy & Yelick]

• local qualification inference: automatically inserts local qualifiers [Liblit & Aiken]


Split-C Experience: Latency Overlap• Titanium borrowed ideas from Split-C

• global address space• SPMD parallelism

• But, Split-C had non-blocking accesses built in to tolerate network latency on remote read/write

• Also one-way communication

• Conclusion: useful, but complicated

int *global p; x := *p; /* get */ *p := 3; /* put */ sync; /* wait for my puts/gets */

*p :- x; /* store */all_store_sync; /* wait globally */


Sources of Memory/Comm. Overlap• Would like compiler to introduce put/get/store.• Hardware also reorders

• out-of-order execution• write buffered with read by-pass• non-FIFO write buffers• weak memory models in general

• Software already reorders too• register allocation• any code motion

• System provides enforcement primitives• e.g., memory fence, volatile, etc.• tend to be heavy wait and with unpredictable performance

• Can the compiler hide all this?

End of Compiling Parallel Code

Documents

Optimization of Java-Like Languages for Parallel and Distributed Environments