Upload
nonnie
View
29
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Optimization of Java-Like Languages for Parallel and Distributed Environments. Kathy Yelick. U.C. Berkeley Computer Science Division. http://www.cs.berkeley.edu/~yelick/talks.html. What this tutorial is about. Language and compiler support for: Performance Programmability Scalability - PowerPoint PPT Presentation
Citation preview
Optimization of Java-Like Languages for Parallel and Distributed Environments
U.C. BerkeleyComputer Science Division
Kathy Yelick
http://www.cs.berkeley.edu/~yelick/talks.html
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
What this tutorial is about• Language and compiler support for:
• Performance• Programmability• Scalability• Portability
• Some of this is specific to the Java language (not the JVM), but much of it applies to other parallel languages
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
TitaniumTitanium will be used as an examples• Based on Java
• Has Java’s syntax, safety, memory management, etc.• Replaces Java’s thread model with static threads (SPMD)• Other extensions for performance and parallelism
• Optimizing compiler• Compiles to C (and from there to executable)• Synchronization analysis • Various optimizations
• Portable• Runs on uniprocessors, shared memory, and clusters
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Organization• Can we use Java for high performance on
• 1 processor machines? • Java commercial compilers on some Scientific
applications• Java the language, compiled to native code (via C)• Extensions of Java to improve performance
• 10-100 processor machines?• 1K-10K processor machines?• 100K-1M processor machines?
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
SciMark Benchmark
• Numerical benchmark for Java, C/C++• Five kernels:
• FFT (complex, 1D)• Successive Over-Relaxation (SOR)• Monte Carlo integration (MC)• Sparse matrix multiply • dense LU factorization
• Results are reported in Mflops• Download and run on your machine from:
• http://math.nist.gov/scimark2• C and Java sources also provided
Roldan Pozo, NIST, http://math.nist.gov/~Rpozo
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
SciMark: Java vs. C(Sun UltraSPARC 60)
0102030405060708090
MFl
ops
FFT SOR MC Sparse LU
CJava
* Sun JDK 1.3 (HotSpot) , javac -0; Sun cc -0; SunOS 5.7Roldan Pozo, NIST, http://math.nist.gov/~Rpozo
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
SciMark: Java vs. C(Intel PIII 500MHz, Win98)
0
20
40
60
80
100
120
FFT SOR MC Sparse LU
CJava
* Sun JDK 1.2, javac -0; Microsoft VC++ 5.0, cl -0; Win98Roldan Pozo, NIST, http://math.nist.gov/~Rpozo
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Can we do better without the JVM?• Pure Java with a JVM (and JIT)
• Within 2x of C and sometimes better• OK for many users, even those using high end
machines• Depends on quality of both compilers
• We can try to do better using a traditional compilation model • E.g., Titanium compiler at Berkeley
• Compiles Java extension to C• Does not optimize Java arrays or for loops (prototype)
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Java Compiled by Titanium Compiler
Performance on a Pentium IV (1.5GHz)
050
100150200250300350400450
Overall FFT SOR MC Sparse LU
MFl
ops
java C (gcc -O6) Ti Ti -nobc
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Java Compiled by Titanium CompilerPerformance on a Sun Ultra 4
0
10
20
3040
50
60
70
Overall FFT SOR MC Sparse LU
MFl
ops
Java C Ti Ti -nobc
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Language Support for Performance• Multidimensional arrays
• Contiguous storage• Support for sub-array operations without copying
• Support for small objects• E.g., complex numbers• Called “immutables” in Titanium• Sometimes called “value” classes
• Unordered loop construct• Programmer specifies iteration independent• Eliminates need for dependence analysis – short term
solution? Used by vectorizing compilers.
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
HPJ Compiler from IBM• HPJ Compiler from IBM Research
• Moreira et. al • Program using Array classes which use
contiguous storage• e.g. A[i][j] becomes A.get(i,j)• No new syntax (worse for programming, but better
portability – any Java compiler can be used)• Compiler for IBM machines, exploits hardware
• e.g., Fused Multiply-Add• Result: 85+% of Fortran on RS/6000
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Java vs. Fortran Performance
0
50
100
150
200
250
Mflops
*IBM RS/6000 67MHz POWER2 (266 Mflops peak) AIX Fortran, HPJC
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Organization• Can we use Java for high performance on
• 1 processor machines?• 10-100 processor machines?
• A correctness model• Cycle detection for reordering analysis• Synchronization analysis
• 1K-10K processor machines?• 100K-1M processor machines?
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Parallel ProgrammingParallel programming models and language are
distinguished primary by:1. How parallel processes/threads are created
• Statically at program startup time• The SPMD model, 1 thread per processor
• Dynamically during program execution• Through fork statements or other features
2. How the parallel threads communicate• Through message passing (send/receive)• By reading and writing to shared memory
Implicit parallelism not included here
Java
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Two Problems• Compiler writers would like to move code around• The hardware folks also want to build hardware
that dynamically moves operations around
• When is reordering correct? • Because the programs are parallel, there are more
restrictions, not fewer• The reason is that we have to preserve semantics of what
may be viewed by other processors
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Sequential Consistency• Given a set of executions from n processors, each defines a
total order Pi.• The program order is the partial order given by the union of
these Pi ’s. • The overall execution is sequentially consistent if there exists
a correct total order that is consistent with the program order.
write x =1 read y 0
write y =3 read z 2
read x 1 read y 3
When this is serialized, the read and write
semantics must be
preserved
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Sequential Consistency Intuition• Sequential consistency says that:
• The compiler may only reorder operations if another processor cannot observe it.
• Writes (to variables that are later read) cannot result in garbage values being written.
• The program behaves as if processors take turns executing instructions
• Comments:• In a legal execution, there will typically be many possible
total orders – limited only the reads and writes to shared variables
• This is what you get if all reads and writes go to a single shared memory, and accesses serialized at memory cell
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
How Can Sequential Consistency Fail?• The compiler saves a value in a register across multiple read
accesses• This “moves” the later reads to the point of the first one
• The compiler saves a value in a register across writes• This “moves” the write until the register is written back from the standpoint
of other processors.• The compiler performance common subexpression
elimination• As if the later expression reads are all moved to the first• Once contiguous in the instruction stream, they are merged
• The compiler performs other code motion• The hardware has a write buffer
• Reads may by-pass writes in the buffer (to/from different variables) • Some write buffers are not FIFO
• The hardware may have out-of-order execution
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Weaker Correctness Models• Many systems use weaker memory models:
• Sun has TSO, PSO, and RMO• Alpha has its own model
• Some languages do as well• Java also has its own, currently undergoing redesign• C spec is mostly silent on threads – very weak on
memory mapped I/O• These are variants on the following, sequential
consistency under proper synchronization:• All accesses to shared data must be protected by a lock,
which must be a primitive known to the system• Otherwise, all bets are off (extreme)
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Why Don’t Programmers Care?• If these popular languages have used these weak models
successfully, then what is wrong?• They don’t worry about what they don’t understand• Many people use compilers that are not very aggressive about
reordering• The hardware reordering is non-deterministic, and may happen very
infrequently in practice
• Architecture community is way ahead of us in worrying about these problems.
• Open problem: A hardware simulator and/or Java (or C) compiler that reorders things in the “worst possible way”
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Using Software to Mask Hardware• Recall our two problems:
1. Compiler writers would like to move code around2. The hardware folks also want to build hardware that
dynamically moves operations around
• The second can be viewed as compiler problem• Weak memory models come extra primitives, usually
called fences or memory barriers• Write fence: wait for all outstanding writes from this processor
to complete• Read fence: do not issue any read pre-fetches before this point
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Use of Memory Fences• Memory fences can turn a particular memory
model into sequential consistency under proper synchronization:• Add a read-fence to acquire lock operation• Add a write fence to release lock operation
• In general, a language can have a stronger model than the machine it runs if the compiler is clever
• The language may also have a weaker model, if the compiler does any optimizations
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Aside: Volatile• Because Java and C have weak memory models
at the language level, they give programmers a tool: volatile variables• These variables should not be kept in registers• Operations should not be reordered• Should have mem fences around accesses
• General problem• This is a big hammer which may be unnecessary• No fine-grained control over particular accesses or
program phases (static notion)• To get SC using volatile, many variables must be volatile
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
How Can Compilers Help?• To implement a stronger model on a weaker one:
• Figure out what can legal be reordered• Do optimizations under these constraints• Generate necessary fences in resulting code
• Open problem: Can this be used to give Java a sequentially consistent semantics?
• What about C?
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Compiler Analysis Overview• When compiling sequential programs, compute
dependencies:
Valid if y not in expr1 and x not in expr2 (roughly)• When compiling parallel code, we need to consider
accesses by other processors.
y = expr2;
x = expr1;
x = expr1;
y = expr2;
Initially flag = data = 0
Proc A Proc B
data = 1; while (flag == 0);
flag = 1; ... = ...data...;
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Cycle Detection• Processors define a “program order” on accesses from
the same thread P is the union of these total orders
• Memory system define an “access order” on accesses to the same variable
A is access order (read/write & write/write pairs)
• A violation of sequential consistency is cycle in P U A [Shash&Snir]
write data read flag
write flag read data
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Cycle Analysis Intuition• Definition is based on execution model, which
allows you to answer the question: Was this execution sequentially consistent?
• Intuition:• Time cannot flow backwards• Need to be able to construct total order
• Examples (all variables initially 0)write data 1 read flag 1
write flag 1 read data 0
write data 1 read data 1
write flag 1 read flag 0
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Cycle Detection Generalization• Generalizes to arbitrary numbers of variables and
processors
• Cycles may be arbitrarily long, but it is sufficient to consider only minimal cycles with 1 or 2 consecutive stops per processor
• Can simplify the analysis by assuming all processors run a copy of the same code
write x write y read y
read y write x
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Static Analysis for Cycle Detection• Approximate P by the control flow graph• Approximate A by undirected “conflict” edges
• Bi-directional edge between accesses to the same variable in which at least one is a write
• It is still correct if the conflict edge set is a superset of the reality
• Let the “delay set” D be all edges from P that are part of a minimal cycle• The execution order of D edge must be preserved; other P edges
may be reordered (modulo usual rules about serial code)
write ywrite z
read y write z
read x
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Cycle Detection in Practice• Cycle detection was implemented in a prototype
version of the Split-C and Titanium compilers. • Split-C version used many simplifying assumptions.• Titanium version had too many conflict edges.
• What is needed to make it practical?• Finding possibly-concurrent program blocks
• Use SPMD model rather than threads to simplify• Or apply data race detection work for Java threads
• Compute conflict edges• Need good alias analysis• Reduce size by separating shared/private variables
• Synchronization analysis
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Synchronization Analysis• Enrich language with synchronization primitives
• Lock/Unlock or “synchronized” blocks• Post/Wait or Wait/Notify on condition variables• Global barriers: all processors wait at barrier
• Compiler can exploit understanding of synchronization primitives to reduce cycles• Note: use of language primitives for synchronization may
aid in optimization, but “rolling your own” is still correct
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Edge Ordering• Post-Wait operations on the a variable can be ordered
• Although correct to treat these as shared memory accesses, we can get leverage by ordering them
• Then turn edges • ? post c into delay edges• wait c ? into delay edges
• And oriented corresponding conflict edges
post c
wait c…
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Edge Deletion• In SPMD programs, the most common form of
synchronization is global barrier
• If we add to the delay set edges of the form• ? barrier• barrier ? Then we can remove corresponding conflict edges
…barrier
barrier…
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Synchronization in Cycle Detection• Iterative algorithm
• Compute delay set restrictions in which at least one operation is a synchronization operation
• Perform edge orientation and deletion• Compute delay set on remaining conflict edges
• Two important details• For locks (and synchronized) we need good alias
information about the lock variables. (Conservative would probably work…)
• For barriers, need to line up corresponding barriers
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Static Analysis for Barriers• Lining up barriers is needed for cycle detection.• Mis-aligned barriers are also a source of bugs
inside branches or loops.• Includes other global communication primitives
barrier, broadcast, reductions• Titanium uses barrier analysis, based on the
idea of single variables and methods:• A “single” method is one called by all procs public single static void allStep(...)• A “single” variable has same value on all procs
int single timestep = 0;
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Single Analysis• The underlying requirement is that barriers only
match the same textual instance• Complication from conditionals:
if (this processor owns some data) { compute on it barrier}
• Hence the use of “single” variables in Titanium• If a conditional or loop block contains a barrier,
all processors must execute it• expression in such loops headers, if statements, etc. must
contain only single variables
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Single Variable Example in Titanium• Barriers and single in N-body Simulation class ParticleSim { public static void main (String [] argv) { int single allTimestep = 0; int single allEndTime = 100; for (; allTimestep < allEndTime; allTimestep++){ // read all particles and compute forces on mine computeForces(…); Ti.barrier(); // write to my particles using new forces spreadParticles(…); Ti.barrier(); } } }
• Single methods are automatically inferred, variables not
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Some Open Problems• What is the right semantic model for shared
memory parallel languages?
• Is cycle detection practical on real languages? • How well can synchronization be analyzed?• Aliases between non-synchronizing variables?• Can we distinguish between shared and private data?• What is the complexity on real applications?
• Analysis in programs with dynamic thread creation
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Organization• Can we use Java for high performance on a
• 1 processor machine?• 10-100 processor machine?• 1K-10K processor machine?
• Programming model landscape• Global address space language support• Optimizing local pointers• Optimizing remote pointers
• 100K-1M processor machine?
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Programming Models at Scale• Large scale machines are mostly
• Clusters of uniprocessors or SMPs• Some have hardware support for remote memory access
• Shmem in Cray T3E• GM layer in Myrinet• DSM on SGI Origin 2K
• Yet most programs are written in:• SPMD model • Message passing
• Can we use a simpler, shared memory model?• On Origin, yes, but what about large machines?
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Global Address Space• To run shared memory programs on distributed memory
hardware, we replace references (pointers) by global ones:• May point to remote data• Useful in building large, complex data structures• Easy to port shared-memory programs (functionality is correct)• Uniform programming model across machines• Especially true for cluster of SMPs
• Usual implementation• Each reference contains:
• Processor id (or process id on cluster of SMPs)• And a memory address on that processor
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Use of Global / Local• Global pointers are more expensive than local
• When data is remote, it turns into a remote read or write) which is a message call of some kind
• When the data is not remote, there is still an overhead• space (processor number + memory address)• dereference time (check to see if local)
• Conclusion: not all references should be global -- use normal references when possible.
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Explicit Approach to Global/Local• A common approach in parallel languages is to
distinguish between local and global (“possibly remote”) pointers in the language.
• Two variations are:• Make global the default – nice for porting shared memory
programs• Make local the default – nice for calling libraries on a
single processor that were built for uniprocessor
• Titanium uses global deafault, with local declarations in important sections
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Global Address Space• Processes allocate locally• References can be passed
to other processesclass C { int val;... }C gv; // global pointerC local lv; // local pointer
if (thisProc() == 0) {lv = new C();
}gv = broadcast lv from 0; gv.val = ...; ... = gv.val;
Process 0Other
processes
lvgv
lvgv
lvgv
lvgv
lvgv
lvgv
LOCAL HEAP
LOCAL HEAP
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Local Pointer Analysis• Compiler can infer locals using Local
Qualification Inference
• Data structures must be well partitioned
Effect of LQI
0
50
100
150
200
250
cannon lu sample gsrb poison
applica tions
runn
ing
time
(sec
)
Original
After LQI
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Remote Accesses• What about remote accesses? In this case, the
cost of the storage and extra check is small relative to the message cost.
• Strategies for reducing remote accesses:• Use non-blocking writes – do not wait for them to
performed• Use prefetching for reads – ask before data is needed• Aggregate several accesses to the same processor
together• All of these involve reordering or the potential
for reordering
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Communication Optimizations• Data on an old machine, UCB NOW, using a simple
subset of C
Tim
e (n
orm
aliz
ed)
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Example communication costslatency () and bandwidth () measured in units of flops° measured per 8-byte word
Machine Year Mflop rate per proc
CM-5 1992 1900 20 20 IBM SP-1 1993 5000 32 100 Intel Paragon 1994 1500 2.3 50 IBM SP-2 1994 7000 40 200 Cray T3D (PVM) 1994 1974 28 94 UCB NOW 1996 2880 38 180 UCB Millennium 2000 50000 300 500
SGI Power Challenge 1995 3080 39 308 SUN E6000 1996 1980 9 180 SGI Origin 2K 2000 5000 25 500
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Organization• Can we use Java for high performance on a
• 1 processor machine?• 10-100 processor machine?• 1K-10K processor machine?• 100K-1M processor machine?
• Kinds of machines• Open problems
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Future Machines
• IBM is building a 1M processor Blue Gene machine• Expect a processor failure per day• Would like to run 1 job for a year
• “The grid” is made from harnessing unused cycles across the internet• Need to kill job if owner wants to use the machine• Frequent failures
• All of our high performance programming models assume the machine works
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Possible Software Model• System hides some faults at each layer
• Lower levels send “hints” upward• Lower level has control, but upper level can optimize
Byzantine faultsFail-stop faults
Performance faults (process pairs, checkpoints)
Uniform machine (dynamic load balancing)
Over-partitioned applications (Java,Titanium,…)
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
References• Serial Java performance:
• Roldan Pozo, Jose Moreira et al, Titanium group• Java memory models
• Bill Pugh, Jaejlin Lee• Cycle analysis
• Dennis Shasha and Marc Snir, Arvind Krishnamurthy and Kathy Yelick, Jaejin Lee and Sam Midkiff and David Padua
• Synchronization analysis• Data race detection: many people• Barriers: Alex Aiken and David Gay
• Global pointers• See UPC, Split-C, AC, CC++, Titanium, and others• Local Qualification Inference: Ben Liblit and Alex Aiken
• Non-blocking communication• Active messages, Global Arrays (PNL), and others
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Summary• Opportunities to improve programmability
• Simplify programmers model (e.g., Java with sequential consistency)
• Solve harder compiler problem (use it on “the grid”)• Basic requirements understood, but not
• Usability in practice on real applications• Interaction with other analyses• Complexity
• Current and future machines are harder• More processors, more levels of hierarchy• Less reliable overall, because of scale
Backup Slides
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Outline• Java-like Languages
• Language support for performance• Optimizations• Compilation models
• Parallel• Machine models• Language models• Memory models• Analysis
• Distributed• Remote access• Failures
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Data from Dan• Origin 2000 (128 CPU configuration):• local memory latency: 300 ns• remote memory latency: 900 ns avg.• bandwidth: 160 MB/sec per CPU• CPU: MIPS R10000 195Mhz(390 MFLOPS) or 250MHz(500 MFLOPS)• note the hardware supports up to 4 outstanding non-blocking references to • remote cache lines (SGI obviously agrees with you)
• Millennium cluster:• CPU: 4-way Intel P3-700• AMUDP performance (100Mbit half-duplex switched ethernet, kernel UDP • driver):• 100 microsec latency, 12 MB/sec bandwidth (both actual measured)• Myrinet PCI64C performance through GM:• (SAN-2000 network) 240 MB/sec in each direction, 7-9 microsec latency
• Cray T3E-900 (the version NERSC has):• CPU: 450 MHz, 900 Mflops• local memory: 600 MB/sec actual, 280 ns latency• remote memory:• SHMEM: 2 microsec, 350 MB/sec• PVM: 11 microsec, 154 MB/sec• MPI: 14 microsec, 260 MB/sec
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Data from Millennium Home page• Poweredge 2-way SMPs (500 MHz Pentium IIIs)
running Linux 2.2.5• Each SMP has a Lanai 7.2 card:
• Round trip time: 32-33 microseconds for small messages• BW: 59.5 MB/s for 16 KB msgs• Gap (time between msg sends in steady state): 18-19
microseconds
• Page: Dec 1999
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Value of optimizations
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Also for I/O (Dan’s stuff)
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Parallel Language Landscape• Two axes (2-d grid)
• Parallelism (control) model• Static (SPMD)• Dynamic (threads)
• Communication/Sharing model• Shared memory• Global address space• Message passing
• In the 2-100 processor range, one can buy shared memory machines
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Parallel Language Landscape• Implicitly parallel (serial semantics)
• Sequential – compiler too hard• Data parallel – compiler too hard
• Explicitly parallel (parallel semantics)• OpenMP – compiler too hard (for large machines)• Threads – the sweet spot
• People use it (java, vector supers)• Message passing (e.g., MPI) – programming too hard
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
The Economics of High Performance• The failure (or delay of) compilers for data
parallel languages in the 90s -> most programs for large scale machines written in MPI
• Programming community is elite• Many applications with parallelism don’t use it,
because it’s two hard
Backup Slides II
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Titanium Group• Susan Graham• Katherine Yelick• Paul Hilfinger• Phillip Colella (LBNL)• Alex Aiken
• Greg Balls (SDSC)• Peter McQuorquodale
(LBNL)
• Andrew Begel• Dan Bonachea• Tyson Condie• David Gay• Ben Liblit• Chang Sun Lin• Geoff Pike• Siu Man Yau
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Target Problems• Many modeling problems in astrophysics,
biology, material science, and other areas require • Enormous range of spatial and temporal scales
• To solve interesting problems, one needs:• Adaptive methods• Large scale parallel machines
• Titanium is designed for methods with• Stuctured grids• Locally-structured grids (AMR)
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Common Requirements• Algorithms for numerical PDE computations
are • communication intensive• memory intensive
• AMR makes these harder• more small messages • more complex data structures• most of the programming effort is
debugging the boundary cases• locality and load balance trade-off is hard
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
A Little History• Most parallel programs are written using explicit parallelism, either:
• Message passing with a SPMD model• Usually for scientific applications with C++/Fortran• Scales easily
• Shared memory with a thread C or Java • Usually for non-scientific applications• Easier to program
• Take the best features of both for Titanium
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Why Java for Scientific Computing?• Computational scientists use increasingly
complex models• Popularized C++ features: classes, overloading, pointer-
based data structures• But C++ is very complicated
• easy to lose performance and readability• Java is a better C++
• Safe: strongly typed, garbage collected• Much simpler to implement (research vehicle)• Industrial interest as well: IBM HP Java
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Summary of Features Added to Java• Multidimensional arrays with iterators • Immutable (“value”) classes• Templates• Operator overloading• Scalable SPMD parallelism• Global address space• Checked Synchronization • Zone-based memory management• Scientific Libraries
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Lecture Outline• Language and compiler support for
uniprocessor performance• Immutable classes• Multidimensional Arrays• Foreach
• Language support for ease of programming• Language support for parallel computation• Applications and application-level libraries• Summary and future directions
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Java: A Cleaner C++• Java is an object-oriented language
• classes (no standalone functions) with methods• inheritance between classes
• Documentation on web at java.sun.com• Syntax similar to C++
class Hello { public static void main (String [] argv) { System.out.println(“Hello, world!”); }}
• Safe: strongly typed, auto memory management• Titanium is (almost) strict superset
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Sequential PerformanceC/C++/FORTRAN
JavaArrays
TitaniumArrays Overhead
DAXPY3D multigrid2D multigridEM3D
1.4s12s
5.4s0.7s 1.8s 1.0s 42%
15%83%
7%
6.2s22s
1.5s6.8s
Ultrasparc:
C/C++/FORTRAN
JavaArrays
TitaniumArrays Overhead
DAXPY3D multigrid2D multigridEM3D
1.8s23.0s
7.3s1.0s 1.6s 60%
-25%-13%27%
5.5s20.0s
2.3sPentium II:
Performance results from 98; new IR and optimization framework almost complete.
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Lecture Outline• Language and compiler support for
uniprocessor performance• Language support for ease of programming
• Templates• Operator overloading
• Language support for parallel computation• Applications and application-level libraries• Summary and future directions
Example later
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Lecture Outline• Language and compiler support for
uniprocessor performance• Language support for parallel computation
• SPMD execution• Barriers and single• Explicit Communication• Implicit Communication (global and local references)• More on Single• Synchronized methods and blocks (as in Java)
• Applications and application-level libraries• Summary and future directions
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
SPMD Execution Model• Java programs can be run as Titanium, but the
result will be that all processors do all the work• E.g., parallel hello world class HelloWorld { public static void main (String [] argv) { System.out.println(“Hello from proc “ + Ti.thisProc()); } }
• Any non-trivial program will have communication and synchronization
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
SPMD Model• All processors start together and execute same code,
but not in lock-step• Basic control done using
• Ti.numProcs() total number of processors• Ti.thisProc() number of executing processor
• Bulk-synchronous style read all particles and compute forces on mine Ti.barrier(); write to my particles using new forces Ti.barrier();
• This is neither message passing nor data-parallel
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Explicit Communication: Broadcast
• Broadcast is a one-to-all communication broadcast <value> from <processor>• For example: int count = 0; int allCount = 0; if (Ti.thisProc() == 0) count = computeCount(); allCount = broadcast count from 0;
• The processor number in the broadcast must be single; all constants are single.
• The allCount variable could be declared single.
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Example of Data Input• Same example, but reading from keyboard• Shows use of Java exceptions int single count = 0; int allCount = 0; if (Ti.thisProc() == 0) try { DataInputStream kb = new
DataInputStream(System.in); myCount =
Integer.valueOf(kb.readLine()).intValue(); } catch (Exception e) { System.err.println(``Illegal Input’’); allCount = myCount from 0;
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Example: A Distributed Data Structure
Proc 0 Proc 1
local_grids
• Data can be accessed across processor boundaries
all_grids
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Example: Setting Boundary Conditionsforeach (l in local_grids.domain()) {foreach (a in all_grids.domain()) {
local_grids[l].copy(all_grids[a]);}
}
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Explicit Communication: Exchange• To create shared data structures
• each processor builds its own piece• pieces are exchanged (for object, just exchange pointers)
• Exchange primitive in Titanium int [1d] single allData; allData = new int [0:Ti.numProcs()-1]; allData.exchange(Ti.thisProc()*2);
• E.g., on 4 procs, each will have copy of allData:
0 2 4 6
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Building Distributed Structures• Distributed structures are built with exchange: class Boxed { public Boxed (int j) { val = j;} public int val; }
Object [1d] single allData;allData = new Object [0:Ti.numProcs()-1];allData.exchange(new Boxed(Ti.thisProc());
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Distributed Data Structures• Building distributed arrays:
RectDomain <1> single allProcs = [0:Ti.numProcs-1]; RectDomain <1> myParticleDomain = [0:myPartCount-1]; Particle [1d] single [1d] allParticle = new Particle [allProcs][1d]; Particle [1d] myParticle = new Particle [myParticleDomain]; allParticle.exchange(myParticle);
• Now each processor has array of pointers, one to each processor’s chunk of particles
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Lecture Outline• Language and compiler support for uniprocessor
performance• Language support for ease of programming• Language support for parallel computation• Applications and application-level libraries
• Gene sequencing application• Heart simulation • AMR elliptic and hyperbolic solvers• Scalable Poisson for infinite domains• Several smaller benchmarks: EM3D, MatMul, LU, FFT, Join
• Summary and future directions
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Unstructured Mesh Kernel
• EM3D: Relaxation on a 3D unstructured mesh
• Speedup on Ultrasparc SMP
• Simple kernel: mesh not partitioned.
0
1
2
3
4
5
6
7
8
1 2 4 8
em3d
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
AMR Poisson• Poisson Solver [Semenzato, Pike, Colella]
• 3D AMR • finite domain• variable
coefficients• multigrid
across levels • Performance of Titanium implementation
• Sequential multigrid performance +/- 20% of Fortran• On fixed, well-balanced problem of 8 patches, each 723 • parallel speedups of 5.5 on 8 processors
Level 0
Level 2
Level 1
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Scalable Poisson Solver• MLC for Finite-Differences by Balls and Colella• Poisson equation with infinite boundaries
• arise in astrophysics, some biological systems, etc.• Method is scalable
• Low communication • Performance on
• SP2 (shown) and t3e• scaled speedups• nearly ideal (flat)
• Currently 2D and non-adaptive
0
0.2
0.4
0.6
0.8
1
1.2
1 4 16processors
Tim
e/fin
e-pa
tch-
iter/p
roc
129x129/65x65
129x129/33x33257x257/129x129
257x257/65x65
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
AMR Gas Dynamics• Developed by McCorquodale and Colella• Merge with Poisson underway for self-gravity• 2D Example (3D supported)
• Mach-10 shock on solid surface at oblique angle
• Future: Self-gravitating gas dynamics package
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Distributed Array Libraries• There are some “standard” distributed array
libraries associated with Titanium• Hides the details of exchange, indirection within
the data structure, etc.• Libraries benefit from support for templates
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Distributed Array Library Fragmenttemplate <class T, int single arity> public class
DistArray { RectDomain <arity> single rd; T [arity d][arity d] subMatrices; RectDomain <arity> [arity d] single subDomains; ... /* Sets the element at p to value */ public void set (Point <arity> p, T value) { getHomingSubMatrix (p) [p] = value; }}template DistArray <double, 2> single A = new template
DistArray <double, 2> ([ [0, 0] : [aHeight, aWidth]);
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Immersed Boundary Method (future)• Immersed boundary method [Peskin,MacQueen]
• Used in heart model, platelets, and others• Currently uses FFT for Navier-Stokes solver• Begun effort to move solver and full method into
Titanium
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Implementation• Strategy
• Titanium into C• Solaris or Posix threads for SMPs• Lightweight communication for MPPs/Clusters
• Status: Titanium runs on • Solaris or Linux SMPs and uniprocessors• Berkeley NOW• SDSC Tera, SP2, T3E (also NERSC)• SP3 port underway
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Using Titanium on NPACI Machines• Send mail to us if you are interested [email protected]• Has been installed in individual accounts
• t3e and BH: upgrade needed• On uniprocessors and SMPs
• available from the Titanium home page• http://www.cs.berkeley.edu/projects/titanium• other documentation available as well
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Calling Other Languages• We have built interfaces to
• PETSc : scientific library for finite element applications• Metis: graph partitioning library• KeLP: starting work on this
• Two issues with cross-language calls• accessing Titanium data structures (arrays) from C
• possible because Titanium arrays have same format on inside• having a common message layer
• Titanium is built on lightweight communication
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Future Plans• Improved compiler optimizations for scalar code
• large loops are currently +/- 20% of Fortran• working on small loop performance
• Packaged solvers written in Titanium• Elliptic and hyperbolic solvers, both regular and adaptive
• New application collaboration• Peskin and McQueen (NYU) with Colella (LBNL)• Immersed boundary method, currently use for heart
simulation, platelet coagulation, and others
Backup Slides
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Example: Domain
Point<2> lb = [0, 0];Point<2> ub = [6, 4];RectDomain<2> r = [lb : ub : [2, 2]];...Domain<2> red = r + (r + [1, 1]);foreach (p in red) {
...}
(0, 0)
(6, 4)r
(1, 1)
(7, 5)r + [1, 1]
(0, 0)
(7, 5)red
• Domains in general are not rectangular• Built using set operations
• union, +• intersection, *• difference, -
• Example is red-black algorithm
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Example using Domains and foreach• Gauss-Seidel red-black computation in multigridvoid gsrb() {
boundary (phi);
for (domain<2> d = res; d != null;
d = (d = = red ? black : null)) {
foreach (q in d)
res[q] = ((phi[n(q)] + phi[s(q)] + phi[e(q)] + phi[w(q)])*4
+ (phi[ne(q) + phi[nw(q)] + phi[se(q)] + phi[sw(q)])
- 20.0*phi[q] - k*rhs[q]) * 0.05;
foreach (q in d) phi[q] += res[q];
}
}
unordered iteration
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Recent Progress in Titanium
• Distributed data structures built with global refs• communication may be implicit, e.g.: a[j] = a[i].dx; • use extensively in AMR algorithms
• Runtime layer optimizes• bulk communication• bulk I/O
• Runs on• t3e, SP2, and Tera
• Compiler analysis optimizes • global references converted to local ones when possible
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Consistency Model• Titanium adopts the Java memory consistency
model• Roughly: Access to shared variables that are not
synchronized have undefined behavior.• Use synchronization to control access to shared
variables.• barriers• synchronized methods and blocks
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Compiler Techniques Outline• Analysis and Optimization of parallel code
• Tolerate network latency: Split-C experience• Hardware trends and reordering• Semantics: sequential consistency• Cycle detection: parallel dependence analysis• Synchronization analysis: parallel flow analysis
• Summary and future directions
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Parallel Optimizations• Titanium compiler performs parallel
optimizations• communication overlap and aggregation
• Two new analyses• synchronization analysis: the parallel analog to control
flow analysis for serial code [Gay & Aiken]
• shared variable analysis: the parallel analog to dependence analysis [Krishnamurthy & Yelick]
• local qualification inference: automatically inserts local qualifiers [Liblit & Aiken]
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Split-C Experience: Latency Overlap• Titanium borrowed ideas from Split-C
• global address space• SPMD parallelism
• But, Split-C had non-blocking accesses built in to tolerate network latency on remote read/write
• Also one-way communication
• Conclusion: useful, but complicated
int *global p; x := *p; /* get */ *p := 3; /* put */ sync; /* wait for my puts/gets */
*p :- x; /* store */all_store_sync; /* wait globally */
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Sources of Memory/Comm. Overlap• Would like compiler to introduce put/get/store.• Hardware also reorders
• out-of-order execution• write buffered with read by-pass• non-FIFO write buffers• weak memory models in general
• Software already reorders too• register allocation• any code motion
• System provides enforcement primitives• e.g., memory fence, volatile, etc.• tend to be heavy wait and with unpredictable performance
• Can the compiler hide all this?
End of Compiling Parallel Code