Upload
valencia-cruz
View
48
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory. UPC and Titanium. Outline. Global Address Languages in General UPC Language overview - PowerPoint PPT Presentation
Citation preview
UPC and Titanium
Open-source compilers and tools for scalable global address space computing
Kathy YelickUniversity of California, Berkeley and
Lawrence Berkeley National Laboratory
Center for Programming Models for Scalable Parallel Computing 2Review, March 13, 2003
Outline
• Global Address Languages in General
• UPC• Language overview• Berkeley UPC compiler status and microbenchmarks• Application benchmarks and plans
• Titanium• Language overview• Berkeley Titanium compiler status• Application benchmarks and plans
Center for Programming Models for Scalable Parallel Computing 3Review, March 13, 2003
Global Address Space Languages• Explicitly-parallel programming model with SPMD parallelism
• Fixed at program start-up, typically 1 thread per processor
• Global address space model of memory• Allows programmer to directly represent distributed data structures
• Address space is logically partitioned• Local vs. remote memory (two-level hierarchy)
• Programmer control over performance critical decisions• Data layout and communication
• Performance transparency and tunability are goals• Initial implementation can use fine-grained shared memory
• Suitable for current and future architectures• Either shared memory or lightweight messaging is key
• Base languages differ: UPC (C), CAF (Fortran), Titanium (Java)
Center for Programming Models for Scalable Parallel Computing 4Review, March 13, 2003
Global Address Space
• The languages share the global address space abstraction• Shared memory is partitioned by processors• Remote memory may stay remote: no automatic caching implied• One-sided communication through reads/writes of shared variables• Both individual and bulk memory copies
• Differ on details• Some models have a separate private memory area• Distributed arrays generality and how they are constructed
Shared
Glo
bal
ad
dre
ss
sp
ace
X[0]
Privateptr: ptr: ptr:
X[1] X[P]
Center for Programming Models for Scalable Parallel Computing 5Review, March 13, 2003
UPC Programming Model Features
• SPMD parallelism• fixed number of images during execution• images operate asynchronously
• Several kinds of array distributions• double a[n] a private n-element array on each processor• shared double a[n] a n-element shared array, with cyclic mapping • shared [4] double a[n] a block cyclic array with 4-element blocks • shared [0] double *a = (shared [0] double *) upc_alloc(n);
a shared array with all elements local
• Pointers for irregular data structures• shared double *sp a pointer to shared data• double *lp a pointers to private data
Center for Programming Models for Scalable Parallel Computing 6Review, March 13, 2003
UPC Programming Model Features
• Global synchronization• upc_barrier traditional barrier• upc_notify/upc_wait split-phase global synchronization
• Pair-wise synchronization• upc_lock/upc_unlock traditional locks
• Memory consistence has two types of accesses• Strict: must be performed immediately and atomically: typically
a blocking round-trip message if remote• Relaxed: still must preserve dependencies, but other
processors may view these as happening out of order
• Parallel I/O• Based on ideas in MPI I/O• Specification for UPC by Thakur, El Ghazawi et al
Center for Programming Models for Scalable Parallel Computing 7Review, March 13, 2003
Berkeley UPC Compiler
UPC
Higher WHIRL
Lower WHIRL
• Compiler based on Open64• Recently merged Rice sources• Multiple front-ends, including gcc• Intermediate form called WHIRL
• Current focus on C backend• IA64 possible in future
• UPC Runtime • Pointer representation• Shared/distribute memory
• Communication in GASNet• Portable • Language-independent
Optimizingtransformations
C + Runtime
Assembly: IA64, MIPS,… + Runtime
Center for Programming Models for Scalable Parallel Computing 8Review, March 13, 2003
Design for Portability & Performance• UPC to C translator:
• Translates UPC to C; insert runtime calls for parallel features
• UPC runtime: • Allocate shared data; implement pointers-to-shared
• GASNet: • A uniform interface for low-level communication primitives
• Portability: • C is our intermediate language
• GASNet is itself layered with a small core as the essential part
• High-Performance: • Native C compiler optimizes serial code
• Translator can perform communication optimizations
• GASNet can access network directly
Center for Programming Models for Scalable Parallel Computing 9Review, March 13, 2003
Berkeley UPC Compiler Status• UPC Extensions added to front-end• Code-generation complete
• Some issues related to code quality (hints to backend compilers)
• GASNet communication layer• Running on Quadrics/Elan, IBM/LAPI, Myrinet/GM, and MPI• Optimized for small non-blocking messages and compiled code• Next step: strided and indexed put/get leveraging ARMCI work
• UPC Runtime layer • Developed and tested on all GASNet implementations• Supports multiple pointer representations• Next step: direct shared memory support
• Release scheduled for later this month• Glitch related to include files and usability to iron out
Center for Programming Models for Scalable Parallel Computing 10Review, March 13, 2003
Pointer-to-Shared Representation
• UPC has three difference kinds of pointers:• Block-cyclic, cyclic, and indefinite (always local)
• A pointer needs a “phase” to keep track of where it is in a block• Source of overhead for updating and de-referencing• Consumes space in the pointer
• Our runtime has special cases for:• Phaseless (cyclic and indefinite) – skip phase update• Indefinite – skip thread id update
• Pointer size/representation easily reconfigured• 64 bits on small machines, 128 on large, word or struct
Address Thread Phase
Center for Programming Models for Scalable Parallel Computing 11Review, March 13, 2003
Preliminary Performance• Testbed
• Compaq AlphaServer, with Quadrics GASNet conduit• Compaq C compiler for the translated C code
• Microbenchmarks• Measures the cost of UPC language features and construct• Shared pointer arithmetic, barrier, allocation, etc• Vector addition: no remote communication
• NAS Parallel Benchmarks • EP: no communication• IS: large bulk memory operations• MG: bulk memput • CG: fine-grained vs. bulk memput
Center for Programming Models for Scalable Parallel Computing 12Review, March 13, 2003
Performance of Shared Pointer Arithmetic
• Phaseless pointers are an important optimization Indefinite pointers almost as fast as regular C pointers General blocked cyclic pointer 7x slower for addition
• Competitive with HP compiler, which generates native code Both compiler have known opportunities for improvement
Pointer-to-shared operations
05
101520253035404550
generic cyclic indefinite
type of pointer
# o
f c
yc
les
(1
.5 n
s/c
yc
le)
ptr + int -- HP
ptr + int -- Berkeley
ptr == ptr -- HP
ptr == ptr-- Berkeley
Center for Programming Models for Scalable Parallel Computing 13Review, March 13, 2003
Cost of Shared Memory Access
Cost of Shared Local Memory Access
750
8
240
70
100
200
300
400
500
600
700
800
HP read Berkeleyread
HP write Berkeleywrite
# of
cyc
les
(1.5
ns/
cycl
e)
Cost of shared remote access
0
1000
2000
3000
4000
5000
6000
HP read Berkeleyread
HP write Berkeleywrite
# o
f cyc
les
(1.5
n
s/cy
cle)
• Local shared accesses somewhat slower than private ones HP has improved local performance in newer version
• Remote accesses worse than local, as expected Runtime/GASNet layering for portability is not a problem
Center for Programming Models for Scalable Parallel Computing 14Review, March 13, 2003
NAS PB: EP
EP -- class A
0
10
20
30
40
50
60
70
0 5 10 15 20
number of threads
Mo
ps
/se
co
nd HP
Berkeley
• EP = Embarrassingly Parallel has no communication• Serial performance via C code generation is not a problem
Center for Programming Models for Scalable Parallel Computing 15Review, March 13, 2003
NAS PB: IS
• IS = Integer Sort is dominated by Bulk Communication• GASNet bulk communication adds no measurable overhead
IS -- Class B
0
5
10
15
20
25
30
35
40
0 2 4 6 8 10
number of threads
Mo
ps/
seco
nd
s
HP
Berkeley
Center for Programming Models for Scalable Parallel Computing 16Review, March 13, 2003
NAS PB: MG
• MG = Multigrid involves medium bulk copies• “Berkeley” reveals a slight serial performance degradation due to casts• Berkeley-C uses the original C code for the inner loops
MG Class B
0100020003000400050006000700080009000
1 2 4 8 16 32Processors
Per
form
ance
Mfl
op/s Berkeley
Berkeley-C
F77+MPI
Center for Programming Models for Scalable Parallel Computing 17Review, March 13, 2003
Scaling MG on the T3E
• Scalability of the language shown here for the T3E compiler• Directly shared memory support is probably needed to be competitive
on most current machines
MG Class C -- T3E
0
5000
10000
15000
20000
25000
0 50 100 150 200 250 300
Processors
Mo
ps/s UPC
MPI
linear
Center for Programming Models for Scalable Parallel Computing 18Review, March 13, 2003
Mesh Generation in UPC
• Parallel Mesh Generation in UPC• 2D Delaunay triangulation
• Based on Triangle software by Shewchuk (UCB)
• Parallel version from NERSC uses dynamic load balancing, software caching, and parallel sorting
Center for Programming Models for Scalable Parallel Computing 19Review, March 13, 2003
UPC Interactions• UPC consortium
• Tarek El-Ghazawi is coordinator: semi-annual meetings, ~daily e-mail• Revised UPC Language Specification (IDA,GWU,…)• UPC Collectives (MTU)• UPC I/O Specifications (GWU, ANL-PModels)
• Other Implementations• HP (Alpha cluster and C+MPI compiler (with MTU))• MTU (C+MPI Compiler based on HP compiler, memory model)• Cray (X1 implementation)• Intrepid (SGI implementation based on gcc)• Etnus (debugging)
• UPC Book: T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick• Goal is proofs by SC03
• HPC HPCS Effort• Recent interest from Sandia
Center for Programming Models for Scalable Parallel Computing 20Review, March 13, 2003
Titanium
• Based on Java, a cleaner C++• classes, automatic memory management, etc.• compiled to C and then native binary (no JVM)
• Same parallelism model as UPC and CAF• SPMD with a global address space• Dynamic Java threads are not supported
• Optimizing compiler• static (compile-time) optimizer, not a JIT• communication and memory optimizations• synchronization analysis (e.g. static barrier analysis)• cache and other uniprocessor optimizations
Center for Programming Models for Scalable Parallel Computing 21Review, March 13, 2003
Summary of Features Added to Java
1. Scalable parallelism (Java threads replaced)
2. Immutable (“value”) classes
3. Multidimensional arrays with iterators
4. Checked Synchronization
5. Operator overloading
6. Templates
7. Zone-based memory management (regions)
8. Libraries for collective communication, distributed arrays, bulk I/O
Center for Programming Models for Scalable Parallel Computing 22Review, March 13, 2003
Immutable Classes in Titanium
• For small objects, would sometimes prefer• to avoid level of indirection • pass by value (copy entire object)• especially when immutable -- fields never modified
• Example: immutable class Complex { Complex () {real=0; imag=0; } ... } Complex c1 = new Complex(7.1, 4.3); c1 = c1.add(c1);
• Addresses performance and programmability• Similar to structs in C (not C++ classes) in terms of performance
• Adds support for complex types
Center for Programming Models for Scalable Parallel Computing 23Review, March 13, 2003
Multidimensional Arrays
• Arrays in Java are objects
• Array bounds are checked
• Multidimensional arrays are arrays-of-arrays
Safe and general, but potentially slow
• New kind of multidimensional array added to Titanium• Sub-arrays are supported (interior, boundary, etc.)• Indexed by Points (tuple of ints)
• Combined with unordered iteration to enable optimizations foreach (p within A.domain()) {
A[p]... } • “A” could be multidimensional, an interior region, etc.
Center for Programming Models for Scalable Parallel Computing 24Review, March 13, 2003
Communication
• Titanium has explicit global communication:• Broadcast, reduction, etc.• Primarily used to set up distributed data structures
• Most communication is implicit through the shared address space• Dereferencing a global reference, g.x, can generate
communication• Arrays have copy operations, which generate bulk communication:
A1.copy(A2) • Automatically computes the intersection of A1 and A2’s index
set or domain
Center for Programming Models for Scalable Parallel Computing 25Review, March 13, 2003
Distributed Data Structures
• Building distributed arrays:
Particle [1d] single [1d] allParticle = new Particle [0:Ti.numProcs-1][1d]; Particle [1d] myParticle = new Particle [0:myParticleCount-1]; allParticle.exchange(myParticle);
• Now each processor has array of pointers, one to each processor’s chunk of particles
P0 P1 P2
All to all broadcast
Center for Programming Models for Scalable Parallel Computing 26Review, March 13, 2003
Titanium Compiler Status
• Titanium compiler runs on almost any machine• Requires a C compiler (and decent C++ to compile translator)• Pthreads for shared memory• Communication layer for distributed memory (or hybrid)
• Recently moved to live on GASNet: obtained GM, Elan, and improved LAPI implementation
• Leverages other PModels work for maintenance
• Recent language extensions• Indexed array copy (scatter/gather style)• Non-blocking array copy under development
• Compiler optimizations• Cache optimizations, for loop optimizations• Communication optimizations for overlap, pipelining, and scatter/gather
under development
Center for Programming Models for Scalable Parallel Computing 27Review, March 13, 2003
Applications in Titanium
• Several benchmarks• Fluid solvers with Adaptive Mesh Refinement (AMR)• Conjugate Gradient• 3D Multigrid• Unstructured mesh kernel: EM3D• Dense linear algebra: LU, MatMul• Tree-structured n-body code• Finite element benchmark• Genetics: micro-array selection• SciMark serial benchmarks
• Larger applications• Heart simulation• Ocean modeling with AMR (in progress)
Center for Programming Models for Scalable Parallel Computing 28Review, March 13, 2003
Serial Performance (Pure Java)
Performance on a Pentium IV (1.5GHz)
050
100150200250300350400450
Overall FFT SOR MC Sparse LU
MF
lop
s
java C (gcc -O6) Ti Ti -nobc
• Several optimizations in Titanium compiler (tc) over the past year• These codes are all written in pure Java without performance extensions
SciMark Small - Linux, 1.8GHz Athlon, 256 KB L2, 1GB RAM
0
100
200
300
400
500
600
700
800
900
CompositeScore
FFT SOR Monte Carlo Sparse matmul LU
sunjdkibmjdktc1.1078tc2.87gcc
Center for Programming Models for Scalable Parallel Computing 29Review, March 13, 2003
AMR for Ocean Modeling
• Ocean Modeling [Wen, Colella]• Require embedded boundaries to model the ocean floor and
coastline• Results in irregular data structures and array accesses
• Starting with AMR solver this year for ocean flow• Compiler and language support for irregular problem under
design
Graphics from Titanium AMR Gas Dynamics [McCorquodale,Colella
Center for Programming Models for Scalable Parallel Computing 30Review, March 13, 2003
• Immersed Boundary Method [Peskin/MacQueen]• Fibers (e.g., heart muscles)
modeled by list of fiber points
• Fluid space modeled by a regular lattice
• Irregular fiber lists need to interact with regular fluid lattice• Trade-off between load balancing
of fibers and minimizing communication
• Memory and communication intensive
Heart Simulation
• Random array access is key problem in the performance• Developed compiler optimizations to improve their performance
• Application effort funded by NSF/NPACI
Center for Programming Models for Scalable Parallel Computing 31Review, March 13, 2003
Parallel Performance and Scalability• Poisson solver using “Method of Local Corrections” [Balls, Colella]
• Communication < 5%; Scaled speedup nearly ideal (flat)
IBM SP Cray T3E
Center for Programming Models for Scalable Parallel Computing 32Review, March 13, 2003
Titanium Interactions
• GASNet interactions• In addition to the
• Application collaborators• Charles Peskin and Dave McQueen and Courant Institute• Phil Colella and Tong Wen and LBNL• Scott Baden and Greg Balls and UCSD
• Involved in Sun HPCS Effort
• The GASNet work is common to UPC and Titanium• Joint effort between U.C. Berkeley and LBNL• (UPC project is primarily at LBNL; Titanium is U.C. Berkeley)• Collaboration with Nieplocha on communication runtime
• Participation in Global Address Space tutorials
Center for Programming Models for Scalable Parallel Computing 33Review, March 13, 2003
The End
•http://upc.nersc.gov•http://titanium.cs.berkeley.edu/
Center for Programming Models for Scalable Parallel Computing 34Review, March 13, 2003
NAS PB: CG
• CG = Conjugate Gradient can be written naturally with fine-grained communication in the sparse matrix-vector product• Worked well on the T3E (and hopefully will on the X1)• For other machines, a bulk version is required
NAS CG Performance
0
200
400
600
800
1000
1200
2 4 8
Pe
rfo
rma
nce
Mo
ps/s
ec
compaq
berkeley2
berkeley3
f77+MPI