Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1 Kathy Yelick
Architectural Trends and ProgrammingModel Strategies for Large-Scale
Machines
Katherine Yelick
U.C. Berkeley and Lawrence Berkeley National Lab
http://titanium.cs.berkeley.eduhttp://upc.lbl.gov
Kathy Yelick, 2Climate 2007
Architecture Trends
• What happened to Moore’s Law?• Power density and system power• Multicore trend• Game processors and GPUs• What this means to you
Kathy Yelick, 3Climate 2007
Moore’s Law is Alive and Well
2X transistors/Chip Every 1.5 yearsCalled “Moore’s Law”
Moore’s Law
Microprocessors havebecome smaller, denser,and more powerful.
Gordon Moore (co-founder ofIntel) predicted in 1965 that thetransistor density ofsemiconductor chips woulddouble roughly every 18months.
Slide source: Jack Dongarra
Kathy Yelick, 4Climate 2007
But Clock Scaling Bonanza Has Ended•Processor designers forced to go “multicore”:
• Heat density: faster clock means hotter chips• more cores with lower clock rates burn less power
• Declining benefits of “hidden” Instruction Level Parallelism(ILP)• Last generation of single core chips probably over-engineered• Lots of logic/power to find ILP parallelism, but it wasn’t in the apps
• Yield problems• Parallelism can also be used for redundancy• IBM Cell processor has 8 small cores; a blade system with all 8 sells
for $20K, whereas a PS3 is about $600 and only uses 7
Kathy Yelick, 5Climate 2007
Clock Scaling Hits Power Density Wall
400480088080
8085
8086
286 386486
Pentium®P6
1
10
100
1000
10000
1970 1980 1990 2000 2010Year
Pow
er D
ensi
ty (W
/cm
2 )
Hot Plate
NuclearReactor
RocketNozzle
Sun’sSurface
Source: Patrick Gelsinger, Intel®
Scaling clock speed (business as usual) will not work
Kathy Yelick, 6Climate 2007
Revolution is Happening Now• Chip density is
continuingincrease ~2x every2 years• Clock speed is not• Number of
processor coresmay doubleinstead
• There is little or nohidden parallelism(ILP) to be found
• Parallelism mustbe exposed to andmanaged bysoftware
Source: Intel, Microsoft (Sutter) andStanford (Olukotun, Hammond)
Kathy Yelick, 7Climate 2007
100
1000
10000
100000
1E+06
1E+07
1E+08
1E+09
1E+10
1E+11
1E+12
1993 1996 1999 2002 2005 2008 2011 2014
SUM#1#500
Petaflop with ~1M Cores By 20081Eflop/s
100 Pflop/s
10 Pflop/s
1 Pflop/s
100 Tflop/s
10 Tflops/s
1 Tflop/s
100 Gflop/s
10 Gflop/s
1 Gflop/s
10 MFlop/s
1 PFlop system in 2008
Slide source Horst Simon, LBNL
Data from top500.org
6-8 years
Commonby 2015?
Kathy Yelick, 8Climate 2007
NERSC 2005 Projections forComputer Room Power (System + Cooling)
0
10
20
30
40
50
60
70
Meg
aWat
ts
2001 2003 2005 2007 2009 2011 2013 2015 2017
CoolingN8N7N6N5bN5aNGFBassiJacquardN3EN3PDSFHPSSMisc
Power is a systemlevel problem, not justa chip level problem
Kathy Yelick, 9Climate 2007
Concurrency for Low Power• Highly concurrent systems are more power efficient
• Dynamic power is proportional to V2fC• Increasing frequency (f) also increases supply voltage
(V): more than linear effect• Increasing cores increases capacitance (C) but has
only a linear effect• Hidden concurrency burns power
• Speculation, dynamic dependence checking, etc.• Push parallelism discover to software (compilers and
application programmers) to save power• Challenge: Can you double the concurrency in your
algorithms every 18 months?
Kathy Yelick, 10Climate 2007
Cell Processor (PS3) on Scientific Kernels
• Very hard to program: explicit control over memory hierarchy
Kathy Yelick, 11Climate 2007
Game Processors Outpace Moore’s Law
• Traditionally too specialized (no scatter, no inter-corecommunication, no double fp) but trend to generalize
• Still have control over memory hierarchy/partitioning
spee
dup
Kathy Yelick, 12Climate 2007
What Does this Mean to You?•The world of computing is going parallel
• Number of cores will double every 12-24 months• Petaflop (1M processor) machines common by 2015
•More parallel programmers to hire ☺•Climate codes must have more parallelism
• Need for fundamental rewrite and new algorithms• Added challenge of combining with adaptive algorithms
•New programming model or language likely ☺• Can the HPC benefit from investments in parallel languages
and tools?• One programming model for all?
Kathy Yelick, 13Climate 2007
Performance on Current MachinesOliker, Wehner, Mirin, Parks and Worley
• Current state-of-the-art systems attain around 5% of peak atthe highest available concurrencies• Note current algorithm uses OpenMP when possible to increase
parallelism• Unless we can do better, peak performance of system must
be 10-20x of sustained requirement• Limitations (from separate studies of scientific kernels)
• Not just memory bandwidth: latency, compiler code generation, ..
Percent of Theoretical Peak
3%4%5%6%7%8%9%
10%11%
P=256 P=336 P=672
Concurrency
Power3Itanium2Earth SimulatorX1X1E
Kathy Yelick, 14Climate 2007
Strawman 1km Climate ComputerOliker, Shalf, Wehner
• Cloud system resolving global atmospheric model• .015oX.02o horizontally with 100 vertical levels
• Caveat: A back-of-the-envelope calculation, with manyassumptions about scaling current model
• To acheive simulation time 1000x faster than real time:• ~10 Petaflops sustained performance requirement• ~100 Terabytes total memory• ~2 million horizontal subdomains• ~10 vertical domains• ~20 million processors at 500Mflops each sustained inclusive of
communications costs.• 5 MB memory per processor• ~20,000 nearest neighbor send-receive pairs per subdomain per
simulated hour of ~10KB each
15 Kathy Yelick
A Programming Model Approach:Partitioned Global Address Space
(PGAS) Languages
What, Why, and How
Kathy Yelick, 16Climate 2007
Parallel Programming Models• Easy parallel software is still an unsolved problem !• Goals:
• Making parallel machines easier to use• Ease algorithm experiments• Make the most of your machine (network and memory)• Enable compiler optimizations
• Partitioned Global Address Space (PGAS)Languages
• Global address space like threads (programmability)• One-sided communication• Current static (SPMD) parallelism like MPI• Local/global distinction, i.e., layout matters (performance)
Kathy Yelick, 17Climate 2007
Partitioned Global Address Space• Global address space: any thread/process may directly
read/write data allocated by another• Partitioned: data is designated as local or global
Glo
bal a
ddre
ss s
pace x: 1
y:
l: l: l:
g: g: g:
x: 5y:
x: 7y: 0
p0 p1 pn
By default:• Object heaps
are shared• Program
stacks areprivate
• 3 Current languages: UPC, CAF, and Titanium• All three use an SPMD execution model• Emphasis in this talk on UPC and Titanium (based on Java)
• 3 Emerging languages: X10, Fortress, and Chapel
Kathy Yelick, 18Climate 2007
PGAS Language for Hybrid Parallelism• PGAS languages are a good fit to shared
memory machines• Global address space implemented as reads/writes• Current UPC and Titanium implementation uses threads• Working on System V shared memory for UPC
• PGAS languages are a good fit to distributedmemory machines and networks• Good match to modern network hardware• Decoupling data transfer from synchronization can
improve bandwidth
Kathy Yelick, 19Climate 2007
PGAS Languages on Clusters:One-Sided vs Two-Sided Communication
• A one-sided put/get message can be handled directly by a networkinterface with RDMA support• Avoid interrupting the CPU or storing data from CPU (preposts)
• A two-sided messages needs to be matched with a receive toidentify memory address to put data• Offloaded to Network Interface in networks like Quadrics• Need to download match tables to interface (from host)
address
message id
data payload
data payloadone-sided put message
two-sided message
network interface
memory
hostCPU
Joint work with Dan Bonachea
Kathy Yelick, 20Climate 2007
One-Sided vs. Two-Sided: Practice
0
100
200
300
400
500
600
700
800
900
10 100 1,000 10,000 100,000 1,000,000
Size (bytes)
Ban
dwid
th (M
B/s
)
GASNet put (nonblock)"MPI Flood
Re lative B W (GASNet/MPI)
1.01.21.41.61.82.02.22.4
10 1000 100000 10000000
Size (bytes)
• InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5• Half power point (N ½ ) differs by one order of magnitude• This is not a criticism of the implementation!
Joint work with Paul Hargrove and Dan Bonachea
(up
is g
ood) NERSC Jacquard
machine withOpteronprocessors
Kathy Yelick, 21Climate 2007
GASNet: Portability and High-Performance(d
own
is g
ood)
GASNet better for latency across machines
8-byte Roundtrip Latency
14.6
6.6
22.1
9.6
6.6
4.5
9.5
18.5
24.2
13.5
17.8
8.3
0
5
10
15
20
25
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Rou
ndtr
ip L
aten
cy (u
sec)
MPI ping-pongGASNet put+sync
Joint work with UPC Group; GASNet design by Dan Bonachea
Kathy Yelick, 22Climate 2007
(up
is g
ood)
GASNet at least as high (comparable) for large messages
Flood Bandwidth for 2MB messages
1504
630
244
857225
610
1490799255
858 228795
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Perc
ent H
W p
eak
(BW
in M
B)
MPIGASNet
GASNet: Portability and High-Performance
Joint work with UPC Group; GASNet design by Dan Bonachea
Kathy Yelick, 23Climate 2007
(up
is g
ood)
GASNet excels at mid-range sizes: important for overlap
GASNet: Portability and High-PerformanceFlood Bandwidth for 4KB messages
547
420
190
702
152
252
750
714231
763223
679
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Perc
ent H
W p
eak
MPIGASNet
Joint work with UPC Group; GASNet design by Dan Bonachea
Kathy Yelick, 24Climate 2007
Communication Strategies for 3D FFT
Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea
chunk = all rows with same destination
pencil = 1 row
• Three approaches:• Chunk:
• Wait for 2nd dim FFTs to finish• Minimize # messages
• Slab:• Wait for chunk of rows destined
for 1 proc to finish• Overlap with computation
• Pencil:• Send each row as it completes• Maximize overlap and• Match natural layout slab = all rows in a single plane with
same destination
Kathy Yelick, 25Climate 2007
NAS FT Variants Performance Summary
• Slab is always best for MPI; small message cost too high• Pencil is always best for UPC; more overlap
0
200
400
600
800
1000
Myrinet 64InfiniBand 256
Elan3 256Elan3 512
Elan4 256Elan4 512
MFl
ops
per T
hrea
d
Best MFlop rates for all NAS FT Benchmark versions
Best NAS Fortran/MPIBest MPIBest UPC
0
100
200
300
400
500
600
700
800
900
1000
1100
Myrinet 64
InfiniBand 256Elan3 256
Elan3 512Elan4 256
Elan4 512
MF
lops
per
Thr
ead
Best NAS Fortran/MPI
Best MPI (always Slabs)
Best UPC (always Pencils)
.5 Tflops
Myrinet Infiniband Elan3 Elan3 Elan4 Elan4#procs 64 256 256 512 256 512
MFl
ops
per T
hrea
d
Chunk (NAS FT with FFTW)Best MPI (always slabs)Best UPC (always pencils)
26 Kathy Yelick
Making PGAS Real:Applications and Portability
Kathy Yelick, 27Climate 2007
Coding Challenges: Block-Structured AMR• Adaptive Mesh Refinement
(AMR) is challenging• Irregular data accesses and
control from boundaries• Mixed global/local view is useful
AMR Titanium work by Tong Wen and Philip Colella
Titanium AMR benchmark available
Kathy Yelick, 28Climate 2007
Languages Support Helps Productivity
C++/Fortran/MPI AMR• Chombo package from LBNL• Bulk-synchronous comm:
• Pack boundary data between procs• All optimizations done by programmer
Titanium AMR• Entirely in Titanium• Finer-grained communication
• No explicit pack/unpack code• Automated in runtime system
• General approach• Language allow programmer
optimizations• Compiler/runtime does some
automatically
Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su
0
5000
10000
15000
20000
25000
30000
Titanium C++/F/MPI(Chombo)
Line
s of
Cod
e
AMRElliptic
AMRTools
Util
Grid
AMR
Array
Kathy Yelick, 29Climate 2007
Conclusions• Future hardware performance improvements
• Mostly from concurrency, not clock speed• New commodity hardware coming (sometime) from
games/GPUs• PGAS Languages
• Good fit for shared and distributed memory• Control over locality and (for better or worse) SPMD• Available for download
• Berkeley UPC compiler: http://upc.lbl.gov• Titanium compiler: http://titanium.cs.berkeley.edu
• Implications for Climate Modeling• Need to rethink algorithms to identify more parallelism• Need to match needs of future (efficient) algorithms
30 Kathy Yelick
Extra Slides
Kathy Yelick, 31Climate 2007
Parallelism Saves Low Power• Exploit explicit parallelism for reducing power
Power = C * V2 * F Performance = Cores * F
• Using additional cores– Allows reduction in frequency and power supply
without decreasing (original) performance– Power supply reduction lead to large power savings
Power = 2C * V2 * F Performance = 2Cores * FPower = 4C * V2/4 * F/2 Performance = 4Lanes * F/2Power = (C * V2 * F)/2 Performance = 2Cores * F
• Additional benefits– Small/simple cores more predictable performance
Kathy Yelick, 32Climate 2007
Where is Fortran?
Source: Tim O’Reilly