Upload
augusta-campbell
View
219
Download
0
Embed Size (px)
Citation preview
Featured attraction:
Computers for Doing Big Science
Bill Camp, Sandia Labs
2nd Feature: Hints on MPP
computing
Sandia MPPs (since 1987)
1987: 1024-processor nCUBE10 [512 Mflops] 1990--1992 + +: 2 1024-processor nCUBE-2 machines [2 @ 2 Gflops] 1988--1990: 16384-processor CM-200 1991: 64-processor Intel IPSC-860 1993--1996: ~3700-processor Intel Paragon [180 Gflops] 1996--present: 9400-processor Intel TFLOPS (ASCI Red) [3.2 Tflops] 1997--present: 400 --> 2800 processors in Cplant Linux Cluster [~3 Tflops] 2003: 1280-processor IA32- Linux cluster [~7 Tflops] 2004: Red Storm: ~11600 processor Opteron-based MPP [>40 Tflops] 2005: ~1280-Processor 64-bit Linux Cluster [~10 TF] 2006 Red Storm upgrade ~20K nodes, 160 TF. 2008--9 Red Widow ~ 50K nodes, 1000 TF. (?)
Computing domains at Sandia
Red Storm is targeting the highest-end market but has real advantages for the mid-range market (from 1 cabinet on up)
Domain
# Procs 1 101 102 103 104
Red StormX X X
Cplant Linux Supercluster
X X X
Beowulf clusters X X X
Desktop X
VolumeMid-Range
PeakBig Science
RS Node architecture
CPUsAMD
Opteron
DRAM 1 (or 2) Gbyte or more
ASICNIC +Router
Six LinksTo Other
Nodes in X, Y,and Z
ASIC = ApplicationSpecific Integrated
Circuit, or a“custom chip”
3-D Mesh topology (Z direction is a torus)
10,368Compute
Node Mesh
X=27
Y=16
Z=24
TorusInterconnect
in Z
640 V
isualization S
ervice &
I/O N
odes
640
Vis
ualiz
atio
n,
Ser
vice
& I
/O N
odes
Comparison of ASCI Redand Red Storm
ASCI Red Red Storm
Full System Operational Time Frame June 1997 (processor and memory upgrade in 1999)
August 2004
Theoretical Peak (TF)-- compute partition alone
3.15 41.47
MP-Linpack Performance (TF) 2.38 >30 (estimated)
Architecture Distributed Memory MIMD Distributed Memory MIMD
Number of Compute Node Processors 9,460 10,368
Processor Intel P II @ 333 MHz AMD Opteron @ 2 GHz
Total Memory 1.2 TB 10.4 TB (up to 80 TB)
System Memory Bandwidth 2.5 TB/s 55 TB/s
Disk Storage 12.5 TB 240 TB
Parallel File System Bandwidth 2.0 GB/s 100.0 GB/s
External Network Bandwidth 0.4 GB/s 50 GB/s
Comparison of ASCI Redand Red Storm
ASCI Red RED STORM
Interconnect Topology 3D Mesh (x, y, z)
38 x 32 x 2
3D Mesh (x, y, z)27 x 16 x 24
Interconnect Performance MPI Latency
Bi-Directional Bandwidth Minimum Bi-section Bandwidth
15 s 1 hop, 20 s max800 MB/s51.2 GB/s
2.0 s 1 hop, 5 s s max6.4 GB/s2.3 TB/s
Full System RAS RAS Network RAS Processors
10 Mbit Ethernet
1 for each 32 CPUs
100 Mbit Ethernet1 for each 4 CPUs
Operating System Compute Nodes Service and I/O Nodes RAS Nodes
CougarTOS (OSF1 UNIX)
VX-Works
CatamountLINUXLINUX
Red/Black Switching 2260 – 4940 – 2260 2688 – 4992 - 2688
System Foot Print ~2500 ft2 ~3000 ft2
Power Requirement 850 KW 1.7 MW
Red Storm Project
Goal: 23 months, design to First Product Shipment! System software is a joint project between Cray and Sandia
Sandia is supplying Catamount LWK and the service node run-time system Cray is responsible for Linux, NIC software interface, RAS software, file
system software, and Totalview port Initial software development was done on a cluster of workstations with a
commodity interconnect. Second stage involved an FPGA implementation of SEASTAR NIC/Router (Starfish). Final checkout is on real SEASTAR-based system
System engineering is wrapping up! Cabinets-- exist SEASTAR NIC/Router-- RTAT back from Fabrication at IBM late last month
Full system to be installed and turned over to Sandia in stages culminating in August--December 2004
Designing for scalable scientific supercomputing
Challenges in: -Design-Integration-Management-Use
Design SUREty for Very Large Parallel Computer Systems
Scalability - Full System Hardware and System Software
Usability - Required Functionality Only
Reliability - Hardware and System Software
Expense minimization- use commodity, high-volume parts SURE poses Computer System Requirements:
SURE Architectural tradeoffs:• Processor and memory sub-
system balance• Compute vs interconnect balance• Topology choices• Software choices• RAS• Commodity vs. Custom technology• Geometry and mechanical design
Sandia Strategies:-build on commodity-leverage Open Source (e.g., Linux)-Add to commodity selectively (in RS there is basically one truly custom part!)-leverage experience with previous scalable supercomputers
System Scalability Driven Requirements
Overall System Scalability - Complex scientific applications such as molecular dynamics, hydrodynamics, & radiation transport should achieve scaled parallel efficiencies greater than 50% on the full system (~20,000 processors).
-
ScalabilitySystem Software;System Software Performance scales nearly perfectly with the number of processors to the full size of the computer (~30,000 processors). This means that System Software time (overhead) remains nearly constant with the size of the system or scales at most logarithmically with the system size.
- Full re-boot time scales logarithmically with the system size.- Job loading is logarithmic with the number of processors.- Parallel I/O performance is not sensitive to # of PEs doing I/O- Communication Network software must be scalable.
- prefer no connection-based protocols among compute nodes.- Message buffer space independent of # of processors.- Compute node OS gets out of the way of the
application.
Hardware scalability•Balance in the node hardware:
•Memory BW must match CPU speed
Ideally 24 Bytes/flop (never yet done)
•Communications speed must match CPU speed
•I/O must match CPU speeds
•Scalable System SW( OS and Libraries)
•Scalable Applications
Usability>Application Code Support:
Software that supports scalability of the Computer System
Math LibrariesMPI Support for Full System SizeParallel I/O LibraryCompilers
Tools that Scale to the Full Size of the Computer System
DebuggersPerformance Monitors
Full-featured LINUX OS support at the user interface
Reliability
Light Weight Kernel (LWK) O. S. on compute partition Much less code fails much less often
Monitoring of correctible errors Fix soft errors before they become hard
Hot swapping of components Overall system keeps running during maintenance
Redundant power supplies & memories Completely independent RAS System monitors virtually
every component in system
Economy
1. Use high-volume parts where possible2. Minimize power requirements
Cuts operating costsReduces need for new capital
investment3. Minimize system volume
Reduces need for large new capital facilities
4. Use standard manufacturing processes where possible-- minimize customization
5. Maximize reliability and availability/dollar6. Maximize scalability/dollar7. Design for integrability
Economy
Red Storm leverages economies of scale AMD Opteron microprocessor & standard memory Air cooled Electrical interconnect based on Infiniband physical devices Linux operating system
Selected use of custom components System chip ASIC
• Critical for communication intensive applications
Light Weight Kernel• Truly custom, but we already have it (4th generation)
Cplant on a slide
Net I/O
System Support
Service
Sys Admin
Users
File I/O
Compute
/home
other
I/ONodes
Compute NodesService Nodes
……
……
……
… … … …
Ethernet
ATM
Operator(s)
HiPPI
I/O Nodes
System
Goal: MPP “look and feel”
• Start ~1997, upgrade ~1999--2001
• Alpha & Myrinet, mesh topology
• ~3000 procs (3Tf) in 7 systems
• Configurable to ~1700 procs
• Red/Black switching
• Linux w/ custom runtime & mgmt.
• Production operation for several yrs.
ASCI Red
IA-32 Cplant on a slide
Net I/O
System Support
Service
Sys Admin
Users
File I/O
Compute
/home
other
I/ONodes
Compute NodesService Nodes
……
……
……
… … … …
Ethernet
ATM
Operator(s)
HiPPI
I/O Nodes
System
Goal: Mid-range capacity
• Started 2003, upgrade annually
• Pentium-4 & Myrinet, Clos network
• 1280 procs (~7 Tf) in 3 systems
• Currently configurable to 512 procs
• Linux w/ custom runtime & mgmt.
• Production operation for several yrs.
ASCI Red
Observation:For most large scientific and engineering applications the performance is more determined by parallel scalability and less by the speed of individual CPUs.
There must be balance between processor, interconnect, and I/O performance to achieve overall performance.
To date, only a few tightly-coupled, parallel computer systems have been able to demonstrate a high level of scalability on a broad set of scientific and engineering applications.
Let’s Compare Balance In Parallel Systems
10000
2500
24000
2650
64000
500
1000
666
1200
400
Node Speed Rating(MFlops)
0.2650Q*
0.04400Q**
0.0832000White
0.11 (0.05)300 (132)Blue Pacific
0.02 (0.16*)1200 (9600*)BlueMtn**
1.6800Blue Mtn*
0.14140Cplant
(1.2)0.67800(533)ASCI RED**
11200T3E
2(1.33)800(533)ASCI RED
Communications Balance
(Bytes/flop)
Network Link BW
(Mbytes/s)Machine
Comparing Red Storm and BGL
Blue Gene Light** Red Storm*
Node Speed 5.6 GF 5.6 GF (1x)
Node Memory 0.25--.5 GB 2 (1--8 ) GB (4x nom.)
Network latency 7 secs 2 secs (2/7 x)
Network link BW 0.28 GB/s 6.0 GB/s (22x)
BW Bytes/Flops 0.05 1.1 (22x)
Bi-Section B/F 0.0016 0.038 (24x)
#nodes/problem 40,000 10,000 (1/4 x)
*100 TF version of Red Storm
* * 360 TF version of BGL
Fixed problem performance
Molecular dynamics problem(LJ liquid)
Scalable computing works
ASCI Red efficiencies for major codes
0
20
40
60
80
100
1 10 100 1000 10000
Processors
Scaled parallel efficiency (%)
QS-Particles
QS-Fields-Only
QS-1B Cells
Rad x-port-1B Cells
Rad x-port - 17M
Rad x-port - 80M
Rad x-port - 168M
Rad x-port - 532M
Finite Element
Zapotec
Reactive Fluid Flow
Salinas
CTH
Basic Parallel Efficiency Model
0.00
0.20
0.40
0.60
0.80
1.00
1.20
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Communication/Computation Load
Parallel Efficiency
Red Storm (B=1.5)
ASCI Red (B=1.2)
Ref. Machine (B=1.0)
Earth Sim. (B=.4)
Cplant (B=.25)
Blue Gene Light (B=.05)
Std. Linux Cluster (B=.04)
Balance is critical to scalability
PeakLin
pack
Scientific & eng. codes
Relating scalability and cost
0.00
1.00
2.00
3.00
4.00
5.00
6.00
1 2 4 8 16 32 64 128 256 512 1024 2048 4096Processors
Efficiency ratio (Red/Cplant)
Eff. Ratio Extrapolation
Efficiency ratio =Cost ratio = 1.8
MPP more cost effective
Cluster more cost effective
Average efficiency ratio over the five codes that consume >80% of Sandia’s cycles
Scalability determines cost effectiveness
0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
80,000,000
1 10 100 1000 10000
Number of Nodes
Total Node-Hours of Jobs
380M node-hrs55M node-hrs
MPP more cost effective
Cluster more cost effective
256
Sandia’s top priority computing workload:
Scalability also limits capability
ITS Speedup curves
0
200
400
600
800
1000
1200
0128256384512640768896
1024115212801408Processors
Speedup
Red Speedup
Cplant Speedup
Poly. (RedSpeedup)
Poly. (CplantSpeedup)
~3x processors
Commodity nearly everywhere-- Customization drives cost
• Earth Simulator and Cray X-1 are fully custom Vector systems with good balance• This drives their high cost (and their high performance).
• Clusters are nearly entirely high-volume with no truly custom parts• Which drives their low-cost (and their low scalability)
• Red Storm uses custom parts only where they are critical to performance and reliability• High scalability at minimal cost/performance
“Honey It’s not one of those…”or
Hints on MPP Computing
[Excerpted from a talk with this title given by Bill Camp at CUG-Tours in October 1994]
Issues in MPP Computing:
1. Physically shared memory does not scale
2. Data must be distributed
3. No single data layout may be optimal
4. The optimal data layout may change during the computation
5. Communications are expensive
6. The single control stream in SIMD computing makes it simple-- at the cost of severe loss in performance-- due to load balancing problems
7. In data parallel computing (‘a la CM-5) there can be multiple control streams-- but with global synchronization
Less simple but overhead remains an issue
8. In MIMD computing there are many control streams loosely synchronized (eg with messages)
Powerful, flexible and complex
Why doesn’t shared memory scale?
Switch
CPU CPU CPU CPUCPUCPU CPU
cache cache cache cache cache cache cache
memory memory memory memory memory memory memory
Bank conflicts-- about a 40% hit for large # of banks and CPU’sMemory coherency-- who has the data, can I access it?High, non-deterministic latencies
Amdahl’s Law
Time on a single processor:
T1 = Tser + Tser
Time on P processors:
Tp = Tser + Tpar/P + Tcomm’s
Ignore communications:
Speedup, Sp (= T1/ Tp) is then
Sp = { fser + [ 1 - fser ] / P}-1
Where fser = Tser / Tser
So, Sp < 1 / fser
The Axioms
Axiom 1: Amdahl’s Law is inviolate (Sp < 1 / fser )
Axiom 2: Amdahl’ Law doesn’t matter for MPP if you know what you are doing (Comm’s dominate)
Axiom 3 : Nature is parallel
Axiom 4 : Nature is (mostly) local
Axiom 5 : Physical shared memory does not scale
Axiom 6 : Physically distributed memory does
Axiom 7 : Nevertheless, a global address space is nice to have
The Axioms
Axiom 8: Like solar energy, automatic parallelism is the technology of the future
Axiom 9: successful parallelism requires the near total suppression of serialism
Axiom 10 : The best thing you can do with a processor is serial execution
Axiom 11 : Axioms 9 &10 are not contradictory
Axiom 12 : MPPs are for doing large problems fast (if you need to do a small problem fast, look
elsewhere).
Axiom 13 : Generals build weapons to win the last war (so computer scientists)
The Axioms
Axiom 14 : first find coarse-grained, then medium-grained, then fine-grained parallelism
Axiom 15: done correctly, the gain from these is multiplicative
Axiom 16 : Life’s a balancing act; so’s MPP computing
Axiom 17 : Be an introvert-- never communicate needlessly
Axiom 18 : Be independent; never synchronize needlessly
Axiom 19 : Parallel computing is a cold world, bundle up well
The Axioms
Axiom 20 : I/O should only be done under medical supervision
Axiom 21: If MPP computin’ is easy it ain’t cheap
Axiom 22 : If MPP computin’ is cheap, it ain’t easy
Axiom 23 : The difficulty of programming an MPP effectively is directly proportional to latency
Axiom 24 : The parallelism is in the problem, not in the code
The Axioms
Axiom 25 : There are an infinite number of parallel algorithms
Axiom 26 : There are no parallel algorithms (Simon’s theorem)-- it’s almost true
Axiom 27: The best parallel algorithm is almost always a parallel implementation of the best serial algorithm (what Horst really meant)
Axiom 28 : Amdahl’s Law DOES limit vector speedup!
Axiom 18’ : Work in teams ( sometimes SIMD constructs are just what the Doctor ordered)!
Axiom 29 : Do try this at home!
(Some of) the Hints
Hint 1:
Any amount of serial computing is death
So… 1) make the problem large
2) Look everywhere for serialism and purge it from your code
3) Never, ever, ever add serial statements
(Some of) the Hints
Hint 2:
Keep communications in the noise!
So… 1) Don’t do little problems on big computers
2) Change algorithms when profitable
3) Bundle up!-- avoid small messages on high-latency interconnects
4) Don’t waste memory-- using all the memory on a node minimizes the ratio of communications to useful work
(Some of) the Hints
Hint 3:
The parallelism is in the problem!
E.G. SAR, Monte Carlo, Direct Sparse solvers, Molecular Dynamics
So,… 1) Look first at the problem
2) Look second at algorithms
3) Look at data structures in the code
4) don’t waste cycles on line-by-line parallelism
(Some of) the Hints
Hint 4:
Incremental Parallelism Is Too Inefficient! Don’t fiddle with the Fortran
Look at the Problem:
-- Identify the kinds of parallelism it contains
1) Multi-program
2) Multi-task
4) data parallelism
5) inner-loop parallelism (e.g. vectors)
Time
into
effort
(Some of) the Hints
Hint 5:
Often:With Explicit Message Passing (EMP) or Gets/Puts
You can re-use virtually all of your code
(changes and additions ~ few%)
-- With data parallel languages, you re-write your code
It can be easy
but
Performance is usually unacceptable
(Some of) the Hints
Hint 6:
Load Balancing (Use existing libraries and technology)
-Easy in EMP!
-Hard (or impossible) in HPF, F90, CMF, …
-Only load balance if Tnew + Tbal < Told
Static or Dynamic:
Graph-based
geometry based
Particle-based
Hierarchical Master-Slave
(Some of) the Hints
Hint 7:
Synchronization is expensive
So, … Don’t do it unless you have to
Never, ever put in synchronization just to get rid of a bug
else you’ll be stuck with it for the life of the code!
(Some of) the Hints
Hint 8:
I/O can ruin your whole afternoon:
It is amazing how many people will create wonderfully scalable codes only to spoil them with needless or serial or non-balanced I/O
Use I/O sparingly
Stage I/O carefully
(Some of) the Hints
Hint 9:
Religious prejudice is the bane of computing
Caches aren’t inherently bad
Vectors aren’t inherently good
Small SMP’s will not ruin your life
Single processer nodes are not killers
…
La Fin (The End)
Scaling data for some key engineering codes
Performance on Engineering Codes
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1 2 4 8 16 32 64 128 256 512 1024Processors
Scaled Parallel Efficiency
ITS, Red
ITS, Cplant
ACME, Red
ACME, Cplant
Random variation at small proc. counts
Large differential in efficiency at large proc. counts
Scaling data for some key physics codes
Los Alamos’ Radiation transport
code
PARTISN Diffusion Solver Sizeup StudyS6P2, 12 Groups, 13,800 cells/PE
0%
20%
40%
60%
80%
100%
120%
1 2 4 8 16 32 6412825651210242048
Number of Processor Elements
Parallel Efficiency
ASCI Red
Blue Mountain
White
QSC
PARTISN Transport Solver Sizeup StudyS6P2, 12 Groups, 13,800 cells/PE
0%
20%
40%
60%
80%
100%
120%
1 2 4 8 16 32 6412825651210242048
Number of Processor Elements
Parallel Efficiency
ASCI Red
Blue Mountain
White
QSC
Parallel Sn Neutronics (provided by LANL)