Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing

Featured attraction:

Computers for Doing Big Science

Bill Camp, Sandia Labs

2nd Feature: Hints on MPP

computing

Sandia MPPs (since 1987)

1987: 1024-processor nCUBE10 [512 Mflops] 1990--1992 + +: 2 1024-processor nCUBE-2 machines [2 @ 2 Gflops] 1988--1990: 16384-processor CM-200 1991: 64-processor Intel IPSC-860 1993--1996: ~3700-processor Intel Paragon [180 Gflops] 1996--present: 9400-processor Intel TFLOPS (ASCI Red) [3.2 Tflops] 1997--present: 400 --> 2800 processors in Cplant Linux Cluster [~3 Tflops] 2003: 1280-processor IA32- Linux cluster [~7 Tflops] 2004: Red Storm: ~11600 processor Opteron-based MPP [>40 Tflops] 2005: ~1280-Processor 64-bit Linux Cluster [~10 TF] 2006 Red Storm upgrade ~20K nodes, 160 TF. 2008--9 Red Widow ~ 50K nodes, 1000 TF. (?)

Computing domains at Sandia

Red Storm is targeting the highest-end market but has real advantages for the mid-range market (from 1 cabinet on up)

Domain

# Procs 1 101 102 103 104

Red StormX X X

Cplant Linux Supercluster

X X X

Beowulf clusters X X X

Desktop X

VolumeMid-Range

PeakBig Science

RS Node architecture

CPUsAMD

Opteron

DRAM 1 (or 2) Gbyte or more

ASICNIC +Router

Six LinksTo Other

Nodes in X, Y,and Z

ASIC = ApplicationSpecific Integrated

Circuit, or a“custom chip”

3-D Mesh topology (Z direction is a torus)

10,368Compute

Node Mesh

X=27

Y=16

Z=24

TorusInterconnect

in Z

640 V

isualization S

ervice &

I/O N

odes

640

Vis

ualiz

atio

n,

Ser

vice

& I

/O N

odes

Comparison of ASCI Redand Red Storm

ASCI Red Red Storm

Full System Operational Time Frame June 1997 (processor and memory upgrade in 1999)

August 2004

Theoretical Peak (TF)-- compute partition alone

3.15 41.47

MP-Linpack Performance (TF) 2.38 >30 (estimated)

Architecture Distributed Memory MIMD Distributed Memory MIMD

Number of Compute Node Processors 9,460 10,368

Processor Intel P II @ 333 MHz AMD Opteron @ 2 GHz

Total Memory 1.2 TB 10.4 TB (up to 80 TB)

System Memory Bandwidth 2.5 TB/s 55 TB/s

Disk Storage 12.5 TB 240 TB

Parallel File System Bandwidth 2.0 GB/s 100.0 GB/s

External Network Bandwidth 0.4 GB/s 50 GB/s

Comparison of ASCI Redand Red Storm

ASCI Red RED STORM

Interconnect Topology 3D Mesh (x, y, z)

38 x 32 x 2

3D Mesh (x, y, z)27 x 16 x 24

Interconnect Performance MPI Latency

Bi-Directional Bandwidth Minimum Bi-section Bandwidth

15 s 1 hop, 20 s max800 MB/s51.2 GB/s

2.0 s 1 hop, 5 s s max6.4 GB/s2.3 TB/s

Full System RAS RAS Network RAS Processors

10 Mbit Ethernet

1 for each 32 CPUs

100 Mbit Ethernet1 for each 4 CPUs

Operating System Compute Nodes Service and I/O Nodes RAS Nodes

CougarTOS (OSF1 UNIX)

VX-Works

CatamountLINUXLINUX

Red/Black Switching 2260 – 4940 – 2260 2688 – 4992 - 2688

System Foot Print ~2500 ft2 ~3000 ft2

Power Requirement 850 KW 1.7 MW

Red Storm Project

Goal: 23 months, design to First Product Shipment! System software is a joint project between Cray and Sandia

Sandia is supplying Catamount LWK and the service node run-time system Cray is responsible for Linux, NIC software interface, RAS software, file

system software, and Totalview port Initial software development was done on a cluster of workstations with a

commodity interconnect. Second stage involved an FPGA implementation of SEASTAR NIC/Router (Starfish). Final checkout is on real SEASTAR-based system

System engineering is wrapping up! Cabinets-- exist SEASTAR NIC/Router-- RTAT back from Fabrication at IBM late last month

Full system to be installed and turned over to Sandia in stages culminating in August--December 2004

Designing for scalable scientific supercomputing

Challenges in: -Design-Integration-Management-Use

Design SUREty for Very Large Parallel Computer Systems

Scalability - Full System Hardware and System Software

Usability - Required Functionality Only

Reliability - Hardware and System Software

Expense minimization- use commodity, high-volume parts SURE poses Computer System Requirements:

SURE Architectural tradeoffs:• Processor and memory sub-

system balance• Compute vs interconnect balance• Topology choices• Software choices• RAS• Commodity vs. Custom technology• Geometry and mechanical design

Sandia Strategies:-build on commodity-leverage Open Source (e.g., Linux)-Add to commodity selectively (in RS there is basically one truly custom part!)-leverage experience with previous scalable supercomputers

System Scalability Driven Requirements

Overall System Scalability - Complex scientific applications such as molecular dynamics, hydrodynamics, & radiation transport should achieve scaled parallel efficiencies greater than 50% on the full system (~20,000 processors).

-

ScalabilitySystem Software;System Software Performance scales nearly perfectly with the number of processors to the full size of the computer (~30,000 processors). This means that System Software time (overhead) remains nearly constant with the size of the system or scales at most logarithmically with the system size.

- Full re-boot time scales logarithmically with the system size.- Job loading is logarithmic with the number of processors.- Parallel I/O performance is not sensitive to # of PEs doing I/O- Communication Network software must be scalable.

- prefer no connection-based protocols among compute nodes.- Message buffer space independent of # of processors.- Compute node OS gets out of the way of the

application.

Hardware scalability•Balance in the node hardware:

•Memory BW must match CPU speed

Ideally 24 Bytes/flop (never yet done)

•Communications speed must match CPU speed

•I/O must match CPU speeds

•Scalable System SW( OS and Libraries)

•Scalable Applications

Usability>Application Code Support:

Software that supports scalability of the Computer System

Math LibrariesMPI Support for Full System SizeParallel I/O LibraryCompilers

Tools that Scale to the Full Size of the Computer System

DebuggersPerformance Monitors

Full-featured LINUX OS support at the user interface

Reliability

Light Weight Kernel (LWK) O. S. on compute partition Much less code fails much less often

Monitoring of correctible errors Fix soft errors before they become hard

Hot swapping of components Overall system keeps running during maintenance

Redundant power supplies & memories Completely independent RAS System monitors virtually

every component in system

Economy

1. Use high-volume parts where possible2. Minimize power requirements

Cuts operating costsReduces need for new capital

investment3. Minimize system volume

Reduces need for large new capital facilities

4. Use standard manufacturing processes where possible-- minimize customization

5. Maximize reliability and availability/dollar6. Maximize scalability/dollar7. Design for integrability

Economy

Red Storm leverages economies of scale AMD Opteron microprocessor & standard memory Air cooled Electrical interconnect based on Infiniband physical devices Linux operating system

Selected use of custom components System chip ASIC

• Critical for communication intensive applications

Light Weight Kernel• Truly custom, but we already have it (4th generation)

Cplant on a slide

Net I/O

System Support

Service

Sys Admin

Users

File I/O

Compute

/home

other

I/ONodes

Compute NodesService Nodes

……

……

……

… … … …

Ethernet

ATM

Operator(s)

HiPPI

I/O Nodes

System

Goal: MPP “look and feel”

• Start ~1997, upgrade ~1999--2001

• Alpha & Myrinet, mesh topology

• ~3000 procs (3Tf) in 7 systems

• Configurable to ~1700 procs

• Red/Black switching

• Linux w/ custom runtime & mgmt.

• Production operation for several yrs.

ASCI Red

IA-32 Cplant on a slide

Net I/O

System Support

Service

Sys Admin

Users

File I/O

Compute

/home

other

I/ONodes

Compute NodesService Nodes

……

……

……

… … … …

Ethernet

ATM

Operator(s)

HiPPI

I/O Nodes

System

Goal: Mid-range capacity

• Started 2003, upgrade annually

• Pentium-4 & Myrinet, Clos network

• 1280 procs (~7 Tf) in 3 systems

• Currently configurable to 512 procs

• Linux w/ custom runtime & mgmt.

• Production operation for several yrs.

ASCI Red

Observation:For most large scientific and engineering applications the performance is more determined by parallel scalability and less by the speed of individual CPUs.

There must be balance between processor, interconnect, and I/O performance to achieve overall performance.

To date, only a few tightly-coupled, parallel computer systems have been able to demonstrate a high level of scalability on a broad set of scientific and engineering applications.

Let’s Compare Balance In Parallel Systems

10000

2500

24000

2650

64000

500

1000

666

1200

400

Node Speed Rating(MFlops)

0.2650Q*

0.04400Q**

0.0832000White

0.11 (0.05)300 (132)Blue Pacific

0.02 (0.16*)1200 (9600*)BlueMtn**

1.6800Blue Mtn*

0.14140Cplant

(1.2)0.67800(533)ASCI RED**

11200T3E

2(1.33)800(533)ASCI RED

Communications Balance

(Bytes/flop)

Network Link BW

(Mbytes/s)Machine

Comparing Red Storm and BGL

Blue Gene Light** Red Storm*

Node Speed 5.6 GF 5.6 GF (1x)

Node Memory 0.25--.5 GB 2 (1--8 ) GB (4x nom.)

Network latency 7 secs 2 secs (2/7 x)

Network link BW 0.28 GB/s 6.0 GB/s (22x)

BW Bytes/Flops 0.05 1.1 (22x)

Bi-Section B/F 0.0016 0.038 (24x)

#nodes/problem 40,000 10,000 (1/4 x)

*100 TF version of Red Storm

* * 360 TF version of BGL

Fixed problem performance

Molecular dynamics problem(LJ liquid)

Scalable computing works

ASCI Red efficiencies for major codes

0

20

40

60

80

100

1 10 100 1000 10000

Processors

Scaled parallel efficiency (%)

QS-Particles

QS-Fields-Only

QS-1B Cells

Rad x-port-1B Cells

Rad x-port - 17M

Rad x-port - 80M

Rad x-port - 168M

Rad x-port - 532M

Finite Element

Zapotec

Reactive Fluid Flow

Salinas

CTH

Basic Parallel Efficiency Model

0.00

0.20

0.40

0.60

0.80

1.00

1.20

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Communication/Computation Load

Parallel Efficiency

Red Storm (B=1.5)

ASCI Red (B=1.2)

Ref. Machine (B=1.0)

Earth Sim. (B=.4)

Cplant (B=.25)

Blue Gene Light (B=.05)

Std. Linux Cluster (B=.04)

Balance is critical to scalability

PeakLin

pack

Scientific & eng. codes

Relating scalability and cost

0.00

1.00

2.00

3.00

4.00

5.00

6.00

1 2 4 8 16 32 64 128 256 512 1024 2048 4096Processors

Efficiency ratio (Red/Cplant)

Eff. Ratio Extrapolation

Efficiency ratio =Cost ratio = 1.8

MPP more cost effective

Cluster more cost effective

Average efficiency ratio over the five codes that consume >80% of Sandia’s cycles

Scalability determines cost effectiveness

0

10,000,000

20,000,000

30,000,000

40,000,000

50,000,000

60,000,000

70,000,000

80,000,000

1 10 100 1000 10000

Number of Nodes

Total Node-Hours of Jobs

380M node-hrs55M node-hrs

MPP more cost effective

Cluster more cost effective

256

Sandia’s top priority computing workload:

Scalability also limits capability

ITS Speedup curves

0

200

400

600

800

1000

1200

0128256384512640768896

1024115212801408Processors

Speedup

Red Speedup

Cplant Speedup

Poly. (RedSpeedup)

Poly. (CplantSpeedup)

~3x processors

Commodity nearly everywhere-- Customization drives cost

• Earth Simulator and Cray X-1 are fully custom Vector systems with good balance• This drives their high cost (and their high performance).

• Clusters are nearly entirely high-volume with no truly custom parts• Which drives their low-cost (and their low scalability)

• Red Storm uses custom parts only where they are critical to performance and reliability• High scalability at minimal cost/performance

“Honey It’s not one of those…”or

Hints on MPP Computing

[Excerpted from a talk with this title given by Bill Camp at CUG-Tours in October 1994]

Issues in MPP Computing:

1. Physically shared memory does not scale

2. Data must be distributed

3. No single data layout may be optimal

4. The optimal data layout may change during the computation

5. Communications are expensive

6. The single control stream in SIMD computing makes it simple-- at the cost of severe loss in performance-- due to load balancing problems

7. In data parallel computing (‘a la CM-5) there can be multiple control streams-- but with global synchronization

Less simple but overhead remains an issue

8. In MIMD computing there are many control streams loosely synchronized (eg with messages)

Powerful, flexible and complex

Why doesn’t shared memory scale?

Switch

CPU CPU CPU CPUCPUCPU CPU

cache cache cache cache cache cache cache

memory memory memory memory memory memory memory

Bank conflicts-- about a 40% hit for large # of banks and CPU’sMemory coherency-- who has the data, can I access it?High, non-deterministic latencies

Amdahl’s Law

Time on a single processor:

T1 = Tser + Tser

Time on P processors:

Tp = Tser + Tpar/P + Tcomm’s

Ignore communications:

Speedup, Sp (= T1/ Tp) is then

Sp = { fser + [ 1 - fser ] / P}-1

Where fser = Tser / Tser

So, Sp < 1 / fser

The Axioms

Axiom 1: Amdahl’s Law is inviolate (Sp < 1 / fser )

Axiom 2: Amdahl’ Law doesn’t matter for MPP if you know what you are doing (Comm’s dominate)

Axiom 3 : Nature is parallel

Axiom 4 : Nature is (mostly) local

Axiom 5 : Physical shared memory does not scale

Axiom 6 : Physically distributed memory does

Axiom 7 : Nevertheless, a global address space is nice to have

The Axioms

Axiom 8: Like solar energy, automatic parallelism is the technology of the future

Axiom 9: successful parallelism requires the near total suppression of serialism

Axiom 10 : The best thing you can do with a processor is serial execution

Axiom 11 : Axioms 9 &10 are not contradictory

Axiom 12 : MPPs are for doing large problems fast (if you need to do a small problem fast, look

elsewhere).

Axiom 13 : Generals build weapons to win the last war (so computer scientists)

The Axioms

Axiom 14 : first find coarse-grained, then medium-grained, then fine-grained parallelism

Axiom 15: done correctly, the gain from these is multiplicative

Axiom 16 : Life’s a balancing act; so’s MPP computing

Axiom 17 : Be an introvert-- never communicate needlessly

Axiom 18 : Be independent; never synchronize needlessly

Axiom 19 : Parallel computing is a cold world, bundle up well

The Axioms

Axiom 20 : I/O should only be done under medical supervision

Axiom 21: If MPP computin’ is easy it ain’t cheap

Axiom 22 : If MPP computin’ is cheap, it ain’t easy

Axiom 23 : The difficulty of programming an MPP effectively is directly proportional to latency

Axiom 24 : The parallelism is in the problem, not in the code

The Axioms

Axiom 25 : There are an infinite number of parallel algorithms

Axiom 26 : There are no parallel algorithms (Simon’s theorem)-- it’s almost true

Axiom 27: The best parallel algorithm is almost always a parallel implementation of the best serial algorithm (what Horst really meant)

Axiom 28 : Amdahl’s Law DOES limit vector speedup!

Axiom 18’ : Work in teams ( sometimes SIMD constructs are just what the Doctor ordered)!

Axiom 29 : Do try this at home!

(Some of) the Hints

Hint 1:

Any amount of serial computing is death

So… 1) make the problem large

2) Look everywhere for serialism and purge it from your code

3) Never, ever, ever add serial statements

(Some of) the Hints

Hint 2:

Keep communications in the noise!

So… 1) Don’t do little problems on big computers

2) Change algorithms when profitable

3) Bundle up!-- avoid small messages on high-latency interconnects

4) Don’t waste memory-- using all the memory on a node minimizes the ratio of communications to useful work

(Some of) the Hints

Hint 3:

The parallelism is in the problem!

E.G. SAR, Monte Carlo, Direct Sparse solvers, Molecular Dynamics

So,… 1) Look first at the problem

2) Look second at algorithms

3) Look at data structures in the code

4) don’t waste cycles on line-by-line parallelism

(Some of) the Hints

Hint 4:

Incremental Parallelism Is Too Inefficient! Don’t fiddle with the Fortran

Look at the Problem:

-- Identify the kinds of parallelism it contains

1) Multi-program

2) Multi-task

4) data parallelism

5) inner-loop parallelism (e.g. vectors)

Time

into

effort

(Some of) the Hints

Hint 5:

Often:With Explicit Message Passing (EMP) or Gets/Puts

You can re-use virtually all of your code

(changes and additions ~ few%)

-- With data parallel languages, you re-write your code

It can be easy

but

Performance is usually unacceptable

(Some of) the Hints

Hint 6:

Load Balancing (Use existing libraries and technology)

-Easy in EMP!

-Hard (or impossible) in HPF, F90, CMF, …

-Only load balance if Tnew + Tbal < Told

Static or Dynamic:

Graph-based

geometry based

Particle-based

Hierarchical Master-Slave

(Some of) the Hints

Hint 7:

Synchronization is expensive

So, … Don’t do it unless you have to

Never, ever put in synchronization just to get rid of a bug

else you’ll be stuck with it for the life of the code!

(Some of) the Hints

Hint 8:

I/O can ruin your whole afternoon:

It is amazing how many people will create wonderfully scalable codes only to spoil them with needless or serial or non-balanced I/O

Use I/O sparingly

Stage I/O carefully

(Some of) the Hints

Hint 9:

Religious prejudice is the bane of computing

Caches aren’t inherently bad

Vectors aren’t inherently good

Small SMP’s will not ruin your life

Single processer nodes are not killers

…

La Fin (The End)

Scaling data for some key engineering codes

Performance on Engineering Codes

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1 2 4 8 16 32 64 128 256 512 1024Processors

Scaled Parallel Efficiency

ITS, Red

ITS, Cplant

ACME, Red

ACME, Cplant

Random variation at small proc. counts

Large differential in efficiency at large proc. counts

Scaling data for some key physics codes

Los Alamos’ Radiation transport

code

PARTISN Diffusion Solver Sizeup StudyS6P2, 12 Groups, 13,800 cells/PE

0%

20%

40%

60%

80%

100%

120%

1 2 4 8 16 32 6412825651210242048

Number of Processor Elements

Parallel Efficiency

ASCI Red

Blue Mountain

White

QSC

PARTISN Transport Solver Sizeup StudyS6P2, 12 Groups, 13,800 cells/PE

0%

20%

40%

60%

80%

100%

120%

1 2 4 8 16 32 6412825651210242048

Number of Processor Elements

Parallel Efficiency

ASCI Red

Blue Mountain

White

QSC

Parallel Sn Neutronics (provided by LANL)

Documents

Featured attraction: Computers for Doing Big Science Bill Camp, Sandia Labs 2nd Feature: Hints on MPP computing