Simulation: The Third Pillar CS 594 Spring 2003 Overview of High

1

1

CS 594 Spring 2003Lecture 1:Overview of High-Performance Computing

Jack Dongarra Computer Science DepartmentUniversity of Tennessee

2

Simulation: The Third Pillar of Science

Traditional scientific and engineering paradigm:1) Do theory or paper design.2) Perform experiments or build system.Limitations:

Too difficult -- build large wind tunnels.Too expensive -- build a throw-away passenger jet.Too slow -- wait for climate or galactic evolution.Too dangerous -- weapons, drug design, climate experimentation.

Computational science paradigm:3) Use high performance computer systems to simulate the

phenomenon» Base on known physical laws and efficient numerical

methods.

3

Computational Science Definition

Computational science is a rapidly growing multidisciplinary field that uses advanced computing capabilities to understand and solve complex problems. Computational science fuses three distinct elements:

numerical algorithms and modeling and simulation software developed to solve science (e.g., biological, physical, and social), engineering, and humanities problems;advanced system hardware, software, networking, and data management components developed through computer and information science to solve computationally demanding problems;the computing infrastructure that supports both science and engineering problem solving and developmental computer and information science. 4

Some Particularly Challenging ComputationsScience

Global climate modelingAstrophysical modelingBiology: genomics; protein folding; drug designComputational ChemistryComputational Material Sciences and Nanosciences

EngineeringCrash simulationSemiconductor designEarthquake and structural modelingComputation fluid dynamics (airplane design)Combustion (engine design)

BusinessFinancial and economic modelingTransaction processing, web services and search engines

DefenseNuclear weapons -- test by simulationsCryptography

5

Why Turn to Simulation?When the problem is too . . .

ComplexLarge / smallExpensiveDangerous

to do any other way.

Taurus_to_Taurus_60per_30deg.mpeg

6

Complex Systems Engineering

Supercomputers,Storage, & Networks

Analysis and Visualization

Computation Management- AeroDB- ILab

Grand Challenges

Next Generation Codes& AlgorithmsOVERFLOWHonorable Mention,NASA Software of YearSTS107INS3DNASA Software of YearTurbopump Analysis

CART3DNASA Software of YearSTS-107

Modeling Environment(experts and tools)

- Compilers- Scaling and Porting- Parallelization Tools

R&D Team:Grand Challenge Driven

Ames Research CenterGlenn Research Center

Langley Research Center

Engineering Team:Operations Driven

Johnson Space CenterMarshall Space Flight Center

Industry Partners

Source: Walt Brooks, NASA

2

7

Economic Impact of HPC

Airlines:System-wide logistics optimization systems on parallel systems.Savings: approx. $100 million per airline per year.

Automotive design:Major automotive companies use large systems (500+ CPUs) for:

» CAD-CAM, crash testing, structural integrity and aerodynamics.» One company has 500+ CPU parallel system.

Savings: approx. $1 billion per company per year.Semiconductor industry:

Semiconductor firms use large systems (500+ CPUs) for» device electronics simulation and logic validation

Savings: approx. $1 billion per company per year.Securities industry:

Savings: approx. $15 billion per year for U.S. home mortgages.

8

Pretty Pictures

9

Why Turn to Simulation?Climate / Weather ModelingData intensive problems (data-mining, oil reservoir simulation)Problems with large length and time scales (cosmology)

10

Titov’s Tsunami Simulation

tsunami-nw10.movTitov’s Tsunami SimulationGlobal model

11

Cost (Economic Loss) to Evacuate 1 Mile of Coastline: $1M

We now over-warn by a factor of 3Average over-warning is 200 miles of coastline, or $200M per event

12

This problem demands a complete, STABLE environment (hardware and software)

100 TF to stay a factor of 10 ahead of the weatherStreaming ObservationsMassive Storage and Meta Data QueryFast NetworkingVisualizationData Mining for Feature Detection

24 Hour Forecast at Fine Grid Spacing

3

13

Units of High Performance Computing

1 Mflop/s 1 Megaflop/s 106 Flop/sec

1 Gflop/s 1 Gigaflop/s 109 Flop/sec

1 Tflop/s 1 Teraflop/s 1012 Flop/sec

1 Pflop/s 1 Petaflop/s 1015 Flop/sec

1 MB 1 Megabyte 106 Bytes

1 GB 1 Gigabyte 109 Bytes

1 TB 1 Terabyte 1012 Bytes

1 PB 1 Petabyte 1015 Bytes14

High-Performance Computing Today

In the past decade, the world has experienced one of the most exciting periods in computer development.Microprocessors have become smaller, denser, and more powerful.The result is that microprocessor-based supercomputing is rapidly becoming the technology of preference in attacking some of the most important problems of science and engineering.

15

Technology Trends: Microprocessor Capacity

2X transistors/Chip Every 1.5 yearsCalled “Moore’s Law”

Moore’s Law

Microprocessors have become smaller, denser, and more powerful.Not just processors, bandwidth, storage, etc

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

16

Eniac and My Laptop

5,000,000,000800Performance (FP/sec)

1,073,741,824~200Memory (bytes)

1,0004,630,000Cost (1999 dollars)

6020,000Power (watts)

0.002868Size (m3)

0.927,200Weight (kg)

6,000,000,00018,000Devices

20021945Year

My LaptopEniac

17

No Exponential is Forever,But perhaps we can Delay it Forever

410,000,0002003Intel® Itanium® 2 processor

220,000,0002002Intel® Itanium®processor

42,000,0002000Intel® Pentium® 4 processor

24,000,0001999Intel® Pentium® III processor

7,500,0001997Intel® Pentium® II processor

3,100,0001993Intel® Pentium®processor

1,180,0001989Intel486™ processor

275,0001985Intel386™ processor

120,0001982286

29,00019788086

5,00019748080

2,50019728008

2,25019714004

TransistorsYear of Introduction

18

Today’s processorsSome equivalences for the microprocessors of today

Voltage level» A flashlight (~1 volt)

Current level» An oven (~250 amps)

Power level» A light bulb (~100 watts)

Area» A postage stamp (~1 square inch)

4

19

Moore’s “Law”Something doubles every 18-24 monthsSomething was originally the number of transistorsSomething is also considered performanceMoore’s Law is an exponential

Exponentials can not last forever» However Moore’s Law has held remarkably

true for ~30 yearsBTW: It is really an empiricism rather than a law (not a derogatory comment) 20

Percentage of peakA rule of thumb that often applies

A contemporary RISC processor, for a spectrum of applications, delivers (i.e., sustains) 10% of peak performance

There are exceptions to this rule, in both directionsWhy such low efficiency?There are two primary reasons behind the disappointing percentage of peak

IPC (in)efficiencyMemory (in)efficiency

21

IPCToday the theoretical IPC (instructions per cycle) is 4 in most contemporary RISC processors (6 in Itanium)Detailed analysis for a spectrum of applications indicates that the average IPC is 1.2–1.4We are leaving ~75% of the possible performance on the table…

22

Why Fast Machines Run Slow

LatencyWaiting for access to memory or other parts of the system

OverheadExtra work that has to be done to manage program concurrency and parallel resources the real work you want to perform

StarvationNot enough work to do due to insufficient parallelism or poor load balancing among distributed resources

ContentionDelays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint.

23

Extra transistorsWith the increasing number of transistors per chip from reduced design rules do we:

Add more functional units?» Little gain owing to poor IPC for today’s

codes, compilers and ISAsAdd more cache?

» This generally helps but does not solve the problem

Add more processors» This helps somewhat» This hurts somewhat

24

Processor vs. memory speedIn 1986

processor cycle time ~120 nanosecondsDRAM access time ~140 nanoseconds

» 1:1 ratioIn 1996

processor cycle time ~4 nanosecondsDRAM access time ~60 nanoseconds

» 20:1 ratioIn 2002

processor cycle time ~0.6 nanosecondDRAM access time ~50 nanoseconds

» 100::1 ratio

5

25

Latency in a Single System

1997 1999 2001 2003 2006 2009

X-Axis

0.1

1

10

100

1000

Tim

e (n

s)

0

100

200

300

400

500

Mem

ory

to C

PU

Rat

io

CPU Clock Period (ns)Memory System Access Time

Ratio

Memory Access Time

CPU Time

Ratio

THE WALL26

Memory hierarchy

Typical latencies for today’s technologyHierarchy Processor clocks

Register 1 L1 cache 2-3 L2 cache 6-12 L3 cache 14-40

Near memory 100-300 Far memory 300-900

Remote memory O(103) Message-passing O(103)-O(104)

yHierarchy

Most programs have a high degree of locality in their accessesspatial locality: accessing things nearby previous accessestemporal locality: reusing an item that was previously accessed

Memory hierarchy tries to exploit locality

on-chip cacheregisters

datapath

control

processor

Second level

cache (SRAM)

Main memory

(DRAM)

Secondary storage (Disk)

Tertiary storage

(Disk/Tape)

TBGBMBKBBSize

10sec10ms100ns10ns1nsSpeed

28

Memory bandwidthTo provide bandwidth to the processor the bus either needs to be faster or widerBusses are limited to perhaps 400-800 MHzLinks are faster

Single-ended 0.5–1 GT/sDifferential: 2.5–5.0 (future) GT/sIncreased link frequencies increase error rates requiring coding and redundancy thus increasing power and die size and not helping bandwidth

Making things wider requires pin-out (Sireal estate) and power

Both power and pin-out are serious issue

29

What’s needed for memory?This is a physics problem

“Money can buy bandwidth, but latency is forever!”

There are possibilities being studiedGenerally involve putting “processing” at memory

» Reduces the latency and increases the bandwidth

Fundamental architectural research is lacking

Government funding is a necessity 30

Processor in Memory (PIM)

PIM merges logic with memoryWide ALUs next to the row bufferOptimized for memory throughput, not ALU utilization

PIM has the potential of riding Moore's law while

greatly increasing effective memory bandwidth,providing many more concurrent execution threads,reducing latency, reducing power, and increasing overall system efficiency

It may also simplify programming and system design

MemoryStack

Sense Amps

Node Logic

Sense Amps

Memory Stack

Sense Amps

Sense Amps

Dec

ode

Memory Stack

Sense Amps

Sense Amps

Memory Stack

Sense Amps

Sense Amps

6

31

Internet – 4th Revolution in Telecommunications

Telephone, Radio, TelevisionGrowth in Internet outstrips the othersExponential growth since 1985Traffic doubles every 100 days

Growth of Internet Hosts *Sept. 1969 - Sept. 2002

0

50,000,000

100,000,000

150,000,000

200,000,000

250,000,000

9/69

01/71

01/73

01/74

01/76

01/79

08/81

08/83

10/85

11/86

07/88

01/89

10/89

01/91

10/91

04/92

10/92

04/93

10/93

07/94

01/95

01/96

01/97

01/98

01/99

01/01

08/02

Time Period

No.

of H

osts

Domain names

32

The Web Phenomenon90 – 93 Web inventedU of Illinois Mosaic released March 94, ~ 0.1% trafficSeptember 93 ~ 1% traffic w/200 sitesJune 94 ~ 10% of traffic w/2,000 sitesToday 60% of traffic w/2,000,000 sitesEvery organization, company, school

33

Peer to Peer ComputingPeer-to-peer is a style of networking in which a group of computers communicate directly with each other.Wireless communicationHome computer in the utility room, next to the water heater and furnace. Web tablets BitTorrent

http://en.wikipedia.org/wiki/BittorrentImbedded computers in things all tied together.

Books, furniture, milk cartons, etcSmart Appliances

Refrigerator, scale, etc

34

Internet On Everything

35

SETI@home: Global Distributed Computing

Running on 500,000 PCs, ~1000 CPU Years per Day

485,821 CPU Years so farSophisticated Data & Signal Processing AnalysisDistributes Datasets from Arecibo Radio Telescope

36

SETI@homeUse thousands of Internet-connected PCs to help in the search for extraterrestrial intelligence.When their computer is idle or being wasted this software will download a 300 kilobyte chunk of data for analysis. Performs about 3 Tflops for each client in 15 hours.The results of this analysis are sent back to the SETI team, combined with thousands of other participants.

Largest distributed computation project in existence

Averaging 40 Tflop/sToday a number of companies trying this for profit.

7

Google query attributes150M queries/day (2000/second)100 countries8.0B documents in the index

Data centers100,000 Linux systems in data centers around the world

» 15 TFlop/s and 1000 TB total capability

» 40-80 1U/2U servers/cabinet » 100 MB Ethernet

switches/cabinet with gigabit Ethernet uplink

growth from 4,000 systems (June 2000)

» 18M queries thenPerformance and operation

simple reissue of failed commands to new serversno performance debugging

» problems are not reproducible

Source: Monika Henzinger, Google & Cleve Moler

Forward link are referred to in the rowsBack links are referred to in the columns

Eigenvalue problem; Ax = λxn=8x109

(see: MathWorksCleve’s Corner)

The matrix is the transition probability matrix of the Markov chain; Ax = x 38

Next Generation Web

To treat CPU cycles and software like commodities.Enable the coordinated use of geographically distributed resources – in the absence of central control and existing trust relationships. Computing power is produced much like utilities such as power and water are produced for consumers.Users will have access to “power” on demand This is one of our efforts at UT.

39

Where Has This Performance Improvement Come From?

Technology?Organization?Instruction Set Architecture?Software?Some combination of all of the above?

40

Impact of Device ShrinkageWhat happens when the feature size (transistor size) shrinks by a factor of x ?Clock rate goes up by x because wires are shorter

actually less than x, because of power consumption

Transistors per unit area goes up by x2

Die size also tends to increasetypically another factor of ~x

Raw computing power of the chip goes up by ~ x4 !of which x3 is devoted either to parallelismor locality

41

How fast can a serial computer be?

Consider the 1 Tflop sequential machinedata must travel some distance, r, to get from memory to CPUto get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/sso r < c/1012 = .3 mm

Now put 1 TB of storage in a .3 mm2 area each word occupies about 3 Angstroms2, the size of a small atom

r = .3 mm1 Tflop 1 TB sequential machine

42

Processor-Memory ProblemProcessors issue instructions roughly every nanosecond.DRAM can be accessed roughly every 100 nanoseconds (!).DRAM cannot keep processors busy! And the gap is growing:

processors getting faster by 60% per yearDRAM getting faster by 7% per year (SDRAM and EDO RAM might help, but not enough)

8

43

1

100

10000

1000000

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

Year

Perf

orm

ance

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

“Moore’s Law”

Processor-MemoryPerformance Gap:(grows 50% / year)

CPU

DRAM

Processor-DRAM Memory Gap

44

•Why Parallel ComputingDesire to solve bigger, more realistic applications problems.Fundamental limits are being approached.More cost effective solution

45

Principles of Parallel Computing

Parallelism and Amdahl’s LawGranularityLocalityLoad balanceCoordination and synchronizationPerformance modeling

All of these things makes parallel programming even harder than sequential programming.

46

“Automatic” Parallelism in Modern Machines

Bit level parallelismwithin floating point operations, etc.

Instruction level parallelism (ILP)multiple instructions execute per clock cycle

Memory system parallelismoverlap of memory operations with computation

OS parallelismmultiple jobs run in parallel on commodity SMPs

Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks

47

Finding Enough Parallelism

Suppose only part of an application seems parallelAmdahl’s law

let fs be the fraction of work done sequentially, (1-fs) is fraction parallelizableN = number of processors

Even if the parallel part speeds up perfectly may be limited by the sequential part

48

Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahl’s Law are given below:

tN = (fp/N + fs)t1 Effect of multiple processors on run time

S = 1/(fs + fp/N) Effect of multiple processors on speedup

Where:fs = serial fraction of codefp = parallel fraction of code = 1 - fsN = number of processors

Amdahl’s Law

9

49

0

50

100

150

200

250

0 50 100 150 200 250

Number of processors

spee

dup

fp = 1.000fp = 0.999fp = 0.990fp = 0.900

It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors

Illustration of Amdahl’s Law

50

Overhead of Parallelism

Given enough parallel work, this is the biggest barrier to getting desired speedupParallelism overheads include:

cost of starting a thread or processcost of communicating shared datacost of synchronizingextra (redundant) computation

Each of these can be in the range of milliseconds (=millions of flops) on some systemsTradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

Locality and Parallelism

Large memories are slow, fast memories are smallStorage hierarchies are large and fast on averageParallel processors, collectively, have large, fast $

the slow accesses to “remote” data we call “communication”Algorithm should do most work on local data

ProcCache

L2 Cache

L3 Cache

Memory

Conventional Storage Hierarchy

ProcCache

L2 Cache

L3 Cache

Memory

ProcCache

L2 Cache

L3 Cache

Memory

potentialinterconnects

52

Load Imbalance

Load imbalance is the time that some processors in the system are idle due to

insufficient parallelism (during that phase)unequal size tasks

Examples of the latteradapting to “interesting parts of a domain”tree-structured computations fundamentally unstructured problems

Algorithm needs to balance load

53

Performance Trends Revisited (Microprocessor Organization)

Year

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000

r4400

r4000

r3010

i80386

i4004

i8080

i80286

i8086

• Bit Level Parallelism

• Pipelining

• Caches

• Instruction Level Parallelism

• Out-of-order Xeq

• Speculation

• . . .

54

Greater instruction level parallelism?Bigger caches?Multiple processors per chip?Complete systems on a chip? (Portable Systems)

High performance LAN, Interface, and Interconnect

What is Ahead?

10

55

DirectionsMove toward shared memory

SMPs and Distributed Shared MemoryShared address space w/deep memory hierarchy

Clustering of shared memory machines for scalabilityEfficiency of message passing and data parallel programming

Helped by standards efforts such as MPI and HPF

56

High Performance Computers~ 20 years ago

1x106 Floating Point Ops/sec (Mflop/s) » Scalar based

~ 10 years ago1x109 Floating Point Ops/sec (Gflop/s)

» Vector & Shared memory computing, bandwidth aware» Block partitioned, latency tolerant

~ Today1x1012 Floating Point Ops/sec (Tflop/s)

» Highly parallel, distributed processing, message passing, network based

» data decomposition, communication/computation~ 5 years away

1x1015 Floating Point Ops/sec (Pflop/s) » Many more levels MH, combination/grids&HPC» More adaptive, LT and bandwidth aware, fault tolerant,

extended precision, attention to SMP nodes

57

Top 500 ComputersTop 500 Computers

- Listing of the 500 most powerfulComputers in the World

- Yardstick: Rmax from LINPACK MPPAx=b, dense problem

Updated twice a yearSC‘xy in the States in NovemberMeeting in Germany in June

Size

Rat

e

TPP performance

58

A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved.

Over the last 10 years the range for the Top500 has increased greater than Moore’s Law1993:

#1 = 59.7 GFlop/s#500 = 422 MFlop/s

2004:#1 = 70 TFlop/s#500 = 850 GFlop/s

What is a Supercomputer?

Why do we need them? Almost all of the technical areas that are important to the well-being of humanity use supercomputing in fundamental and essential ways.

Computational fluid dynamics,protein folding, climate modeling, national security, in particular for cryptanalysis and for simulating nuclear weapons to name a few.

59

24th List: The TOP10

25002003USANCSA9.82TungstenPowerEdge, MyrinetDell10

29442004USANaval Oceanographic Office10.31pSeries 655IBM9

81922004USALawrence LivermoreNational Laboratory11.68BlueGene/L

DD1 500 MHzIBM/LLNL8

22002004USAVirginia Tech12.25XApple XServe, InfinibandSelf Made7

81922002USALos AlamosNational Laboratory13.88ASCI Q

AlphaServer SC, QuadricsHP6

40962004USALawrence LivermoreNational Laboratory19.94Thunder

Itanium2, QuadricsCCD5

35642004SpainBarcelona Supercomputer Center20.53MareNostrum

BladeCenter JS20, MyrinetIBM4

51202002JapanEarth Simulator Center35.86Earth-SimulatorNEC3

101602004USANASA Ames51.87ColumbiaAltix, InfinibandSGI2

327682004USADOE/IBM70.72BlueGene/Lβ-SystemIBM1

#ProcYearCountryInstallation SiteRmax[TF/s]ComputerManufacturer

399 system > 1 TFlop/s; 294 machines are clusters, top10 average 8K proc 60

Performance Development1.127 PF/s

1.167 TF/s

59.7 GF/s

70.72 TF/s

0.4 GF/s

850 GF/s

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

Fujitsu

'NWT' NAL

NEC

Earth Simulator

Intel ASCI Red

Sandia

IBM ASCI White

LLNL

N=1

N=500

SUM

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s IBM

BlueGene/L

My Laptop

11

61

Performance Projection

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015

N=1

N=500

SUM

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

10 Pflop/s

1 Eflop/s

100 Pflop/s

DARPA HPCS

BlueGene/L

62

Performance Projection

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015

N=1

N=500

SUM

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

10 Pflop/s

1 Eflop/s

100 Pflop/s

DARPA HPCS

BlueGene/L

My Laptop

63

Customer Segments / Systems

Industry

Research

Academic

ClassifiedVendor

Government

0

100

200

300

400

500

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

64

Manufacturers / Systems

0

100

200

300

400

50019

93

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

othersHitachiNECFujitsuIntelTMCHPSunIBMSGICray

65

Processor Types

0

100

200

300

400

500

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

SIMD

Vector

Scalar

Sparc

MIPS

intel

HP

Power

Alpha

66

Interconnects / Systems

0

100

200

300

400

500

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

Others

Infiniband

Quadrics

Gigabit Ethernet

Cray Interconnect

Myrinet

SP Switch

Crossbar

N/A

12

67

Top500 Performance by Manufacture (11/04)

IBM49%

HP21%

others14%

SGI7%

NEC4%

Fujitsu2%

Cray2%

Hitachi1%

Sun0%

Intel0%

68

Clusters (NOW) / Systems

0

50

100

150

200

250

300

1997 1998 1999 2000 2001 2002 2003 2004

OthersSun Fire - IntelNOW - AlphaNOW - SunDell ClusterNOW - AMDHP ClusterHP AlphaServerNOW - PentiumIBM Cluster

69

Top500 ConclusionsMicroprocessor based supercomputers have brought a major change in accessibility and affordability.MPPs continue to account of more than half of all installed high-performance computers worldwide.

70

40003534 (88%)1102 (27%)1000Intel/HP Itanium 2

160110 (69%)27 (17%)80Cray 1

200190 (95%)106 (53%)100Cray J-90

235218 (93%)121 (51%)118Cray X-MP

333324 (97%)161 (48%)166Cray Y-MP

952902 (95%)387 (41%)238Cray C90

18001603 (89%)705 (39%)454Cray T90

600553 (92%)173 (29%)300SGI Origin 2K

900607 (67%)208 (23%)450SUN Ultra 80

1066478 (45%)231 (22%)533PowerPC G4

933514 (55%)234 (25%)933Intel P3

15001208 (80%)424 (28%)375IBM Power 3

22001583 (71%)468 (21%)550HP PA

2400998 (42%)558 (23%)1200AMD Athlon

20001542 (77%)824 (41%)1000Compaq Alpha

Peak

5080

Linpack n=1000

2355 (46%)

Linpack n=100

1190 (23%)

Cycle Time

2540

Processor

Intel P4

Performance Numbers on RISC Processors

71

High-Performance Computing Directions: Beowulf-class PC Clusters

COTS PC NodesPentium, Alpha, PowerPC, SMP

COTS LAN/SAN Interconnect

Ethernet, Myrinet, Giganet, ATM

Open Source UnixLinux, BSD

Message Passing ComputingMPI, PVMHPF

Best price-performanceLow entry-level costJust-in-place configurationVendor invulnerableScalableRapid technology tracking

Definition: Advantages:

Enabled by PC hardware, networks and operating system achieving capabilities of scientific workstations at a fraction of the cost and availability of industry standard message passing libraries. However, much more of a contact sport. 72

Peak performance Interconnectionhttp://clusters.top500.org Benchmark results to follow in the coming months

13

Distributed and Parallel Systems

Distributedsystemshetero-geneous

Massivelyparallelsystemshomo-geneous

Grid

Com

putin

gBe

owul

fBe

rkle

y NOW

SNL

Cpla

nt

Entro

pia

ASCI

Tflo

ps

Gather (unused) resourcesSteal cyclesSystem SW manages resourcesSystem SW adds value10% - 20% overhead is OKResources drive applicationsTime to completion is not criticalTime-shared

Bounded set of resources Apps grow to consume all cyclesApplication manages resourcesSystem SW gets in the way5% overhead is maximumApps drive purchase of equipmentReal-time constraintsSpace-shared

SETI

@ho

me

Para

llel D

ist m

em74

Virtual Environments0.32E-08 0.00E+00 0.00E+00 0.00E+00 0.38E-06 0.13E-05 0.22E-05 0.33E-05 0.59E-05 0.11E-040.18E-04 0.23E-04 0.23E-04 0.21E-04 0.67E-04 0.38E-03 0.90E-03 0.18E-02 0.30E-02 0.43E-020.50E-02 0.51E-02 0.49E-02 0.44E-02 0.39E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-020.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.39E-02 0.38E-020.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-02 0.16E-02 0.14E-02 0.11E-02 0.96E-030.79E-03 0.63E-03 0.48E-03 0.35E-03 0.24E-03 0.15E-03 0.80E-04 0.34E-04 0.89E-05 0.16E-050.18E-06 0.34E-08 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+000.00E+00 0.00E+00 0.00E+00 0.00E+00 0.24E-08 0.00E+00 0.00E+00 0.00E+00 0.29E-06 0.11E-050.19E-05 0.30E-05 0.53E-05 0.96E-05 0.15E-04 0.20E-04 0.20E-04 0.18E-04 0.27E-04 0.23E-030.65E-03 0.14E-02 0.27E-02 0.40E-02 0.49E-02 0.51E-02 0.49E-02 0.45E-02 0.40E-02 0.35E-020.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-020.38E-02 0.39E-02 0.39E-02 0.37E-02 0.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-020.16E-02 0.14E-02 0.12E-02 0.98E-03 0.81E-03 0.65E-03 0.51E-03 0.38E-03 0.27E-03 0.17E-030.99E-04 0.47E-04 0.16E-04 0.36E-05 0.62E-06 0.41E-07 0.75E-10 0.00E+00 0.00E+00 0.00E+000.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.15E-08 0.00E+000.00E+00 0.00E+00 0.19E-06 0.84E-06 0.16E-05 0.27E-05 0.47E-05 0.82E-05 0.13E-04 0.17E-040.17E-04 0.15E-04 0.16E-04 0.10E-03 0.41E-03 0.11E-02 0.23E-02 0.37E-02 0.48E-02 0.51E-020.49E-02 0.45E-02 0.40E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-020.28E-02 0.31E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.38E-02 0.36E-02 0.33E-02 0.29E-02

Do they make any sense?

76

Performance Improvements for Scientific Computing Problems

Derived from Computational Methods

1

10

100

1000

10000

1970 1975 1980 1985 1990 1995

Spee

d-Up Fa

ctor

Multi-grid

Conjugate Gradient

SOR

Gauss-Seidel

Sparse GE

1

10

100

1000

10000

1970 1975 1980 1985 1995

77

Different ArchitecturesParallel computing: single systems with many processors working on same problemDistributed computing: many systems loosely coupled by a scheduler to work on related problemsGrid Computing: many systems tightly coupled by software, perhaps geographically distributed, to work together on single problems or on related problems

78

Types of Parallel ComputersThe simplest and most useful way to classify modern parallel computers is by their memory model:

shared memorydistributed memory

14

79

P P P P P P

BUS

Memory

M

P

M

P

M

P

M

P

M

P

M

P

Network

Shared memory - single address space. All processors have access to a pool of shared memory. (Ex: SGI Origin, Sun E10000)

Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM SP, clusters)

Shared vs. Distributed Memory

80

P P P P P P

BUS

Memory

Uniform memory access (UMA): Each processor has uniform access to memory. Also known as symmetric multiprocessors(Sun E10000)

P P P P

BUS

Memory

P P P P

BUS

Memory

Network

Non-uniform memory access (NUMA): Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (SGI Origin)

Shared Memory: UMA vs. NUMA

81

Distributed Memory: MPPs vs. Clusters

Processors-memory nodes are connected by some type of interconnect network

Massively Parallel Processor (MPP): tightly integrated, single system image.Cluster: individual computers connected by s/w

CPU

MEMCPU

MEM

CPU

MEMCPU

MEMCPU

MEM

CPU

MEM

CPU

MEMCPU

MEMCPU

MEM

InterconnectNetwork

82

Processors, Memory, & Networks

Both shared and distributed memory systems have:1. processors: now generally commodity RISC

processors2. memory: now generally commodity DRAM3. network/interconnect: between the

processors and memory (bus, crossbar, fat tree, torus, hypercube, etc.)

We will now begin to describe these pieces in detail, starting with definitions of terms.

83

Processor-Related Terms Clock period (cp): the minimum time

interval between successive actions in the processor. Fixed, depends on design of processor. Measured in nanoseconds (~1-5 for fastest processors). Inverse of frequency (MHz)

Instruction: an action executed by a processor, such as a mathematical operation or a memory operation.

Register: a small, extremely fast location for storing data or instructions in the processor

84

Processor-Related TermsFunctional Unit: a hardware element that

performs an operation on an operand or pair of operations. Common FUs are ADD, MULT, INV, SQRT, etc.

Pipeline : technique enabling multiple instructions to be overlapped in execution

Superscalar: multiple instructions are possible per clock period

Flops: floating point operations per second

15

85

Processor-Related TermsCache: fast memory (SRAM) near the

processor. Helps keep instructions and data close to functional units so processor can execute more instructions more rapidly.

TLB: Translation-Lookaside Buffer keeps addresses of pages (block of memory) in main memory that have recently been accessed (a cache for memory addresses)

86

Memory-Related TermsSRAM: Static Random Access Memory

(RAM). Very fast (~10 nanoseconds), made using the same kind of circuitry as the processors, so speed is comparable.

DRAM: Dynamic RAM. Longer access times (~100 nanoseconds), but hold more bits and are much less expensive (10x cheaper).

Memory hierarchy: the hierarchy of memory in a parallel system, from registers to cache to local memory to remote memory More later

87

Interconnect-Related TermsLatency: How long does it take to start sending a "message"? Measured in microseconds.

(Also in processors: How long does it take to output results of some operations, such as floating point add, divide etc., which are pipelined?)

Bandwidth: What data rate can be sustained once the message is started? Measured in Mbytes/sec.

88

Interconnect-Related TermsTopology: the manner in which the nodes

are connected. Best choice would be a fully connected network (every processor to every other). Unfeasible for cost and scaling reasons.Instead, processors are arranged in some variation of a grid, torus, or hypercube.

3-d hypercube 2-d mesh 2-d torus

89

Highly Parallel Supercomputing: Where Are We?

Performance:Sustained performance has dramatically increased during the lastyear.On most applications, sustained performance per dollar now exceeds that of conventional supercomputers. But...Conventional systems are still faster on some applications.

Languages and compilers:Standardized, portable, high-level languages such as HPF, PVM and MPI are available. But ...Initial HPF releases are not very efficient.Message passing programming is tedious and hardto debug.Programming difficulty remains a major obstacle tousage by mainstream scientist.

90

Highly Parallel Supercomputing: Where Are We?

Operating systems:Robustness and reliability are improving.New system management tools improve system utilization. But...Reliability still not as good as conventional systems.

I/O subsystems:New RAID disks, HiPPI interfaces, etc. provide substantially improved I/O performance. But...I/O remains a bottleneck on some systems.

16

91

The Importance of Standards -Software

Writing programs for MPP is hard ...But ... one-off efforts if written in a standard languagePast lack of parallel programming standards ...

... has restricted uptake of technology (to "enthusiasts")

... reduced portability (over a range of currentarchitectures and between future generations)

Now standards exist: (PVM, MPI & HPF), which ...... allows users & manufacturers to protect software investment

... encourage growth of a "third party" parallel software industry & parallel versions of widely used codes

92

The Importance of Standards -Hardware

Processorscommodity RISC processors

Interconnectshigh bandwidth, low latency communications protocolno de-facto standard yet (ATM, Fibre Channel, HPPI, FDDI)

Growing demand for total solution:robust hardware + usable software

HPC systems containing all the programming tools/ environments / languages / libraries / applicationspackages found on desktops

93

The Future of HPCThe expense of being different is being replaced by the economics of being the sameHPC needs to lose its "special purpose" tagStill has to bring about the promise of scalable general purpose computing ...... but it is dangerous to ignore this technologyFinal success when MPP technology is embedded in desktop computingYesterday's HPC is today's mainframe is tomorrow's workstation

94

Achieving TeraFlopsIn 1991, 1 Gflop/s1000 fold increase

Architecture» exploiting parallelism

Processor, communication, memory» Moore’s Law

Algorithm improvements» block-partitioned algorithms

95

Future: Petaflops ( fl pt ops/s)

A Pflop for 1 second a typical workstation computing for 1 year.From an algorithmic standpoint

concurrencydata localitylatency & syncfloating point accuracy

1015

dynamic redistribution of workloadnew language and constructsrole of numerical librariesalgorithm adaptation to hardware failure

Today flops for our workstations≈ 1015

96

A Petaflops Computer System1 Pflop/s sustained computingBetween 10,000 and 1,000,000 processorsBetween 10 TB and 1PB main memoryCommensurate I/O bandwidth, mass store, etc.If built today, cost $40 B and consume 1 TWatt.May be feasible and “affordable” by the year 2010

Documents

Simulation: The Third Pillar CS 594 Spring 2003 Overview of High