Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
1
1
CS 594 Spring 2003Lecture 1:Overview of High-Performance Computing
Jack Dongarra Computer Science DepartmentUniversity of Tennessee
2
Simulation: The Third Pillar of Science
Traditional scientific and engineering paradigm:1) Do theory or paper design.2) Perform experiments or build system.Limitations:
Too difficult -- build large wind tunnels.Too expensive -- build a throw-away passenger jet.Too slow -- wait for climate or galactic evolution.Too dangerous -- weapons, drug design, climate experimentation.
Computational science paradigm:3) Use high performance computer systems to simulate the
phenomenon» Base on known physical laws and efficient numerical
methods.
3
Computational Science Definition
Computational science is a rapidly growing multidisciplinary field that uses advanced computing capabilities to understand and solve complex problems. Computational science fuses three distinct elements:
numerical algorithms and modeling and simulation software developed to solve science (e.g., biological, physical, and social), engineering, and humanities problems;advanced system hardware, software, networking, and data management components developed through computer and information science to solve computationally demanding problems;the computing infrastructure that supports both science and engineering problem solving and developmental computer and information science. 4
Some Particularly Challenging ComputationsScience
Global climate modelingAstrophysical modelingBiology: genomics; protein folding; drug designComputational ChemistryComputational Material Sciences and Nanosciences
EngineeringCrash simulationSemiconductor designEarthquake and structural modelingComputation fluid dynamics (airplane design)Combustion (engine design)
BusinessFinancial and economic modelingTransaction processing, web services and search engines
DefenseNuclear weapons -- test by simulationsCryptography
5
Why Turn to Simulation?When the problem is too . . .
ComplexLarge / smallExpensiveDangerous
to do any other way.
Taurus_to_Taurus_60per_30deg.mpeg
6
Complex Systems Engineering
Supercomputers,Storage, & Networks
Analysis and Visualization
Computation Management- AeroDB- ILab
Grand Challenges
Next Generation Codes& AlgorithmsOVERFLOWHonorable Mention,NASA Software of YearSTS107INS3DNASA Software of YearTurbopump Analysis
CART3DNASA Software of YearSTS-107
Modeling Environment(experts and tools)
- Compilers- Scaling and Porting- Parallelization Tools
R&D Team:Grand Challenge Driven
Ames Research CenterGlenn Research Center
Langley Research Center
Engineering Team:Operations Driven
Johnson Space CenterMarshall Space Flight Center
Industry Partners
Source: Walt Brooks, NASA
2
7
Economic Impact of HPC
Airlines:System-wide logistics optimization systems on parallel systems.Savings: approx. $100 million per airline per year.
Automotive design:Major automotive companies use large systems (500+ CPUs) for:
» CAD-CAM, crash testing, structural integrity and aerodynamics.» One company has 500+ CPU parallel system.
Savings: approx. $1 billion per company per year.Semiconductor industry:
Semiconductor firms use large systems (500+ CPUs) for» device electronics simulation and logic validation
Savings: approx. $1 billion per company per year.Securities industry:
Savings: approx. $15 billion per year for U.S. home mortgages.
8
Pretty Pictures
9
Why Turn to Simulation?Climate / Weather ModelingData intensive problems (data-mining, oil reservoir simulation)Problems with large length and time scales (cosmology)
10
Titov’s Tsunami Simulation
tsunami-nw10.movTitov’s Tsunami SimulationGlobal model
11
Cost (Economic Loss) to Evacuate 1 Mile of Coastline: $1M
We now over-warn by a factor of 3Average over-warning is 200 miles of coastline, or $200M per event
12
This problem demands a complete, STABLE environment (hardware and software)
100 TF to stay a factor of 10 ahead of the weatherStreaming ObservationsMassive Storage and Meta Data QueryFast NetworkingVisualizationData Mining for Feature Detection
24 Hour Forecast at Fine Grid Spacing
3
13
Units of High Performance Computing
1 Mflop/s 1 Megaflop/s 106 Flop/sec
1 Gflop/s 1 Gigaflop/s 109 Flop/sec
1 Tflop/s 1 Teraflop/s 1012 Flop/sec
1 Pflop/s 1 Petaflop/s 1015 Flop/sec
1 MB 1 Megabyte 106 Bytes
1 GB 1 Gigabyte 109 Bytes
1 TB 1 Terabyte 1012 Bytes
1 PB 1 Petabyte 1015 Bytes14
High-Performance Computing Today
In the past decade, the world has experienced one of the most exciting periods in computer development.Microprocessors have become smaller, denser, and more powerful.The result is that microprocessor-based supercomputing is rapidly becoming the technology of preference in attacking some of the most important problems of science and engineering.
15
Technology Trends: Microprocessor Capacity
2X transistors/Chip Every 1.5 yearsCalled “Moore’s Law”
Moore’s Law
Microprocessors have become smaller, denser, and more powerful.Not just processors, bandwidth, storage, etc
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
16
Eniac and My Laptop
5,000,000,000800Performance (FP/sec)
1,073,741,824~200Memory (bytes)
1,0004,630,000Cost (1999 dollars)
6020,000Power (watts)
0.002868Size (m3)
0.927,200Weight (kg)
6,000,000,00018,000Devices
20021945Year
My LaptopEniac
17
No Exponential is Forever,But perhaps we can Delay it Forever
410,000,0002003Intel® Itanium® 2 processor
220,000,0002002Intel® Itanium®processor
42,000,0002000Intel® Pentium® 4 processor
24,000,0001999Intel® Pentium® III processor
7,500,0001997Intel® Pentium® II processor
3,100,0001993Intel® Pentium®processor
1,180,0001989Intel486™ processor
275,0001985Intel386™ processor
120,0001982286
29,00019788086
5,00019748080
2,50019728008
2,25019714004
TransistorsYear of Introduction
18
Today’s processorsSome equivalences for the microprocessors of today
Voltage level» A flashlight (~1 volt)
Current level» An oven (~250 amps)
Power level» A light bulb (~100 watts)
Area» A postage stamp (~1 square inch)
4
19
Moore’s “Law”Something doubles every 18-24 monthsSomething was originally the number of transistorsSomething is also considered performanceMoore’s Law is an exponential
Exponentials can not last forever» However Moore’s Law has held remarkably
true for ~30 yearsBTW: It is really an empiricism rather than a law (not a derogatory comment) 20
Percentage of peakA rule of thumb that often applies
A contemporary RISC processor, for a spectrum of applications, delivers (i.e., sustains) 10% of peak performance
There are exceptions to this rule, in both directionsWhy such low efficiency?There are two primary reasons behind the disappointing percentage of peak
IPC (in)efficiencyMemory (in)efficiency
21
IPCToday the theoretical IPC (instructions per cycle) is 4 in most contemporary RISC processors (6 in Itanium)Detailed analysis for a spectrum of applications indicates that the average IPC is 1.2–1.4We are leaving ~75% of the possible performance on the table…
22
Why Fast Machines Run Slow
LatencyWaiting for access to memory or other parts of the system
OverheadExtra work that has to be done to manage program concurrency and parallel resources the real work you want to perform
StarvationNot enough work to do due to insufficient parallelism or poor load balancing among distributed resources
ContentionDelays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint.
23
Extra transistorsWith the increasing number of transistors per chip from reduced design rules do we:
Add more functional units?» Little gain owing to poor IPC for today’s
codes, compilers and ISAsAdd more cache?
» This generally helps but does not solve the problem
Add more processors» This helps somewhat» This hurts somewhat
24
Processor vs. memory speedIn 1986
processor cycle time ~120 nanosecondsDRAM access time ~140 nanoseconds
» 1:1 ratioIn 1996
processor cycle time ~4 nanosecondsDRAM access time ~60 nanoseconds
» 20:1 ratioIn 2002
processor cycle time ~0.6 nanosecondDRAM access time ~50 nanoseconds
» 100::1 ratio
5
25
Latency in a Single System
1997 1999 2001 2003 2006 2009
X-Axis
0.1
1
10
100
1000
Tim
e (n
s)
0
100
200
300
400
500
Mem
ory
to C
PU
Rat
io
CPU Clock Period (ns)Memory System Access Time
Ratio
Memory Access Time
CPU Time
Ratio
THE WALL26
Memory hierarchy
Typical latencies for today’s technologyHierarchy Processor clocks
Register 1 L1 cache 2-3 L2 cache 6-12 L3 cache 14-40
Near memory 100-300 Far memory 300-900
Remote memory O(103) Message-passing O(103)-O(104)
yHierarchy
Most programs have a high degree of locality in their accessesspatial locality: accessing things nearby previous accessestemporal locality: reusing an item that was previously accessed
Memory hierarchy tries to exploit locality
on-chip cacheregisters
datapath
control
processor
Second level
cache (SRAM)
Main memory
(DRAM)
Secondary storage (Disk)
Tertiary storage
(Disk/Tape)
TBGBMBKBBSize
10sec10ms100ns10ns1nsSpeed
28
Memory bandwidthTo provide bandwidth to the processor the bus either needs to be faster or widerBusses are limited to perhaps 400-800 MHzLinks are faster
Single-ended 0.5–1 GT/sDifferential: 2.5–5.0 (future) GT/sIncreased link frequencies increase error rates requiring coding and redundancy thus increasing power and die size and not helping bandwidth
Making things wider requires pin-out (Sireal estate) and power
Both power and pin-out are serious issue
29
What’s needed for memory?This is a physics problem
“Money can buy bandwidth, but latency is forever!”
There are possibilities being studiedGenerally involve putting “processing” at memory
» Reduces the latency and increases the bandwidth
Fundamental architectural research is lacking
Government funding is a necessity 30
Processor in Memory (PIM)
PIM merges logic with memoryWide ALUs next to the row bufferOptimized for memory throughput, not ALU utilization
PIM has the potential of riding Moore's law while
greatly increasing effective memory bandwidth,providing many more concurrent execution threads,reducing latency, reducing power, and increasing overall system efficiency
It may also simplify programming and system design
MemoryStack
Sense Amps
Node Logic
Sense Amps
Memory Stack
Sense Amps
Sense Amps
Dec
ode
Memory Stack
Sense Amps
Sense Amps
Memory Stack
Sense Amps
Sense Amps
6
31
Internet – 4th Revolution in Telecommunications
Telephone, Radio, TelevisionGrowth in Internet outstrips the othersExponential growth since 1985Traffic doubles every 100 days
Growth of Internet Hosts *Sept. 1969 - Sept. 2002
0
50,000,000
100,000,000
150,000,000
200,000,000
250,000,000
9/69
01/71
01/73
01/74
01/76
01/79
08/81
08/83
10/85
11/86
07/88
01/89
10/89
01/91
10/91
04/92
10/92
04/93
10/93
07/94
01/95
01/96
01/97
01/98
01/99
01/01
08/02
Time Period
No.
of H
osts
Domain names
32
The Web Phenomenon90 – 93 Web inventedU of Illinois Mosaic released March 94, ~ 0.1% trafficSeptember 93 ~ 1% traffic w/200 sitesJune 94 ~ 10% of traffic w/2,000 sitesToday 60% of traffic w/2,000,000 sitesEvery organization, company, school
33
Peer to Peer ComputingPeer-to-peer is a style of networking in which a group of computers communicate directly with each other.Wireless communicationHome computer in the utility room, next to the water heater and furnace. Web tablets BitTorrent
http://en.wikipedia.org/wiki/BittorrentImbedded computers in things all tied together.
Books, furniture, milk cartons, etcSmart Appliances
Refrigerator, scale, etc
34
Internet On Everything
35
SETI@home: Global Distributed Computing
Running on 500,000 PCs, ~1000 CPU Years per Day
485,821 CPU Years so farSophisticated Data & Signal Processing AnalysisDistributes Datasets from Arecibo Radio Telescope
36
SETI@homeUse thousands of Internet-connected PCs to help in the search for extraterrestrial intelligence.When their computer is idle or being wasted this software will download a 300 kilobyte chunk of data for analysis. Performs about 3 Tflops for each client in 15 hours.The results of this analysis are sent back to the SETI team, combined with thousands of other participants.
Largest distributed computation project in existence
Averaging 40 Tflop/sToday a number of companies trying this for profit.
7
Google query attributes150M queries/day (2000/second)100 countries8.0B documents in the index
Data centers100,000 Linux systems in data centers around the world
» 15 TFlop/s and 1000 TB total capability
» 40-80 1U/2U servers/cabinet » 100 MB Ethernet
switches/cabinet with gigabit Ethernet uplink
growth from 4,000 systems (June 2000)
» 18M queries thenPerformance and operation
simple reissue of failed commands to new serversno performance debugging
» problems are not reproducible
Source: Monika Henzinger, Google & Cleve Moler
Forward link are referred to in the rowsBack links are referred to in the columns
Eigenvalue problem; Ax = λxn=8x109
(see: MathWorksCleve’s Corner)
The matrix is the transition probability matrix of the Markov chain; Ax = x 38
Next Generation Web
To treat CPU cycles and software like commodities.Enable the coordinated use of geographically distributed resources – in the absence of central control and existing trust relationships. Computing power is produced much like utilities such as power and water are produced for consumers.Users will have access to “power” on demand This is one of our efforts at UT.
39
Where Has This Performance Improvement Come From?
Technology?Organization?Instruction Set Architecture?Software?Some combination of all of the above?
40
Impact of Device ShrinkageWhat happens when the feature size (transistor size) shrinks by a factor of x ?Clock rate goes up by x because wires are shorter
actually less than x, because of power consumption
Transistors per unit area goes up by x2
Die size also tends to increasetypically another factor of ~x
Raw computing power of the chip goes up by ~ x4 !of which x3 is devoted either to parallelismor locality
41
How fast can a serial computer be?
Consider the 1 Tflop sequential machinedata must travel some distance, r, to get from memory to CPUto get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/sso r < c/1012 = .3 mm
Now put 1 TB of storage in a .3 mm2 area each word occupies about 3 Angstroms2, the size of a small atom
r = .3 mm1 Tflop 1 TB sequential machine
42
Processor-Memory ProblemProcessors issue instructions roughly every nanosecond.DRAM can be accessed roughly every 100 nanoseconds (!).DRAM cannot keep processors busy! And the gap is growing:
processors getting faster by 60% per yearDRAM getting faster by 7% per year (SDRAM and EDO RAM might help, but not enough)
8
43
1
100
10000
1000000
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
Year
Perf
orm
ance
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)
“Moore’s Law”
Processor-MemoryPerformance Gap:(grows 50% / year)
CPU
DRAM
Processor-DRAM Memory Gap
44
•Why Parallel ComputingDesire to solve bigger, more realistic applications problems.Fundamental limits are being approached.More cost effective solution
45
Principles of Parallel Computing
Parallelism and Amdahl’s LawGranularityLocalityLoad balanceCoordination and synchronizationPerformance modeling
All of these things makes parallel programming even harder than sequential programming.
46
“Automatic” Parallelism in Modern Machines
Bit level parallelismwithin floating point operations, etc.
Instruction level parallelism (ILP)multiple instructions execute per clock cycle
Memory system parallelismoverlap of memory operations with computation
OS parallelismmultiple jobs run in parallel on commodity SMPs
Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks
47
Finding Enough Parallelism
Suppose only part of an application seems parallelAmdahl’s law
let fs be the fraction of work done sequentially, (1-fs) is fraction parallelizableN = number of processors
Even if the parallel part speeds up perfectly may be limited by the sequential part
48
Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahl’s Law are given below:
tN = (fp/N + fs)t1 Effect of multiple processors on run time
S = 1/(fs + fp/N) Effect of multiple processors on speedup
Where:fs = serial fraction of codefp = parallel fraction of code = 1 - fsN = number of processors
Amdahl’s Law
9
49
0
50
100
150
200
250
0 50 100 150 200 250
Number of processors
spee
dup
fp = 1.000fp = 0.999fp = 0.990fp = 0.900
It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors
Illustration of Amdahl’s Law
50
Overhead of Parallelism
Given enough parallel work, this is the biggest barrier to getting desired speedupParallelism overheads include:
cost of starting a thread or processcost of communicating shared datacost of synchronizingextra (redundant) computation
Each of these can be in the range of milliseconds (=millions of flops) on some systemsTradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work
Locality and Parallelism
Large memories are slow, fast memories are smallStorage hierarchies are large and fast on averageParallel processors, collectively, have large, fast $
the slow accesses to “remote” data we call “communication”Algorithm should do most work on local data
ProcCache
L2 Cache
L3 Cache
Memory
Conventional Storage Hierarchy
ProcCache
L2 Cache
L3 Cache
Memory
ProcCache
L2 Cache
L3 Cache
Memory
potentialinterconnects
52
Load Imbalance
Load imbalance is the time that some processors in the system are idle due to
insufficient parallelism (during that phase)unequal size tasks
Examples of the latteradapting to “interesting parts of a domain”tree-structured computations fundamentally unstructured problems
Algorithm needs to balance load
53
Performance Trends Revisited (Microprocessor Organization)
Year
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000
r4400
r4000
r3010
i80386
i4004
i8080
i80286
i8086
• Bit Level Parallelism
• Pipelining
• Caches
• Instruction Level Parallelism
• Out-of-order Xeq
• Speculation
• . . .
54
Greater instruction level parallelism?Bigger caches?Multiple processors per chip?Complete systems on a chip? (Portable Systems)
High performance LAN, Interface, and Interconnect
What is Ahead?
10
55
DirectionsMove toward shared memory
SMPs and Distributed Shared MemoryShared address space w/deep memory hierarchy
Clustering of shared memory machines for scalabilityEfficiency of message passing and data parallel programming
Helped by standards efforts such as MPI and HPF
56
High Performance Computers~ 20 years ago
1x106 Floating Point Ops/sec (Mflop/s) » Scalar based
~ 10 years ago1x109 Floating Point Ops/sec (Gflop/s)
» Vector & Shared memory computing, bandwidth aware» Block partitioned, latency tolerant
~ Today1x1012 Floating Point Ops/sec (Tflop/s)
» Highly parallel, distributed processing, message passing, network based
» data decomposition, communication/computation~ 5 years away
1x1015 Floating Point Ops/sec (Pflop/s) » Many more levels MH, combination/grids&HPC» More adaptive, LT and bandwidth aware, fault tolerant,
extended precision, attention to SMP nodes
57
Top 500 ComputersTop 500 Computers
- Listing of the 500 most powerfulComputers in the World
- Yardstick: Rmax from LINPACK MPPAx=b, dense problem
Updated twice a yearSC‘xy in the States in NovemberMeeting in Germany in June
Size
Rat
e
TPP performance
58
A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved.
Over the last 10 years the range for the Top500 has increased greater than Moore’s Law1993:
#1 = 59.7 GFlop/s#500 = 422 MFlop/s
2004:#1 = 70 TFlop/s#500 = 850 GFlop/s
What is a Supercomputer?
Why do we need them? Almost all of the technical areas that are important to the well-being of humanity use supercomputing in fundamental and essential ways.
Computational fluid dynamics,protein folding, climate modeling, national security, in particular for cryptanalysis and for simulating nuclear weapons to name a few.
59
24th List: The TOP10
25002003USANCSA9.82TungstenPowerEdge, MyrinetDell10
29442004USANaval Oceanographic Office10.31pSeries 655IBM9
81922004USALawrence LivermoreNational Laboratory11.68BlueGene/L
DD1 500 MHzIBM/LLNL8
22002004USAVirginia Tech12.25XApple XServe, InfinibandSelf Made7
81922002USALos AlamosNational Laboratory13.88ASCI Q
AlphaServer SC, QuadricsHP6
40962004USALawrence LivermoreNational Laboratory19.94Thunder
Itanium2, QuadricsCCD5
35642004SpainBarcelona Supercomputer Center20.53MareNostrum
BladeCenter JS20, MyrinetIBM4
51202002JapanEarth Simulator Center35.86Earth-SimulatorNEC3
101602004USANASA Ames51.87ColumbiaAltix, InfinibandSGI2
327682004USADOE/IBM70.72BlueGene/Lβ-SystemIBM1
#ProcYearCountryInstallation SiteRmax[TF/s]ComputerManufacturer
399 system > 1 TFlop/s; 294 machines are clusters, top10 average 8K proc 60
Performance Development1.127 PF/s
1.167 TF/s
59.7 GF/s
70.72 TF/s
0.4 GF/s
850 GF/s
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
Fujitsu
'NWT' NAL
NEC
Earth Simulator
Intel ASCI Red
Sandia
IBM ASCI White
LLNL
N=1
N=500
SUM
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s IBM
BlueGene/L
My Laptop
11
61
Performance Projection
1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
N=1
N=500
SUM
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
10 Pflop/s
1 Eflop/s
100 Pflop/s
DARPA HPCS
BlueGene/L
62
Performance Projection
1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
N=1
N=500
SUM
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
10 Pflop/s
1 Eflop/s
100 Pflop/s
DARPA HPCS
BlueGene/L
My Laptop
63
Customer Segments / Systems
Industry
Research
Academic
ClassifiedVendor
Government
0
100
200
300
400
500
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
64
Manufacturers / Systems
0
100
200
300
400
50019
93
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
othersHitachiNECFujitsuIntelTMCHPSunIBMSGICray
65
Processor Types
0
100
200
300
400
500
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
SIMD
Vector
Scalar
Sparc
MIPS
intel
HP
Power
Alpha
66
Interconnects / Systems
0
100
200
300
400
500
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
Others
Infiniband
Quadrics
Gigabit Ethernet
Cray Interconnect
Myrinet
SP Switch
Crossbar
N/A
12
67
Top500 Performance by Manufacture (11/04)
IBM49%
HP21%
others14%
SGI7%
NEC4%
Fujitsu2%
Cray2%
Hitachi1%
Sun0%
Intel0%
68
Clusters (NOW) / Systems
0
50
100
150
200
250
300
1997 1998 1999 2000 2001 2002 2003 2004
OthersSun Fire - IntelNOW - AlphaNOW - SunDell ClusterNOW - AMDHP ClusterHP AlphaServerNOW - PentiumIBM Cluster
69
Top500 ConclusionsMicroprocessor based supercomputers have brought a major change in accessibility and affordability.MPPs continue to account of more than half of all installed high-performance computers worldwide.
70
40003534 (88%)1102 (27%)1000Intel/HP Itanium 2
160110 (69%)27 (17%)80Cray 1
200190 (95%)106 (53%)100Cray J-90
235218 (93%)121 (51%)118Cray X-MP
333324 (97%)161 (48%)166Cray Y-MP
952902 (95%)387 (41%)238Cray C90
18001603 (89%)705 (39%)454Cray T90
600553 (92%)173 (29%)300SGI Origin 2K
900607 (67%)208 (23%)450SUN Ultra 80
1066478 (45%)231 (22%)533PowerPC G4
933514 (55%)234 (25%)933Intel P3
15001208 (80%)424 (28%)375IBM Power 3
22001583 (71%)468 (21%)550HP PA
2400998 (42%)558 (23%)1200AMD Athlon
20001542 (77%)824 (41%)1000Compaq Alpha
Peak
5080
Linpack n=1000
2355 (46%)
Linpack n=100
1190 (23%)
Cycle Time
2540
Processor
Intel P4
Performance Numbers on RISC Processors
71
High-Performance Computing Directions: Beowulf-class PC Clusters
COTS PC NodesPentium, Alpha, PowerPC, SMP
COTS LAN/SAN Interconnect
Ethernet, Myrinet, Giganet, ATM
Open Source UnixLinux, BSD
Message Passing ComputingMPI, PVMHPF
Best price-performanceLow entry-level costJust-in-place configurationVendor invulnerableScalableRapid technology tracking
Definition: Advantages:
Enabled by PC hardware, networks and operating system achieving capabilities of scientific workstations at a fraction of the cost and availability of industry standard message passing libraries. However, much more of a contact sport. 72
Peak performance Interconnectionhttp://clusters.top500.org Benchmark results to follow in the coming months
13
Distributed and Parallel Systems
Distributedsystemshetero-geneous
Massivelyparallelsystemshomo-geneous
Grid
Com
putin
gBe
owul
fBe
rkle
y NOW
SNL
Cpla
nt
Entro
pia
ASCI
Tflo
ps
Gather (unused) resourcesSteal cyclesSystem SW manages resourcesSystem SW adds value10% - 20% overhead is OKResources drive applicationsTime to completion is not criticalTime-shared
Bounded set of resources Apps grow to consume all cyclesApplication manages resourcesSystem SW gets in the way5% overhead is maximumApps drive purchase of equipmentReal-time constraintsSpace-shared
SETI
@ho
me
Para
llel D
ist m
em74
Virtual Environments0.32E-08 0.00E+00 0.00E+00 0.00E+00 0.38E-06 0.13E-05 0.22E-05 0.33E-05 0.59E-05 0.11E-040.18E-04 0.23E-04 0.23E-04 0.21E-04 0.67E-04 0.38E-03 0.90E-03 0.18E-02 0.30E-02 0.43E-020.50E-02 0.51E-02 0.49E-02 0.44E-02 0.39E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-020.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.39E-02 0.38E-020.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-02 0.16E-02 0.14E-02 0.11E-02 0.96E-030.79E-03 0.63E-03 0.48E-03 0.35E-03 0.24E-03 0.15E-03 0.80E-04 0.34E-04 0.89E-05 0.16E-050.18E-06 0.34E-08 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+000.00E+00 0.00E+00 0.00E+00 0.00E+00 0.24E-08 0.00E+00 0.00E+00 0.00E+00 0.29E-06 0.11E-050.19E-05 0.30E-05 0.53E-05 0.96E-05 0.15E-04 0.20E-04 0.20E-04 0.18E-04 0.27E-04 0.23E-030.65E-03 0.14E-02 0.27E-02 0.40E-02 0.49E-02 0.51E-02 0.49E-02 0.45E-02 0.40E-02 0.35E-020.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-020.38E-02 0.39E-02 0.39E-02 0.37E-02 0.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-020.16E-02 0.14E-02 0.12E-02 0.98E-03 0.81E-03 0.65E-03 0.51E-03 0.38E-03 0.27E-03 0.17E-030.99E-04 0.47E-04 0.16E-04 0.36E-05 0.62E-06 0.41E-07 0.75E-10 0.00E+00 0.00E+00 0.00E+000.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.15E-08 0.00E+000.00E+00 0.00E+00 0.19E-06 0.84E-06 0.16E-05 0.27E-05 0.47E-05 0.82E-05 0.13E-04 0.17E-040.17E-04 0.15E-04 0.16E-04 0.10E-03 0.41E-03 0.11E-02 0.23E-02 0.37E-02 0.48E-02 0.51E-020.49E-02 0.45E-02 0.40E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-020.28E-02 0.31E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.38E-02 0.36E-02 0.33E-02 0.29E-02
Do they make any sense?
76
Performance Improvements for Scientific Computing Problems
Derived from Computational Methods
1
10
100
1000
10000
1970 1975 1980 1985 1990 1995
Spee
d-Up Fa
ctor
Multi-grid
Conjugate Gradient
SOR
Gauss-Seidel
Sparse GE
1
10
100
1000
10000
1970 1975 1980 1985 1995
77
Different ArchitecturesParallel computing: single systems with many processors working on same problemDistributed computing: many systems loosely coupled by a scheduler to work on related problemsGrid Computing: many systems tightly coupled by software, perhaps geographically distributed, to work together on single problems or on related problems
78
Types of Parallel ComputersThe simplest and most useful way to classify modern parallel computers is by their memory model:
shared memorydistributed memory
14
79
P P P P P P
BUS
Memory
M
P
M
P
M
P
M
P
M
P
M
P
Network
Shared memory - single address space. All processors have access to a pool of shared memory. (Ex: SGI Origin, Sun E10000)
Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM SP, clusters)
Shared vs. Distributed Memory
80
P P P P P P
BUS
Memory
Uniform memory access (UMA): Each processor has uniform access to memory. Also known as symmetric multiprocessors(Sun E10000)
P P P P
BUS
Memory
P P P P
BUS
Memory
Network
Non-uniform memory access (NUMA): Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (SGI Origin)
Shared Memory: UMA vs. NUMA
81
Distributed Memory: MPPs vs. Clusters
Processors-memory nodes are connected by some type of interconnect network
Massively Parallel Processor (MPP): tightly integrated, single system image.Cluster: individual computers connected by s/w
CPU
MEMCPU
MEM
CPU
MEMCPU
MEMCPU
MEM
CPU
MEM
CPU
MEMCPU
MEMCPU
MEM
InterconnectNetwork
82
Processors, Memory, & Networks
Both shared and distributed memory systems have:1. processors: now generally commodity RISC
processors2. memory: now generally commodity DRAM3. network/interconnect: between the
processors and memory (bus, crossbar, fat tree, torus, hypercube, etc.)
We will now begin to describe these pieces in detail, starting with definitions of terms.
83
Processor-Related Terms Clock period (cp): the minimum time
interval between successive actions in the processor. Fixed, depends on design of processor. Measured in nanoseconds (~1-5 for fastest processors). Inverse of frequency (MHz)
Instruction: an action executed by a processor, such as a mathematical operation or a memory operation.
Register: a small, extremely fast location for storing data or instructions in the processor
84
Processor-Related TermsFunctional Unit: a hardware element that
performs an operation on an operand or pair of operations. Common FUs are ADD, MULT, INV, SQRT, etc.
Pipeline : technique enabling multiple instructions to be overlapped in execution
Superscalar: multiple instructions are possible per clock period
Flops: floating point operations per second
15
85
Processor-Related TermsCache: fast memory (SRAM) near the
processor. Helps keep instructions and data close to functional units so processor can execute more instructions more rapidly.
TLB: Translation-Lookaside Buffer keeps addresses of pages (block of memory) in main memory that have recently been accessed (a cache for memory addresses)
86
Memory-Related TermsSRAM: Static Random Access Memory
(RAM). Very fast (~10 nanoseconds), made using the same kind of circuitry as the processors, so speed is comparable.
DRAM: Dynamic RAM. Longer access times (~100 nanoseconds), but hold more bits and are much less expensive (10x cheaper).
Memory hierarchy: the hierarchy of memory in a parallel system, from registers to cache to local memory to remote memory More later
87
Interconnect-Related TermsLatency: How long does it take to start sending a "message"? Measured in microseconds.
(Also in processors: How long does it take to output results of some operations, such as floating point add, divide etc., which are pipelined?)
Bandwidth: What data rate can be sustained once the message is started? Measured in Mbytes/sec.
88
Interconnect-Related TermsTopology: the manner in which the nodes
are connected. Best choice would be a fully connected network (every processor to every other). Unfeasible for cost and scaling reasons.Instead, processors are arranged in some variation of a grid, torus, or hypercube.
3-d hypercube 2-d mesh 2-d torus
89
Highly Parallel Supercomputing: Where Are We?
Performance:Sustained performance has dramatically increased during the lastyear.On most applications, sustained performance per dollar now exceeds that of conventional supercomputers. But...Conventional systems are still faster on some applications.
Languages and compilers:Standardized, portable, high-level languages such as HPF, PVM and MPI are available. But ...Initial HPF releases are not very efficient.Message passing programming is tedious and hardto debug.Programming difficulty remains a major obstacle tousage by mainstream scientist.
90
Highly Parallel Supercomputing: Where Are We?
Operating systems:Robustness and reliability are improving.New system management tools improve system utilization. But...Reliability still not as good as conventional systems.
I/O subsystems:New RAID disks, HiPPI interfaces, etc. provide substantially improved I/O performance. But...I/O remains a bottleneck on some systems.
16
91
The Importance of Standards -Software
Writing programs for MPP is hard ...But ... one-off efforts if written in a standard languagePast lack of parallel programming standards ...
... has restricted uptake of technology (to "enthusiasts")
... reduced portability (over a range of currentarchitectures and between future generations)
Now standards exist: (PVM, MPI & HPF), which ...... allows users & manufacturers to protect software investment
... encourage growth of a "third party" parallel software industry & parallel versions of widely used codes
92
The Importance of Standards -Hardware
Processorscommodity RISC processors
Interconnectshigh bandwidth, low latency communications protocolno de-facto standard yet (ATM, Fibre Channel, HPPI, FDDI)
Growing demand for total solution:robust hardware + usable software
HPC systems containing all the programming tools/ environments / languages / libraries / applicationspackages found on desktops
93
The Future of HPCThe expense of being different is being replaced by the economics of being the sameHPC needs to lose its "special purpose" tagStill has to bring about the promise of scalable general purpose computing ...... but it is dangerous to ignore this technologyFinal success when MPP technology is embedded in desktop computingYesterday's HPC is today's mainframe is tomorrow's workstation
94
Achieving TeraFlopsIn 1991, 1 Gflop/s1000 fold increase
Architecture» exploiting parallelism
Processor, communication, memory» Moore’s Law
Algorithm improvements» block-partitioned algorithms
95
Future: Petaflops ( fl pt ops/s)
A Pflop for 1 second a typical workstation computing for 1 year.From an algorithmic standpoint
concurrencydata localitylatency & syncfloating point accuracy
1015
dynamic redistribution of workloadnew language and constructsrole of numerical librariesalgorithm adaptation to hardware failure
Today flops for our workstations≈ 1015
96
A Petaflops Computer System1 Pflop/s sustained computingBetween 10,000 and 1,000,000 processorsBetween 10 TB and 1PB main memoryCommensurate I/O bandwidth, mass store, etc.If built today, cost $40 B and consume 1 TWatt.May be feasible and “affordable” by the year 2010