Upload
chad-neal
View
214
Download
0
Embed Size (px)
Citation preview
1
Today
• About the class
• Introductions
• Any new people?
• Start of first module: Parallel Computing
2
Goals for This Module
• Overview of Parallel Architecture and Programming Models• Drivers of Parallel Computing (Appl, Tech trends, Arch., Economics)• Trends in “Supercomputers” for Scientific Computing• Evolution and Convergence of Parallel Architectures• Fundamental Issues in Programming Models and Architecture
• Parallel programs• Process of parallelization• What parallel programs look like in major programming models
• Programming for performance• Key performance issues and architectural interactions
Overview of Parallel Architecture and Programming Models
4
What is a Parallel Computer?
A collection of processing elements that cooperate to solve large problems fast
Some broad issues that distinguish parallel computers:• Resource Allocation:
– how large a collection? – how powerful are the elements?– how much memory?
• Data access, Communication and Synchronization– how do the elements cooperate and communicate?– how are data transmitted between processors?– what are the abstractions and primitives for cooperation?
• Performance and Scalability– how does it all translate into performance?– how does it scale?
5
Why Parallelism?
• Provides alternative to faster clock for performance
• Assuming a doubling of effective per-node performance every 2 years, 1024-CPU system can get you the performance that it would take 20 years for a single-CPU system to deliver
• Applies at all levels of system design
• Is increasingly central in information processing
• Scientific computing: simulation, data analysis, data storage and management, etc.
• Commercial computing: Transaction processing, databases
• Internet applications: Search; Google operates at least 50,000 CPUs, many as part of large parallel systems
6
How to Study Parallel Systems
History: diverse and innovative organizational structures, often tied to novel programming models
Rapidly matured under strong technological constraints• The microprocessor is ubiquitous• Laptops and supercomputers are fundamentally similar!• Technological trends cause diverse approaches to converge
Technological trends make parallel computing inevitable• In the mainstream
Need to understand fundamental principles and design tradeoffs, not just taxonomies
• Naming, Ordering, Replication, Communication performance
7
Outline
• Drivers of Parallel Computing
• Trends in “Supercomputers” for Scientific Computing
• Evolution and Convergence of Parallel Architectures
• Fundamental Issues in Programming Models and Architecture
8
Drivers of Parallel Computing
Application Needs: Our insatiable need for computing cycles• Scientific computing: CFD, Biology, Chemistry, Physics, ...• General-purpose computing: Video, Graphics, CAD, Databases, TP...• Internet applications: Search, e-Commerce, Clustering ...
Technology Trends
Architecture Trends
Economics
Current trends:• All microprocessors have multiprocessor support• Servers and workstations are often MP: Sun, SGI, Dell, COMPAQ...• Microprocessors are multiprocessors: SMP on a chip
9
Application Trends
Demand for cycles fuels advances in hardware, and vice-versa• Cycle drives exponential increase in microprocessor performance• Drives parallel architecture harder: most demanding applications
Range of performance demands• Need range of system performance with progressively increasing cost• Platform pyramid
Goal of applications in using parallel machines: Speedup
Speedup (p processors) =
For a fixed problem size (input data set), performance = 1/time
Speedup fixed problem (p processors) =
Performance (p processors)
Performance (1 processor)
Time (1 processor)
Time (p processors)
10
Scientific Computing Demand
11
Engineering Computing Demand
Large parallel machines a mainstay in many industries• Petroleum (reservoir analysis)• Automotive (crash simulation, drag analysis, combustion efficiency), • Aeronautics (airflow analysis, engine efficiency, structural mechanics,
electromagnetism), • Computer-aided design• Pharmaceuticals (molecular modeling)• Visualization
– in all of the above– entertainment (movies), architecture (walk-throughs, rendering)
• Financial modeling (yield and derivative analysis)• etc.
12
Learning Curve for Parallel Applications
• AMBER molecular dynamics simulation program• Starting point was vector code for Cray-1• 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon,
891 on 128-processor Cray T3D
13
Commercial Computing
Also relies on parallelism for high end• Scale not so large, but use much more wide-spread• Computational power determines scale of business that can be handled
Databases, online-transaction processing, decision support, data mining, data warehousing ...
TPC benchmarks (TPC-C order entry, TPC-D decision support)• Explicit scaling criteria provided• Size of enterprise scales with size of system• Problem size no longer fixed as p increases, so throughput is used as a
performance measure (transactions per minute or tpm)
14
TPC-C Results for Wintel Systems
• Parallelism is pervasive• Small to moderate scale parallelism very important• Difficult to obtain snapshot to compare across vendor platforms
4-way4-wayCpq PL 5000Cpq PL 5000Pentium Pro Pentium Pro
200 MHz200 MHz6,751 tpmC6,751 tpmC
$89.62/tpmC$89.62/tpmCAvail: 12-1-96Avail: 12-1-96
TPC-C v3.2TPC-C v3.2(withdrawn)(withdrawn)
6-way6-wayUnisys AQ Unisys AQ
HS6HS6Pentium Pro Pentium Pro
200 MHz200 MHz12,026 tpmC12,026 tpmC$39.38/tpmC$39.38/tpmC
Avail: 11-30-97Avail: 11-30-97TPC-C v3.3TPC-C v3.3(withdrawn)(withdrawn)
4-way4-wayIBM NF 7000IBM NF 7000PII Xeon 400 PII Xeon 400
MHzMHz18,893 tpmC18,893 tpmC$29.09/tpmC$29.09/tpmC
Avail: 12-29-98Avail: 12-29-98TPC-C v3.3TPC-C v3.3(withdrawn)(withdrawn)
8-way8-wayCpq PL 8500Cpq PL 8500PIII Xeon 550 PIII Xeon 550
MHzMHz40,369 tpmC40,369 tpmC$18.46/tpmC$18.46/tpmC
Avail: 12-31-99Avail: 12-31-99TPC-C v3.5TPC-C v3.5(withdrawn)(withdrawn)
8-way8-wayDell PE 8450Dell PE 8450PIII Xeon 700 PIII Xeon 700
MHzMHz57,015 tpmC57,015 tpmC$14.99/tpmC$14.99/tpmC
Avail: 1-15-01Avail: 1-15-01TPC-C v3.5TPC-C v3.5(withdrawn)(withdrawn)
32-way32-wayUnisys Unisys ES7000ES7000
PIII Xeon 900 PIII Xeon 900 MHzMHz
165,218 tpmC165,218 tpmC$21.33/tpmC$21.33/tpmCAvail: 3-10-02Avail: 3-10-02
TPC-C v5.0TPC-C v5.0
32-way32-wayNEC Express5800NEC Express5800
Itanium2 1GHzItanium2 1GHz342,746 tpmC342,746 tpmC$12.86/tpmC$12.86/tpmCAvail: 3-31-03Avail: 3-31-03
TPC-C v5.0TPC-C v5.0
32-way32-wayUnisys ES7000Unisys ES7000Xeon MP 2 GHzXeon MP 2 GHz234,325 tpmC234,325 tpmC$11.59/tpmC$11.59/tpmCAvail: 3-31-03Avail: 3-31-03
TPC-C v5.0TPC-C v5.0
PerformancePerformance Price-PerformancePrice-Performance 342K
19K40K
57K
165K
234K
12K7K0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
1996 1997 1998 1999 2000 2001 2002 2002
Per
form
ance
(tp
mC
)
$0
$10
$20
$30
$40
$50
$60
$70
$80
$90
$100
Pri
ce-P
erf
($/t
pm
C)
15
Summary of Application Trends
Transition to parallel computing has occurred for scientific and engineering computing
In rapid progress in commercial computing• Database and transactions as well as financial• Usually smaller-scale, but large-scale systems also used
Desktop also uses multithreaded programs, which are a lot like parallel programs
Demand for improving throughput on sequential workloads• Greatest use of small-scale multiprocessors
Solid application demand, keeps increasing with time
16
Drivers of Parallel Computing
Application Needs
Technology Trends
Architecture Trends
Economics
17
Technology Trends: Rise of the Micro
The natural building block for multiprocessors is now also about the fastest!
Per
form
ance
0.1
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
18
General Technology Trends
• Microprocessor performance increases 50% - 100% per year• Transistor count doubles every 3 years• DRAM size quadruples every 3 years• Huge investment per generation is carried by huge commodity market
• Not that single-processor performance is plateauing, but that parallelism is a natural way to improve it.
19
Clock Frequency Growth Rate (Intel family)
• 30% per year
20
Transistor Count Growth Rate (Intel family)
• 100 million transistors on chip by early 2000’s A.D.• Transistor count grows much faster than clock rate
- 40% per year, order of magnitude more contribution in 2 decades
21
Technology: A Closer LookBasic advance is decreasing feature size ( )
• Circuits become either faster or lower in power
Die size is growing too• Clock rate improves roughly proportional to improvement in • Number of transistors improves like (or faster)
Performance > 100x per decade; clock rate 10x, rest transistor count
How to use more transistors?• Parallelism in processing
– multiple operations per cycle reduces CPI• Locality in data access
– avoids latency and reduces CPI– also improves processor utilization
• Both need resources, so tradeoff
Fundamental issue is resource distribution, as in uniprocessors Proc $
Interconnect
22
Similar Story for StorageDivergence between memory capacity and speed more pronounced
• Capacity increased by 1000x from 1980-95, and increases 50% per yr• Latency reduces only 3% per year (only 2x from 1980-95)• Bandwidth per memory chip increases 2x as fast as latency reduces
Larger memories are slower, while processors get faster• Need to transfer more data in parallel• Need deeper cache hierarchies• How to organize caches?
23
Similar Story for Storage
Parallelism increases effective size of each level of hierarchy, without increasing access time
Parallelism and locality within memory systems too• New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface• Buffer caches most recently accessed data
Disks too: Parallel disks plus caching
Overall, dramatic growth of processor speed, storage capacity and bandwidths relative to latency (especially) and clock speed point toward parallelism as the desirable architectural direction
24
Drivers of Parallel Computing
Application Needs
Technology Trends
Architecture Trends
Economics
25
Architectural Trends
Architecture translates technology’s gifts to performance and capability
Resolves the tradeoff between parallelism and locality• Recent microprocessors: 1/3 compute, 1/3 cache, 1/3 off-chip connect• Tradeoffs may change with scale and technology advances
Four generations of architectural history: tube, transistor, IC, VLSI• Here focus only on VLSI generation
Greatest delineation in VLSI has been in type of parallelism exploited
26
Architectural Trends in Parallelism
Greatest trend in VLSI generation is increase in parallelism• Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
– slows after 32 bit – adoption of 64-bit well under way, 128-bit is far (not performance issue)– great inflection point when 32-bit micro and cache fit on a chip
• Mid 80s to mid 90s: instruction level parallelism– pipelining and simple instruction sets, + compiler advances (RISC)– on-chip caches and functional units => superscalar execution– greater sophistication: out of order execution, speculation, prediction
• to deal with control transfer and latency problems
• Next step: thread level parallelism
27
Phases in VLSI Generation
• How good is instruction-level parallelism (ILP)? • Thread-level needed in microprocessors?
• SMT, Intel Hyperthreading
Tran
sist
ors
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level (?)
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
28
Can ILP get us there?• Reported speedups for superscalar processors
• Horst, Harris, and Jardine [1990] ...................... 1.37
• Wang and Wu [1988] .......................................... 1.70
• Smith, Johnson, and Horowitz [1989] .............. 2.30
• Murakami et al. [1989] ........................................ 2.55
• Chang et al. [1991] ............................................. 2.90
• Jouppi and Wall [1989] ...................................... 3.20
• Lee, Kwok, and Briggs [1991] ........................... 3.50
• Wall [1991] .......................................................... 5
• Melvin and Patt [1991] ....................................... 8
• Butler et al. [1991] ............................................. 17+
• Large variance due to difference in– application domain investigated (numerical versus non-numerical)– capabilities of processor modeled
29
ILP Ideal Potential
• Infinite resources and fetch bandwidth, perfect branch prediction and renaming – real caches and non-zero miss latencies
0 1 2 3 4 5 6+0
5
10
15
20
25
30
0 5 10 150
0.5
1
1.5
2
2.5
3
Fra
ctio
n o
f to
tal c
ycle
s (%
)
Number of instructions issued
Sp
ee
du
p
Instructions issued per cycle
30
Results of ILP Studies
1x
2x
3x
4x
Jouppi_89 Smith_89 Murakami_89 Chang_91 Butler_91 Melvin_91
1 branch unit/real prediction
perfect branch prediction
• Concentrate on parallelism for 4-issue machines
• Realistic studies show only 2-fold speedup• More recent work examines ILP that looks across threads for parallelism
31
Architectural Trends: Bus-based MPs•Micro on a chip makes it natural to connect many to shared memory
– dominates server and enterprise market, moving down to desktop•Faster processors began to saturate bus, then bus technology advanced
– today, range of sizes for bus-based systems, desktop to large servers
0
10
20
30
40
CRAY CS6400
SGI Challenge
Sequent B2100
Sequent B8000
Symmetry81
Symmetry21
Power
SS690MP 140 SS690MP 120
AS8400
HP K400AS2100SS20
SE30
SS1000E
SS10
SE10
SS1000
P-ProSGI PowerSeries
SE60
SE70
Sun E6000
SC2000ESun SC2000SGI PowerChallenge/XL
SunE10000
50
60
70
1984 1986 1988 1990 1992 1994 1996 1998
Nu
mb
er
of
pro
cess
ors
32
Bus Bandwidth
Sh
are
d b
us
ba
nd
wid
th (
MB
/s)
10
100
1,000
10,000
100,000
1984 1986 1988 1990 1992 1994 1996 1998
SequentB8000
SGIPowerCh
XL
Sequent B2100
Symmetry81/21
SS690MP 120SS690MP 140
SS10/SE10/SE60
SE70/SE30SS1000 SS20
SS1000EAS2100SC2000EHPK400
SGI Challenge
Sun E6000AS8400
P-Pro
Sun E10000
SGI PowerSeries
SC2000
Power
CS6400
33
Do Buses Scale?
Switch
P
$
XY
Z
External I/O
Memctrl
and NI
Mem
Buses are a convenient way to extend architecture to parallelism, but they do not scale
• bandwidth doesn’t grow as CPUs are added
Scalable systems use physically distributed memory
34
Drivers of Parallel Computing
Application Needs
Technology Trends
Architecture Trends
Economics
35
Finally, Economics
Commodity microprocessors not only fast but CHEAP• Development cost is tens of millions of dollars (5-100 typical)
• BUT, many more are sold compared to supercomputers• Crucial to take advantage of the investment, and use the
commodity building block• Exotic parallel architectures no more than special-purpose
Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors
Standardization by Intel makes small, bus-based SMPs commodity
Desktop: few smaller processors versus one larger one?• Multiprocessor on a chip
36
Summary: Why Parallel Architecture?
Increasingly attractive• Economics, technology, architecture, application demand
Increasingly central and mainstream
Parallelism exploited at many levels• Instruction-level parallelism• Multiprocessor servers• Large-scale multiprocessors (“MPPs”)
Focus of this class: multiprocessor level of parallelism
Same story from memory (and storage) system perspective• Increase bandwidth, reduce average latency with many local memories
Wide range of parallel architectures make sense• Different cost, performance and scalability
37
Outline
• Drivers of Parallel Computing
• Trends in “Supercomputers” for Scientific Computing
• Evolution and Convergence of Parallel Architectures
• Fundamental Issues in Programming Models and Architecture
38
Scientific Supercomputing
Proving ground and driver for innovative architecture and techniques • Market smaller relative to commercial as MPs become mainstream• Dominated by vector machines starting in 70s• Microprocessors have made huge gains in floating-point performance
– high clock rates – pipelined floating point units (e.g. mult-add) – instruction-level parallelism– effective use of caches
• Plus economics
Large-scale multiprocessors replace vector supercomputers
39
Raw Uniprocessor Performance: LINPACK
LIN
PA
CK
(M
FL
OP
S)
1
10
100
1,000
10,000
1975 1980 1985 1990 1995 2000
CRAY n = 100 CRAY n = 1,000
Micro n = 100 Micro n = 1,000
CRAY 1s
Xmp/14se
Xmp/416Ymp
C90
T94
DEC 8200
IBM Power2/990MIPS R4400
HP9000/735DEC Alpha
DEC Alpha AXPHP 9000/750
IBM RS6000/540
MIPS M/2000
MIPS M/120
Sun 4/260
40
Raw Parallel Performance: LINPACK
• Even vector Crays became parallel: X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)• Since 1993, Cray produces MPPs too (T3D, T3E)
LIN
PA
CK
(G
FLO
PS
)
CRAY peak MPP peak
Xmp /416(4)
Ymp/832(8) nCUBE/2(1024)iPSC/860
CM-2CM-200
Delta
Paragon XP/S
C90(16)
CM-5
ASCI Red
T932(32)
T3D
Paragon XP/S MP(1024)
Paragon XP/S MP(6768)
0.1
1
10
100
1,000
10,000
1985 1987 1989 1991 1993 1995 1996
41
500 Fastest Computers
42
Top 500 as of 2003
• Earth Simulator, built by NEC, remains the unchallenged #1, at 38 Tflop/s
• ASCI Q at Los Alamos is #2 at 13.88 TFlop/s.
• The third system ever to exceed the 10 TFflop/s mark is Virgina Tech's X: a cluster with the Apple G5 as building blocks and the new Infiniband interconnect.
• #4 is also a cluster, at NCSA. Dell PowerEdge system with Myrinet interconnect
• #5 is also a cluster, with upgraded Itanium2-based HP system at DOE's Pacific Northwest National Lab, with Quadrics interconnect
• #6 is the based on AMD's Opteron chip. It was installed by Linux Networx at the Los Alamos National Laboratory and also uses a Myrinet interconnect
• The list of clusters in the TOP10 has grown to seven systems. The Earth Simulator and two IBM SP systems at Livermore and LBL are the non-clusters.
• The performance of the #10 system is 6.6 TFlop/s.
43
44
Another View of Performance Growth
45
Another View of Performance Growth
46
Another View of Performance Growth
47
Another View of Performance Growth
48
Outline
• Drivers of Parallel Computing
• Trends in “Supercomputers” for Scientific Computing
• Evolution and Convergence of Parallel Architectures
• Fundamental Issues in Programming Models and Architecture
49
History
Application Software
System Software SIMD
Message Passing
Shared MemoryDataflow
SystolicArrays Architecture
• Uncertainty of direction paralyzed parallel software development!
Historically, parallel architectures tied to programming models • Divergent architectures, with no predictable pattern of growth.
50
Today
Extension of “computer architecture” to support communication and cooperation• OLD: Instruction Set Architecture • NEW: Communication Architecture
Defines • Critical abstractions, boundaries, and primitives (interfaces)• Organizational structures that implement interfaces (hw or sw)
Compilers, libraries and OS are important bridges between application and architecture today
51
Modern Layered Framework
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Database Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compilationor library
Operating systems support
Communication hardware
Physical communication medium
Hardware/software boundary
52
Parallel Programming Model
What the programmer uses in writing applications
Specifies communication and synchronization
Examples:• Multiprogramming: no communication or synch. at program level• Shared address space: like bulletin board• Message passing: like letters or phone calls, explicit point to point• Data parallel: more regimented, global actions on data
– Implemented with shared address space or message passing
53
Communication Abstraction
User level communication primitives provided by system• Realizes the programming model• Mapping exists between language primitives of programming model
and these primitives
Supported directly by hw, or via OS, or via user sw
Lot of debate about what to support in sw and gap between layers
Today:• Hw/sw interface tends to be flat, i.e. complexity roughly uniform• Compilers and software play important roles as bridges today• Technology trends exert strong influence
Result is convergence in organizational structure• Relatively simple, general purpose communication primitives
54
Communication Architecture= User/System Interface + Implementation
User/System Interface:• Comm. primitives exposed to user-level by hw and system-level sw• (May be additional user-level software between this and prog model)
Implementation:• Organizational structures that implement the primitives: hw or OS• How optimized are they? How integrated into processing node?• Structure of network
Goals:• Performance• Broad applicability• Programmability• Scalability• Low Cost