Upload
maurice-stokes
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
MACI - University of Alberta - April 2001
1
High-Performance Computing
José Nelson AmaralDepartment of Computing Science
University of [email protected]
MACI - University of Alberta - April 2001
2
Why High Performance Computing?
Many important problems cannot be solved yet even with the fastest machines available.
faster computers enable the formulation of more interesting questions
when a problem is solved, researchers find bigger problems to tackle!
MACI - University of Alberta - April 2001
3
Grand Challenges
weather forecastingeconomic modelingcomputer-aided designdrug designexploring the origins of the universesearching for extra-terrestrial lifecomputer vision
MACI - University of Alberta - April 2001
4
Grand Challenges
To simulate the folding of a 300 amino acid protein in water:# of atoms: ~ 32,000folding time: 1 milisecond# of FLOPs: 3 1022 Machine Speed: 1 PetaFLOP/sSimulation Time: 1 year (Source: IBM Blue Gene Project)
IBM’s answer: The Blue Gene ProjectUS$ 100 M of funding to build a1 PetaFLOP/s computer
Ken Dil and Kit Lau’s protein folding model.
Charles L Brooks III, Scripps Research Institute
MACI - University of Alberta - April 2001
5
Grand Challenges
In 1996 the GeneCrunch project demonstrates that a cluster of SGI Chalengers (64 processors) delivers near linear speedup for multiple sequence alignment.
MACI - University of Alberta - April 2001
6
Commercial Applications
In October 2000, SGI andESI (France) revealed acrash simulator to be usedin the future BMW Series 5.
Sustained performance: 12 GFLOPSProcessors: 96 400 MHz MIPS Machine: SGI Origin 3000 series.
MACI - University of Alberta - April 2001
7
Powerful Computers
Increased computing power enablesincreasing problem dimensions
adding more particles to a system increasing accuracy of the result improving experiment turnaround time
MACI - University of Alberta - April 2001
8
Speed and Storage
MACI - University of Alberta - April 2001
9
Solution?
Instead of using a single processor …
use multiple processors combine their efforts to solve a
problembenefit from the aggregate of their
processing speed, memory, cache and disk storage
MACI - University of Alberta - April 2001
10
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
MACI - University of Alberta - April 2001
11
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
1980
1988
1990
1994
1998
2000
MACI - University of Alberta - April 2001
13
Distributed Memory Machine Architecture
Interconnection Network
Processor
Caches
Memory I/O
Processor
Caches
Memory I/O
Processor
Caches
Memory I/O
NonUniform Memory Access (NUMA):Accessing local memory is faster than
accessing remote memory
MACI - University of Alberta - April 2001
14
Centralized Shared Memory Multiprocessor
Interconnection Network
Processor
Caches
Main Memory I/O System
Processor
Caches
Processor
Caches
Processor
Caches
MACI - University of Alberta - April 2001
15
Centralized Shared Memory Multiprocessor
Interconnection Network
Processor
Caches
Mem. Mem. Mem.
Processor
Caches
Processor
Caches
Processor
Caches
I/O crtl I/O crtl
Uniform Memory Address (UMA)“Dance Hall Approach”
MACI - University of Alberta - April 2001
16
Distributed Shared Memory(Clusters of SMPs)
Cluster Interconnection Network
Memory I/O
Proc.
Caches
Node Interc. Network
Proc.
Caches
Proc.
Caches
Memory I/O
Proc.
Caches
Node Interc. Network
Proc.
Caches
Proc.
Caches
Typically: Shared Address Space with Non-Uniform Memory Access (NUMA)
MACI - University of Alberta - April 2001
17
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
MACI - University of Alberta - April 2001
18
What’s Next?
What is Next in High-Performance Computing?(Gordon Bell and Jim Gray, Comm. of ACM, Feb 2002)
Thesis: 1. Clusters are becoming ubiquous, and even traditional data centers are migrating to clusters;2. Grid communities are beginning to provide significant advantages for addressing parallel problems and sharing vast number of files.
“Dark Side of Clusters: Clusters perform poorly on applications that require large shared memory.”
MACI - University of Alberta - April 2001
19
Beowulf
Project started at NASA in 1993 with the goal of:
“Implementing a 1 GFLOPs workstation costingless than US$50,000 using commercial off-the-shelf(COTS) hardware and software.”
In 1994 a US$ 40,000 cluster, with 16 Intel 486s reached the goal.
In 1997 a Beowulf cluster won theGordon Bell performance/price Prize.
In June 2001, 28 Beowulfs were in theTop500 fastest computers in the world.
MACI - University of Alberta - April 2001
20
“The Dark Side of Clusters”
What is Next in High-Performance Computing?(Gordon Bell and Jim Gray, Comm. of ACM, Feb 2002)
“Clusters perform poorly on applications that require large shared memory.”
PAP = Peak Advertised PerformanceRAP = Real Application Performance
Shared memory computers deliver RAP of30-50% of the PAP, while clusters deliver5-15% of the PAP.
MACI - University of Alberta - April 2001
21
Non-Shared Address Space
Clusters require an explicit message passing programming model:
MPI is the most widely used parallel programming model today
PVM used in some engineering departments.
MACI - University of Alberta - April 2001
22
Large and Expensive Clusters
ASCI White8,192 PowerPC processors6 TB of memory160 TB of disk space12.3 Teraops (peak)28 tractor trailers to transport(July 2000)
Suplier: IBMClient: USA Department of EnergyMain Application: Simulated Testing of Nuclear Weapons Stockpile
MACI - University of Alberta - April 2001
23
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
MACI - University of Alberta - April 2001
24
Programming Model Requirements
What data can be named by the threads?
What operations can be performed on the named data?
What ordering exists among these operations?
MACI - University of Alberta - April 2001
25
Programming Model Requirements
Naming:Global Physical Address SpaceIndependent Local Physical Address Spaces
Ordering: Mutual ExclusionEventsCommunication X Synchronization
MACI - University of Alberta - April 2001
26
Parallel Framework
Layers: Programming Model:
Multiprogramming: lots of jobs, no communication
Shared address space: communicate via memory
Message passing: send and receive messages
MACI - University of Alberta - April 2001
27
Message Passing Model
Communicate through explicit I/O operationsEssentially NUMA but integrated at I/O devices vs.
memory system Send specifies local buffer + receiving
process on remote computer Receive specifies sending process on remote
computer + local buffer to place dataSynch: when send completes, when buffer free,
when request accepted, receive wait for send Send+receive => memory-memory copy,
where each supplies local address, AND does pair-wise synchronization!
MACI - University of Alberta - April 2001
28
Shared Address Model Summary
Each processor can name every physical location in the machine
Each process can name all data it shares with other processes
Data transfer via load and store Data size: byte, word, ... or cache blocks Uses virtual memory to map virtual to local or
remote physical Memory hierarchy model applies: communication
moves data to local processor cache (as load moves data from memory to cache)
MACI - University of Alberta - April 2001
29
Shared Address/Memory Multiprocessor Model
Communicate via Load and Store Oldest and most popular model
process: a virtual address space and ~ 1 thread of control: Multiple processes can overlap (share), but all threads
share the process address space Writes to shared address space by one thread are
visible to reads by other threads
MACI - University of Alberta - April 2001
30
Advantages of shared-memory communication
modelCompatibility with SMP hardwareEase of programming
• for complex communication patterns; or • for dynamic communication patterns;
Uses familiar SMP model• attention only on performance critical accesses
Lower communication overhead, • better use of BW for small items• memory mapping implements protection in hardware
HW-controlled caching • reduces remote comm. by caching of all data, both shared
and private.
MACI - University of Alberta - April 2001
31
Advantages of message-passing communication
model
The hardware can be simplerCommunication explicit =>
• simpler to understand• focuses attention on costly aspect of parallel
computation
Synchronization is associated with messages• reduces the potential for errors introduced by
incorrect synchronization
Easier to implement sender-initiated communication models, which may have some advantages in performance
MACI - University of Alberta - April 2001
32
DataParallelModel
TaskParallelModel
Programming Models
SIMDSingle Instruction
Multiple Data
SPMDSingle ProgramMultiple Data
MPMDMultiple Programs
Multiple Data
SIMDArchitecture
MIMDArchitecture
MACI - University of Alberta - April 2001
33
OpenMP (1)
OpenMP gives programmers a “simple” and portable interface for developing shared-memory parallel programs
OpenMP supports C/C++ and Fortran on “all” architectures, including Unix platforms and Windows NT platforms
may become the industry standard
MACI - University of Alberta - April 2001
34
OpenMP (2) - C
#pragma omp parallel for shared(A) private(i) for( i = 1; i <= 100; i++) { ... + A[i]; compute A[i] = ... }
MACI - University of Alberta - April 2001
35
OpenMP (3) - Fortran
c$omp paralleldo schedule(static)c$omp&shared(omega,error,uold,
u)c$omp&private(i,j,resid)c$omp&reduction(+:error) do j = 2,m-1 do i = 2,n-1 resid = calcerror(uold,I,j) u(i,j) = uold(i,j) - omega *
resid error = error + resid*resid end do enddoc$omp end paralleldo
MACI - University of Alberta - April 2001
36
Vector Processing (1)
Cray, NEC computersmultiple functional units, each with multiple stagesreplication and pipelined parallelism at the
instruction level
MACI - University of Alberta - April 2001
37
Vector Processing (2)
for( i=0; i<=N; i++ )for( i=0; i<=N; i++ )A[i] = B[i] * C[i];A[i] = B[i] * C[i];
Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3Mult1 Mult2 Mult3
MACI - University of Alberta - April 2001
38
Multi-threading
OS-level multi-threading: P-threads
Programming Language-level multi-threading: Java
Fine Grain Multi-threading: Threaded-C, Cilk, TAM
Hardware Supported Multi-threading: Tera
Instruction Level Multi-threading: Simultaneous Multi-threading (Compaq-Intel)
MACI - University of Alberta - April 2001
39
Other Issues: Debugging
Debugging parallel programs can be frustrating
non-deterministic executionprobe effectdifficult to “stop” a parallel programmultiple core filesdifficult to visualize parallel activity tools are barely adequate
MACI - University of Alberta - April 2001
40
Other Issues:Performance Tuning
Use available performance tuning tools (perfex, Speedshop on SGI) to know where the program spends time.
Re-tune code for performance when hardware changes.
MACI - University of Alberta - April 2001
41
Other Issues: Fault Tolerance
Consider a job running on 40 processors for a week, then there is a power outage, losing all the work.Long-running jobs must be able to save a program’s state and then be able to restart from that state. This is called check-pointing.
MACI - University of Alberta - April 2001
42
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
MACI - University of Alberta - April 2001
43
What Does Coherency Mean?
Informally:• “Any read must return the most recent write”• Too strict and too difficult to implement
Better:• “Any write must eventually be seen by a read”• All writes are seen in proper order
(“serialization”)
Two rules to ensure this:• “If P writes x and P1 reads x, P’s write will be
seen by P1 if the read and write are sufficiently far apart”
Writes to a single location are serialized: seen in the same order order
MACI - University of Alberta - April 2001
44
Potential HW Coherency Solutions
Snooping Solution (Snoopy Bus):• Send all requests for data to all processors• Each processor snoops to see if it have a copy• Requires broadcast• Works well with bus (natural broadcast medium)• Prefered scheme for small scale machines
Directory-Based Schemes:• Keep track of what is being shared in 1 centralized
place
Distributed memory => distributed directory• Sends point-to-point requests• Scales better than Snooping
MACI - University of Alberta - April 2001
45
Basic Snoopy Protocols
Write Invalidate Protocol:Multiple readers, single writerWrite to shared data: an invalidate is sent to all
caches which snoop and invalidate any copiesRead Miss:
• Write-through: memory is always up-to-date• Write-back: snoop in caches to find most recent copy
Write Broadcast Protocol:Write to shared data: broadcast on bus, processors
snoop, and update any copiesRead miss: memory is always up-to-date
Write serialization: bus serializes requests!Bus is single point of arbitration
MACI - University of Alberta - April 2001
46
Basic Snoopy Protocols
Write Invalidate versus Broadcast: Invalidate requires one transaction per
write-run Invalidate uses spatial locality: one
transaction per block Broadcast has lower latency between
write and read
MACI - University of Alberta - April 2001
47
Interconnection Network
Processor0
I/O System
Processor1
Processor2
ProcessorN-1
ox
read x
Main Memory
read miss
Snoopy, Cache Invalidation
Protocol (Example)
MACI - University of Alberta - April 2001
48
Interconnection Network
Processor0
I/O System
Processor1
Processor2
ProcessorN-1
ox
Main Memory
oxshared
Snoopy, Cache Invalidation
Protocol (Example)
MACI - University of Alberta - April 2001
49
Interconnection Network
Processor0
I/O System
Processor1
Processor2
ProcessorN-1
ox
Main Memory
oxshared
read x
read miss
Snoopy, Cache Invalidation
Protocol (Example)
MACI - University of Alberta - April 2001
50
Interconnection Network
Processor0
I/O System
Processor1
Processor2
ProcessorN-1
ox
Main Memory
oxshared
oxshared
Snoopy, Cache Invalidation
Protocol (Example)
MACI - University of Alberta - April 2001
51
Interconnection Network
Processor0
I/O System
Processor1
Processor2
ProcessorN-1
ox
Main Memory
oxshared
oxshared
write x
invalidate
Snoopy, Cache Invalidation
Protocol (Example)
MACI - University of Alberta - April 2001
52
Interconnection Network
Processor0
I/O System
Processor1
Processor2
ProcessorN-1
ox
Main Memory
1xexclusive
Snoopy, Cache Invalidation
Protocol (Example)
MACI - University of Alberta - April 2001
53
Programmer’s Abstraction for a
Sequential Consistency Model
P1 P1 Pn
Memory
The switch is randomly set after each memory reference.
(See CullerSinghGupta, pp. 287)
MACI - University of Alberta - April 2001
54
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
MACI - University of Alberta - April 2001
55
Top 500 (November 10, 2001)
Manuf. Computer Rmax Site Year Proc. Rpeak Rmax/Rpeak
I BM ASCI White, Power3, 375 MHz
7226 Lawrence Livermore 2000 8192 12288 0.59
Compaq AlphaServer SC ES45/ 1 GHz
4059 Pittsburgh Superc. Center
2001 6048 6048 0.67
I BM SP Power3, 375 MHz 16-way
3052 NERSC/ LBNL 2001 3328 4992 0.61
I ntel ASCI Red 2379 Sandia Nat. Lab. 1999 9632 3207 0.74 I BM ASCI Blue-Pacifi c
I BM SP 604e 2144 Lawrence Livermore 1999 5808 3868 0.55
Compaq AlphaServer SC ES45/ 1 GHz
2096 Los Alamos Nat. Lab. 2001 1536 3072 0.68
Hitachi SR8000/ MPP 1709.1 Univ. of Tokyo 2001 1152 2074 0.82 …
NEC(12) SX-5/ 128M8 3.2ns 1192 Osaka University 2001 128 1280 0.93
Rmax = Maximal LINPACK Performance Achieved [GFLOPS]Rpeak = Theoretical peak performance [GFLOPS]
Canada appears in ranks 123, 144, 183, 255, 266, 280, 308, 311, 315, 414, 419.
MACI - University of Alberta - April 2001
56
Top500 Statistics
HP SPP
IBM SP
SGI Origin
T3E/T3D
NOW
Sun UltraHPC
Others
Industry
Research
Academic
Classified
Vendor
Government
MPP
Constellations
SMP
Clusters
USA
Germany
Japan
UK
France
Korea
Italy
Canada
Others
Computer Family Type of Organization
Machine Organization Country
MACI - University of Alberta - April 2001
57
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
MACI - University of Alberta - April 2001
58
Intel Architecture 64
Itanium, the first one, is out….
… but we are still waiting for Mckinley...
Will we get Yamhill (Intel’s Plan B) instead?
MACI - University of Alberta - April 2001
59
Alpha is gone...
So is Compaq! EV8 is scrapped.
Compaq designers split betweenAMD and Intel.
Intel converts to the SymultaneousMultithreading religion, but renamesit: Hyperthreading!!
MACI - University of Alberta - April 2001
60
IBM’s POWER4 is the 2001/2002 winner
Best Floating Point and integer performanceavailable in the market.
Highest memory bandwidth in the industry.
Well integrated cache coherence mechanism.
MACI - University of Alberta - April 2001
61
Implicit X ExplicitInstruction Level Parallelism
EPIC: Explicitly ParallelInstruction Computer
Superscalar: Instruction LevelParallelism discovered
implicitly by compiler or by hardware.
MACI - University of Alberta - April 2001
62
Instruction Level Parallelism
Most ILP is implicit, i.e., instructions that can beexecuted in parallel are automatically discoveredby the hardware at runtime.
Intel launched Itanium, the first IA-64 processor,that explores explicit parallelism at the instructionlevel, in this processors the compiler codes paralleloperations in the assembly code.
But applications are still written in standard sequentialprogramming languages.
MACI - University of Alberta - April 2001
63
Instruction Level Parallelism
For example, in the IA-64 an instruction group, identified by the compiler, is a set of instructions that
have no read after write (RAW) or write after write (WAW)register dependencies (they can execute in parallel).
Consecutive instruction groups are separated by stops (represented by a double semi-column in the assembly code).
ld8 r1=[r5] // First groupsub r6=r8, r9 // First groupadd r3=r1,r4 ;; // First groupst8 [r6]=r12 // Second group
MACI - University of Alberta - April 2001
64
IA-64 Innovations
if-conversion: execute both sides of a branch
if(r1 == 0)r2 = r3 + r3
elser7 = r6 - r5
cmp.ne p1, p2 = r1, 0 ;; Set predicate reg(p1) add r2 = r3, r4(p2) sub r7 = r6,r5
data speculation: load a value before knowing if the addressis correct.
control speculation: Execute a computation before knowing it it have to.
rotating registers: Support for software pipelining.
MACI - University of Alberta - April 2001
65
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
MACI - University of Alberta - April 2001
66
Below Above the line
for(n=0 ; …) for(f=0 ; …) for(t=0 ; …) for(x=0 ; …) for(y=0 ; …) for(z=0 ; …) { …….. }
Application LevelParallelism
AutomaticParallelism
MACI - University of Alberta - April 2001
67
Some Common Loop Optimizations
UnswitchingLoop PeelingLoop AlignmentIndex Set SplittingScalar ExpansionLoop FusionLoop FissionLoop ReversalLoop Interchange
MACI - University of Alberta - April 2001
68
Unswitching
Remove loop independent conditionals from a loop.
for i=1 to N do for j=2 to N do if T[i] > 0 then A[i,j] = A[i, j-1]*T[i] + B[i] else A[i,j] = 0.0 endif endforendfor
Before Unswitching
for i=1 to N do if T[i] > 0 then for j=2 to N do A[i,j] = A[i, j-1]*T[i] + B[i] endfor else for j=2 to N do A[i,j] = 0.0 enfor endifendfor
After Unswitching
MACI - University of Alberta - April 2001
69
Loop Peeling
Remove the first (last) iteration of the loop into separate code.
for i=1 to N do A[i] = (X+Y)*B[i]endfor
Before Peeling
if N >= 1 then A[i] = (X+Y)*B[i] for j=2 to N do A[i] = (X+Y)*B[i] enforendif
After Peeling
MACI - University of Alberta - April 2001
70
Index Set Splitting
Divides the index set into two portions.
for i=1 to 100 do A[i] = B[i] + C[i] if i > 10 then D[i] = A[i] + A[i-10] endifendfor
Before Set Splitting
for i=1 to 10 do A[i] = B[i] + C[i]endforfor i=11 to 100 do A[i] = B[i] + C[i] D[i] = A[i] + A[i-10]endfor
After Set Splitting
MACI - University of Alberta - April 2001
71
Scalar Expansion
Breaks anti-dependence relations by expanding, or promoting a scalar into an array.
for i=1 to N do T = A[i] + B[i] C[i] = T + 1/Tendfor
Before Scalar Expansion
if N >= 1 then allocate Tx(1:N) for i=1 to N do Tx[i] = A[i] + B[i] C[i] = Tx[i] + 1/Tx[i] endfor T = Tx[N]endif
After Scalar Expansion
MACI - University of Alberta - April 2001
72
Loop Fusion
Takes two adjacent loops and generates a singleloop.
(1) for i=1 to N do(2) A[i] = B[i] + 1(3) endfor(4) for i=1 to N do(5) C[i] = A[i] / 2(6) endfor(7) for i=1 to N do(8) D[i] = 1 / C[i+1](9) endfor
Before Loop Fusion
(1) for i=1 to N do(2) A[i] = B[i] + 1(5) C[i] = A[i] / 2(6) endfor(7) for i=1 to N do(8) D[i] = 1 / C[i+1](9) endfor
After Loop Fusion
MACI - University of Alberta - April 2001
73
Loop Fusion (Another Example)
(1) for i=1 to 99 do(2) A[i] = B[i] + 1(3) endfor(4) for i=1 to 98 do(5) C[i] = A[i+1] * 2(6) endfor
(2) A[1] = B[1] + 1(1) for i=2 to 99 do(2) A[i] = B[i] + 1(3) endfor(4) for i=1 to 98 do(5) C[i] = A[i+1] * 2(6) endfor
(1) i = 1(2) A[i] = B[i] + 1 for ib=0 to 97 do(1) i = ib+2(2) A[i] = B[i] + 1(4) i = ib+1(5) C[i] = A[i+1] * 2(6) endfor
MACI - University of Alberta - April 2001
74
Loop Fission
Breaks a loop into two or more smaller loops.
(1) for i=1 to N do(2) A[i] = A[i] + B[i-1](3) B[i] = C[i-1]*X + Z(4) C[i] = 1/B[i](5) D[i] = sqrt(C[i])(6) endfor
Original Loop
(1) for ib=0 to N-1 do(3) B[ib+1] = C[ib]*X + Z(4) C[ib+1] = 1/B[ib+1](6) endfor(1) for ib=0 to N-1 do(2) A[ib+1] = A[ib+1] + B[ib](6) endfor(1) for ib=0 to N-1 do(5) D[ib+1] = sqrt(C[ib+1])(6) endfor(1) i = N+1
After Loop Fission
MACI - University of Alberta - April 2001
75
Loop Reversal
Run a loop backward.All dependence directions are reversed.
It is only legal for loops that have no loop carrieddependences.
Can be used to allow fusion(1) for i=1 to N do(2) A[i] = B[i] + 1(3) C[i] = A[i]/2(4) endfor(5) for i=1 to N do(6) D[i] = 1/C[i+1](7) endfor
(1) for i=N downto 1 do(2) A[i] = B[i] + 1(3) C[i] = A[i]/2(4) endfor(5) for i=N downto 1 do(6) D[i] = 1/C[i+1](7) endfor
(1) for i=N downto 1 do(2) A[i] = B[i] + 1(3) C[i] = A[i]/2(6) D[i] = 1/C[i+1](7) endfor
MACI - University of Alberta - April 2001
76
Loop Interchanging
(1) for j=2 to M do(2) for i=1 to N do(3) A[i,j] = A[i,j-1] + B[i,j](4) endfor(5) endfor
(1) for i=1 to N do(2) for j=2 to M do(3) A[i,j] = A[i,j-1] + B[i,j](4) endfor(5) endfor
MACI - University of Alberta - April 2001
77
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
MACI - University of Alberta - April 2001
78
Speedup
goal is to use N processors to make a program run N times faster
speedup is the factor by which the program’s speed improves
processor 1ePerformanc
processor ePerformanc processors Speedup
pp
processor Time
processor 1Time processors Speedup
pp
MACI - University of Alberta - April 2001
79
AbsoluteRelative Speedup
Careful: the execution time depends on what the program does!
A parallel program spends time in: Work Synchronization Communication Extra work
A program implemented for a parallel machine is likely to do extra work (than a sequential program) even when running in a single processor machine!
MACI - University of Alberta - April 2001
80
Absolute Relative Speedup
When talking about execution time, ask what algorithm is implemented!
processor Alg., Par.Time
processor 1 Alg., Par.Time processors Speedup Relative
pp
processor 1 Alg., Par.Time
processor 1 Alg., Seq.Time
processor Alg., Par.Time
processor 1 Alg., Seq.Time processors Speedup Absolute
pp
MACI - University of Alberta - April 2001
81
Speedup
MACI - University of Alberta - April 2001
82
Which is Better?
programs A & B solve the same problem using different algorithmsboth are run on a 100-processor computerprogram A gets a 90-fold speedup program B gets a 10-fold speedupWhich one would you prefer to use?
MACI - University of Alberta - April 2001
83
It Depends!
all that matters is overall execution timewhat if A runs sequentially
1,000 times slower than B?always use the best sequential time (over all algorithms) for
computing speedups!and the best compiler!
MACI - University of Alberta - April 2001
84
Superlinear Speedups
sometimes N processors can achieve a speedup > N
usually the result of improving an inferior sequential algorithm
can legitimately occur because of cache and memory effects
MACI - University of Alberta - April 2001
85
Amdahl’s Law (1)
Npar
seq
parseq
NT
TSpeedup
N
par
seq
1
processors ofnumber :
program a ofportion parallel:
program a ofportion sequential:
seqseq
parseq MaxSpeedup
0.1
:is obtained becan that speedup maximum The
MACI - University of Alberta - April 2001
86
Amdahl’s Law (2)
= % seq timeN
MACI - University of Alberta - April 2001
87
Scalability
desirable property of algorithm is scalability, regardless of speedup
problem of size P using N processors takes time T
problem is scalable if problem of size 2P on 2N processors still takes time T
MACI - University of Alberta - April 2001
88
Choose Right Algorithm
understand strengths and weaknesses of hardware being used
choose an algorithm to exploit strengths and avoid weaknessesExample: there are many parallel sorting algorithms, each valid for different hardware/application properties
MACI - University of Alberta - April 2001
89
This Talk Motivation Parallel Machine Organizations Cluster Computing Programming Models Cache Coherence and Memory Consistency The Top 500: Who is the fastest? Processor Architecture: What is new? The Role of Compilers Speedup and Scalability Final Remarks
MACI - University of Alberta - April 2001
90
The Reality Is ...
software – more issues to be addressed beyond sequential case
hardware – need to understand machine architecture
Parallel programming is hard!
The Reward Is ...high performance – the motivation
in the first place!
MACI - University of Alberta - April 2001
91
Do You Need Parallelism?
Consider the trade-offs+Potentially faster execution
turnaround- Longer software development time- Obtaining machine access- Cost = f(development) + g(execution
time)Do the benefits out-weigh the costs?
MACI - University of Alberta - April 2001
92
Resistance to Parallelism
software inertia – cost of converting existing software
hardware inertia – waiting for faster machineslack of educationlimited access to resources
MACI - University of Alberta - April 2001
93
Starting Out...
parallel program design is black magicsoftware tools are primitiveexperience is an asset
All is not lost!many problems are amenable
to simple parallel solutions
MACI - University of Alberta - April 2001
94
Starting Out...
sequential world is simple: one architectural model
in parallel world, need to choose the algorithm to suit the architecture
parallel algorithm may only perform well on one class of machine
MACI - University of Alberta - April 2001
95
Granularity (1)
Granularity is the relation between the amount of computation and the amount of communication.
It is a measure of how much work gets done before processes have to communicate.
MACI - University of Alberta - April 2001
96
Granularity (2)
Problem: Shopping for 100 items of grocery.
Scenario 2: (large granularity)
You are told all 100 items, and then you make a single trip to purchase everything.
Scenario 1: (small granularity)
You are told an item to buy, go to the store, purchase the item, then return home to find out the next item to buy.
MACI - University of Alberta - April 2001
97
Granularity (3)
As in the real world, communication and synchronization takes time
MACI - University of Alberta - April 2001
98
ArchitecturesMatch problem granularity to parallel architecture
fine-grained vector/array processors
medium-grained shared memory multiprocessor
coarse/large-grained network of workstations
MACI - University of Alberta - April 2001
99
Program Design
1) Identify hardware platforms available2) Identify parallelism in the application3) Choose right type of algorithmic
parallelism to match the architecture4) Implement algorithm, being wary of
performance and correctness issues
MACI - University of Alberta - April 2001
100
Vector Processing (3)
this class of machines is very effective at striding through arrays
some parallelism can be automatically detected by compiler
there is “right” way and “wrong” way to code loops to allow parallelism
MACI - University of Alberta - April 2001
101
Distributed Memory (1)Loosely coupled processorsIBM SP-2, networks of computersPCs connected by a fast network
MACI - University of Alberta - April 2001
102
Distributed Memory (2)
communication between processes by sending messagesOverhead of parallelism includes
cost of preparing a messagecost of sending a messagecost of waiting for a message
MACI - University of Alberta - April 2001
103
Communication
distributed memory(loosely coupled)
Explicitly sendmessages between processes
MACI - University of Alberta - April 2001
104
Synchronization
MACI - University of Alberta - April 2001
105
Message Passing (1)Process 1 Process 2compute computeSend( P2, info ); computecompute Receive( P1, info );idle computeidle Send( P1, reply );Receive( P2, reply );
SynchronizeSynchronize
CommunicateCommunicate
MACI - University of Alberta - April 2001
106
Message Passing (2)
Two popular message passing libraries:PVM (Parallel Virtual Machine)MPI (Message Passing Interface)MPI will likely be the industry standardBoth are easy to use, but verbose
MACI - University of Alberta - April 2001
107
Master/Slave (1)
many distributed memory programs are structured as one master and N slaves
master generates work to dowhen idle, slave asks for a piece of work, gets it, does the
task, and reports the result
MACI - University of Alberta - April 2001
108
Master/Slave (2)Master Slaveworklist = Make( work ) data = NEED_WORK;while( worklist != empty ) while( true ){ { Receive( slave, result ); Send( master, data ); to_do = Head( worklist ); /* wait */ Send( slave, to_do ) Receive( master, work ); Process( result ); data = Process( work );} }
M
S
S
S
MACI - University of Alberta - April 2001
109
Pitfall: Deadlock
scenario by which no further progress can be made, since each process is waiting on a cyclic dependency
MACI - University of Alberta - April 2001
110
Pitfall: Deadlock
A real-world problem!
MACI - University of Alberta - April 2001
111
Pitfall: Load Balancing
need to assign a roughly equal portion of work to each processor
load imbalances can result in poor speedups
MACI - University of Alberta - April 2001
112
Shared Memory (1)
Tightly coupled multiprocessorsSGI Origin 2400, SUN E10000Classified by memory access times
Same (SMP - symmetric multiprocessor) Different (NUMA - non-uniform memory
access)
P P P P
Memory
MACI - University of Alberta - April 2001
113
Shared Memory (2)
communicate between processes through reading/writing shared variables (instead of sending and receiving messages)Overhead of parallelism includes:
contention for shared resourceprotecting integrity of shared resources
MACI - University of Alberta - April 2001
114
Communication
shared memory(tightly coupled)
Read from Read from and write to and write to shared datashared data
MACI - University of Alberta - April 2001
115
SynchronizationMay have to prevent simultaneous access!Process 1 Process 2B = BankBalance; B = BankBalance;B = B + 100; B = B + 150;BankBalance = B; BankBalance = B;Print Statement; Print Statement;What is the value of BankBalance?
MACI - University of Alberta - April 2001
116
Pitfall: Shared Data Access
Need to restrict access to data!Avoid race conditions!Process 1 Process 2Lock( access ); Lock( access );B = BankBalance; B = BankBalance;B = B + 100; B = B + 150;BankBalance = B; BankBalance = B;Unlock( access ); Unlock( access );
MACI - University of Alberta - April 2001
117
Multi-threading
MACI - University of Alberta - April 2001
118
Simultaneous Multi-threading
http://www.eet.com/story/0EG19991008S0014 by Rick Merrit
MACI - University of Alberta - April 2001
119
Top 500
http://www.top500.org
MACI - University of Alberta - April 2001
120
Top 500
http://www.top500.org
MACI - University of Alberta - April 2001
121
Top 500
http://www.top500.org
MACI - University of Alberta - April 2001
122
Top 500
http://www.top500.org
MACI - University of Alberta - April 2001
123
Top 500
http://www.top500.org
MACI - University of Alberta - April 2001
124
Top 500
http://www.top500.org
MACI - University of Alberta - April 2001
125
Top 500
http://www.top500.org
MACI - University of Alberta - April 2001
126
Conclusions
some problems require extensive computational resources
parallelism allows you to decrease experiment turnaround time
the tools are adequate but still have a long way to go
“simple” parallelism gets some performance but maximum performance requires effort.
Performance commensurate with effort!
MACI - University of Alberta - April 2001
127
Reminders
understand your computational needs
understand the hardware and software resources available to you
match parallelism to the architecture
maximize utilization, don’t waste cycles!
granularity, granularity, granularity
develop, test and debug small data sets before trying large ones
be wary of the many pitfalls
MACI - University of Alberta - April 2001
128
We Want You!
For help with parallel programming, contact
Get parallel!Become part of MACI