CS267 Lecture 5 1
CS 267 Applications of Parallel Computers
Lecture 5: Sources of Parallelism (continued)
Shared-Memory Multiprocessors
Kathy Yelick
http://www.cs.berkeley.edu/~dmartin/cs267/
CS267 Lecture 5 2
Outline
° Recap
° Parallelism and Locality in PDEs• Continuous Variables Depending on Continuous Parameters
• Example: The heat equation
• Euler’s method
• Indirect methods
° Shared Memory Machines• Historical Perspective: Centralized Shared Memory
• Bus-based Cache-coherent Multiprocessors
• Scalable Shared Memory Machines
CS267 Lecture 5 3
Recap: Source of Parallelism and Locality
° Discrete event system• model is discrete space with discrete interactions
• synchronous and asynchronous versions
• parallelism over graph of entities; communication for events
° Particle systems• discrete entities moving in continuous space and time
• parallelism between particles; communication for interactions
° ODEs• systems of lumped (discrete) variables, continuous parameters
• parallelism in solving (usually sparse) linear systems
• graph partitioning for parallelizing the sparse matrix computation
° PDEs (today)
CS267 Lecture 5 4
Continuous Variables, Continuous Parameters
Examples of such systems include
° Heat flow: Temperature(position, time)
° Diffusion: Concentration(position, time)
° Electrostatic or Gravitational Potential:Potential(position)
° Fluid flow: Velocity,Pressure,Density(position,time)
° Quantum mechanics: Wave-function(position,time)
° Elasticity: Stress,Strain(position,time)
CS267 Lecture 5 5
Example: Deriving the Heat Equation
0 1x x+hConsider a simple problem
° A bar of uniform material, insulated except at ends
° Let u(x,t) be the temperature at position x at time t
° Heat travels from x to x+h at rate proportional to:
° As h 0, we get the heat equation:
d u(x,t) (u(x-h,t)-u(x,t))/h - (u(x,t)- u(x+h,t))/h
dt h
= C *
d u(x,t) d2 u(x,t)
dt dx2= C *
CS267 Lecture 5 6
Explicit Solution of the Heat Equation
° For simplicity, assume C=1
° Discretize both time and position
° Use finite differences with xt[i] as the heat at• time t and position I
• initial conditions on x0[i]
• boundary conditions on xt[0] and xt[1]
° At each timestep
° This corresponds to• matrix vector multiply
• nearest neighbors on grid
t=5
t=4
t=3
t=2
t=1
t=0x[0] x[1] x[2] x[3] x[4] x[5]
x[i]t+1 = z*xt[i-1] + (1-2*z)*xt[i] + z*xt[i+1]
where z = k/h2
CS267 Lecture 5 7
Parallelism in Explicit Method for PDEs
° Partitioning the space (x) into p largest chunks• good load balance (assuming large number of points relative to p)
• minimized communication (only p chunks)
° Generalizes to • multiple dimentions
• arbitrary graphs (= sparse matrices)
° Problem with explicit approach• numerical instability
• need to make the timesteps very small
CS267 Lecture 5 8
Implicit Solution
° As with many (stiff) ODEs, need an implicit method
° This turns into solving the following equation
° Where I is the identity matrix and T is:
° I.e., essentially solving Poisson’s equation
(I + (z/2)*T) * xt+1 = (I - (z/2)*T) * xt
2 -1
-1 2 -1
-1 2 -1
-1 2 -1
-1 2 -1
T =
CS267 Lecture 5 9
2D Implicit Method
° Similar to the 1D case, but the matrix T is now
° Multiplying by this matrix (as in the explicit case) issimply nearest neighbor computation
° To solve this system, there are several techniques
4 -1 -1
-1 4 -1 -1
-1 4 -1 -1
-1 4 -1 -1
-1 -1 4 -1
-1 -1 4 -1
-1 -1 4 -1
-1 -1 4
T =
CS267 Lecture 5 10
Algorithms for Solving the Poisson Equation
Algorithm Serial PRAM Mem #Procs
° Dense LU N3 N N2 N2
° Band LU N2 N N3/2 N
° Jacobi N2 N N N
° Conj.Grad. N 3/2 N 1/2 *log N N N
° RB SORN 3/2 N 1/2 N N
° Sparse LU N 3/2 N 1/2 N*log N N
° FFT N*log N log N N N
° Multigrid N log2 N N N
° Lower bound N log N N
PRAM is an idealized parallel model with zero cost communication
CS267 Lecture 5 11
Administrative
° HW2 extended to Monday, Feb. 16th
° Break
° On to shared memory machines
CS267 Lecture 5 12
Programming Recap and History of SMPs
CS267 Lecture 5 13
Relationship of Architecture and Programming Model
HW / SW interface(comm. primitives)
operatingsystem
User / System Interfacecompilerlibrary
Programming Model
Parallel Application
Hardware
CS267 Lecture 5 14
Shared Address Space Programming Model
° Collection of processes
° Naming:• Each can name data in a private address space and
• all can name all data in a common “shared” address space
° Operations• Uniprocessor operations, plus sychronization operations on
shared address
- lock, unlock, test&set, fetch&add, ...
° Operations on the shared address space appear to be performed in program order
• it’s own operations appear to be in program order
• all see a consistent interleaving of each other’s operations
• like timesharing on a uniprocessor
– explicit synchronization operations used when program ordering is not sufficient.
CS267 Lecture 5 15
Example: shared flag indicating full/empty
° Intuitively clear that intention was to convey meaning by order of stores
° No data dependences
° Sequential compiler / architecture would be free to reorder them!
P1 P2
A = 1;
a: while (flag is 0) do nothing; b: flag = 1;
print A;
CS267 Lecture 5 16
Historical Perspective
° Diverse spectrum of parallel machines designed to implement a particular programming model directly
° Technological convergence on collections of microprocessors on a scalable interconnection network
° Map any programming model to simple hardware• with some specialization
Shared Address Space
centralized sharedmemory
Message Passing
hypercubes andgrids
Data Parallel
SIMD
° ° °
M
Scalable Interconnection Network
CA
P
$
essentially completecomputer
CS267 Lecture 5 17
60s Mainframe Multiprocessors
° Enhance memory capacity or I/O capabilities by adding memory modules or I/O devices
° How do you enhance processing capacity?• Add processors
° Already need an interconnect between slow memory banks and processor + I/O channels
• cross-bar or multistage interconnection network
P
M
M
IO$ IO
M
M
P
$
P
IO$ IO
P
$
M M M M
(b) Cross-bar (c) Multistage
Proc
Mem IOCMem Mem Mem IOC
I/ODevices
Interconnect
Proc
CS267 Lecture 5 18
Caches: A Solution and New Problems
CS267 Lecture 5 19
70s breakthrough
° Caches!
P
memory (slow)
interconnect
I/O Deviceor
Processor
A: 17
processor (fast)
CS267 Lecture 5 20
Technology Perspective
DRAM
Year Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
1000:1! 2:1!
Capacity Speed
Logic: 2x in 3 years 2x in 3 years
DRAM: 4x in 3 years 1.4x in 10 years
Disk: 2x in 3 years 1.4x in 10 years
0
50
100
150
200
250
300
1986 1988 1990 1992 1994
Year
SpecIntSpecFP
CS267 Lecture 5 21
Bus Bottleneck and Caches
I/O MEM MEM° ° °
PROC
cache
PROC
cache
° ° °
Assume 100 MB/s bus
50 MIPS processor w/o cache
=> 200 MB/s inst BW per processor
=> 60 MB/s data BW at 30% load-store
Suppose 98% inst hit rate and 95% data hit rate (16 byte block)
=> 4 MB/s inst BW per processor
=> 12 MB/s data BW per processor
=> 16 MB/s combined BW
8 processors will saturate bus
Cache provides bandwidth filter – as well as reducing average access time
260 MB/s
16 MB/s
CS267 Lecture 5 22
Cache Coherence: The Semantic Problem
° Scenario:• p1 and p2 both have cached copies of x (as 0)
• p1 writes x=1 and then the flag, f=1 pulling f into its cache
- both of these writes may write through to memory
• p2 reads f (bringing it into cache) to see if it is 1, which it is
• p2 therefore reads x, but gets the stale cached copy
x 1f 1
x 0f 1
x = 1f = 1
p1 p2
CS267 Lecture 5 23
Snoopy Cache Coherence
° Bus is a broadcast medium• all caches can watch other’s mem ops
° All processors write through:• update local cache and global bus write:
- updates main memory
- invalidates/updates all other caches with that item
• Examples: Early Sequent and Encore machines.
° Cache stay coherent
° Consistent view of memory!• One shared write at a time
° Performance is much worse than uniprocessor• write-back caches
• Since ~15-30% of references are writes, this scheme consumes tremendous bus bandwidth. Few processors can be supported.
CS267 Lecture 5 24
Write-Back/Ownership Schemes
° When a single cache has ownership of a block, processor writes do not result in bus writes, thus conserving bandwidth.
• reads by others cause it to return to “shared” state
° Most bus-based multiprocessors today use such schemes.
° Many variants of ownership-based protocols
CS267 Lecture 5 25
Programming SMPs
° Consistent view of shared memory
° All addresses equidistant• don’t worry about data partitioning
° Automatic replication of shared data close to processor
° If program concentrates on a block of the data set that no one else updates => very fast
° Communication occurs only on cache misses• cache misses are slow
° Processor cannot distinguish communication misses from regular cache misses
° Cache block may introduce artifacts• two distinct variables in the same cache block
• false sharing
CS267 Lecture 5 26
Scalable Cache-Coherence
CS267 Lecture 5 27
90 Scalable, Cache Coherent Multiprocessors
P
Cache
P
Cache
Interconnection Netw ork
Memory presence bits
dirty-bit
Directory
memory block
1 n
CS267 Lecture 5 28
SGI Origin 2000
L2 cache
P
(1-4 MB)L2 cache
P
(1-4 MB)
HubXbow
Main Memory(1-4 GB)
Direc-tory
L2 cache
P
(1-4 MB)L2 cache
P
(1-4 MB)
Hub Xbow
Main Memory(1-4 GB)
Direc-tory
Interconnection Network
CS267 Lecture 5 29
90’s Pushing the bus to the limit: Sun Enterprise
GigaplaneTM bus (256 data, 41 address, 83 MHz)
I/O Cards
P
$2
$P
$2
$
mem ctr l
Bus Interface / Switch
Bus Interface
CPU/MemCards
CS267 Lecture 5 30
90’s Pushing the SMP to the masses
P-Pro bus (64-bit data, 36-bit address, 66 MHz)
CPU
256kbL2 $
IC
BIU
P-ProModule
P-ProModule
P-ProModule
PCIbridge
PCIbridge
MemController
MIC
1-, 2-, or 4-wayinterleaved
DRAM
PCI
I/O
Cards
CS267 Lecture 5 31
Caches and Scientific Computing
° Caches tend to perform worst on demanding applications that operate on large data sets
• transaction processing
• operating systems
• sparse matrices
° Modern scientific codes use tiling/blocking to become cache friendly
• easier for dense codes than for sparse
• tiling and parallelism are similar transformations
CS267 Lecture 5 32
Scalable Global Address Space
CS267 Lecture 5 33
Structured Shared Memory: SPMD
Pr ivate P or tionof Address Space
Shar ed P or tionof Address Space
store
Common Ph ysicalAd dr esses
P1P2
Pn
P0
load
P0 pr ivate
P1 pr ivate
P2 pr ivate
Pn private
Each Process is same programwith same address space layout
machine physicaladdress space
x
x
CS267 Lecture 5 34
Large Scale Shared Physical Address
° Processor performs load
° Pseudo-memory controller turns it into a message transaction with a remote controller, which performs the memory operation and replies with the data.
° Examples: BBN butterfly, Cray T3D
src
° ° °
Scalable Network
M
PseudoMem
P$
mmu
M
PseudoProc
readaddr desttag
src rrsp tag data
Ld R<- Addr
P$
mmu
CS267 Lecture 5 35
Cray T3D
DRAM
ReqOut
P$
mmu
150 MHz Dec Alpha (64-bit)8 KB Inst + 8 KB Data43-bit Virtual Address32 & 64 bit mem + byte operationsNon-blocking stores + mem-barrierPrefetchLoad-lock, Store Conditional
32-bit P.A. - 5 + 27
DTB
Prefetch Queue - 16 x 64
Msg Queue - 4080x4x64
Special Registers - swaperand - fetch&add - barrier
PE# + FC
BLT
RespIn 3D Torus of Pair of PEs
– share net & BLT – upto 2048 – 64 MB each
Reqin
RespOut
CS267 Lecture 5 36
The Cray T3D
° 2048 Alphas (150 MHz, 16 or 64 MB each) + fast network
• 43-bit virtual address space, 32-bit physical
• 32-bit and 64-bit load/store + byte manipulation on regs.
• no L2 cache
• non-blocking stores, load/store re-ordering, memory fence
• load-lock / store-conditional
° Direct global memory access via external segment regs• DTB annex, 32 entries, remote processor number and mode
• atomic swap between special local reg and memory
• special fetch&inc register
• global-OR, global-AND barriers
° Prefetch Queue
° Block Transfer Engine
° User-level Message Queue
CS267 Lecture 5 37
T3D Local Read (average latency)
0
100
200
300
400
500
600
8 16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Stride
ns
8MB
4MB
2MB
1MB
512KB
256KB
128KB
64KB
32KB
16KB
8KB
L1 Cache Size: 8KB
CacheAccessTime: 6.7 ns(1 cycle)
MemoryAccessTime: 155 ns(23 cycles)
Line Size: 32 bytes
DRAM pagemiss: 100 ns(15 cycles)
No TLB !
CS267 Lecture 5 38
T3D Remote Read Uncached
0
100
200
300
400
500
600
700
800
900
1000
8 16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Stride
ns
8MB
4MB
2MB
1MB
512KB
256KB
128KB
64KB
32KB
16KB
8KB
610 ns(91 cycles)
100 nsDRAM-pagemiss
Network Latency:Additional 13-20 ns(2-3 cycles) per hop
3 - 4x Local Memory Read !
local T3D
DEC Alpha
CS267 Lecture 5 39
Bulk Read Options
0
20
40
60
80
100
120
140
160
1 10 100 1000 10000 100000
Bytes
MB/s
Uncached Read
Cached read
Prefetch
BLT
Split-C
CS267 Lecture 5 40
Where are things going
° High-end• collections of almost complete workstations/SMP on high-speed
network
• with specialized communication assist integrated with memory system to provide global access to shared data
° Mid-end• almost all servers are bus-based CC SMPs
• high-end servers are replacing the bus with a network
- Sun Enterprise 10000, IBM J90, HP/Convex SPP
• volume approach is Pentium pro quadpack + SCI ring
- Sequent, Data General
° Low-end• SMP desktop is here
° Major change ahead• SMP on a chip as a building block