CS267 Lecture 5 1 CS 267 Applications of Parallel Computers Lecture 5: Sources of Parallelism (continued) Shared-Memory Multiprocessors Kathy Yelick dmartin/cs267

CS267 Lecture 5 1

CS 267 Applications of Parallel Computers

Lecture 5: Sources of Parallelism (continued)

Shared-Memory Multiprocessors

Kathy Yelick

http://www.cs.berkeley.edu/~dmartin/cs267/

CS267 Lecture 5 2

Outline

° Recap

° Parallelism and Locality in PDEs• Continuous Variables Depending on Continuous Parameters

• Example: The heat equation

• Euler’s method

• Indirect methods

° Shared Memory Machines• Historical Perspective: Centralized Shared Memory

• Bus-based Cache-coherent Multiprocessors

• Scalable Shared Memory Machines

CS267 Lecture 5 3

Recap: Source of Parallelism and Locality

° Discrete event system• model is discrete space with discrete interactions

• synchronous and asynchronous versions

• parallelism over graph of entities; communication for events

° Particle systems• discrete entities moving in continuous space and time

• parallelism between particles; communication for interactions

° ODEs• systems of lumped (discrete) variables, continuous parameters

• parallelism in solving (usually sparse) linear systems

• graph partitioning for parallelizing the sparse matrix computation

° PDEs (today)

CS267 Lecture 5 4

Continuous Variables, Continuous Parameters

Examples of such systems include

° Heat flow: Temperature(position, time)

° Diffusion: Concentration(position, time)

° Electrostatic or Gravitational Potential:Potential(position)

° Fluid flow: Velocity,Pressure,Density(position,time)

° Quantum mechanics: Wave-function(position,time)

° Elasticity: Stress,Strain(position,time)

CS267 Lecture 5 5

Example: Deriving the Heat Equation

0 1x x+hConsider a simple problem

° A bar of uniform material, insulated except at ends

° Let u(x,t) be the temperature at position x at time t

° Heat travels from x to x+h at rate proportional to:

° As h 0, we get the heat equation:

d u(x,t) (u(x-h,t)-u(x,t))/h - (u(x,t)- u(x+h,t))/h

dt h

= C *

d u(x,t) d2 u(x,t)

dt dx2= C *

CS267 Lecture 5 6

Explicit Solution of the Heat Equation

° For simplicity, assume C=1

° Discretize both time and position

° Use finite differences with xt[i] as the heat at• time t and position I

• initial conditions on x0[i]

• boundary conditions on xt[0] and xt[1]

° At each timestep

° This corresponds to• matrix vector multiply

• nearest neighbors on grid

t=5

t=4

t=3

t=2

t=1

t=0x[0] x[1] x[2] x[3] x[4] x[5]

x[i]t+1 = z*xt[i-1] + (1-2*z)*xt[i] + z*xt[i+1]

where z = k/h2

CS267 Lecture 5 7

Parallelism in Explicit Method for PDEs

° Partitioning the space (x) into p largest chunks• good load balance (assuming large number of points relative to p)

• minimized communication (only p chunks)

° Generalizes to • multiple dimentions

• arbitrary graphs (= sparse matrices)

° Problem with explicit approach• numerical instability

• need to make the timesteps very small

CS267 Lecture 5 8

Implicit Solution

° As with many (stiff) ODEs, need an implicit method

° This turns into solving the following equation

° Where I is the identity matrix and T is:

° I.e., essentially solving Poisson’s equation

(I + (z/2)*T) * xt+1 = (I - (z/2)*T) * xt

2 -1

-1 2 -1

-1 2 -1

-1 2 -1

-1 2 -1

T =

CS267 Lecture 5 9

2D Implicit Method

° Similar to the 1D case, but the matrix T is now

° Multiplying by this matrix (as in the explicit case) issimply nearest neighbor computation

° To solve this system, there are several techniques

4 -1 -1

-1 4 -1 -1

-1 4 -1 -1

-1 4 -1 -1

-1 -1 4 -1

-1 -1 4 -1

-1 -1 4 -1

-1 -1 4

T =

CS267 Lecture 5 10

Algorithms for Solving the Poisson Equation

Algorithm Serial PRAM Mem #Procs

° Dense LU N3 N N2 N2

° Band LU N2 N N3/2 N

° Jacobi N2 N N N

° Conj.Grad. N 3/2 N 1/2 *log N N N

° RB SORN 3/2 N 1/2 N N

° Sparse LU N 3/2 N 1/2 N*log N N

° FFT N*log N log N N N

° Multigrid N log2 N N N

° Lower bound N log N N

PRAM is an idealized parallel model with zero cost communication

CS267 Lecture 5 11

Administrative

° HW2 extended to Monday, Feb. 16th

° Break

° On to shared memory machines

CS267 Lecture 5 12

Programming Recap and History of SMPs

CS267 Lecture 5 13

Relationship of Architecture and Programming Model

HW / SW interface(comm. primitives)

operatingsystem

User / System Interfacecompilerlibrary

Programming Model

Parallel Application

Hardware

CS267 Lecture 5 14

Shared Address Space Programming Model

° Collection of processes

° Naming:• Each can name data in a private address space and

• all can name all data in a common “shared” address space

° Operations• Uniprocessor operations, plus sychronization operations on

shared address

- lock, unlock, test&set, fetch&add, ...

° Operations on the shared address space appear to be performed in program order

• it’s own operations appear to be in program order

• all see a consistent interleaving of each other’s operations

• like timesharing on a uniprocessor

– explicit synchronization operations used when program ordering is not sufficient.

CS267 Lecture 5 15

Example: shared flag indicating full/empty

° Intuitively clear that intention was to convey meaning by order of stores

° No data dependences

° Sequential compiler / architecture would be free to reorder them!

P1 P2

A = 1;

a: while (flag is 0) do nothing; b: flag = 1;

print A;

CS267 Lecture 5 16

Historical Perspective

° Diverse spectrum of parallel machines designed to implement a particular programming model directly

° Technological convergence on collections of microprocessors on a scalable interconnection network

° Map any programming model to simple hardware• with some specialization

Shared Address Space

centralized sharedmemory

Message Passing

hypercubes andgrids

Data Parallel

SIMD

° ° °

M

Scalable Interconnection Network

CA

P

$

essentially completecomputer

CS267 Lecture 5 17

60s Mainframe Multiprocessors

° Enhance memory capacity or I/O capabilities by adding memory modules or I/O devices

° How do you enhance processing capacity?• Add processors

° Already need an interconnect between slow memory banks and processor + I/O channels

• cross-bar or multistage interconnection network

P

M

M

IO$ IO

M

M

P

$

P

IO$ IO

P

$

M M M M

(b) Cross-bar (c) Multistage

Proc

Mem IOCMem Mem Mem IOC

I/ODevices

Interconnect

Proc

CS267 Lecture 5 18

Caches: A Solution and New Problems

CS267 Lecture 5 19

70s breakthrough

° Caches!

P

memory (slow)

interconnect

I/O Deviceor

Processor

A: 17

processor (fast)

CS267 Lecture 5 20

Technology Perspective

DRAM

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

1000:1! 2:1!

Capacity Speed

Logic: 2x in 3 years 2x in 3 years

DRAM: 4x in 3 years 1.4x in 10 years

Disk: 2x in 3 years 1.4x in 10 years

0

50

100

150

200

250

300

1986 1988 1990 1992 1994

Year

SpecIntSpecFP

CS267 Lecture 5 21

Bus Bottleneck and Caches

I/O MEM MEM° ° °

PROC

cache

PROC

cache

° ° °

Assume 100 MB/s bus

50 MIPS processor w/o cache

=> 200 MB/s inst BW per processor

=> 60 MB/s data BW at 30% load-store

Suppose 98% inst hit rate and 95% data hit rate (16 byte block)

=> 4 MB/s inst BW per processor

=> 12 MB/s data BW per processor

=> 16 MB/s combined BW

8 processors will saturate bus

Cache provides bandwidth filter – as well as reducing average access time

260 MB/s

16 MB/s

CS267 Lecture 5 22

Cache Coherence: The Semantic Problem

° Scenario:• p1 and p2 both have cached copies of x (as 0)

• p1 writes x=1 and then the flag, f=1 pulling f into its cache

- both of these writes may write through to memory

• p2 reads f (bringing it into cache) to see if it is 1, which it is

• p2 therefore reads x, but gets the stale cached copy

x 1f 1

x 0f 1

x = 1f = 1

p1 p2

CS267 Lecture 5 23

Snoopy Cache Coherence

° Bus is a broadcast medium• all caches can watch other’s mem ops

° All processors write through:• update local cache and global bus write:

- updates main memory

- invalidates/updates all other caches with that item

• Examples: Early Sequent and Encore machines.

° Cache stay coherent

° Consistent view of memory!• One shared write at a time

° Performance is much worse than uniprocessor• write-back caches

• Since ~15-30% of references are writes, this scheme consumes tremendous bus bandwidth. Few processors can be supported.

CS267 Lecture 5 24

Write-Back/Ownership Schemes

° When a single cache has ownership of a block, processor writes do not result in bus writes, thus conserving bandwidth.

• reads by others cause it to return to “shared” state

° Most bus-based multiprocessors today use such schemes.

° Many variants of ownership-based protocols

CS267 Lecture 5 25

Programming SMPs

° Consistent view of shared memory

° All addresses equidistant• don’t worry about data partitioning

° Automatic replication of shared data close to processor

° If program concentrates on a block of the data set that no one else updates => very fast

° Communication occurs only on cache misses• cache misses are slow

° Processor cannot distinguish communication misses from regular cache misses

° Cache block may introduce artifacts• two distinct variables in the same cache block

• false sharing

CS267 Lecture 5 26

Scalable Cache-Coherence

CS267 Lecture 5 27

90 Scalable, Cache Coherent Multiprocessors

P

Cache

P

Cache

Interconnection Netw ork

Memory presence bits

dirty-bit

Directory

memory block

1 n

CS267 Lecture 5 28

SGI Origin 2000

L2 cache

P

(1-4 MB)L2 cache

P

(1-4 MB)

HubXbow

Main Memory(1-4 GB)

Direc-tory

L2 cache

P

(1-4 MB)L2 cache

P

(1-4 MB)

Hub Xbow

Main Memory(1-4 GB)

Direc-tory

Interconnection Network

CS267 Lecture 5 29

90’s Pushing the bus to the limit: Sun Enterprise

GigaplaneTM bus (256 data, 41 address, 83 MHz)

I/O Cards

P

$2

$P

$2

$

mem ctr l

Bus Interface / Switch

Bus Interface

CPU/MemCards

CS267 Lecture 5 30

90’s Pushing the SMP to the masses

P-Pro bus (64-bit data, 36-bit address, 66 MHz)

CPU

256kbL2 $

IC

BIU

P-ProModule

P-ProModule

P-ProModule

PCIbridge

PCIbridge

MemController

MIC

1-, 2-, or 4-wayinterleaved

DRAM

PCI

I/O

Cards

CS267 Lecture 5 31

Caches and Scientific Computing

° Caches tend to perform worst on demanding applications that operate on large data sets

• transaction processing

• operating systems

• sparse matrices

° Modern scientific codes use tiling/blocking to become cache friendly

• easier for dense codes than for sparse

• tiling and parallelism are similar transformations

CS267 Lecture 5 32

Scalable Global Address Space

CS267 Lecture 5 33

Structured Shared Memory: SPMD

Pr ivate P or tionof Address Space

Shar ed P or tionof Address Space

store

Common Ph ysicalAd dr esses

P1P2

Pn

P0

load

P0 pr ivate

P1 pr ivate

P2 pr ivate

Pn private

Each Process is same programwith same address space layout

machine physicaladdress space

x

x

CS267 Lecture 5 34

Large Scale Shared Physical Address

° Processor performs load

° Pseudo-memory controller turns it into a message transaction with a remote controller, which performs the memory operation and replies with the data.

° Examples: BBN butterfly, Cray T3D

src

° ° °

Scalable Network

M

PseudoMem

P$

mmu

M

PseudoProc

readaddr desttag

src rrsp tag data

Ld R<- Addr

P$

mmu

CS267 Lecture 5 35

Cray T3D

DRAM

ReqOut

P$

mmu

150 MHz Dec Alpha (64-bit)8 KB Inst + 8 KB Data43-bit Virtual Address32 & 64 bit mem + byte operationsNon-blocking stores + mem-barrierPrefetchLoad-lock, Store Conditional

32-bit P.A. - 5 + 27

DTB

Prefetch Queue - 16 x 64

Msg Queue - 4080x4x64

Special Registers - swaperand - fetch&add - barrier

PE# + FC

BLT

RespIn 3D Torus of Pair of PEs

– share net & BLT – upto 2048 – 64 MB each

Reqin

RespOut

CS267 Lecture 5 36

The Cray T3D

° 2048 Alphas (150 MHz, 16 or 64 MB each) + fast network

• 43-bit virtual address space, 32-bit physical

• 32-bit and 64-bit load/store + byte manipulation on regs.

• no L2 cache

• non-blocking stores, load/store re-ordering, memory fence

• load-lock / store-conditional

° Direct global memory access via external segment regs• DTB annex, 32 entries, remote processor number and mode

• atomic swap between special local reg and memory

• special fetch&inc register

• global-OR, global-AND barriers

° Prefetch Queue

° Block Transfer Engine

° User-level Message Queue

CS267 Lecture 5 37

T3D Local Read (average latency)

0

100

200

300

400

500

600

8 16

32

64

128

256

512

1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Stride

ns

8MB

4MB

2MB

1MB

512KB

256KB

128KB

64KB

32KB

16KB

8KB

L1 Cache Size: 8KB

CacheAccessTime: 6.7 ns(1 cycle)

MemoryAccessTime: 155 ns(23 cycles)

Line Size: 32 bytes

DRAM pagemiss: 100 ns(15 cycles)

No TLB !

CS267 Lecture 5 38

T3D Remote Read Uncached

0

100

200

300

400

500

600

700

800

900

1000

8 16

32

64

128

256

512

1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Stride

ns

8MB

4MB

2MB

1MB

512KB

256KB

128KB

64KB

32KB

16KB

8KB

610 ns(91 cycles)

100 nsDRAM-pagemiss

Network Latency:Additional 13-20 ns(2-3 cycles) per hop

3 - 4x Local Memory Read !

local T3D

DEC Alpha

CS267 Lecture 5 39

Bulk Read Options

0

20

40

60

80

100

120

140

160

1 10 100 1000 10000 100000

Bytes

MB/s

Uncached Read

Cached read

Prefetch

BLT

Split-C

CS267 Lecture 5 40

Where are things going

° High-end• collections of almost complete workstations/SMP on high-speed

network

• with specialized communication assist integrated with memory system to provide global access to shared data

° Mid-end• almost all servers are bus-based CC SMPs

• high-end servers are replacing the bus with a network

- Sun Enterprise 10000, IBM J90, HP/Convex SPP

• volume approach is Pentium pro quadpack + SCI ring

- Sequent, Data General

° Low-end• SMP desktop is here

° Major change ahead• SMP on a chip as a building block

Documents

CS267 Lecture 5 1 CS 267 Applications of Parallel Computers Lecture 5: Sources of Parallelism (continued) Shared-Memory Multiprocessors Kathy Yelick dmartin/cs267