1 Goal: Consider some of the parallel opportunities on the horizon

1

Goal: Consider some of the parallel opportunities on the horizon

2

Define a cluster computer as an connected set of independent computers … it usually implies “connected by a dedicated network”

Grid: the Internet is the only network; “resources not in a single administrative unit” Greater latency -- locality, a bigger deal Interference by other traffic Nodes more autonomous -- configuration, security,

heterogeneity, etc all issues Bottomline: Grid has all of the problems of ||

architectures, but worse and more

3

SETI and BOINC type problems work well because (a) the problems are independent, and (b) have limited need for communication

Other computations are proportionally tougher -- best to consider Grid as distributed resource rather than a || computer

4

Several forms of attached processors are popular: FPGA Cell GPGPU

These machines do not match the CTA model -- why? -- and require different programming concepts There is a machine model (type

architecture) by Ebeling, van Essen and Ylvisaker

5

These engines benefit from Si advances more than standard processors

They exploit … Kernel processing … only speeding inner

loop; leave all other processing/orchestrating to CPU

Data streaming from memory to get high throughput … moving lots of data fast permits cheap computations on each item

Specialized circuit designs at the expense of generality, making VLSI pay big

6

Field Programmable Gate Array (Xilinx, Altera) “Programmable hardware” Ideal for ints, bit-twiddling, etc. Very dramatic “regime change” between CPU

and attached engine FPGAs are supported “well” for circuit

designers, but tools low level compared to IDE Drop into “Opteron slot” for fast system build

Mostly for special purposeMostly for special purpose

04/21/23 © 2010 Larry Snyder, CSE 7

A Type Architecture for Hybrid Micro-Parallel Computers (search UW)

When programming a family of machines that departs significantly from the “usual suspects,” work out a type architecture

The TA will be the simplest logical machine that exhibits the important costs

Notice How the CTA does this How the EVY Machine Does This


Though IBM canceled Cell development last November, the ideas in it are crucial to our parallel computation discussion

(And it’s still a fire breather)


10

Basic Floor plan and photo

04/21/23 © 2010 Larry Snyder, CSE 11

12

Blachford describes cell programming Cell however, has gone against the grain and

actually removed a level of abstraction. The programming model for the Cell will be concrete, when you program an SPE you will be programming what is in the SPE itself, not some abstraction. You will be "hitting the hardware" so to speak. The SPEs programming model will include 256K of local store and 128 registers, the SPE itself will include 128 registers and 256K of local store, no less, no more. [Feb, 2005]

13

14

Complexity seems dramatic

04/21/23 © 2010 Larry Snyder, CSE 15

DRAM

Cache Cache Cache Cache

ALU ALU ALU ALU

Control Control Control Control

DRAM

Cache

Control ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

04/21/23 © 2010 Larry Snyder, CSE 16

04/21/23 © 2010 Larry Snyder, CSE 17

The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel

Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

Differences between GPU and CPU threads GPU threads are extremely lightweight

▪ Very little creation overhead GPU needs 1000s of threads for full efficiency

▪ Multi-core CPU needs only a few

A kernel is executed as a grid of thread blocks All threads share data

memory space A thread block is a batch of

threads that can cooperate with each other by: Synchronizing their

execution▪ For hazard-free shared

memory accesses Efficiently sharing data

through a low latency shared memory

Two threads from two different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA

Threads and blocks have IDs So each thread can decide

what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes …

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA

Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host• The host can R/W

global, constant, and texture memories

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Courtesy: NDVIA

Global memory Main means of

communicating R/W Data between host and device

Contents visible to all threads

Texture and Constant Memories Constants initialized

by host Contents visible to

all threads

23

Attached processors provide enormous power for low price, but they … Use a different programming paradigm than our

CTA-based view Have no pretense of being general purpose Should be programmed as though configuring

hardware or building a hybrid machine Strengths are fast data streaming and

leveraging VLSI Liabilities are programming challenges and

rigidity

24

A goal of || abstractions is to specify efficient computation without specifying unnecessary order

Problem Space Promotion is One Way

Consider a countingsort:

for (j=0; j<n; j++) { B[j] = 0; // Init}for (i=0; i<n; i++) { for (j=0; j<n; j++) { if (A[i]<A[j]) // Count # B[j]++; // larger }}for (j=0; j<n; j++) { B[j] = A[B[j]]; // Reorder}for (j=0; j<n; j++) { A[j] = B[j]; // Restore}

25

Specify the operations w/o ordering

The solution “promotes” the operand to be 2D, allowing all pairs to be tested at once This technique works a lot … 3D MM is instance Promotion is logical … no actual copy needed

C←ST S compare all pairs,(1 if true, else 0) P←sum_cols(C)column sums give sorted index positions S←S[P] permute elements of S into order

C←ST S compare all pairs,(1 if true, else 0) P←sum_cols(C)column sums give sorted index positions S←S[P] permute elements of S into order

Bradford L. Chamberlain, E Christopher Lewis, and Lawrence Snyder. Problem space promotion and its evaluation as a technique for efficient parallel computation. In Proceedings of the ACM International Conference on Supercomputing, 1999.

26

You all know parallel programming is tough

How to make it easier? Raise abstractions! Transactional Memory: Return of old idea

Databases concurrently manage external data consistently using multiple computers; well studied

Apply idea to concurrent management of the internal memory image

Transaction: Atomically change memory to new state or do nothing at all

Say the goal not how to achieve it

Idea: David Lomet in 1977Idea: David Lomet in 1977

27

An easier-to-use and harder-to-implement primitive

lock acquire/release (behave as if)no interleaved computation;no unfair starvation

void deposit(int x){synchronized(this){ int tmp = balance; tmp += x; balance = tmp;}}

void deposit(int x){atomic { int tmp = balance; tmp += x; balance = tmp; }}

28

The code executing atomically is everything (dynamically) between braces, including foo().

atomic {

if (x != null) x.foo();

y = true;

} Three choices: commit; abort; not terminate Optimistic: Little overhead if no conflict Avoid races and deadlocks due to lock

acquisition

29

In DB transactions have ACID properties A = atomicity … sequence of operations never

interrupted or incomplete; commit or abort C= consistency …changes leave memory in

consistent state relative to application; for example new_balance==old_balance+deposit

I = isolation … transaction works correctly with any combination of other transactions

D = durability … result persists; not appropriate for multithreading memory case

30

DBs use disks, meaning the SW support for DB transactions not time-critical; referencing memory is too brief (and frequent) to allow for heavy-weight protection

TM need not be durable (last) since data doesn’t outlast execution; simplifying

TM must retrofit into a rich world of legacy code … must coexist with all other mechanisms; pervasive changes not feasible

31

It’s not difficult to mess up using atomic

Thread 1 Thread 2atomic { atomic {

while (!flagA) flagA = true;

flagB = true; while (!flagB);

} }

flags have to be true at the start of block This code not serializable, i.e. there is

no correct serial execution

32

Transactions are no panacea Neither hardware nor software

implementations have proved themselves

Very Nice Monograph: Larus & Rajwarhttp://www.morganclaypool.com/doi/abs/10.2200/S00070ED1V01Y200611CAC002 A fundamental problem TM will not

solve: How to scale shared memory computations to architectures with much larger which are inevitable

33

34

Rather than using the Test&Set to guard shared data, use Fetch&Add

▪ Fetch&Add is an atomic read-modify-write operation on memory -- requires special hardware, to be discussed

▪ Use Fetch&Add as a semaphore and as a scheduler Operation: Fetch&Add(V,e)

▪ V is a memory location▪ e is an integer expression▪ Contents of V are returned▪ New value of V is V+e▪ Operation is atomic

V: 0

Fetch&Add(V,1)

V: 1

0 is returned

V: 0

Fetch&Add(V,1)

V: 1

0 is returned

35

When multiple Fetch&Adds are executed simultaneously, they are serializable

Assume Fetch&Add(V, e1), Fetch&Add(V, e2) execute simultaneously

▪ Assuming an initial value of e0▪ Final value is e0+e1+e2▪ The 1st process receives either e0 or e0+e2, implying it

was first (e0) or second (e0+e2)▪ The 2nd process receives either e0 or e0+e1, implying it

was second (e0+e1) or first (e0)

Suppose both execute Fetch&Add(I,1), then one gets I back, the other I+1, and final is I+2

Suppose both execute Fetch&Add(I,1), then one gets I back, the other I+1, and final is I+2

36

Though earlier solutions attempt to reduce sharing to reduce the amount of invalidation and acknowledgment, Fetch&Add does better with greater sharing

Sharing is used to schedule or allocate, which is then independent activity

▪ Sharing is concentrated in a few variables ▪ Fine grain size is possible

Since load/store, Test&Set, etc. are implementable, it is a “sufficient” primitive

37

Fetch&Add assumes a flat shared memory as implemented by a “dance hall architecture”

P0

PNI

P1

PNI

P2

PNI

P6

PNI

P7

PNI

P3

PNI

P4

PNI

P5

PNI

Interconnection Network

M0

MNI

M1

MNI

M2

MNI

M3

MNI

M4

MNI

M6

MNI

M5

MNI

M7

MNI

38

The interconnection network is an -network Connection between 2 and 6 … follow bits to

destination lsb to msb

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

000001

010011

100101

110111

000001

010011

100101

110111

Pro

cess

or I

DH

i Mem

ory Bits

39

The -Network requires O(P log P) routers

The given network uses 2x2 but 2bx2b work

Wiring is consistent at each stage0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

000001

010011

100101

110111

000001

010011

100101

110111

Long Wires Are NecessaryLong Wires Are Necessary

40

The network is pipelined There is a unique path between any

processor and memory port pair Conflicts are possible because there

exist permutations in which packets collide

What happens when two packets collide at a router?

▪ Packet is delayed, leaving its “file”▪ Pipelining is affected, here comes more

0 01 1

The separate packets must be serializedThe separate packets must be serialized

41

Simultaneous requests collide in network 0 0

1 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

000001

010011

100101

110111

000001

010011

100101

110111

Fetch&Add(V)

Fetch&Add(V)

V

Fetch&Add increases potential for collisionsFetch&Add increases potential for collisions

Hot spotHot spot

42

Idea: Combine requests for same dest. In limit all nodes could reference same loc.

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

0 01 1

000001

010011

100101

110111

000001

010011

100101

110111

V

43

At a switch combine loads and stores to a common location as follows

▪ Load/Load -- forward one of the loads towards the memory, and when the value is returned, satisfy both

▪ Load/Store -- forward the store, and when the ACK arrives back at the switch, return value to satisfy load

▪ Store/Store -- forward one of the stores, and when the ACK arrives back at the switch, return it for both

Processors are restricted to having only one outstanding request at a time to a given location

44

Include an adder with the Memory Network Interface chips

For Fetch&Add(V,e) ▪ Fetch the value of V, say e0▪ Return e0 to processor requesting ▪ Add e0+e▪ Store e0+e back into V

It is probably necessary to do these concurrently

MNI

Mem Mem

45

Suppose Fetch&Add(V,e), Fetch&Add(V,f) arrive at a switch together … Form the sum f+e Send Fetch&Add(V,f+e) on to the memory Store e locally When g0 is returned by the memory

▪ Return g0 as response to Fetch&Add(V,e)▪ Return g0+e as response to Fetch&Add(V,f)

Switch

F&E(V,e+f)g0

F&E(V,f)g0+e

F&E(V,e)g0

46

Combining can apply to all memory traffic to a location V

Consider the following cases▪ Fetch&Add/Fetch&Add -- as just described▪ Fetch&Add/Load -- Treat Load as Fetch&Add(V,0)▪ Fetch&Add/Store -- If Fetch&Add(V,e) meets

Store(V,f) send Store(V,e+f) to memory; when ACK is received, return f as value of F&A

Conclusion -- it is possible to combine all requests to the same memory location

47

Potential Problems …▪ Network routing is driven entirely by performance, so a complicated

switch is usually a problem▪ Routers typically forward non-blocked packets in <= 3 tix▪ Matching to recognize that two requests collide is an “add” operation▪ Combining is an “add” operation after the previous add▪ Combining relies on the requests getting to the switch

simultaneously, or at worst, before the forwarded packet leaves … this is improbable

▪ Most traffic is non-combinable -- headed for different places

▪ A combining router was created by Susan Dickey

48

If the network switch is too slow then … Do not combine at every stage … so that some

stages can be fast Use two networks, one fast and one that does

combining -- it can handle the sharing requests Combine only like requests, e.g. loads/loads Limit combining at a node to two requests

As it happened▪ Only like requests have ever been implemented in

switch▪ IBM used the two network solution in the RP3

49

Norton and Pfister discovered in simulation for the RP3 computer that the -Network develops hot-spots

It was thought that combining would remove the hot-spots … it seemed to for 64-way network

The problem is that once a node becomes hot, a “back-up” tree forms “behind” the node

50

Lee, Kruskal, Kuck studied by simulation, analysis

▪ LKK discovered and named the “back-up tree” ▪ Showed in simulation that the 64-way network is

lucky▪ Combining doesn’t help because it’s the other

traffic

0.125% traffic directed at a hot spot

51

It was a good idea but it didn’t work Good

▪ Fetch&Add is clever -- a primitive with good properties▪ Shifting from protecting data to allocating work is

better▪ Computation at memory is powerful, worth doing

Bad▪ Pipelined multistage networks probably just don’t work▪ Complexity in a switch is wrong -- speed is essential▪ Failed to exploit locality -- caching basically impossible

52

Goal: Summarize what we’ve covered and what you might have learned

53

To write successful parallel programs, we need a model of how a || computer works CTA is a generalization of || hardware CTA favors …

▪ Locality because of high ▪ Minimizing inter-thread dependences because of

high plus wait time and sometimes contention CTA algorithms were successful and usually

they were the only ones that worked wellLambda is always relevant, usually very significantLambda is always relevant, usually very significant

54

In parallel algorithm design and parallel programming, be guided

by the CTA

In parallel algorithm design and parallel programming, be guided

by the CTA

55

We acquired parallel programming skills using small exercises Thread programming Peril-L MPI Chapel

Experiences -- Much of it is low-level and grotty Not always easy to get performance

56

Approach problems “top down”, maximizing the highest level of parallelism first Corollary: Don’t try to be too smart at low

level Granularity, the coarser the betterCommunication load, less is betterBandwidth, use wiselyMinimize “programming to the

machine”

57

What in your view was the high-order bit?

58

Using parallel computers is tough Parallel computers generally behave like

the CTA, so program to it … it won’t disappoint

Parallel algorithms often require fresh thinking -- sequential case may not be a good place to begin

CMPs are sweet-spot now, but for how long?

Reduce/Scan are basic building blocks

59

Programming tools are all over the place OpenMP is very simple, but not too expressive Pthreads, a standard library, but very low level MPI is universal, low level and abstraction-free ZPL leaves all parallelism to compiler, but gives

WYSIWYG as guide to writing good programs Chapel, Transactional Memory have promise, but

NRFPT Hardware design is very volatile at moment

Need a volunteer at each siteFill out both forms

Bubble in the white form Give me suggestions on the yellow form

Volunteer▪ On campus, drop the completed forms in box

across st▪ In Redmond, Fedex the completed forms to

campus

04/21/23 © 2010 Larry Snyder, CSE 60

Documents

1 Goal: Consider some of the parallel opportunities on the horizon