CS 2200

CS 2200

Presentation 18a

Parallel Processors

Questions?

Our Road Map

Processor

Networking

Parallel Systems

I/O Subsystem

Memory Hierarchy

The Next Step• Create more powerful computers simply

by interconnecting many small computers– Should be scalable– Should be fault tolerant– More economical

• Multiprocessors– High throughput running independent tasks

• Parallel Processing– Single program on multiple processors

Key Questions

• How do parallel processors share data?

• How do parallel processors communicate?

• How many processors?

Sharing Data I

Processor Processor

Memory

SingleAddressSpace

Communication with memory via loads and stores

Same box

Problems?

Processor Processor

Memory

Sharing Data I has Two Flavors!

• Uniform Memory Access (UMA)– Symmetric Multiprocessors (SMP)

• Non-Uniform Memory Access (NUMA)

Sharing Data I

Memory

Processor

Cache

Processor

Cache

Uniform Memory Access - UMA

Symmetric Multiprocessor SMP

Processor

Cache

CPU x 4

Channel

Cache

Memory

I/O

CPU x 4

Channel

Cache

Memory

I/O

CPU x 4

Channel

Cache

Memory

I/O

CPU x 4

Channel

Cache

Memory

I/O

Sharing Data I

Non-Uniform Memory Access - NUMA

Sharing Data II

Computer withPrivate Memory



Use Message PassingEach machine capable of

• Send• Receive

Use Message PassingEach machine capable of

• Send• Receive

Local Area Network

Connection Schemes• Single Bus

– Improved feasability due to -processors

– Caches can reduce bus traffic– Need to worry about cache coherency

• Network

Network

Cache

Processor

Cache

Processor

Cache

Processor

Memory Memory Memory

Programming• As contrasted to instruction level

parallelism which may be largely ignored by the programmer...

• Writing efficient multiprocessor programs is hard.– Wizards write programs with sequential

interface (e.g. Databases, file servers, CAD)– Communications overhead becomes a factor– Requires a lot of knowledge of the hardware!!!

Speedup Challenge• To get full benefit of parallelism need to be

able to parallelize the entire program!

• Amdahl’s Law– Timeafter = (Timeaffected/Improvement)+Timeunaffected

– Example: We want 100 times speedup with 100 processors

– Timeunaffected = 0!!!

Back to the Bus

Multiprocessor Cache Coherency• Means that values in cache and memory

are consistent or that we know they are different and can act accordingly

• Considered to be a good thing.

• Becomes more difficult with multiple processors and multiple caches!

• Popular technique: Snooping!– Write-invalidate– Write-update

Multi-Processor Cache Coherency

P One of many processors.

P

This indicates what operation the processoris trying to perform andwith what address.

000000 R W

Addr

The processors cache:

Tag (4 bits),4 lines (ID),Valid, dirty and Shared bits.

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0

P

000000 R W

Addr

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0

P

000000 R W

Addr

Note: For this somewhat simplifiedexample we won’t concern ourselves with how many bytes (or words) arein each line. Assume that it’s more than one.

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0

The Bus with indicationof address and operation.

000000 R W

Addr

P

000000 R W

Addr

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0

These bus operationsare coming from other processors which aren’t shown.

000000 R W

Addr

P

000000 R W

Addr

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0

Main Memory

000000 R W

Addr

P

000000 R W

Addr

MEMORY

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

Processor issues a read

MEMORY

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

Cache reports...

MEMORY

MISS

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

Cache reports...

MEMORY

MISS

Because the tags don’t match!

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

Data read from memory

MEMORY

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

Data read from memory

MEMORY

This bit indicates that this line is“shared” which means othercaches might have the same value.

This bit indicates that this line is“shared” which means othercaches might have the same value.

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

From now on we willshow these as 2 stepoperations…step 1 therequest.

MEMORY

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

Step 2…what was the result and the change tothe cache.

MEMORY

MISS

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

A write...

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

WriteMiss

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

WriteMiss

Keep in mind that sincemost cache configurationshave multiple bytes per linea write miss will actually require us to get the linefrom memory into thecache first since we are onlywriting one byte into theline.

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

Note: The dirty bit signifiesthat the data in the cache isnot the same as in memory.

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

MEMORY

Another read...

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

MEMORY

…this time a hit!

HIT!

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

Now another write...

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

To a dirty line!

This is a write hit and since the shared bit is 0 we know we are in the exclusive state.

This is a write hit and since the shared bit is 0 we know we are in the exclusive state.

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 010101 R W

Addr

P

000000 R W

Addr

MEMORY

Now another processor failing to find what it needs in its cache goes to the bus…a “bus readmiss”

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 010101 R W

Addr

P

000000 R W

Addr

MEMORY

Our cache which ismonitoring the bus orsnooping sees the missbut can’t help.

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 101010 R W

Addr

P

000000 R W

Addr

MEMORY

Another bus request...

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 101010 R W

Addr

P

000000 R W

Addr

MEMORY

Since we have thisvalue in our cache we can satisfy the request from our cache assumingthat this will be quicker than from memory.

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

And another request.This time to a dirty line.

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

We have to supply the value out of our cachesince it is more currentthan the value in memory.

Tag ID V D S

1111 00 1 1 1

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

000000 R W

Addr

MEMORY

We also mark it as shared.Why?

Tag ID V D S

1111 00 1 1 1

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

If, for example, our nextoperation was a writeto this line...

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

We would have to note thatit was again exclusive andlet the other caches know

ZAP

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

We could then write repeatedly to this line and sincewe have exclusive ownershipno one has to know!

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 101010 R W

Addr

P

000000 R W

Addr

MEMORY

In a similar way we must respond to write misses byother caches.

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 101010 R W

Addr

P

000000 R W

Addr

MEMORY

In this case we know thatsome other processor is going to have a newervalue so we must mark thisline as invalid.

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

Now assume some otherprocessor requests a bytefrom the 111100 line ofits cache.

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

Since our line is markedvalid and exclusive the other caches should bemarked as invalid.

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

So the first thing that theother cache will do is a readto get the correct value forall the bytes in the line before it writes the one newbyte.

Tag ID V D S

1111 00 1 1 1

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

Our cache supplies the value and goes to the shared state.

Tag ID V D S

1111 00 1 1 1

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

Sometime later, for whateverreason, the other cache writes back the value.

Tag ID V D S

1111 00 0 0 1

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

This requires us to mark ourline as invalid since we nolonger have the most currentvalue for this line.

Tag ID V D S

1111 00 0 0 1

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

We don’t need to worry about the dirty bit sincewe already supplied thatvalue to the other cache.Its entry should now be marked as dirty.

Tag ID V D S

1111 00 0 0 1

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

This concludes ourdemonstration of somebasic multiprocessortechniques for guaranteeingcache coherency and consistency.

Questions?

“One if by Bus, Two if by Network

Multiprocessors connected by network




Local Area Network

A Parallel Program• Task: Sum 100,000 numbers using 100

processors

• Processors can communicate via message passing systemsend(x,y); /* sends unit x the value y */

receive(); /* receives a message from the net */

• Step one: Send 1000 numbers out to each processor

• Each processor knows its number: Pn

Codesum = 0;

for(i=0; i<1000; i++)

sum += A[i];

limit = 100;

half = 100;

do {

half = (half + 1) / 2;

if(Pn >= half && Pn < limit)

send(Pn-half, sum);

if(Pn <= (limit/2-1))

sum = sum + receive();

limit = half;

} while(half > 1);

Snooping?• The preceding example only used

message passing for communication

• But can we do loads and stores over the network?

• Yes, but lacking a single bus snooping protocols no longer work.

• Solution: Directories which know where each block is located and the status thereof.

Clusters• Simple idea

• Use off-the-shelf whole computers

• Plus high speed network technology

• Wait! What’s the difference

NW vs. Cluster• Networked Multiprocessors

– Cost to administer 1– Connect via Memory Bus– Shared memory: 1 O/S– Not as robust– More costly

• Clusters– Cost to administer N– Connect via I/O Bus– N Memories: N/OS’s– Easier to replace 1– Mass market

Network Topologies

a. 2D grid or mesh of 16 nodes b. n-cube tree of 8 nodes (8 = 23 so n = 3)

c. Omega network switch box

A

B

C

D

P0

P1

P2

P3

P4

P5

P6

P7

a. Crossbar b. Omega network

P0

P1

P2

P3

P4

P5

P6

P7

More?

http://www.cc.gatech.edu/projects/ihpcl/index.html

http://www.cc.gatech.edu/projects/ihpcl/index.html

Questions?

Documents

CS 2200