71
CS 2200 Presentation 18a Parallel Processors

CS 2200

  • Upload
    base

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

CS 2200. Presentation 18a Parallel Processors. Questions?. Our Road Map. Processor. Memory Hierarchy. I/O Subsystem. Parallel Systems. Networking. The Next Step. Create more powerful computers simply by interconnecting many small computers Should be scalable Should be fault tolerant - PowerPoint PPT Presentation

Citation preview

Page 1: CS 2200

CS 2200

Presentation 18a

Parallel Processors

Page 2: CS 2200

Questions?

Page 3: CS 2200

Our Road Map

Processor

Networking

Parallel Systems

I/O Subsystem

Memory Hierarchy

Page 4: CS 2200

The Next Step• Create more powerful computers simply

by interconnecting many small computers– Should be scalable– Should be fault tolerant– More economical

• Multiprocessors– High throughput running independent tasks

• Parallel Processing– Single program on multiple processors

Page 5: CS 2200

Key Questions

• How do parallel processors share data?

• How do parallel processors communicate?

• How many processors?

Page 6: CS 2200

Sharing Data I

Processor Processor

Memory

SingleAddressSpace

Communication with memory via loads and stores

Same box

Page 7: CS 2200

Problems?

Processor Processor

Memory

Page 8: CS 2200

Sharing Data I has Two Flavors!

• Uniform Memory Access (UMA)– Symmetric Multiprocessors (SMP)

• Non-Uniform Memory Access (NUMA)

Page 9: CS 2200

Sharing Data I

Memory

Processor

Cache

Processor

Cache

Uniform Memory Access - UMA

Symmetric Multiprocessor SMP

Processor

Cache

Page 10: CS 2200

CPU x 4

Channel

Cache

Memory

I/O

CPU x 4

Channel

Cache

Memory

I/O

CPU x 4

Channel

Cache

Memory

I/O

CPU x 4

Channel

Cache

Memory

I/O

Sharing Data I

Non-Uniform Memory Access - NUMA

Page 11: CS 2200

Sharing Data II

Computer withPrivate Memory

Computer withPrivate Memory

Computer withPrivate Memory

Use Message PassingEach machine capable of

• Send• Receive

Use Message PassingEach machine capable of

• Send• Receive

Local Area Network

Page 12: CS 2200

Connection Schemes• Single Bus

– Improved feasability due to -processors

– Caches can reduce bus traffic– Need to worry about cache coherency

• Network

Network

Cache

Processor

Cache

Processor

Cache

Processor

Memory Memory Memory

Page 13: CS 2200

Programming• As contrasted to instruction level

parallelism which may be largely ignored by the programmer...

• Writing efficient multiprocessor programs is hard.– Wizards write programs with sequential

interface (e.g. Databases, file servers, CAD)– Communications overhead becomes a factor– Requires a lot of knowledge of the hardware!!!

Page 14: CS 2200

Speedup Challenge• To get full benefit of parallelism need to be

able to parallelize the entire program!

• Amdahl’s Law– Timeafter = (Timeaffected/Improvement)+Timeunaffected

– Example: We want 100 times speedup with 100 processors

– Timeunaffected = 0!!!

Page 15: CS 2200

Back to the Bus

Page 16: CS 2200

Multiprocessor Cache Coherency• Means that values in cache and memory

are consistent or that we know they are different and can act accordingly

• Considered to be a good thing.

• Becomes more difficult with multiple processors and multiple caches!

• Popular technique: Snooping!– Write-invalidate– Write-update

Page 17: CS 2200

Multi-Processor Cache Coherency

Page 18: CS 2200

P One of many processors.

Page 19: CS 2200

P

This indicates what operation the processoris trying to perform andwith what address.

000000 R W

Addr

Page 20: CS 2200

The processors cache:

Tag (4 bits),4 lines (ID),Valid, dirty and Shared bits.

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0

P

000000 R W

Addr

Page 21: CS 2200

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0

P

000000 R W

Addr

Note: For this somewhat simplifiedexample we won’t concern ourselves with how many bytes (or words) arein each line. Assume that it’s more than one.

Page 22: CS 2200

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0

The Bus with indicationof address and operation.

000000 R W

Addr

P

000000 R W

Addr

Page 23: CS 2200

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0

These bus operationsare coming from other processors which aren’t shown.

000000 R W

Addr

P

000000 R W

Addr

Page 24: CS 2200

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0

Main Memory

000000 R W

Addr

P

000000 R W

Addr

MEMORY

Page 25: CS 2200

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

Processor issues a read

MEMORY

Page 26: CS 2200

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

Cache reports...

MEMORY

MISS

Page 27: CS 2200

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

Cache reports...

MEMORY

MISS

Because the tags don’t match!

Page 28: CS 2200

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

Data read from memory

MEMORY

Page 29: CS 2200

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

Data read from memory

MEMORY

This bit indicates that this line is“shared” which means othercaches might have the same value.

This bit indicates that this line is“shared” which means othercaches might have the same value.

Page 30: CS 2200

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

0000 10 0 0 0

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

From now on we willshow these as 2 stepoperations…step 1 therequest.

MEMORY

Page 31: CS 2200

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

Step 2…what was the result and the change tothe cache.

MEMORY

MISS

Page 32: CS 2200

Tag ID V D S

0000 00 0 0 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

A write...

Page 33: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

WriteMiss

Page 34: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

WriteMiss

Keep in mind that sincemost cache configurationshave multiple bytes per linea write miss will actually require us to get the linefrom memory into thecache first since we are onlywriting one byte into theline.

Page 35: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

Note: The dirty bit signifiesthat the data in the cache isnot the same as in memory.

Page 36: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

MEMORY

Another read...

Page 37: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

101010 R W

Addr

MEMORY

…this time a hit!

HIT!

Page 38: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

Now another write...

Page 39: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

To a dirty line!

This is a write hit and since the shared bit is 0 we know we are in the exclusive state.

This is a write hit and since the shared bit is 0 we know we are in the exclusive state.

Page 40: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 010101 R W

Addr

P

000000 R W

Addr

MEMORY

Now another processor failing to find what it needs in its cache goes to the bus…a “bus readmiss”

Page 41: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 010101 R W

Addr

P

000000 R W

Addr

MEMORY

Our cache which ismonitoring the bus orsnooping sees the missbut can’t help.

Page 42: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 101010 R W

Addr

P

000000 R W

Addr

MEMORY

Another bus request...

Page 43: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 101010 R W

Addr

P

000000 R W

Addr

MEMORY

Since we have thisvalue in our cache we can satisfy the request from our cache assumingthat this will be quicker than from memory.

Page 44: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

And another request.This time to a dirty line.

Page 45: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

We have to supply the value out of our cachesince it is more currentthan the value in memory.

Page 46: CS 2200

Tag ID V D S

1111 00 1 1 1

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

000000 R W

Addr

MEMORY

We also mark it as shared.Why?

Page 47: CS 2200

Tag ID V D S

1111 00 1 1 1

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

If, for example, our nextoperation was a writeto this line...

Page 48: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

We would have to note thatit was again exclusive andlet the other caches know

ZAP

Page 49: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 000000 R W

Addr

P

111100 R W

Addr

MEMORY

We could then write repeatedly to this line and sincewe have exclusive ownershipno one has to know!

Page 50: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 1 0 1

0000 11 0 0 0 101010 R W

Addr

P

000000 R W

Addr

MEMORY

In a similar way we must respond to write misses byother caches.

Page 51: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 101010 R W

Addr

P

000000 R W

Addr

MEMORY

In this case we know thatsome other processor is going to have a newervalue so we must mark thisline as invalid.

Page 52: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

Now assume some otherprocessor requests a bytefrom the 111100 line ofits cache.

Page 53: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

Since our line is markedvalid and exclusive the other caches should bemarked as invalid.

Page 54: CS 2200

Tag ID V D S

1111 00 1 1 0

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

So the first thing that theother cache will do is a readto get the correct value forall the bytes in the line before it writes the one newbyte.

Page 55: CS 2200

Tag ID V D S

1111 00 1 1 1

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

Our cache supplies the value and goes to the shared state.

Page 56: CS 2200

Tag ID V D S

1111 00 1 1 1

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

Sometime later, for whateverreason, the other cache writes back the value.

Page 57: CS 2200

Tag ID V D S

1111 00 0 0 1

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

This requires us to mark ourline as invalid since we nolonger have the most currentvalue for this line.

Page 58: CS 2200

Tag ID V D S

1111 00 0 0 1

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

We don’t need to worry about the dirty bit sincewe already supplied thatvalue to the other cache.Its entry should now be marked as dirty.

Page 59: CS 2200

Tag ID V D S

1111 00 0 0 1

0000 01 0 0 0

1010 10 0 0 1

0000 11 0 0 0 111100 R W

Addr

P

000000 R W

Addr

MEMORY

This concludes ourdemonstration of somebasic multiprocessortechniques for guaranteeingcache coherency and consistency.

Page 60: CS 2200

Questions?

Page 61: CS 2200

“One if by Bus, Two if by Network

Multiprocessors connected by network

Computer withPrivate Memory

Computer withPrivate Memory

Computer withPrivate Memory

Local Area Network

Page 62: CS 2200

A Parallel Program• Task: Sum 100,000 numbers using 100

processors

• Processors can communicate via message passing systemsend(x,y); /* sends unit x the value y */

receive(); /* receives a message from the net */

• Step one: Send 1000 numbers out to each processor

• Each processor knows its number: Pn

Page 63: CS 2200

Codesum = 0;

for(i=0; i<1000; i++)

sum += A[i];

limit = 100;

half = 100;

do {

half = (half + 1) / 2;

if(Pn >= half && Pn < limit)

send(Pn-half, sum);

if(Pn <= (limit/2-1))

sum = sum + receive();

limit = half;

} while(half > 1);

Page 64: CS 2200

Snooping?• The preceding example only used

message passing for communication

• But can we do loads and stores over the network?

• Yes, but lacking a single bus snooping protocols no longer work.

• Solution: Directories which know where each block is located and the status thereof.

Page 65: CS 2200

Clusters• Simple idea

• Use off-the-shelf whole computers

• Plus high speed network technology

• Wait! What’s the difference

Page 66: CS 2200

NW vs. Cluster• Networked Multiprocessors

– Cost to administer 1– Connect via Memory Bus– Shared memory: 1 O/S– Not as robust– More costly

• Clusters– Cost to administer N– Connect via I/O Bus– N Memories: N/OS’s– Easier to replace 1– Mass market

Page 67: CS 2200

Network Topologies

a. 2D grid or mesh of 16 nodes b. n-cube tree of 8 nodes (8 = 23 so n = 3)

Page 68: CS 2200

c. Omega network switch box

A

B

C

D

P0

P1

P2

P3

P4

P5

P6

P7

a. Crossbar b. Omega network

P0

P1

P2

P3

P4

P5

P6

P7

Page 69: CS 2200

More?

http://www.cc.gatech.edu/projects/ihpcl/index.html

Page 70: CS 2200

Questions?

Page 71: CS 2200