Upload
base
View
50
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CS 2200. Presentation 18a Parallel Processors. Questions?. Our Road Map. Processor. Memory Hierarchy. I/O Subsystem. Parallel Systems. Networking. The Next Step. Create more powerful computers simply by interconnecting many small computers Should be scalable Should be fault tolerant - PowerPoint PPT Presentation
Citation preview
CS 2200
Presentation 18a
Parallel Processors
Questions?
Our Road Map
Processor
Networking
Parallel Systems
I/O Subsystem
Memory Hierarchy
The Next Step• Create more powerful computers simply
by interconnecting many small computers– Should be scalable– Should be fault tolerant– More economical
• Multiprocessors– High throughput running independent tasks
• Parallel Processing– Single program on multiple processors
Key Questions
• How do parallel processors share data?
• How do parallel processors communicate?
• How many processors?
Sharing Data I
Processor Processor
Memory
SingleAddressSpace
Communication with memory via loads and stores
Same box
Problems?
Processor Processor
Memory
Sharing Data I has Two Flavors!
• Uniform Memory Access (UMA)– Symmetric Multiprocessors (SMP)
• Non-Uniform Memory Access (NUMA)
Sharing Data I
Memory
Processor
Cache
Processor
Cache
Uniform Memory Access - UMA
Symmetric Multiprocessor SMP
Processor
Cache
CPU x 4
Channel
Cache
Memory
I/O
CPU x 4
Channel
Cache
Memory
I/O
CPU x 4
Channel
Cache
Memory
I/O
CPU x 4
Channel
Cache
Memory
I/O
Sharing Data I
Non-Uniform Memory Access - NUMA
Sharing Data II
Computer withPrivate Memory
Computer withPrivate Memory
Computer withPrivate Memory
Use Message PassingEach machine capable of
• Send• Receive
Use Message PassingEach machine capable of
• Send• Receive
Local Area Network
Connection Schemes• Single Bus
– Improved feasability due to -processors
– Caches can reduce bus traffic– Need to worry about cache coherency
• Network
Network
Cache
Processor
Cache
Processor
Cache
Processor
Memory Memory Memory
Programming• As contrasted to instruction level
parallelism which may be largely ignored by the programmer...
• Writing efficient multiprocessor programs is hard.– Wizards write programs with sequential
interface (e.g. Databases, file servers, CAD)– Communications overhead becomes a factor– Requires a lot of knowledge of the hardware!!!
Speedup Challenge• To get full benefit of parallelism need to be
able to parallelize the entire program!
• Amdahl’s Law– Timeafter = (Timeaffected/Improvement)+Timeunaffected
– Example: We want 100 times speedup with 100 processors
– Timeunaffected = 0!!!
Back to the Bus
Multiprocessor Cache Coherency• Means that values in cache and memory
are consistent or that we know they are different and can act accordingly
• Considered to be a good thing.
• Becomes more difficult with multiple processors and multiple caches!
• Popular technique: Snooping!– Write-invalidate– Write-update
Multi-Processor Cache Coherency
P One of many processors.
P
This indicates what operation the processoris trying to perform andwith what address.
000000 R W
Addr
The processors cache:
Tag (4 bits),4 lines (ID),Valid, dirty and Shared bits.
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
0000 10 0 0 0
0000 11 0 0 0
P
000000 R W
Addr
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
0000 10 0 0 0
0000 11 0 0 0
P
000000 R W
Addr
Note: For this somewhat simplifiedexample we won’t concern ourselves with how many bytes (or words) arein each line. Assume that it’s more than one.
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
0000 10 0 0 0
0000 11 0 0 0
The Bus with indicationof address and operation.
000000 R W
Addr
P
000000 R W
Addr
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
0000 10 0 0 0
0000 11 0 0 0
These bus operationsare coming from other processors which aren’t shown.
000000 R W
Addr
P
000000 R W
Addr
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
0000 10 0 0 0
0000 11 0 0 0
Main Memory
000000 R W
Addr
P
000000 R W
Addr
MEMORY
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
0000 10 0 0 0
0000 11 0 0 0 000000 R W
Addr
P
101010 R W
Addr
Processor issues a read
MEMORY
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
0000 10 0 0 0
0000 11 0 0 0 000000 R W
Addr
P
101010 R W
Addr
Cache reports...
MEMORY
MISS
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
0000 10 0 0 0
0000 11 0 0 0 000000 R W
Addr
P
101010 R W
Addr
Cache reports...
MEMORY
MISS
Because the tags don’t match!
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
101010 R W
Addr
Data read from memory
MEMORY
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
101010 R W
Addr
Data read from memory
MEMORY
This bit indicates that this line is“shared” which means othercaches might have the same value.
This bit indicates that this line is“shared” which means othercaches might have the same value.
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
0000 10 0 0 0
0000 11 0 0 0 000000 R W
Addr
P
101010 R W
Addr
From now on we willshow these as 2 stepoperations…step 1 therequest.
MEMORY
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
101010 R W
Addr
Step 2…what was the result and the change tothe cache.
MEMORY
MISS
Tag ID V D S
0000 00 0 0 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
111100 R W
Addr
MEMORY
A write...
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
111100 R W
Addr
MEMORY
WriteMiss
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
111100 R W
Addr
MEMORY
WriteMiss
Keep in mind that sincemost cache configurationshave multiple bytes per linea write miss will actually require us to get the linefrom memory into thecache first since we are onlywriting one byte into theline.
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
111100 R W
Addr
MEMORY
Note: The dirty bit signifiesthat the data in the cache isnot the same as in memory.
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
101010 R W
Addr
MEMORY
Another read...
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
101010 R W
Addr
MEMORY
…this time a hit!
HIT!
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
111100 R W
Addr
MEMORY
Now another write...
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
111100 R W
Addr
MEMORY
To a dirty line!
This is a write hit and since the shared bit is 0 we know we are in the exclusive state.
This is a write hit and since the shared bit is 0 we know we are in the exclusive state.
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 010101 R W
Addr
P
000000 R W
Addr
MEMORY
Now another processor failing to find what it needs in its cache goes to the bus…a “bus readmiss”
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 010101 R W
Addr
P
000000 R W
Addr
MEMORY
Our cache which ismonitoring the bus orsnooping sees the missbut can’t help.
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 101010 R W
Addr
P
000000 R W
Addr
MEMORY
Another bus request...
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 101010 R W
Addr
P
000000 R W
Addr
MEMORY
Since we have thisvalue in our cache we can satisfy the request from our cache assumingthat this will be quicker than from memory.
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 111100 R W
Addr
P
000000 R W
Addr
MEMORY
And another request.This time to a dirty line.
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 111100 R W
Addr
P
000000 R W
Addr
MEMORY
We have to supply the value out of our cachesince it is more currentthan the value in memory.
Tag ID V D S
1111 00 1 1 1
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
000000 R W
Addr
MEMORY
We also mark it as shared.Why?
Tag ID V D S
1111 00 1 1 1
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
111100 R W
Addr
MEMORY
If, for example, our nextoperation was a writeto this line...
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
111100 R W
Addr
MEMORY
We would have to note thatit was again exclusive andlet the other caches know
ZAP
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 000000 R W
Addr
P
111100 R W
Addr
MEMORY
We could then write repeatedly to this line and sincewe have exclusive ownershipno one has to know!
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 1 0 1
0000 11 0 0 0 101010 R W
Addr
P
000000 R W
Addr
MEMORY
In a similar way we must respond to write misses byother caches.
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 0 0 1
0000 11 0 0 0 101010 R W
Addr
P
000000 R W
Addr
MEMORY
In this case we know thatsome other processor is going to have a newervalue so we must mark thisline as invalid.
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 0 0 1
0000 11 0 0 0 111100 R W
Addr
P
000000 R W
Addr
MEMORY
Now assume some otherprocessor requests a bytefrom the 111100 line ofits cache.
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 0 0 1
0000 11 0 0 0 111100 R W
Addr
P
000000 R W
Addr
MEMORY
Since our line is markedvalid and exclusive the other caches should bemarked as invalid.
Tag ID V D S
1111 00 1 1 0
0000 01 0 0 0
1010 10 0 0 1
0000 11 0 0 0 111100 R W
Addr
P
000000 R W
Addr
MEMORY
So the first thing that theother cache will do is a readto get the correct value forall the bytes in the line before it writes the one newbyte.
Tag ID V D S
1111 00 1 1 1
0000 01 0 0 0
1010 10 0 0 1
0000 11 0 0 0 111100 R W
Addr
P
000000 R W
Addr
MEMORY
Our cache supplies the value and goes to the shared state.
Tag ID V D S
1111 00 1 1 1
0000 01 0 0 0
1010 10 0 0 1
0000 11 0 0 0 111100 R W
Addr
P
000000 R W
Addr
MEMORY
Sometime later, for whateverreason, the other cache writes back the value.
Tag ID V D S
1111 00 0 0 1
0000 01 0 0 0
1010 10 0 0 1
0000 11 0 0 0 111100 R W
Addr
P
000000 R W
Addr
MEMORY
This requires us to mark ourline as invalid since we nolonger have the most currentvalue for this line.
Tag ID V D S
1111 00 0 0 1
0000 01 0 0 0
1010 10 0 0 1
0000 11 0 0 0 111100 R W
Addr
P
000000 R W
Addr
MEMORY
We don’t need to worry about the dirty bit sincewe already supplied thatvalue to the other cache.Its entry should now be marked as dirty.
Tag ID V D S
1111 00 0 0 1
0000 01 0 0 0
1010 10 0 0 1
0000 11 0 0 0 111100 R W
Addr
P
000000 R W
Addr
MEMORY
This concludes ourdemonstration of somebasic multiprocessortechniques for guaranteeingcache coherency and consistency.
Questions?
“One if by Bus, Two if by Network
Multiprocessors connected by network
Computer withPrivate Memory
Computer withPrivate Memory
Computer withPrivate Memory
Local Area Network
A Parallel Program• Task: Sum 100,000 numbers using 100
processors
• Processors can communicate via message passing systemsend(x,y); /* sends unit x the value y */
receive(); /* receives a message from the net */
• Step one: Send 1000 numbers out to each processor
• Each processor knows its number: Pn
Codesum = 0;
for(i=0; i<1000; i++)
sum += A[i];
limit = 100;
half = 100;
do {
half = (half + 1) / 2;
if(Pn >= half && Pn < limit)
send(Pn-half, sum);
if(Pn <= (limit/2-1))
sum = sum + receive();
limit = half;
} while(half > 1);
Snooping?• The preceding example only used
message passing for communication
• But can we do loads and stores over the network?
• Yes, but lacking a single bus snooping protocols no longer work.
• Solution: Directories which know where each block is located and the status thereof.
Clusters• Simple idea
• Use off-the-shelf whole computers
• Plus high speed network technology
• Wait! What’s the difference
NW vs. Cluster• Networked Multiprocessors
– Cost to administer 1– Connect via Memory Bus– Shared memory: 1 O/S– Not as robust– More costly
• Clusters– Cost to administer N– Connect via I/O Bus– N Memories: N/OS’s– Easier to replace 1– Mass market
Network Topologies
a. 2D grid or mesh of 16 nodes b. n-cube tree of 8 nodes (8 = 23 so n = 3)
c. Omega network switch box
A
B
C
D
P0
P1
P2
P3
P4
P5
P6
P7
a. Crossbar b. Omega network
P0
P1
P2
P3
P4
P5
P6
P7
More?
http://www.cc.gatech.edu/projects/ihpcl/index.html
Questions?