Chapter08-Multiprocessors

Computer Architecture

Chapter 8

Multiprocessors

Shared Memory Architectures

Prof. Jerry Breecher

CSCI 240

Fall 2003

Chap. 8 - Multiprocessors 2

Chapter OverviewWe’re going to do only one section from this chapter, that part

related to how caches from multiple processors interact with each other.

8.1 Introduction – the big picture

8.3 Centralized Shared Memory Architectures


Introduction

8.1 Introduction


The Big Picture: Where are We Now?

The major issue is this:

We’ve taken copies of the contents of main memory and put them in caches closer to the processors. But what happens to those copies if someone else wants to use the main memory data?

How do we keep all copies of the data in synch with each other?


The Multiprocessor Picture

Processor/MemoryBus

PCI Bus

I/O Busses

Example: Pentium System

Organization


Memory

Disk & other IO

Shared Memory Multiprocessor

Registers

Caches

Processor

Registers

Caches

Processor

Registers

Caches

Processor

Registers

Caches

Processor

Chipset •Memory: centralized with Uniform Memory Access time (“uma”) and bus interconnect, I/O•Examples: Sun Enterprise 6000, SGI Challenge, Intel SystemPro


• Several processors share one address space– conceptually a shared memory– often implemented just like a

multicomputer• address space distributed

over private memories• Communication is implicit

– read and write accesses to shared memory locations

• Synchronization– via shared memory locations

• spin waiting for non-zero– barriers

P

M

Network/Bus

P P

Conceptual Model

Shared Memory Multiprocessor


Message Passing Multicomputers

• Computers (nodes) connected by a network

– Fast network interface

• Send, receive, barrier

– Nodes not different than regular PC or workstation

• Cluster conventional workstations or PCs with fast network

– cluster computing

– Berkley NOW

– IBM SP2P

M

P

M

P

M

Network

Node


Large-Scale MP DesignsMemory: distributed with nonuniform memory access time (“numa”)

and scalable interconnect (distributed memory)

1 cycle

Low LatencyHigh Reliability

40 cycles100 cycles



In this section we will understand the issues around:

• Sharing one memory space among several processors.

• Maintaining coherence among several copies of a data item.

8.1 Introduction



The Problem of Cache Coherency

CPU

Cache

100

200

A’

B’

Memory

100

200

A

B

I/O

a) Cache and memory coherent: A’ = A, B’ = B.

CPU

Cache

550

200

A’

B’

Memory

100

200

A

B

I/OOutput of A gives 100

b) Cache and memory incoherent: A’ ^= A.

CPU

Cache

100

200

A’

B’

Memory

100

440

A

B

I/OInput 440 to B

c) Cache and memory incoherent: B’ ^= B.



Some Simple DefinitionsShared Memory Architectures

Mechanism How It Works Performance Coherency Issues

Write Back

Write Through

Write modified data from cache to memory only

when necessary.

Write modified data from cache

to memory immediately.

Good, because

doesn’t tie up memory

bandwidth.

Not so good - uses a lot of

memory bandwidth.

Can have problems with various copies containing different

values.

Modified values always written to

memory; data always matches.


What Does Coherency Mean?

• Informally:

– “Any read must return the most recent write”

– Too strict and too difficult to implement

• Better:

– “Any write must eventually be seen by a read”

– All writes are seen in proper order (“serialization”)

• Two rules to ensure this:

– “If P writes x and P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart”

– Writes to a single location are serialized: seen in one order

• Latest write will be seen

• Otherwise could see writes in illogical order (could see older value after a newer value)



There are Different Types of Memory In The Cache

What kinds of memory are there in the cache?


Test_and_set(lock) shared_data = xyz;Clear(lock);

TYPE Shared? Writable How Kept Coherent

Code Shared No No Need.

Private Data Exclusive Yes Write Back

Shared Data Shared Yes Write Back *

Interlock Data Shared Yes Write Through **

* Write Back gives good performance, but if you use write through here, there will be performance degradation.

** Write through here means the lock state is seen immediately. You want a write through here to flush the cache.


Potential HW Coherency Solutions

• Snooping Solution (Snoopy Bus):

– Send all requests for data to all processors

– Processors snoop to see if they have a copy and respond accordingly

– Requires broadcast, since caching information is at processors

– Works well with bus (natural broadcast medium)

– Dominates for small scale machines (most of the market)

• Directory-Based Schemes

– Keep track of what is being shared in one centralized place

– Distributed memory => distributed directory for scalability(avoids bottlenecks)

– Send point-to-point requests to processors via network

– Scales better than Snooping

– Actually existed BEFORE Snooping-based schemes



An Example Snoopy ProtocolMaintained by Hardware

Invalidation protocol, write-back cache

Each block of memory is in one state:

Clean in all caches and up-to-date in memory (Shared)

OR Dirty in exactly one cache (Exclusive)

OR Not in any caches

Each cache block is in one state (track these):

Shared : block can be read

OR Exclusive : cache has only copy, its writeable, and dirty

OR Invalid : block contains no data

Read misses: cause all caches to snoop bus

Writes to clean line are treated as misses



Snoopy-Cache State Machine-I

• State machinefor CPU requestsfor each cache block

InvalidShared

(read/only)

Exclusive(read/write)

CPU Read

CPU Write

CPU Read hit

Place read misson bus

Place Write Miss on bus

CPU read missWrite back block

CPU WritePlace Write Miss on Bus

CPU Read missPlace read miss on bus

CPU Write MissWrite back cache blockPlace write miss on bus

CPU read hitCPU write hit

Cache BlockState


Applies to Write Back

Data


Snoopy-Cache State Machine-II

• State machinefor bus requests for each cache block

• Appendix E gives details of bus requests

InvalidShared

(read/only)

Exclusive(read/write)

Write BackBlock; (abortmemory access)

Write miss for this block

Read miss for this block

Write miss for this block

Write BackBlock; (abortmemory access)



Example

P1 P2 Bus Memorystep State Addr Value State Addr Value Action Proc. Addr Value Addr Value

P1: Write 10 to A1P1: Read A1P2: Read A1

P2: Write 20 to A1P2: Write 40 to A2

Assumes initial cache state is invalid and A1 and A2 map to same cache block,but A1 ≠ A2

Processor 1 Processor 2 Bus Memory

Remote Write

or MissWrite Back

Remote Write or Miss

Invalid Shared

Exclusive

CPU Read hit

Read miss on bus

Write miss on bus CPU Write

Place Write Miss on Bus

CPU read hitCPU write hit

Remote Read Write Back


This is the Cache for P1.


Example: Step 1


P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1P1: Read A1P2: Read A1


Invalid Shared

Exclusive

Write miss on bus




P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1P1: Read A1 Excl. A1 10P2: Read A1


Example: Step 2


Invalid Shared

Exclusive

CPU read hit



Example: Step 3


P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 WrBk P1 A1 10 10Shar. A1 10 RdDa P2 A1 10 10

P2: Write 20 to A1 10P2: Write 40 to A2 10

10

Assumes initial cache state is invalid and A1 and A2 map to same cache block,but A1 ≠ A2.

Invalid Shared

Exclusive

Read miss on bus

Remote Read Write Back

A1



Example: Step 4




P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 10P2: Write 40 to A2 10

10


Remote Write

Invalid Shared

Exclusive

A1



Example: Step 5




P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 10P2: Write 40 to A2 WrMs P2 A2 10

Excl. A2 40 WrBk P2 A1 20 20

A1

A1




Summary8.1 Introduction – the big picture


We’ve looked at what happens to caches when we have multiple processors or devices looking at memory.

Documents

Chapter08-Multiprocessors