28
Copyright Josep Torrellas 2003 1 Cache Coherence Instr uct or: Josep Torr ellas CS533

Cache Coherence 1

Embed Size (px)

Citation preview

Page 1: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 1/28

Copyright Josep Torrellas 2003 1

Cache Coherence

Instructor: Josep Torrellas

CS533

Page 2: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 2/28

Copyright Josep Torrellas 2003 2

The Cache Coherence Problem

• Caches are critical to modern high-speed processors• Multiple copies of a block can easily get inconsistent

– processor writes. I/O writes,..

P P

Cache CacheA = 5 A = 53

A = 7

Memory

A = 51 2

Page 3: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 3/28

Copyright Josep Torrellas 2003 3

Cache Coherence Solutions• Software based vs hardware based

• Software-based:

– Compiler based or with run-time system support

– With or without hardware assist– Tough problem because perfect information is needed

in the presence of memory aliasing and explicit

parallelism

• Focus on hardware based solutions as they are morecommon

Page 4: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 4/28

Copyright Josep Torrellas 2003 4

Hardware Solutions• The schemes can be classified based on :

– Shared caches vs Snoopy schemes vs. Directory

schemes

– Write through vs. write-back (ownership-based)protocols

– update vs. invalidation protocols

– dirty-sharing vs. no-dirty-sharing protocols

Page 5: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 5/28

Copyright Josep Torrellas 2003 5

Snoopy Cache Coherence Schemes

• A distributed cache coherence scheme based on the notionof a snoop that watches all activity on a global bus, or is

informed about such activity by some global broadcast

mechanism.

• Most commonly used method in commercial

multiprocessors

Page 6: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 6/28

Copyright Josep Torrellas 2003 6

Write Through Schemes

• All processor writes result in :– update of local cache and a global bus write that :

• updates main memory

• invalidates/updates all other caches with that item

• Advantage : Simple to implement

• Disadvantages : Since ~15% of references are writes, this

scheme consumes tremendous bus bandwidth . Thus only a

few processors can be supported.⇒ Need for dual tagging caches in some cases

Page 7: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 7/28

Copyright Josep Torrellas 2003 7

Write-Back/Ownership Schemes

• When a single cache has ownership of a block, processorwrites do not result in bus writes thus conservingbandwidth.

• Most bus-based multiprocessors nowadays use suchschemes.

• Many variants of ownership-based protocols exist:

– Goodman’s write -once scheme

– Berkley ownership scheme

– Firefly update protocol

– …

• We will discuss a few of these

Page 8: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 8/28

Copyright Josep Torrellas 2003 8

Invalidation vs. Update Strategies

1. Invalidation : On a write, all other caches with a copy are invalidated

2. Update : On a write, all other caches with a copy are updated

• Invalidation is bad when :

– single producer and many consumers of data.

• Update is bad when :

– multiple writes by one PE before data is read by another PE.

– Junk data accumulates in large caches (e.g. process migration).

• Overall, invalidation schemes are more popular as the default.

Page 9: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 9/28

Copyright Josep Torrellas 2003 9

Dirty

SharedInvalid

Bus Write Miss

Bus invalidate

P-read

Bus-read

P- Read

P-read

   P  -  w  r   i   t  e

   B  u  s   W  r   i   t  e   M   i  s  s

  B  u s - r

 e a d

P-write

P-write

P- Read

P-write

Page 10: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 10/28

Copyright Josep Torrellas 2003 10

Illinois Scheme

• States: I, VE (valid-exclusive), VS (valid-shared), D (dirty)• Two features :

– The cache knows if it has an valid-exclusive (VE) copy. In VE

state no invalidation traffic on write-hits.

– If some cache has a copy, cache-cache transfer is used.

• Advantages:

– closely approximates traffic on a uniprocessor for sequential pgms.

– In large cluster-based machines, cuts down latency

• Disadvantages:

– complexity of mechanism that determines exclusiveness– memory needs to wait before sharing status is determined

Page 11: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 11/28

Copyright Josep Torrellas 2003 11

Dirty

SharedInvalid

Bus Write Miss

Bus invalidate

P-read [someone has it]

Bus-read

P- Read

P-read

   P  -  w  r   i   t  e

   B  u  s   W

  r   i   t  e   M   i  s  s

  B  u s - r

 e a d

P-write

  P - w r i t e

P- Read [someone has it]

P-write

Valid

Exclusive

B  u s  W  r  i  t e  M  i  s s 

P-read

[no one else has it]

Bus-read

P-write

P-read

[no one else has it]

P- Read

Page 12: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 12/28

Copyright Josep Torrellas 2003 12

DEC Firefly Scheme

• Classification: Write-back, update, no-dirty-sharing.• States :

– VE (valid exclusive): only copy and clean

– VS (valid shared) : shared -clean copy. Write hits result

in updates to memory and other caches and entry remainsin this state

– D(dirty): dirty exclusive (only copy)

• Used special “shared line” on bus to detect sharing status of 

cache line

• Supports producer-consumer model well

• What about sequential processes migrating between CPU’s?

Page 13: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 13/28

Copyright Josep Torrellas 2003 13

Dirty

Shared

Valid

Exclusive

Bus Read/Write

Bus write-miss

P-Write and not SL

Bus-read/write

P- Write and SL

P-read

P-read

   P  -  w  r   i   t  e

  B  u s -  w r

  i t e  m  i s s

P-write

Bus Read

[update MM]

P- Read and SL

P-write Miss

and not SL

P-Read

[no one else has it]

Bus Write miss

P-Write M

and SL

P Read

Page 14: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 14/28

Copyright Josep Torrellas 2003 14

Directory Based Cache Coherence

Key idea :keep track in a global directory (in main

memory) of which processors are caching a

location and the state.

Page 15: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 15/28

Copyright Josep Torrellas 2003 15

Motivation

• Snoopy schemes do not scale because they rely onbroadcast

• Hierarchical snoopy schemes have the root as a bottleneck 

• Directory based schemes allow scaling

– They avoid broadcasts by keeping track of all Pescaching a memory block, and then using point-to-point

messages to maintain coherence

– They allow the flexibility to use any scalable point-to-

point network 

Page 16: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 16/28

Copyright Josep Torrellas 2003 16

Basic Scheme (Censier and Feautrier)

p p

Interconnection network 

cache cache

Directory

Dirty bitPresence bits

memory

•Assume K processors

•With each cache-block in

memory: K presence bits and 1

dirty bit

•With each cache-block in cache :

1 valid bit and 1 dirty (owner) bit

Page 17: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 17/28

Copyright Josep Torrellas 2003 17

Read Miss

Read from main-memory by PE_i

– If dirty bit is off then {read from main memory;turn p[i]ON; }

– If dirty bit is ON then {recall line from dirty PE (cachestate to shared); update memory; turn dirty-bit OFF;turnp[i] ON; supply recalled data to PE_i;}

Page 18: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 18/28

Copyright Josep Torrellas 2003 18

If dirty bit ON then

{recall the data from owner PE which invalidates itself;

(update memory); clear bit of previous owner; forward

data to PE i; turn bit PE[I] on; (dirty bit ON all the time) }

If dirty-bit OFF then{supply data to PE_i; send invalidations to all PE’scaching that block and clear their P[k] bits; turn dirty bitON; turn P[i] ON; .. }

Write Miss

Page 19: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 19/28

Copyright Josep Torrellas 2003 19

Write- hit to data valid (not owned ) in cache:

{access memory-directory; send invalidations to all PE’s

caching block; clear their P[k] bits; supply data to PE i ;turn dirty bit ON ; turn PE[i] ON }

Write Hit to Non-Owned Data

Page 20: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 20/28

Copyright Josep Torrellas 2003 20

Key Issues

• Scaling of memory and directory bandwidth– Cannot have main memory or directory memory

centralized

– Need a distributed cache coherence protocol

• As shown, directory memory requirements do not scale

well

– Reason is that the number of presence bits needed

grows as the number of Pes. --> But how many bitsreally needed?

– Also: the larger the main memory is, the larger the

directory

Page 21: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 21/28

Copyright Josep Torrellas 2003 21

Directory Organizations

• Memory-based schemes (DASH) vs Cache-based schemes(SCI)

• Cache-based schemes (or linked-list based)

– Singly linked

– Doubly-linked (SCI)• Memory-based schemes (or pointer-based)

– Full map (Dir-N) vs Partial-map schemes (Dir-i-B, Dir-

i-CV-r,…)

– Dense (DASH) vs Sparse directory schemes

Page 22: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 22/28

Copyright Josep Torrellas 2003 22

Pointer-Based Coherence Schemes

• The Full Bit Vector Scheme• Limited Pointer Schemes

• Sparse Directories (Caching)

• LimitLess (Software Assistance)

Page 23: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 23/28

Copyright Josep Torrellas 2003 23

The Full Bit Vector Scheme

• One bit of directory memory per main-memory block perPE

• Memory requirements are P x (P x M/B), where P is the

number of PE, M is main memory per PE, and B is cache

block size (not counting the dirty bit)

• Invalidation traffic is best

• One way to reduce the overhead is to increase B

– Can result in false sharing and increased coherence

traffic• Overhead not too large for medium-scale mps

– Example: 256 PE organized as 64 4-PE clusters with

64-byte cache blocks ---> 12% memory overhead

Page 24: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 24/28

Copyright Josep Torrellas 2003 24

Limited Pointer Schemes

• Since data is expected to be in only a few caches at anyone time, a limited number of pointers per directory entry

should suffice

• Overflow strategy: what to do when the number of sharers

exceeds the number of pointers?

• Many different schemes based on different overflow

strategies

Page 25: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 25/28

Copyright Josep Torrellas 2003 25

Some Examples

• Dir-i-B– Beyond i-pointers, set the inval-broadcast bit ON

– Storage needed is: i x log(P) x PM/B (in addition to inval-broadcast

bit)

– Expected to do well since widely shared data is not written often

• Dir-i-NB

– When sharers exceed i, invalidate one of the existing sharers

– Significant degradation expected for widely-shared mostly-read data

• Dir-i-CV-r

– When sharers exceed i, use bits allocated to i poiters as a coarseresolution vector (each bit points to multiple PE)

– Always results in less coherence traffic than Dir-i-B

• LimitLess directories: Handle overflow using SW traps

Page 26: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 26/28

Copyright Josep Torrellas 2003 26

Performance of Directories

• Figure10 in Gupta et al paper• Figure 7 in Gupta et al paper

Page 27: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 27/28

Copyright Josep Torrellas 2003 27

LimitLess Directories

• Limit number of pointers• On overflow:

– Memory module interrupts the local processor

– Processor emulates the full-map directory for block 

Page 28: Cache Coherence 1

8/8/2019 Cache Coherence 1

http://slidepdf.com/reader/full/cache-coherence-1 28/28

Copyright Josep Torrellas 2003 28

LimitLess Directories Require...

• Rapid trap handler: trap code executes within 5-10 cyclesfrom trap initiation)

• Software has complete access to coherence controller

• Interface to the network that allows the processor to launch

and intercept coherence protocol packets