30
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

ECE 669

Parallel Computer Architecture

Lecture 18

Scalable Parallel Caches

Page 2: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Overview

° Most cache protocols are more complicated than two state

° Snooping not effective for network-based systems• Consider three alternate cache coherence approaches

• Full-map, limited directory, chain

° Caching will affect network performance

° Limitless protocol• Gives appearance of full-map

° Practical issues of processor – memory system interaction

Page 3: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Context for Scalable Cache Coherence

Scalable network

CA

P

$

Switch

M

Switch Switch

Realizing Pgm Modelsthrough net transactionprotocols - efficient node-to-net interface - interprets transactions

Caches naturally replicatedata - coherence through bussnooping protocols - consistency

Scalable Networks - many simultaneoustransactions

Scalabledistributedmemory

Need cache coherence protocols that scale! - no broadcast or single point of order

Page 4: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Generic Solution: Directories

° Maintain state vector explicitly • associate with memory block

• records state of block in each cache

° On miss, communicate with directory• determine location of cached copies

• determine action to take

• conduct protocol to maintain coherence

P1

Cache

Memory

Scalable Interconnection Network

Comm.Assist

P1

Cache

CommAssist

Directory MemoryDirectory

Page 5: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

A Cache Coherent System Must:

° Provide set of states, state transition diagram, and actions

° Manage coherence protocol• (0) Determine when to invoke coherence protocol

• (a) Find info about state of block in other caches to determine action

- whether need to communicate with other cached copies

• (b) Locate the other copies

• (c) Communicate with those copies (inval/update)

° (0) is done the same way on all systems• state of the line is maintained in the cache

• protocol is invoked if an “access fault” occurs on the line

° Different approaches distinguished by (a) to (c)

Page 6: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Coherence in small machines: Snooping Caches

• Broadcast address on shared write

• Everyone listens (snoops) on bus to see if any of their own addresses match

• How do you know when to broadcast, invalidate...

- State associated with each cache line

cache cachedirectory

cachedirectory cache

snoop

Dualported

M

a

a

a

a

Broadcast

write

ProcessorProcessor

Purge 1

2

3

4Match a

5

Page 7: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

State diagram for ownership protocols

• For each shared data cache block

Ownership

• In ownership protocol: writer owns exclusive copy

invalid

write-dirtyread-clean

LocalRead Remote

Write(Replace)Remote

Write(Replace)

RemoteRead

LocalWrite

LocalWrite Broadcast a

Broadcast a

Page 8: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Maintaining coherence in large machines

• Software

• Hardware - directories

° Software coherenceTypically yields weak coherence

i.e. Coherence at sync points (or fence pts)

° E.g.: When using critical sections for shared ops...

° Code

° How do you make this work?

foo1foo2foo3foo4

GET_FOO_LOCK/* MUNGE WITH FOOs */

Foo1 =X = Foo2Foo3 =

RELEASE_FOO_LOCK

.

.

.

Page 9: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Situation

° Flush foo* from cache, wait till done

° Issues• Lock ?

• Must be conservative

- Lose some locality

• But, can exploit appl. characteristics

e.g. TSP, Chaotic

Allow some inconsistency

• Need special processor instructions

flush

a1,r

1ra ... a1

rw ,... a2

w

unlock

foo1foo2foo3foo4

P P

cache cache MEM

foo home

flush wait

Page 10: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Scalable Approach: Directories

° Every memory block has associated directory information

• keeps track of copies of cached blocks and their states

• on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary

• in scalable networks, communication with directory and copies is through network transactions

° Many alternatives for organizing directory information

Page 11: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Basic Operation of Directory

• k processors.

• With each cache-block in memory: k presence-bits, 1 dirty-bit

• With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network

• Read from main memory by processor i:

• If dirty-bit OFF then { read from main memory; turn p[i] ON; }

• if dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}

• Write to main memory by processor i:

• If dirty-bit OFF then { supply data to i; send invalidations to all caches that have the block; turn dirty-bit ON; turn p[i] ON; ... }

Page 12: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Scalable dynamic schemes

• Limited directories

• Chained directories

• Limitless schemes

Use software

Other approach: Full Map (not scalable)

Page 13: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

General directories:

• On write, check directory

if shared, send inv msg

• Distribute directories with MEMs

Directory bandwidth scales in proportion to memory bandwidth

• Most directory accesses during memory access -- so not too many extra network requests (except, write to read VAR)

Scalable hardware schemes

Limited directories

Chained directories

LimitLESS directories

MEM

PP

C

P

CC

M

DIR

DIR

MEM

...M

Page 14: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Memory controller - (directory) state diagram for memory block

uncached

1 or moreread copies

1 write copy i

readwrite

write/invs req

replace update

read i/update req

pointers pointer

Page 15: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Network

M MM

C C C

P P P. . .

Page 16: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Network

Limited directories: Exploit worker set behavior

• Invalidate 1 if 5th processor comes along (sometimes can set a broadcast invalidate bit)

• Rarely more than 2 processors share

• Insight: The set of 4 pointers can be managed like a fully-associative 4 entry cache on the virtual space of all pointers

• But what do you do about widely shared data?

Page 17: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Network

° LimitLESS directories:Limited directories Locally Extended through Software Support

• Trap processor when 5th request comes

• Processor extends directory into local memory

M MM

C C C

P P P. . .

5th req

Trap

23

1

Page 18: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Zero pointer LimitLESS: All software

coherence

mem

cache

proc

comm

directory

trap always

mem

cache

proc

comm

remote mem

cache

proc

comm

local

Page 19: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Network

° Chained directories: Simply different data structure for directory

• Link all cache entries

• But longer latencies

• Also more complex hardware

• Must handle replacements of elements in chain due to misses!

M MM

C C C

P P P. . .

inv invinv

wrt

Page 20: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Doubly linked chains

Network

Of course, can do these data structures though software + msgs as in LimitLESS

M MM

C C C

P P P. . .

Page 21: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Network

M MM

C C C

P P P. . .

...

DIR

Page 22: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Network

Full map: Problem• Does not scale -- need N pointers

• Directories distributed, so not a bottleneck

M MM

C C C

P P P. . .

12

3 4

inv

ack write permission granted

write

...

DIR

Page 23: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

° Hierarchical - E.g. KSR (actually has rings...)cache

P P P P P P P

disks?MEMORY

Page 24: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Hierarchical - E.g. KSR (actually has rings...)

cache

P P P P P P P

disks?MEMORY

RD A

A

A

A

A

Page 25: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Hierarchical - E.g. KSR (actually has rings...)

cache

P P P P P P P

disks?MEMORY

RD A

A

A

A

A

A

A?

Page 26: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Widely shared data

° 1. Synchronization variables

° 2. Read only objects

° 3. But, biggest problem:

InstructionsRead-only data

Can mark these and bypass coherence protocol

Frequently read, but rarely written data which does not fall into known patterns like synchronization variables

flagwrt

spin

wrt

Software combining tree

Page 27: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

All software coherence

All software

LimitLESS1

All hardware

LimitLESS2

LimitLESS4

0.00 0.40 0.80 1.20 1.60

Execution Time (Mcycles)

Page 28: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

All software coherence

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

4x4 8x8 32x32 64x64 128x12816x16

Cycle Performance for Block Matrix Multiplication, 16 processors

Matrix size

Execu

tion

tim

e (

meg

acycle

s)

LimitLESS4 Protocol

LimitLESS0 : All software

Page 29: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

All software coherence

Performance for Single Global Barrier (first INVR to last RDATA)

0.0

1.0

2.0

3.0

4.0

5.0

2 nodes 4 nodes 32 nodes 64 nodes16 nodes

Execu

tion

tim

e (

10

00

0 c

ycle

s)

LimitLESS4 Protocol

LimitLESS0 : All software

Page 30: ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE669 L18: Scalable Parallel Caches April 6, 2004

Summary

° Tradeoffs in caching an important issue

° Limitless protocol provides software extension to hardware caching

° Goal: maintain coherence but minimize network traffic

° Full map not scalable and too costly

° Distributed memory makes caching more of a challenge