20
QuickTime™ and a TIFF (Uncompressed) decompre are needed to see this pic Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University of Illinois at Urbana- Champaign *IBM Research http://iacoma.cs.uiuc.edu

Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

  • Upload
    soo

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors. Karin Strauss, Xiaowei Shen*, Josep Torrellas University of Illinois at Urbana-Champaign *IBM Research. http://iacoma.cs.uiuc.edu. Motivation. CMPs are becoming standard components - PowerPoint PPT Presentation

Citation preview

Page 1: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Flexible Snooping:

Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

Karin Strauss, Xiaowei Shen*, Josep Torrellas

University of Illinois at Urbana-Champaign

*IBM Research

http://iacoma.cs.uiuc.edu

Page 2: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 2QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Motivation

• CMPs are becoming standard components

• cheaper to build medium size machines– 32 to 128 cores (multi-CMP)

• shared memory, cache coherent– easier to program, easier to manage

• supporting cache coherence is difficult

Page 3: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 3QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Cache coherence solutions

long latenciessimplenosnoopy

embedded ring

difficult to scale

simpleyessnoopy

broadcast bus

indirection,

extra hardwarescalableno

directory based protocol

consprosordered

network?strategy

• other proposals (e.g. token coherence)

Page 4: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 4QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Contributions

compared to fastest state-of-the-art scheme

performance energyconsumption

Superset Aggressive

performance energyconsumption

Superset Conservative

• family of adaptive coherence protocols for rings

• two were chosen as best options

high performance scheme energy conscious scheme

Page 5: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 5QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Multi-CMP multiprocessor

local network

CMP Proc + L1 + L2

memory

• coherence protocol used: only one supplier if line is cached

Page 6: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 6QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Ring in actionR

S

R

S

R

S

supplierpredictor

snoop

request

cmp

Lazy Eager Oracle

response

datadata

data

Page 7: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 7QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Ring in actionR

S

R

S

R

S

latency

snoops

messages

• goal: adaptive schemes that approximate Oracle’s behavior

Lazy Eager Oracle

Page 8: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 8QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Primitive snooping actions

X X

• snoop and then forward

• forward and then snoop

• forward only

+ fewer messages

+ shorter latency

+ fewer snoops+ shorter latency– false negative predictions not allowed

Page 9: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 9QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Predictors and algorithms

snoopforwardExact

forward

then snoopAgg

forward

snoopforward

then snoopSubset

action on positive

prediction

action on negative

prediction

predictor / algorithm

Superset

Consnoop then

forward

node can supply

in predictor

set of addresses:

Page 10: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 10QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Eager

Subset

Lazy

SupersetAgg

SupersetCon

Oracle

Algorithms

/ Exact

number of snoops

snoop messagelatency

number of messages

Per miss service:

algorithm negative positive

Subsetforward

then snoop

snoop

S

u

p

e

r

set

C

o

n forward

snoop then

forward

A

gg

forward then

snoop

Exact forward snoop

Page 11: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 11QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Predictor implementation• Subset

– associative table:

subset of addresses that can be supplied by node

• Superset– bloom filter: superset of addresses that can be supplied by node– associative table (exclude cache):

addresses that recently suffered false positives

• Exact– associative table: all addresses that can be supplied by node

– downgrading: if address has to be evicted from predictor table,

corresponding line in node has to be downgraded

Page 12: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 12QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Downgrading

AB

ES Negative effects:

• writes by this node need to snoop other nodes

• reads and writes by other nodes need to fetch line from memory

A

Page 13: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 13QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Experiments• 8 CMPs, 4 ooo cores each = 32 cores

– private L2 caches

• on-chip bus interconnect

• off-chip 2D torus interconnect with embedded unidirectional ring

• per node predictors: latency of 3 processor cycles

• sesc simulator (sesc.sourceforge.net)

• SPLASH-2, SPECjbb, SPECweb

Page 14: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 14QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Execution time

0

0.2

0.4

0.6

0.8

1

1.2

SPLASH-2 SPECjbb SPECweb

Normalized

execution time

Lazy Eager Oracle Subset SupersetCon

SupersetAggExact

• the fastest of all algorithms is SupersetAgg

• performance of most flexible snooping algorithms is similar to Eager

Page 15: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 15QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Miss service energy

0

0.5

1

1.5

2

SPLASH-2 SPECjbb SPECweb

Normalized

energy consumption Lazy

Eager Oracle Subset SupersetCon

SupersetAggExact

3.22

• SupersetCon is least energy-hungry algorithm

• algorithms that eagerly forward messages use more energy

Page 16: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 16QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Most cost-effective algorithms

00.20.40.60.81

1.2

SPLASH-2 SPECjbb SPECweb

Normalized

execution time

Lazy Eager Oracle Subset SupersetCon SupersetAggExact

0

0.5

1

1.5

2

SPLASH-2 SPECjbb SPECweb

Normalized

energy consumption

3.22Lazy Eager Oracle Subset SupersetCon SupersetAggExact

• high performance: Superset Aggressive • faster than Eager at lower energy consumption

• energy conscious: Superset Conservative• slightly slower than Eager at much lower energy consumption

Page 17: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 17QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Most cost-effective algorithmscompared to fastest state-of-the-art scheme (Eager)

can be combined by only changing forwarding policy

performance energyconsumption

performanceenergy

consumption

Superset Aggressive high performance scheme

Superset Conservative energy conscious scheme

Page 18: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 18QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Conclusions

• proposed flexible snooping, a family of adaptive protocols for embedded rings

• two chosen protocols– high performance: Superset Aggressive – energy conservation: Superset Conservative – can be selected dynamically

• embedded-ring protocols more attractive

Page 19: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

Karin Strauss Flexible Snooping 19QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Arch map

Google: architecture conference map(1st hit)

http://iacoma.cs.uiuc.edu/students/archmap/archmap.html

Page 20: Flexible Snooping: Adaptive Forwarding and Filtering of Snoops  in Embedded-Ring Multiprocessors

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Flexible Snooping:

Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

Karin Strauss, Xiaowei Shen*, Josep Torrellas

University of Illinois at Urbana-Champaign

*IBM Research

http://iacoma.cs.uiuc.edu