Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

Preview:

DESCRIPTION

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors. Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S. Lee. Shared Last Level Cache. Concurrent Execution in CMP. Single-threaded program. Multi-threaded program. Code, Data. - PowerPoint PPT Presentation

Citation preview

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

Chinnakrishnan S. BallapuramAhmad Sharif

Hsien-Hsin S. Lee

2Ballapuram, Sharif, and Lee

Concurrent Execution in CMP

Code, Data

Single-threaded program

Registers, Stack(Local)

Code Data

Multi-threaded program

Registers, Stack(Local)

Registers, Stack(Local)

Registers, Stack(Local)

Thread 2Thread 1Thread 0Thread 0

Shared Last Level Cache

3Ballapuram, Sharif, and Lee

Self-Modifying Code (SMC) Snoop

IL1IL1

Core 0

IL1IL1 DL1

Core 1

IL1 DL1

Core 2

IL1 DL1

Core 3

IL1 DL1

SMC snoop

SMC snoop

SMC snoop

SMC snoop

4Ballapuram, Sharif, and Lee

Snoop for Core 0 DL1 Miss

IL1IL1

L2 queue (FIFO)L2 queue (FIFO)

L2 L2 cachecache

Snoop queue Snoop queue (FIFO)(FIFO)

Other Other logic logic and and

buffersbuffers

External interconnectExternal interconnect

CMP core interconnectCMP core interconnect

Core 0

IL1IL1 DL1

SMC snoop

Core 1

IL1 DL1

SMC snoop

Core 2

IL1 DL1

SMC snoop

Core 3

IL1 DL1

SMC snoop

5Ballapuram, Sharif, and Lee

External Snoop Request

L2 queue (FIFO)L2 queue (FIFO)

L2 L2 cachecache

Snoop queue Snoop queue (FIFO)(FIFO)

Other Other logic logic and and

buffersbuffers

External interconnectExternal interconnect

CMP core interconnectCMP core interconnect

Core 0

IL1IL1 DL1

SMC snoop

Core 1

IL1 DL1

SMC snoop

Core 2

IL1 DL1

SMC snoop

Core 3

IL1 DL1

SMC snoop

6Ballapuram, Sharif, and Lee

Modified L2 Eviction, External Request, etc

IL1IL1

L2 queue (FIFO)L2 queue (FIFO)

L2 L2 cachecache

Snoop queue Snoop queue (FIFO)(FIFO)

Other Other logic logic and and

buffersbuffers

External interconnectExternal interconnect

CMP core interconnectCMP core interconnect

Core 0

IL1IL1 DL1

SMC snoop

Core 1

IL1 DL1

SMC snoop

Core 2

IL1 DL1

SMC snoop

Core 3

IL1 DL1

SMC snoop

7Ballapuram, Sharif, and Lee

Modified L2 Eviction, External Request, etc

L2 queue (FIFO)L2 queue (FIFO)

L2 L2 cachecache

Snoop queue Snoop queue (FIFO)(FIFO)

Other Other logic logic and and

buffersbuffers

External interconnectExternal interconnect

CMP core interconnectCMP core interconnect

Core 0

IL1IL1 DL1

SMC snoop

Core 1

IL1 DL1

SMC snoop

Core 2

IL1 DL1

SMC snoop

Core 3

IL1 DL1

SMC snoop

As # of cores increasesPower

Performance

8Ballapuram, Sharif, and Lee

Number of Snoop Probes

• SMC Snoops to I-Cache > Snoops to D-Cache > Snoops to LSB.

0

1

2

3

4

5

6

7

8

9

10

11

12to

_lsb

to_d

cach

e

to_i

cach

e

to_l

sb

to_d

cach

e

to_i

cach

e

to_l

sb

to_d

cach

e

to_i

cach

e

to_l

sb

to_d

cach

e

to_i

cach

e

to_l

sb

to_d

cach

e

to_i

cach

e

SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threaded apps

Num

ber o

f sno

op p

robe

s in

Mill

ions

2C

4C

2 x 4C8C

16.4M

9Ballapuram, Sharif, and Lee

Snoop Probe and Snoop Rate

• % of data snoop > % of instruction cache snoop

02468

1012141618202224262830

2C 4C 2Px4C 8C 8C-MT 2Px4C-MT

Num

ber

of s

noop

s in

Mill

ions

0%

200%

400%

600%

800%

1000%

1200%

1400%

1600%

1800%

2000%

2200%

2400%

Processor configuration

% o

f sno

op in

crea

se

to_lsbto_dcacheto_icachetotal snoops% of data snoop increase% of SMC snoop increase% of total snoop increase

~22x increase

~12x increase

10Ballapuram, Sharif, and Lee

We propose two techniques to reduce the power consumed by snoop probes:

1. Selective Snoop Probe (SSP)2. Essential Snoop Probe (ESP)

11Ballapuram, Sharif, and Lee

Selective Snoop Probe (SSP)- SSP for SMC- SSP for Non-Stack Accesses- SSP for Stack Accesses

12Ballapuram, Sharif, and Lee

Selective Snoop Probe (SSP)- SSP for SMC

13Ballapuram, Sharif, and Lee

Normal Operation: To Support SMC

L1 I-Cache

From RS or LSB

dispatch

SMC snoop probe

L1 D-cache MSHR

Core 0

14Ballapuram, Sharif, and Lee

Core 0

SSP (SMC) – No SMC Snoop if BF1 miss

From RS or LSB

dispatch

All store addr

HASH

cntr

MSHR

u1

r1

r1 – read Bloom filteru1 – update Bloom filtercntr- counting Bloom filter

BF1SMC snoop probe

L1 I-Cache

L1 D-cache

To filter SMC/XMC snoops

15Ballapuram, Sharif, and Lee

Core 0

SSP (SMC) – No SMC Snoop if BF1 Hit

From RS or LSB

dispatch

All store addr

HASH

cntr

MSHR

u1

r1

r1 – read Bloom filteru1 – update Bloom filtercntr- counting Bloom filter

BF1SMC snoop probe

L1 I-Cache

L1 D-cache

16Ballapuram, Sharif, and Lee

Selective Snoop Probe (SSP)- SSP for Stack Accesses

17Ballapuram, Sharif, and Lee

Normal Operation: Always Snoop for All Accesses

Snoopprobes

Snoop probes

L2 queue

Last Level Cache

dL1 miss

Core 0

From RS or LSB

dispatch

L1 D-cache MSHR

Snoop controller

Snoop queue

18Ballapuram, Sharif, and Lee

Core 0

SSP – Stack Accesses

All addresses(carry S-bit annotation)

L2 queue

From RS or LSB

dispatch

L1 D-cache MSHR

dL1 miss

Last Level Cache

Snoop controller

0100

Snoop queue

Annotated by

Front-End

19Ballapuram, Sharif, and Lee

Selective Snoop Probe (SSP)- SSP for Non-Stack Accesses

20Ballapuram, Sharif, and Lee

Core 0

SSP – Non-stack Accesses Update BF2

From RS From RS or LSBor LSB

dispatchdispatch

All non-stack addressesAll non-stack addresses

MEME SISI SISIMEME

L1 D-cacheL1 D-cache MSHRMSHR

L2 queueL2 queue

Last Level Cache

Snoop controller

1000

Snoop queuer2 – read Bloom filter

u2 - update Bloom filtercntr - counting Bloom filter

u2u2

Filter snoops to non-stack region

HASH cntr

BF2

21Ballapuram, Sharif, and Lee

SSP – Non-stack Accesses Read BF2

All non-stack addresses

Filter snoops to non-stack region

HASH cntr

u2u2

L2 queue

dL1 miss

r2

r2All addresses(carry S-bit annotation)

r2 – read Bloom filteru2 - update Bloom filtercntr - counting Bloom filter

Last Level Cache

Snoop controller

1000

Snoop queue

BF2

Core 0From RS From RS or LSBor LSB

dispatchdispatch

All non-stack addressesAll non-stack addresses

MEME SISI SISIMEME

L1 D-cacheL1 D-cache MSHRMSHR

22Ballapuram, Sharif, and Lee

SSP - Selectively Send Snoop Probes

Selectively send snoops

L2 queue

Last Level Cache

Snoop controller

1000

Snoop queuer2 – read Bloom filter

u2 - update Bloom filtercntr - counting Bloom filter

u2u2

Selectively send snoops

All non-stack addressesu2u2

All addresses(carry S-bit annotation)

Core 0From RS From RS or LSBor LSB

dispatchdispatch

All non-stack addressesAll non-stack addresses

MEME SISI SISIMEME

L1 D-cacheL1 D-cache MSHRMSHR

Filter snoops to non-stack region

HASH cntr

BF2

dL1 miss

23Ballapuram, Sharif, and Lee

Essential Snoop Probe (ESP)- ESP for SMC- ESP for all variables

24Ballapuram, Sharif, and Lee

Essential Snoop Probe (ESP)- ESP for SMC

25Ballapuram, Sharif, and Lee

Core 0

SMC – Normal Operation

L1 I-$

Every Store SnoopsI-cache

From RS or

LSB dispatch

L1 D-$

Other pipe stages

26Ballapuram, Sharif, and Lee

Core 0

ESP Essential Snoop Probe

From RS or

LSB dispatch

Other pipe stages

L1 I-$ L1 D-$

• OS sets a control register bit (SMC-CR) • SMC-CR=1 Non Self-Modifying Code• SMC-CR=0 Self-Modifying Code

SMC-CR=1

27Ballapuram, Sharif, and Lee

Essential Snoop Probe (ESP)- ESP for all variables

28Ballapuram, Sharif, and Lee

Core 0

Normal Operation – Snoop for All Variables

Snoop probes

L2 queue

From RS or

LSB dispatch

Other pipe stages

CMP interconnect domain

Snoop probes

Snoop controller

Snoop queue

Last Level Cache

L1 I-$ L1 D-$

dL1 miss

29Ballapuram, Sharif, and Lee

Core 0

Essential Snoop Probe (ESP) – SMN bit 1

dL1 misswith SMN bit annotation

L2 queue

From RS or

LSB dispatch

Other pipe stages

CMP interconnect domain

SMN bitSMN bit – Snoop-Me-Not bit is 0/1

Snoop controller

1100

Snoop queue

Last Level Cache

L1 I-$ L1 D-$

30Ballapuram, Sharif, and Lee

Core 0

Essential Snoop Probe (ESP) – SMN bit 0

L2 queue

From RS or

LSB dispatch

ESP

Other pipe stages

CMP interconnect domain

SMN bit – Snoop-Me-Not bit is 0/1

Last Level Cache

SMN bit

Snoop controller

0100

Snoop queue

L1 I-$ L1 D-$

ESPESP

dL1 misswith SMN bit annotation

31Ballapuram, Sharif, and Lee

Energy Savings in D-Cache Using SSP

• In the 2C config 5% - 10% data cache energy savings and in the 8C config 30% - 65% is achieved.

• The data cache energy savings increases with the number of cores on the die as the number of snoops to all the cores increases.

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%

65%

70%

2C 4C 2Px4C 8C

Processor configuration

% o

f dat

a ca

che

ener

gy s

avin

gs p

er c

ore

SPEC INT 2006SPEC FP 2006games/multi-mediaservermulti-threaded application

32Ballapuram, Sharif, and Lee

Energy Savings in I-Cache Using SSP

• There is a 50% - 70% instruction cache tag energy savings is achieved across all processor configurations.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2C 4C 2Px4C 8C

Processor configuration

% o

f ica

che

tag

ener

gy s

avin

gs p

er c

ore

SPEC INT 2006SPEC FP 2006games/multi-mediaservermulti-threaded application

33Ballapuram, Sharif, and Lee

Performance Impact with SSP

• On average there is 1% - 2% performance improvement across various benchmark categories and different processor configurations is achieved.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

120%

SPEC INT 2006 SPEC FP 2006 games/multi-media

server multi-threadedapplication

Harmean acrossbenchmarks

min performanceobserved

maxperformance

observed

2C 4C 2Px4C 8C

34Ballapuram, Sharif, and Lee

Energy Savings with ESP

• It shows that 5% to a maximum of 82% data cache energy is spent on the non-essential snoop probes that can be eliminated using the ESP technique.

• Also, 85% of the snoops to the instruction cache tag energy can be eliminated using ESP.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

dcache icache dcache icache dcache icache dcache icache dcache icache dcache icache

SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threadedapplication

Harmonic meanacross benchmarks%

of c

ache

ene

rgy

spen

t on

non-

esse

ntia

l sno

ops

per

core

2C 4C 2Px4C 8C

35Ballapuram, Sharif, and Lee

• Semantics and program behavior are useful indicators

• They are exploited to reduce power due to snoops

• We proposed– Selective Snoop Probe (SSP) – Essential Snoop Probe (ESP)

• Energy Reduction Results– 5% to 65% in D-cache per core– 50% to 70% in I-cache per core

• 1% - 2% performance improvement

• Extensible to optimize integrated platforms with graphics processor

Conclusion

Georgia TechElectrical and Computer Engineering MARS Labshttp://arch.ece.gatech.edu

Thank You !

BACKUP

38Ballapuram, Sharif, and Lee

Simulation InfrastructureExecution Engine 4-wide, Out-of-OrderLoad buf / Store buf / RS / ROB 96 / 64 / 128 / 256 entriesL1 / L2 latency 4 / 8 cyclesL1 I, L1 D cache size 32KB, 8 way, 64BL2 Cache 4MB, 16 way, 64BL1 TLB entries 128, 4 wayMemory 2GB, DDR 2 timingsCACTI 4.2 70nm power modelBenchmark class Example applicationsServer specJBB, TPCCSPEC FP 2006 wrf, namd, lbm, soplexSPEC INT 2006 hmmer, gobmk, omnetpp,

gccGames and multi-media shooters, realtime

strategy, raytracerMulti-threaded applications ray tracer, cinebench

39Ballapuram, Sharif, and Lee

Number of Modified Lines

• It shows the number of modified lines that needs to be evicted to the last level cache.

0

20

40

60

80

100

120

140

160

180

200

220

SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threadedapplication

Average acrossbenchmarks

Num

ber o

f mod

ified

line

s at

com

plet

ion

2C4C2Px4C8C

40Ballapuram, Sharif, and Lee

Cache access Vs Snoop access

• Cache access – Read one sub-bank (8 bytes)• Snoop access – Need to read all sub-banks to ship the data to other cores

or other processor in an MP system. (all 64 bytes, cache line size)

41Ballapuram, Sharif, and Lee

Hash functions

Cache LineCache Line(physical address)(physical address)

(48-bits)(48-bits)

MESIMESIstatestate

Tag + Tag + Index Index bitsbits

DataData

cntrcntr cntrcntrHASH HASH 33

HASH HASH 33

If M/E stateIf M/E state If S stateIf S state

Unused bitsUnused bits BBCC AATag + Index bits [6-32]Tag + Index bits [6-32]

cntcntrr

cntcntrr

cntcntrr

HASH HASH 33

If bit-10 is 0, HASH3 = A ^ B ^ CIf bit-10 is 0, HASH3 = A ^ B ^ CIf bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ CIf bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ C

6153347

42Ballapuram, Sharif, and Lee

Incoming Events to LLCIncoming events to the last level cache

RFO

Data Read

Code fetch

Shared L2 evict

43Ballapuram, Sharif, and Lee

Incoming Events to LLC and Sources of Snoop TriggersIncoming events to the last level cache

iL1 of thiscore

dL1 ofthiscore

RFO - Event trigger

Data Read - Event trigger

Code fetch

Event trigger

Shared L2 evict

44Ballapuram, Sharif, and Lee

Snooped Units in the Triggered CoreIncoming events to the last level cache

iL1 of thiscore

dL1 ofthiscore

LSB of thiscore

MSHR,WBB of this core

RFO - Event trigger

- -

Data Read - Event trigger

- -

Code fetch

Event trigger

SMC snoop

Snoop store buffer only (updated writes)

Snoop (update writes)

Shared L2 evict

- Snoop - Snoop

45Ballapuram, Sharif, and Lee

Snoop Probes for Incoming Data ReadIncoming events to the last level cache

iL1 of thiscore

dL1 ofthiscore

LSB of thiscore

MSHR,WBB of this core

iL1 ofother 3cores

dL1 ofother 3cores

LSB of other 3cores

MSHR,WBB of other 3 cores

Shared L2queue

RFO - Event trigger

- - XMC snoop to invalidate line

Snoop snoop load buffer only to invalidate

Snoop to invalidate pending requests

Snoop to invalidate

Data Read - Event trigger

- - XMC snoop to invalidate line

Snoop - Snoop Snoop

Code fetch

Event trigger

SMC snoop

Snoop store buffer only (updated writes)

Snoop (update writes)

- XMC snoop

Snoop store buffer only (update writes)

Snoop SMC Snoop

Shared L2 evict

- Snoop - Snoop - Snoop - Snoop Snoop

46Ballapuram, Sharif, and Lee

Snoop Triggers and Snoop UnitsIncoming events to the last level cache

iL1 of thiscore

dL1 ofthiscore

LSB of thiscore

MSHR,WBB of this core

iL1 ofother 3cores

dL1 ofother 3cores

LSB of other 3cores

MSHR,WBB of other 3 cores

Shared L2queue

RFO - Event trigger

- - XMC snoop to invalidate line

Snoop snoop load buffer only to invalidate

Snoop to invalidate pending requests

Snoop to invalidate

Data Read - Event trigger

- - XMC snoop to invalidate line

Snoop - Snoop Snoop

Code fetch

Event trigger

SMC snoop

Snoop store buffer only (updated writes)

Snoop (update writes)

- XMC snoop

Snoop store buffer only (update writes)

Snoop SMC Snoop

Shared L2 evict

- Snoop - Snoop - Snoop - Snoop Snoop

SMC snoop to iL1

On all store addr disp

- - SMC snoop to iL1

On all store addr disp

- - -

Recommended