Moshovos © 1
RegionScout: RegionScout: Exploiting Coarse Grain Sharing in Exploiting Coarse Grain Sharing in
Snoop CoherenceSnoop Coherence
Andreas MoshovosAndreas [email protected]@eecg.toronto.edu
www.eecg.toronto.edu/aenaowww.eecg.toronto.edu/aenao
Moshovos © 2
CPU
I$ D$
CPU
I$ D$
CPU
I$ D$
interconnect
Main Memory
Improving Snoop Coherence
Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth
Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? Remains Attractive: Simple / Design Re-use
Yes: Exploit Program Behavior toDynamically Identify Requests that do not Need Snooping
Moshovos © 3
CPU
I$ D$
CPU
I$ D$
CPU
I$ D$
interconnect
Main Memory
RegionScout: Avoid Some Snoops
Frequent case: non-sharing even at a coarse level/Region RegionScout: Dynamically Identify Non-Shared Regions
First Request to a Region Identifies it as not Shared Subsequent Requests do not need to be broadcast
Uses Imprecise Information Small structures Layer on top of conventional coherence No additional constraints
Moshovos © 4
Roadmap
Conventional Coherence: The need for power-aware designs
Potential: Program Behavior
RegionScout: What and How
Implementation
Evaluation
Summary
Moshovos © 5
Coherence Basics
Given request for memory block X (address) Detect where its current value resides
Main Memory
snoop
snoop
X
hit
CPU CPU CPU
Moshovos © 6
Conventional Coherence not Power-Aware/Bandwidth-Effective
All L2 tags see all accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accesses
Bandwidth: broadcast all coherent requests
Main Memory
L2
CPU
missmiss
CPU CPU
Moshovos © 7
RegionScout Motivation:Sharing is Coarse
Region: large continuous memory area, power of 2 size CPU X asks for data block in region R
1. No one else has X
2. No one else has any block in RRegionScout Exploits this Behavior
Layered Extension over Snoop Coherence
Typical Memory Space Snapshot: colored by owner(s)
addresses
Moshovos © 8
Optimization Opportunities
Power and Bandwidth Originating node: avoid asking others Remote node: avoid tag lookup
CPU
I$ D$
CPU
I$ D$
Memory
SWITCH
CPU
I$ D$
Moshovos © 9
Potential: Region Miss Frequency
0%
25%
50%
75%
100%
256 512 1K 2K 4K 8K 16K
p4.512K
p4.1M
p8.512K
p8.1M
% o
f all
request
s
Region Size
Even with a 16K Region~45% of requests miss in all remote nodes
bett
er
Glo
bal R
eg
ion
Mis
ses
Moshovos © 10
RegionScout at Work: Non-Shared Region Discovery
First request detects a non-shared region
Main Memory
CPUCPU CPU
Global Region Miss
Region Miss Region Miss12 2
3
Record: Non-Shared Regions Record: Locally Cached Regions
Moshovos © 11
RegionScout at Work:Avoiding Snoops
Subsequent request avoids snoops
Main Memory
CPUCPU CPU
Global Region Miss
1
2
Record: Non-Shared Regions Record: Locally Cached Regions
Moshovos © 12
RegionScout is Self-Correcting
Request from another node invalidates non-shared record
Main Memory
CPUCPU CPU
12 2
Record: Non-Shared Regions Record: Locally Cached Regions
Moshovos © 13
Requesting Node provides address:
At Originating Node – from CPU: Have I discovered that this region is not shared?
At Remote Nodes – from Interconnect: Do I have a block in the region?
Implementation: Requirements
Region Tag offsetlg(Region Size)
CPU
address
Moshovos © 14
Remembering Non-Shared Regions
Records non-shared regions Lookup by Region portion prior to issuing a request Snoop requests and invalidate
Region Tag offsetaddress
validNon-Shared Region Table
Few entries16x4 in most experiments
Moshovos © 15
What Regions are Locally Cached?
If we had as many counters as regions: Block Allocation: counter[region]++ Block Eviction: counter[region]-- Region cached only if counter[region] non-zero
Not Practical: E.g., 16K Regions and 4G Memory 256K counters
Region Tag offset
counter
Moshovos © 16
What Regions are Locally Cached?
Use few Counters Imprecise: Records a superset of locally cached Regions False positives: lost opportunity, correctness preserved
Region Tag offset
counter
hashCached Region Hash
“Counter”: + on block allocation - on block evictionFew entries, e.g., 256
p bits
P-bit 1 if counter non-zero used for lookups
Moshovos © 17
Roadmap
Conventional Coherence
Program Behavior: Region Miss Frequency
RegionScout
Evaluation
Summary
Moshovos © 18
Evaluation Overview
Methodology
Filter rates Practical Filters can capture many Region Misses
Interconnect bandwidth reduction
Moshovos © 19
Methodology
In-House simulator based on Simplescalar Execution driven All instructions simulated – MIPS like ISA System calls faked by passing them to host OS Synchronization using load-linked/store-conditional Simple in-order processors Memory requests complete instantaneously MESI snoop coherence 1 or 2 level memory hierarchy WATTCH power models
SPLASH II benchmarks Scientific workloads Feasibility study
Moshovos © 20
Filter Rates
0%
25%
50%
75%
100%
256 512 1K 2K
p4.512K.R4K
p4.512K.R16K
p8.512K.R4K
p8.512K.R16K
Iden
tifi
ed
Glo
bal R
eg
ion
Mis
ses
CRH Size
bett
er
For small CRH better to use large regionsPractical RegionScout filters capture a lot of the potential
Moshovos © 21
Bandwidth Reduction
0%
25%
50%
75%
100%
2K 4K 8K 16K
p4.512K
p8.512K
p4.64K
p8.64K
Messag
es
Region Size
bett
er
CM
P
Moderate Bandwidth Savings for SMP (15%-22%)More so for CMP (>25%)
Moshovos © 22
Related Work
RegionScout Technical Report, Dec. 2003
Jetty Moshovos, Memik, Falsafi, Choudhary, HPCA 2001
PST Eckman, Dahlgren, and Stenström, ISLPED 2002
Coarse-Grain Coherence Cantin, Lipasti and Smith, ISCA 2005
Moshovos © 23
Summary
Exploit program behavior/optimize a frequent case Many requests result in a global region miss
RegionScout Practical filter mechanism Dynamically detect would-be region misses Avoid broadcasts Save tag lookup power and interconnect bandwidth Small structures Layered extension over existing mechanisms Invisible to programmer and the OS
Moshovos © 24
RegionScout and Directories
Different information Directory block-level sharing RegionScout: Region-level sharing
Could build Region-level directory This work serves as motivation
Directories use precise information RegionScout does not have to
Directories/Implementation RegionScout can approximate a directory
If remote nodes sent sharing info as opposed to a single bit