Upload
marv
View
78
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching. Somayeh Sardashti and David A. Wood University of Wisconsin-Madison. Where does energy go?. Communication vs. Computation. Keckler Micro 2011. ~200X . - PowerPoint PPT Presentation
Citation preview
1
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Somayeh Sardashti and David A. WoodUniversity of Wisconsin-Madison
2
3
Where does energy go?
Communication vs. Computation
Keckler Micro 2011
Improving cache utilization is critical for energy-efficiency!
~200X
Compressed Cache: Compress and Compact Blocks+ Higher effective cache size+ Small area overhead+ Higher system performance+ Lower system energy
Previous work limit compression effectiveness:- Limited number of tags- High internal fragmentation- Energy expensive re-compaction
6
Decoupled Compressed Cache (DCC)
Saving system energy by improving LLC utilization through cache compression.
Non-Contiguous Sub-Blocks
Previous work limit compression effectiveness:- Limited number of tags- High Internal Fragmentation- Energy expensive re-compaction
Decoupled Super-Blocks
7
Decoupled Compressed Cache (DCC)
Saving system energy by improving LLC utilization through cache compression.
Outperform 2X LLC1.08X LLC area14% higher performance12% lower energy
8
Outline Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions
9
Uncompressed Caching
A fixed one-to-one tag/data mapping
Tags DataBAB
C
C A
10
Compressed Caching
Compress cache blocks.
Tags DataA A BB A B CC
Compact compressed blocks, to make room.Add more tags to increase effective capacity.
11
Compression
(1) Compression: how to compress blocks? • There are different compression algorithms.• Not the focus of this work. • But, which algorithm matters!
C64 bytes
C20 bytes
Compressor
12
Compression Potentials
High compression ratio potentially large normalized effective cache capacity.
1.5
2.8
3.9
Compression Ratio = Original Size / Compressed Size
Cycles to Decompress
Compression Algorithm
We use C-PACK+Z for the rest of the talk!
13
Compaction
(2) Compaction: how to store and find blocks?• Critical to achieve the compression potentials.• This work focuses on compaction.
Tags DataAB
Fixed Sized Compressed Cache (FixedC) [Kim’02, WMPI, Yang Micro 02]
Internal Fragmentation!
CBAC
14
Compaction
(2) Compaction: how to store and find blocks?
Tags DataAB
Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA 2002]
C
A B CSub-block
D D
15
Previous Compressed Caches
(Limit 1) Limited Tag/Metadata– High Area Overhead by Adding 4X/more Tags
(Limit 2) Internal Fragmentation– Low Cache Capacity Utilization
A10B
A16B 2.6
2.3 2.01.7
Potential: 3.9
3.1
Normalized Effective Capacity = LLC Number of Valid Blocks / MAX Number of (Uncompressed) Blocks
16
(Limit 3) Energy-Expensive Re-Compaction
3X higher LLC dynamic energy!
Tags DataABC AD B DC
VSC requires energy-expensive re-compaction.
B
Update BB needs 2 sub-blocks
17
Outline Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions
18
Decoupled Compressed Cache(1) Exploiting Spatial Locality
Low Area Overhead(2) Decoupling tag/data mapping
Eliminate energy expensive re-compactionReduce internal fragmentation
(3) Co-DCC: Dynamically co-compacting super-blocksFurther reduce internal fragmentation
19
(1) Exploiting Spatial Locality
Neighboring blocks co-reside in LLC.
89%
20
(1) Exploiting Spatial Locality
DCC tracks LLC blocks at Super-Block granularity.
E4X Tags
Tags DataABCD AB C D2X Tags
EQuad (Q): A, B, C, DSingleton (S): E
Super-Block Tag Q stat
e A
stat
e B
stat
e C
stat
e D
Super TagsQ S
Up to 4X blocks with low area overheads!
21
(2) Decoupling tag/data mappingDCC decouples mapping to eliminate re-compaction.
Quad (Q): A, B, C, DSingleton (S): E
Super TagsQ S AB D E
Quad (Q): A, B, C, DSingleton (S): E
Flexible Allocation
CB1 B2
Update B
22
(2) Decoupling tag/data mappingBack pointers identify the owner block of each sub-block.
Quad (Q): A, B, C, DSingleton (S): E
Super TagsQ S A C D E
Quad (Q): A, B, C, DSingleton (S): E
DataBack
Pointers
Tag ID Blk ID
B2B1
23
(3) Co-compacting super-blocks
Co-DCC dynamically co compacts super-blocks.‑ Reducing internal fragmentation
A B DCA sub-blockABCD
B
Quad (Q): A, B, C, D
24
Outline Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions
25
Experimental Methodology Integrated DCC with AMD Bulldozer Cache.
– We model the timing and allocation constraints of sequential regions at LLC in detail.
– No need for an alignment network. Verilog implementation and synthesis of the tag
match and sub-block selection logic.– One additional cycle of latency due to sub-block selection.
26
Experimental Methodology Full-system simulation with a simulator based on GEMS. Wide range of applications with different level of cache
sensitivities:– Commercial workloads: apache, jbb, oltp, zeus– Spec-OMP: ammp, applu, equake, mgrid, wupwise– Parsec: blackscholes, canneal, freqmine– Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astar-
bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm
Cores Eight OOO cores, 3.2 GHzL1I$/L1D$ Private, 32-KB, 8-wayL2$ Private, 256-KB, 8-wayL3$ Shared, 8-MB, 16-way, 8 banksMain Memory 4GB, 16 Banks, 800 MHz bus frequency DDR3
Effective LLC Capacity
27
Components FixedC/VSC-2X DCC Co-DCC
Tag Array 6.3% 2.1% 11.3%
Back Pointer Array 0 4.4% 5.4%
(De-)Compressors 1.8% 1.8% 1.8%
Total Area Overhead 8.1% 8.3% 18.5%
1
2
1 2 3
Nor
mal
ized
LLC
Area
Baseline
2X Baseline
VSC DCCCo-DCC
FixedC
Normalized Effective LLC Capacity
Components FixedC/VSC-2X
Tag Array 6.3%
Back Pointer Array 0
(De-)Compressors 1.8%
Total Area Overhead 8.1%
Components FixedC/VSC-2X DCC
Tag Array 6.3% 2.1%
Back Pointer Array 0 4.4%
(De-)Compressors 1.8% 1.8%
Total Area Overhead 8.1% 8.3%
28
(Co-)DCC Performance
0.930.96 0.95
0.900.86
(Co-)DCC boost system performance significantly.
29
(Co-)DCC Energy Consumption
0.930.96 0.97
0.91 0.88
(Co-)DCC reduce system energy by reducing number of accesses to the main memory.
30
SummaryAnalyze the limits of compressed caching
• Limited number of tags• Internal fragmentation• Energy-expensive re-compaction
Decoupled Compressed Cache• Improving performance and energy of compressed caching• Decoupled super-blocks• Non-contiguous sub-blocks
Co-DCC further reduces internal fragmentation
Practical designs [details in the paper]
31
(De-)Compression overhead DCC data array organization with AMD Bulldozer DCC Timing DCC Lookup Applications Co-DCC design LLC effective capacity LLC miss rate Memory dynamic energy LLC dynamic energy
Backup
32
(De-)Compression Overhead
Parameters Compressor DecompressorPipeline Depth 6 2
Latency (cycles) 16 9
Area () 0.016 0.016
Power Consumption (mW) 25.84 19.01
33
DCC Data Array OrganizationAMD Bulldozer
B Ph
ase
Flop
A Ph
ase
Flop
A Ph
ase
Flop
B Ph
ase
Flop
A0: uncompressed; B1 and C2 are compressed to 2 sub-blocks
SR0SR 1SR 2SR3
A0.3C3.0
A0.2
B1.1
A0.1B1.0
A0.0C3.1
NSet Addr
4SR0 Addr
SR3 Addr
4
44
SR1 AddrSR2 Addr
Read Data
34
DCC Timing
35
DCC Lookup
1. Access Super Tags and Back Pointers in parallel2. Find the matched Back Pointers3. Read corresponding sub-blocks and decompress
Quad (Q): A, B, C, DSingleton (S): E
Super Tags
Data
AB EDBack Pointers
C1 C0
Read C
Q1
0 S
11
11
11
36
Applications
Spec2006 (m1-m8)
bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm
Sensitive to Cache Capacity and Latency
Sensitive to Cache Capacity
Cache Insensitive
Sensitive to Cache
Latency
37
Co-DCC Design
A2.1
A: <A2,A1,A0>A-ENDA1-Begin
Sub-block 7
…Sub-block 6 Sub-block 5
…Sub-blocks 4-2
A0.0
Sub-block 1 Sub-block 0
A0-Begin
A2.2A0.1A1A2.0
Tag ID Sh
arer
s
4b
Super-Block Tag Cs
tate
3Co
mp3
Csta
te2
Com
p2
Begi
n3
7b
Begi
n2Cs
tate
1Co
mp1
Csta
te0
Com
p0
Begi
n1
Begi
n0
END
7b3b 1b 1b7b3b1b 7b3b 1b 7b3b1b
38
LLC Effective Cache Capacity
1.0
1.5
2.0
2.5
3.0
3.5
4.0N
orm
LLC
Effec
tive
Capa
city
39
LLC Miss Rate
0.20
0.40
0.60
0.80
1.00N
orm
LLC
Miss
Rat
e
40
Memory Dynamic Energy
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Nor
m M
emor
y Dy
nam
ic E
nerg
y
41
LLC Dynamic Energy
0
1
2
3
4
5
6N
orm
LLC
Dyna
mic
Ene
rgy