Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

1

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Somayeh Sardashti and David A. WoodUniversity of Wisconsin-Madison

2

3

Where does energy go?

Communication vs. Computation

Keckler Micro 2011

Improving cache utilization is critical for energy-efficiency!

~200X

Compressed Cache: Compress and Compact Blocks+ Higher effective cache size+ Small area overhead+ Higher system performance+ Lower system energy

Previous work limit compression effectiveness:- Limited number of tags- High internal fragmentation- Energy expensive re-compaction

6

Decoupled Compressed Cache (DCC)

Saving system energy by improving LLC utilization through cache compression.

Non-Contiguous Sub-Blocks

Previous work limit compression effectiveness:- Limited number of tags- High Internal Fragmentation- Energy expensive re-compaction

Decoupled Super-Blocks

7

Decoupled Compressed Cache (DCC)

Saving system energy by improving LLC utilization through cache compression.

Outperform 2X LLC1.08X LLC area14% higher performance12% lower energy

8

Outline Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions

9

Uncompressed Caching

A fixed one-to-one tag/data mapping

Tags DataBAB

C

C A

10

Compressed Caching

Compress cache blocks.

Tags DataA A BB A B CC

Compact compressed blocks, to make room.Add more tags to increase effective capacity.

11

Compression

(1) Compression: how to compress blocks? • There are different compression algorithms.• Not the focus of this work. • But, which algorithm matters!

C64 bytes

C20 bytes

Compressor

12

Compression Potentials

High compression ratio potentially large normalized effective cache capacity.

1.5

2.8

3.9

Compression Ratio = Original Size / Compressed Size

Cycles to Decompress

Compression Algorithm

We use C-PACK+Z for the rest of the talk!

13

Compaction

(2) Compaction: how to store and find blocks?• Critical to achieve the compression potentials.• This work focuses on compaction.

Tags DataAB

Fixed Sized Compressed Cache (FixedC) [Kim’02, WMPI, Yang Micro 02]

Internal Fragmentation!

CBAC

14

Compaction

(2) Compaction: how to store and find blocks?

Tags DataAB

Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA 2002]

C

A B CSub-block

D D

15

Previous Compressed Caches

(Limit 1) Limited Tag/Metadata– High Area Overhead by Adding 4X/more Tags

(Limit 2) Internal Fragmentation– Low Cache Capacity Utilization

A10B

A16B 2.6

2.3 2.01.7

Potential: 3.9

3.1

Normalized Effective Capacity = LLC Number of Valid Blocks / MAX Number of (Uncompressed) Blocks

16

(Limit 3) Energy-Expensive Re-Compaction

3X higher LLC dynamic energy!

Tags DataABC AD B DC

VSC requires energy-expensive re-compaction.

B

Update BB needs 2 sub-blocks

17


18

Decoupled Compressed Cache(1) Exploiting Spatial Locality

Low Area Overhead(2) Decoupling tag/data mapping

Eliminate energy expensive re-compactionReduce internal fragmentation

(3) Co-DCC: Dynamically co-compacting super-blocksFurther reduce internal fragmentation

19

(1) Exploiting Spatial Locality

Neighboring blocks co-reside in LLC.

89%

20

(1) Exploiting Spatial Locality

DCC tracks LLC blocks at Super-Block granularity.

E4X Tags

Tags DataABCD AB C D2X Tags

EQuad (Q): A, B, C, DSingleton (S): E

Super-Block Tag Q stat

e A

stat

e B

stat

e C

stat

e D

Super TagsQ S

Up to 4X blocks with low area overheads!

21

(2) Decoupling tag/data mappingDCC decouples mapping to eliminate re-compaction.

Quad (Q): A, B, C, DSingleton (S): E

Super TagsQ S AB D E


Flexible Allocation

CB1 B2

Update B

22

(2) Decoupling tag/data mappingBack pointers identify the owner block of each sub-block.


Super TagsQ S A C D E


DataBack

Pointers

Tag ID Blk ID

B2B1

23

(3) Co-compacting super-blocks

Co-DCC dynamically co compacts super-blocks.‑ Reducing internal fragmentation

A B DCA sub-blockABCD

B

Quad (Q): A, B, C, D

24


25

Experimental Methodology Integrated DCC with AMD Bulldozer Cache.

– We model the timing and allocation constraints of sequential regions at LLC in detail.

– No need for an alignment network. Verilog implementation and synthesis of the tag

match and sub-block selection logic.– One additional cycle of latency due to sub-block selection.

26

Experimental Methodology Full-system simulation with a simulator based on GEMS. Wide range of applications with different level of cache

sensitivities:– Commercial workloads: apache, jbb, oltp, zeus– Spec-OMP: ammp, applu, equake, mgrid, wupwise– Parsec: blackscholes, canneal, freqmine– Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astar-

bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm

Cores Eight OOO cores, 3.2 GHzL1I$/L1D$ Private, 32-KB, 8-wayL2$ Private, 256-KB, 8-wayL3$ Shared, 8-MB, 16-way, 8 banksMain Memory 4GB, 16 Banks, 800 MHz bus frequency DDR3

Effective LLC Capacity

27

Components FixedC/VSC-2X DCC Co-DCC

Tag Array 6.3% 2.1% 11.3%

Back Pointer Array 0 4.4% 5.4%

(De-)Compressors 1.8% 1.8% 1.8%

Total Area Overhead 8.1% 8.3% 18.5%

1

2

1 2 3

Nor

mal

ized

LLC

Area

Baseline

2X Baseline

VSC DCCCo-DCC

FixedC

Normalized Effective LLC Capacity

Components FixedC/VSC-2X

Tag Array 6.3%

Back Pointer Array 0

(De-)Compressors 1.8%

Total Area Overhead 8.1%

Components FixedC/VSC-2X DCC

Tag Array 6.3% 2.1%

Back Pointer Array 0 4.4%

(De-)Compressors 1.8% 1.8%

Total Area Overhead 8.1% 8.3%

28

(Co-)DCC Performance

0.930.96 0.95

0.900.86

(Co-)DCC boost system performance significantly.

29

(Co-)DCC Energy Consumption

0.930.96 0.97

0.91 0.88

(Co-)DCC reduce system energy by reducing number of accesses to the main memory.

30

SummaryAnalyze the limits of compressed caching

• Limited number of tags• Internal fragmentation• Energy-expensive re-compaction

Decoupled Compressed Cache• Improving performance and energy of compressed caching• Decoupled super-blocks• Non-contiguous sub-blocks

Co-DCC further reduces internal fragmentation

Practical designs [details in the paper]

31

(De-)Compression overhead DCC data array organization with AMD Bulldozer DCC Timing DCC Lookup Applications Co-DCC design LLC effective capacity LLC miss rate Memory dynamic energy LLC dynamic energy

Backup

32

(De-)Compression Overhead

Parameters Compressor DecompressorPipeline Depth 6 2

Latency (cycles) 16 9

Area () 0.016 0.016

Power Consumption (mW) 25.84 19.01

33

DCC Data Array OrganizationAMD Bulldozer

B Ph

ase

Flop

A Ph

ase

Flop

A Ph

ase

Flop

B Ph

ase

Flop

A0: uncompressed; B1 and C2 are compressed to 2 sub-blocks

SR0SR 1SR 2SR3

A0.3C3.0

A0.2

B1.1

A0.1B1.0

A0.0C3.1

NSet Addr

4SR0 Addr

SR3 Addr

4

44

SR1 AddrSR2 Addr

Read Data

34

DCC Timing

35

DCC Lookup

1. Access Super Tags and Back Pointers in parallel2. Find the matched Back Pointers3. Read corresponding sub-blocks and decompress


Super Tags

Data

AB EDBack Pointers

C1 C0

Read C

Q1

0 S

11

11

11

36

Applications

Spec2006 (m1-m8)

bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm

Sensitive to Cache Capacity and Latency

Sensitive to Cache Capacity

Cache Insensitive

Sensitive to Cache

Latency

37

Co-DCC Design

A2.1

A: <A2,A1,A0>A-ENDA1-Begin

Sub-block 7

…Sub-block 6 Sub-block 5

…Sub-blocks 4-2

A0.0

Sub-block 1 Sub-block 0

A0-Begin

A2.2A0.1A1A2.0

Tag ID Sh

arer

s

4b

Super-Block Tag Cs

tate

3Co

mp3

Csta

te2

Com

p2

Begi

n3

7b

Begi

n2Cs

tate

1Co

mp1

Csta

te0

Com

p0

Begi

n1

Begi

n0

END

7b3b 1b 1b7b3b1b 7b3b 1b 7b3b1b

38

LLC Effective Cache Capacity

1.0

1.5

2.0

2.5

3.0

3.5

4.0N

orm

LLC

Effec

tive

Capa

city

39

LLC Miss Rate

0.20

0.40

0.60

0.80

1.00N

orm

LLC

Miss

Rat

e

40

Memory Dynamic Energy

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Nor

m M

emor

y Dy

nam

ic E

nerg

y

41

LLC Dynamic Energy

0

1

2

3

4

5

6N

orm

LLC

Dyna

mic

Ene

rgy

Documents

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching