Using Compression to Improve Chip Multiprocessor Performance

1

Using Compression to Improve Chip Multiprocessor Performance

Alaa R. Alameldeen

Dissertation Defense

Wisconsin Multifacet Project

University of Wisconsin-Madison

http://www.cs.wisc.edu/multifacet

2

Motivation

� Architectural trends

– Multi-threaded workloads

– Memory wall

– Pin bandwidth bottleneck

� CMP design trade-offs

– Number of Cores

– Cache Size

– Pin Bandwidth

� Are these trade-offs zero-sum?

– No, compression helps cache size and pin bandwidth

� However, hardware compression raises a few questions

2

3

Thesis Contributions� Question: Is compression’s overhead too high for

caches?

� Contribution #1: Simple compressed cache design

– Compression Scheme: Frequent Pattern Compression

– Cache Design: Decoupled Variable-Segment Cache

� Question: Can cache compression hurt performance?

+ Reduces miss rate

‒ Increases hit latency

� Contribution #2: Adaptive compression

– Adapt to program behavior

– Cache compression only when it helps

4

Thesis Contributions (Cont.)� Question: Does compression help CMP performance?

� Contribution #3: Evaluate CMP cache and link compression

– Cache compression improves CMP throughput

– Link compression reduces pin bandwidth demand

� Question: How does compression and prefetching interact?

� Contribution #4: Compression interacts positively with prefetching

– Speedup (Compr, Pref) > Speedup (Compr) x Speedup (Pref)

� Question: How do we balance CMP cores and caches?

� Contribution #5: Model CMP cache and link compression

– Compression improves optimal CMP configuration

3

5

Outline

� Background

– Technology and Software Trends

– Compression Addresses CMP Design Challenges

� Compressed Cache Design

� Adaptive Compression

� CMP Cache and Link Compression

� Interactions with Hardware Prefetching

� Balanced CMP Design

� Conclusions

6

Technology and Software Trends

� Technology trends:

– Memory Wall: Increasing gap between processor and

memory speeds

– Pin Bottleneck: Bandwidth demand > Bandwidth Supply

4

7

Pin Bottleneck: ITRS 04 Roadmap

�� Annual Rates of Increase: Transistors 26%, Pins 10%Annual Rates of Increase: Transistors 26%, Pins 10%

Transistors and Pins Per Chip

100

1000

10000

100000

2004 2007 2010 2013 2016 2019

#Transistors

(millions)#Pins

8

� Technology trends:

– Memory Wall: Increasing gap between processor and memory speeds

– Pin Bottleneck: Bandwidth demand > Bandwidth Supply

⇒ Favor bigger cache

� Software application trends:

– Higher throughput requirements

⇒ Favor more cores/threads

⇒ Demand higher pin bandwidth

Technology and Software Trends

Contradictory Contradictory

GoalsGoals

5

9

Using Compression

� On-chip Compression

– Cache Compression: Increases effective cache size

– Link Compression: Increases effective pin bandwidth

� Compression Requirements

– Lossless

– Low decompression (compression) overhead

– Efficient for small block sizes

– Minimal additional complexity

�Thesis addresses CMP design with compression support

10

Outline

� Background


– Compressed Cache Hierarchy

– Compression Scheme: FPC

– Decoupled Variable-Segment Cache





� Conclusions

6

11

Compressed Cache Hierarchy (Uniprocessor)

InstructionInstruction

FetcherFetcher

L2 Cache (Compressed)L2 Cache (Compressed)

L1 DL1 D--CacheCache

(Uncompressed)(Uncompressed)

LoadLoad--StoreStore

QueueQueue

L1 IL1 I--CacheCache


L1 Victim CacheL1 Victim Cache

CompressionCompression

PipelinePipeline

DecompressionDecompression

PipelinePipeline

UncompressedUncompressed

LineLine

BypassBypass

From MemoryFrom Memory

To MemoryTo Memory

12

Frequent Pattern Compression (FPC)

� A significance-based compression algorithm

– Compresses each 32-bit word separately

– Suitable for short (32-256 byte) cache lines

– Compressible Patterns: zeros, sign-ext. 4,8,16-bits,

zero-padded half-word, two SE half-words, repeated byte

– Pattern detected ⇒ Store pattern prefix + significant bits

– A 64-byte line is decompressed in a five-stage pipeline

7

13

Decoupled Variable-Segment Cache

� Each set contains twice as many tags as uncompressed lines

� Data area divided into 8-byte segments

� Each tag is composed of:

– Address tag

– Permissions

– CStatus : 1 if the line is compressed, 0 otherwise

– CSize: Size of compressed line in segments

– LRU/replacement bits

Same as Same as

uncompressed uncompressed

cachecache

14


� Example cache set

Addr B compressed 2 Addr B compressed 2

Data AreaData Area

Addr A uncompressed 3Addr A uncompressed 3

Addr C compressed 6Addr C compressed 6

Addr D compressed 4Addr D compressed 4

Tag AreaTag Area

Compression StatusCompression Status Compressed SizeCompressed SizeTag is present Tag is present

but line isnbut line isn’’tt

8

15

Outline

� Background



– Key Insight

– Classification of Cache Accesses

– Performance Evaluation




� Conclusions

16

Adaptive Compression

� Use past to predict future

� Key Insight:

– LRU Stack [Mattson, et al., 1970] indicates for each reference whether compression helps or hurts

Benefit(Compression) Benefit(Compression)

> Cost(Compression> Cost(Compression))

Do not compress Do not compress

future linesfuture linesCompress Compress

future linesfuture lines

YesYes NoNo

9

17

Cost/Benefit Classification

� Classify each cache reference

� Four-way SA cache with space for two 64-byte lines

– Total of 16 available segments


Data AreaData Area




Tag AreaTag Area

18

An Unpenalized Hit

� Read/Write Address A

– LRU Stack order = 1 ≤ 2 � Hit regardless of compression

– Uncompressed Line � No decompression penalty

– Neither cost nor benefit


Data AreaData Area




Tag AreaTag Area

10

19

A Penalized Hit

� Read/Write Address B

– LRU Stack order = 2 ≤ 2 � Hit regardless of compression

– Compressed Line � Decompression penalty incurred

– Compression cost


Data AreaData Area




Tag AreaTag Area

20

An Avoided Miss

� Read/Write Address C

– LRU Stack order = 3 > 2 � Hit only because of compression

– Compression benefit: Eliminated off-chip miss


Data AreaData Area




Tag AreaTag Area

11

21

An Avoidable Miss

� Read/Write Address D

– Line is not in the cache but tag exists at LRU stack order = 4

– Missed only because some lines are not compressed

– Potential compression benefit

Sum(CSize) = 15 Sum(CSize) = 15 ≤≤ 1616


Data AreaData Area




Tag AreaTag Area

22

An Unavoidable Miss

� Read/Write Address E

– LRU stack order > 4 � Compression wouldn’t have helped

– Line is not in the cache and tag does not exist

– Neither cost nor benefit


Data AreaData Area




Tag AreaTag Area

12

23

Compression Predictor

� Estimate: Benefit(Compression) – Cost(Compression)

� Single counter : Global Compression Predictor (GCP)

– Saturating up/down 19-bit counter

� GCP updated on each cache access

– Benefit: Increment by memory latency

– Cost: Decrement by decompression latency

– Optimization: Normalize to memory_lat / decompression_lat, 1

� Cache Allocation

– Allocate compressed line if GCP ≥ 0

– Allocate uncompressed lines if GCP < 0

24

Simulation Setup

� Workloads:

– Commercial workloads [Computer’03, CAECW’02] : • OLTP: IBM DB2 running a TPC-C like workload

• SPECJBB

• Static Web serving: Apache and Zeus

– SPEC2000 benchmarks:• SPECint: bzip, gcc, mcf, twolf

• SPECfp: ammp, applu, equake, swim

� Simulator:

– Simics full system simulator; augmented with:

– Multifacet General Execution-driven Multiprocessor Simulator (GEMS) [Martin, et al., 2005, http://www.cs.wisc.edu/gems/]

13

25

System configuration

� Configuration parameters:

Dynamically scheduled SPARC V9, 4-wide

superscalar, 64-entry Instruction Window, 128-

entry reorder buffer

Processor

4GB DRAM, 400-cycle access time, 16

outstanding requests

Memory

Unified 4MB, 8-way SA, 64B line, access latency

15 cycles + 5-cycle decompression latency (if

needed)

L2 Cache

Split I&D, 64KB each, 4-way SA, 64B line, 3-

cycles/access

L1 Cache

26

Simulated Cache Configurations

� Always: All compressible lines are stored in compressed format

– Decompression penalty for all compressed lines

� Never: All cache lines are stored in uncompressed format

– Cache is 8-way set associative with half the number of sets

– Does not incur decompression penalty

� Adaptive: Adaptive compression scheme

14

27

0

0.2

0.4

0.6

0.8

1

1.2

bzi

p

gcc

mcf

two

lf

am

mp

ap

plu

equ

ak

e

swim

ap

ach

e

zeu

s

olt

p

jbb

Norm

alized R

untim

e

Never

Always

Adaptive

Performance

SpecINTSpecINT SpecFPSpecFP CommercialCommercial

28

Performance

0

0.2

0.4

0.6

0.8

1

1.2

bzi

p

gcc

mcf

two

lf

am

mp

ap

plu

equ

ak

e

swim

ap

ach

e

zeu

s

olt

p

jbb

Norm

alized R

untim

e

Never

Always

Adaptive

15

29

Performance

0

0.2

0.4

0.6

0.8

1

1.2

bzi

p

gcc

mcf

two

lf

am

mp

ap

plu

equ

ak

e

swim

ap

ach

e

zeu

s

olt

p

jbb

Norm

alized R

untim

e

Never

Always

Adaptive

34% 34%

SpeedupSpeedup16% 16%

SlowdownSlowdown

30

Performance

0

0.2

0.4

0.6

0.8

1

1.2

bzi

p

gcc

mcf

two

lf

am

mp

ap

plu

equ

ak

e

swim

ap

ach

e

zeu

s

olt

p

jbb

Norm

alized R

untim

e

Never

Always

Adaptive

�� Adaptive performs similar to the best of Always and NeverAdaptive performs similar to the best of Always and Never

16

31

Cache Miss Rates

0.0

0.5

1.0N

orm

ali

zed M

iss

Rate

Never

Always

Adaptive

1.3

bzip

2.9

gcc

39.9

mcf

5.0

twolf

0.2

ammp

15.7

applu

11.1

equake

41.6

swim

15.7

apache

15.3

zeus

3.6

oltp

3.9

jbb

32

Optimal Adaptive Compression?

� Optimal: Always with no decompression penalty

0.0

0.5

1.0

No

rmali

zed R

un

tim

e

Never

Always

Adaptive

Optimal

bzip gcc mcf twolf ammp applu equake swim apache zeus oltp jbb

17

33

Adapting to L2 Size

0.0

0.5

1.0

No

rmal

ized

Ru

nti

me

ammp

Never

Always

Adaptive

256K/80.9

512K/81.1

1M/85.8

2M/89.0

4M/88252

8M/8164923

16M/81722811

0.0

0.5

1.0

Norm

aliz

ed R

unti

me

mcf

Never

Always

Adaptive

256K/85.9

512K/85.8

1M/83.3

2M/82.9

4M/810.8

8M/880.7

16M/86631841

Penalized Hits

Per Avoided Miss

34

Adaptive Compression: Summary

� Cache compression increases cache capacity but slows down cache hit time

– Helps some benchmarks (e.g., apache, mcf)

– Hurts other benchmarks (e.g., gcc, ammp)

� Adaptive compression

– Uses (LRU) replacement stack to determine whether compression helps or hurts

– Updates a single global saturating counter on cache accesses

� Adaptive compression performs similar to the better of Always Compress and Never Compress

18

35

Outline

� Background






� Conclusions

36

Compressed Cache Hierarchy (CMP)

Shared L2 Cache (Compressed)Shared L2 Cache (Compressed)

CompressionCompression

DecompressionDecompression

UncompressedUncompressed

LineLine

BypassBypass

Processor 1Processor 1

L1 Cache L1 Cache


L3 / Memory Controller (Could L3 / Memory Controller (Could compress/decompress)compress/decompress)

…………………………………………

Processor nProcessor n

L1 Cache L1 Cache


To/From other chips/memoryTo/From other chips/memory

19

37

Link Compression

� On-chip L3/Memory

Controller transfers compressed messages

� Data Messages

– 1-8 sub-messages (flits), 8-

bytes each

� Off-chip memory controller combines flits

and stores to memoryL3/Memory ControllerL3/Memory Controller

Memory ControllerMemory Controller

Processors / L1 CachesProcessors / L1 Caches

L2 CacheL2 Cache

CMPCMP

To/From MemoryTo/From Memory

38

Hardware Stride-Based Prefetching

� L2 Prefetching

+ Hides memory latency

- Increases pin bandwidth demand

� L1 Prefetching

+ Hides L2 latency

- Increases L2 contention and on-chip bandwidth demand

- Triggers L2 fill requests ⇒ Increases pin bandwidth demand

� Questions:

– Does compression interfere positively or negatively with

hardware prefetching?

– How does a system with both compare to a system with only

compression or only prefetching?

20

39

Interactions Terminology

� Assume a base system S with two architectural enhancements A and B, All systems run

program P

Speedup(A) = Runtime(P, S) / Runtime(P, A)

Speedup(B) = Runtime(P, S) / Runtime (P, B)

Speedup(A, B) = Speedup(A) x Speedup(B)

x (1 + Interaction(A,B) )

40

Compression and Prefetching Interactions

� Positive Interactions:

+ L1 prefetching hides part of decompression overhead

+ Link compression reduces increased bandwidth demand because of prefetching

+Cache compression increases effective L2 size, L2 prefetching increases working set size

� Negative Interactions:

– L2 prefetching and L2 compression can eliminate the same misses

�Is Interaction(Compression, Prefetching) positive or negative?

21

41

Evaluation� 8-core CMP

� Cores: single-threaded, out-of-order superscalar with a 64-entry IW, 128-entry ROB, 5 GHz clock frequency

� L1 Caches: 64K instruction, 64K data, 4-way SA, 320 GB/sec total on-chip bandwidth (to/from L1), 3-cycle latency

� Shared L2 Cache: 4 MB, 8-way SA (uncompressed), 15-cycle uncompressed latency, 128 outstanding misses

� Memory: 400 cycles access latency, 20 GB/sec memory bandwidth

� Prefetching:

– Similar to prefetching in IBM’s Power4 and Power5

– 8 unit/negative/non-unit stride streams for L1 and L2 for each processor

– Issue 6 L1 prefetches on L1 miss

– Issue 25 L2 prefetches on L2 miss

42

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

apache zeus oltp jbb art apsi fma3d mgrid

Speedup

No Compression orPrefetching

Cache Compression

Link Compression

Cache and LinkCompression

CommercialCommercial SPECompSPEComp

22

43

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


Speedup


Cache Compression

Link Compression


44

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


Speedup


Cache Compression

Link Compression


�� Cache Compression provides speedups of up to 18%Cache Compression provides speedups of up to 18%

23

45

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


Speedup


Cache Compression

Link Compression


�� Link compression speeds up bandwidthLink compression speeds up bandwidth--limited applicationslimited applications

46

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


Speedup


Cache Compression

Link Compression


�� Cache&Link compression provide speedups up to 22%Cache&Link compression provide speedups up to 22%

24

47

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


Speedup


Compression

Prefetching

Compression andPrefetching

48

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


Speedup


Compression

Prefetching


�� Prefetching speeds up all except jbb (up to 21%)Prefetching speeds up all except jbb (up to 21%)

25

49

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


Speedup


Compression

Prefetching


�� Compression&Prefetching have up to 51% speedupsCompression&Prefetching have up to 51% speedups

50

Interactions Between Prefetching and Compression

-5

0

5

10

15

20

25


Inte

raction (%

)

�� Interaction is positive for seven benchmarksInteraction is positive for seven benchmarks

26

51

Positive Interaction: Pin Bandwidth

0

0.5

1

1.5

2

2.5

3

3.5

apache

zeus

oltp jb

bart

apsi

fma3d

mgrid

Norm

alized Pin Bandwidth Demand


L1 and L2 Prefetching

Cache and Link Compression

Both Prefetching andCompression

8.8 7.3 5.0 6.6 7.6 21.5 27.7 14.48.8 7.3 5.0 6.6 7.6 21.5 27.7 14.4

�� Compression saves bandwidth consumed by prefetchingCompression saves bandwidth consumed by prefetching

Pin Bandwidth (GB/sec) for

no compression or prefetching

52

Negative Interaction: Avoided Misses

0

20

40

60

80

100

120

140

160

180

200

apache

zeus

oltp jb

bart

apsi

fma3d

mgrid

% M

isses

Extra AvoidablePrefetches

Extra UnavoidablePrefetches

Avoided by Both

Avoided by Pref.Only

Avoided by Compr.Only

Unavoidable Misses

�� Small fraction of misses (<9%) avoided by bothSmall fraction of misses (<9%) avoided by both

27

53

Sensitivity to #Cores

Zeus

-20

0

20

40

60

80

100

1p 2p 4p 8p 16p

Perform

ance Impro

vement (%

)

PF Only

Compr Only

PF+Compr

PF+2x BW

PF+2x L2

54


Zeus

-20

0

20

40

60

80

100

1p 2p 4p 8p 16p

Perform

ance Impro

vement (%

)

PF Only

Compr Only

PF+Compr

PF+2x BW

PF+2x L2

28

55


Zeus

-20

0

20

40

60

80

100

1p 2p 4p 8p 16p

Perform

ance Impro

vement (%

)

PF Only

Compr Only

PF+Compr

PF+2x BW

PF+2x L2

56


Zeus

-20

0

20

40

60

80

100

1p 2p 4p 8p 16p

Perform

ance Impro

vement (%

)

PF Only

Compr Only

PF+Compr

PF+2x BW

PF+2x L2

29

57


Zeus

-20

0

20

40

60

80

100

1p 2p 4p 8p 16p

Perform

ance Impro

vement (%

)

PF Only

Compr Only

PF+Compr

PF+2x BW

PF+2x L2

58


Zeus

-20

0

20

40

60

80

100

1p 2p 4p 8p 16p

Perform

ance Impro

vement (%

)

PF Only

Compr Only

PF+Compr

PF+2x BW

PF+2x L2

30

59


#Processors

-30

-20

-10

0

10

jbb

PF Only

Compr Only

PF+Compr

PF+2x BW

PF+2x L2

1p 2p 4p 8p 16p

Pe

rfo

rma

nce

Im

pro

vem

en

t (%

)

60

Sensitivity to Pin Bandwidth

0

10

20

Inte

racti

on (

PF

, C

om

pr)

%

10 GB/sec.

20 GB/sec.

40 GB/sec.

80 GB/sec.


�� Positive interactions for most configurationsPositive interactions for most configurations

31

61

Compression and Prefetching: Summary

� More cores on a CMP increase demand for:

– On-chip (shared) caches

– Off-chip pin bandwidth

� Prefetching further increases demand on both

resources

� Cache and link compression alleviate such demand

� Compression interacts positively with hardware

prefetching

62

Outline

� Background






– Analytical Model

– Simulation

� Conclusions

32

63

Balanced CMP Design

� Compression can shift this balance

– Increases effective cache size (small area overhead)

– Increases effective pin bandwidth

– Can we have more cores in a CMP?

� Explore by analytical model & simulation

Core0

Core1

Core0

Core1

Core2

Core3

Core4

Core5

Core6

Core7

L2 Cache L2 Cache

L2 CacheL2 Cache

CMP CMP

64

Simple Analytical Model

� Provides intuition on core vs. cache trade-off

� Model simplifying assumptions:

– Pin bandwidth demand follows an M/D/1 model

– Miss rate decreases with square root of increase in

cache size

– Blocking in-order processor

– Some parameters are fixed with change in #processors

– Uses IPC instead of a work-related metric

33

65

Throughput (IPC)

0 2 4 6 8 10 12 14

#Processors

0.0

0.2

0.4

Thr

ough

put (

IPC

)20 GB/sec. B/W- 400 cycles memory latency - 10 misses/1K inst

no compr

cache compr

link compr

cache + link compr

�� Cache compression provides speedups of up to 26% (29% Cache compression provides speedups of up to 26% (29%

when combined with link compressionwhen combined with link compression

�� Higher speedup for optimal configurationHigher speedup for optimal configuration

Cache+Link Compr

Cache compr

Link compr

No compr

66

Simulation (20 GB/sec bandwidth)

0 2 4 6 8 10 12 14

#processors

0

5000

10000

15000

Th

ro

ug

hp

ut

(T

r/1

B c

ycle

s)

zeus

No PF or Compr.

PF Only

Compr Only

Both

�� Compression and prefetching combine to significantly Compression and prefetching combine to significantly

improve throughput improve throughput

34

67

Compression & Prefetching Interaction

0 2 4 6 8 10 12 14

#Processors

0

10

20

30

Interaction (PF, Compr) %

apache

zeus

oltp

jbb

�� Interaction is positive for most configurations (and all Interaction is positive for most configurations (and all

““optimaloptimal”” configurations)configurations)

68

Balanced CMP Design: Summary

� Analytical model can qualitatively predict throughput

– Can provide intuition into trade-off

– Quickly analyzes sensitivity to CMP parameters

– Not accurate enough to estimate throughput

� Compression improves throughput across all configurations

– Larger improvement for “optimal” configuration

� Compression can shift balance towards more cores

� Compression interacts positively with prefetching for most configurations

35

69

Related Work (1/2)

� Memory Compression

– IBM MXT technology

– Compression schemes: X-Match, X-RL

– Significance-based compression: Ekman and Stenstrom

� Virtual Memory Compression

– Wilson et al.: varying compression cache size

� Cache Compression

– Selective compressed cache: compress blocks to half size

– Frequent value cache: frequent L1 values stored in cache

– Hallnor and Reinhardt: Use indirect indexed cache for compression

70

Related Work (2/2)

� Link Compression

– Farrens and Park: address compaction

– Citron and Rudolph: table-based approach for address & data

� Prefetching in CMPs

– IBM’s Power4 and Power5 stride-based prefetching

– Beckmann and Wood: prefetching improves 8-core performance

– Gunasov and Burtscher: One CMP core dedicated to prefetching


– Huh et al.: Pin bandwidth a first-order constraint

– Davis et al.: Simple Chip multi-threaded cores maximize throughput

36

71

Conclusions

� CMPs increase demand on caches and pin bandwidth

– Prefetching further increases such demand

� Cache Compression

+ Increases effective cache size - Increases cache access time

� Link Compression decreases bandwidth demand


– Helps programs that benefit from compression

– Does not hurt programs that are hurt by compression


– Improve CMP throughput

– Interact positively with hardware prefetching

� Compression improves CMP performance

72

Backup Slides� Moore’s Law: CPU vs. Memory Speed

� Moore’s Law (1965)

� Software Trends

� Decoupled Variable-Segment Cache

� Classification of L2 Accesses

� Compression Ratios

� Seg. Compr. Ratios SPECint SPECfp Commercial

� Frequent Pattern Histogram

� Segment Histogram

� (LRU) Stack Replacement

� Cache Bits Read or Written

� Sensitivity to L2 Associativity

� Sensitivity to Memory Latency

� Sensitivity to Decompression Latency

� Sensitivity to Cache Line Size

� Phase Behavior

� Commercial CMP Designs

� CMP Compression – Miss Rates

� CMP Compression: Pin Bandwidth Demand

� CMP Compression: Sensitivity to L2 Size

� CMP Compression: Sensitivity to Memory Latency

� CMP Compression: Sensitivity to Pin Bandwidth

� Prefetching Properties (8p)

� Sensitivity to #Cores – OLTP

� Sensitivity to #Cores – Apache

� Analytical Model: IPC

� Model Parameters

� Model - Sensitivity to Memory Latency

� Model - Sensitivity to Pin Bandwidth

� Model - Sensitivity to L2 Miss rate

� Model-Sensitivity to Compression Ratio

� Model - Sensitivity to Decompression Penalty

� Model - Sensitivity to Perfect CPI

� Simulation (20 GB/sec bandwidth) – apache

� Simulation (20 GB/sec bandwidth) – oltp

� Simulation (20 GB/sec bandwidth) – jbb

� Simulation (10 GB/sec bandwidth) – zeus

� Simulation (10 GB/sec bandwidth) – apache

� Simulation (10 GB/sec bandwidth) – oltp

� Simulation (10 GB/sec bandwidth) – jbb

� Compression & Prefetching Interaction – 10 GB/sec pin bandwidth

� Model Error apache, zeus oltp, jbb

� Online Transaction Processing (OLTP)

� Java Server Workload (SPECjbb)

� Static Web Content Serving: Apache

37

73

Moore’s Law: CPU vs. Memory Speed

0.1

1

10

100

1000

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

Year

Tim

e (

ns

)

CPU Cycle Time

DRAM latency

�� CPU cycle time: 500 times faster since 1982CPU cycle time: 500 times faster since 1982

�� DRAM Latency: Only ~5 times faster since 1982DRAM Latency: Only ~5 times faster since 1982

74

Moore’s Law (1965)

#Transistors Per Chip (Intel)

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

1971

1974

1977

1980

1983

1986

1989

1992

1995

1998

2001

2004

Year

Almost 75% Almost 75%

increase per increase per

yearyear

38

75

Software Trends

� Software trends favor more cores and higher off-chip

bandwidth

0

500

1000

1500

2000

2500

3000

3500

Sep

-00

Mar

-01

Sep

-01

Mar

-02

Sep

-02

Mar

-03

Sep

-03

Mar

-04

Sep

-04

Mar

-05

tpm

C (th

ousands)

76

LRU

StateTag 7Tag 0 Tag 1 ... ... ... ... ...

Segments

Tag Area Data Area

Cstatus

312

Permissions: States M (modified), S (shared), I (invalid), NP (not present)

CStatus: 1 if line is compressed, 0 otherwise

CSize: Size of compressed line (in segments) if compressed

CStatus and CSize are used to determine the actual size(in segments) of the cache line :

CStatus CSize Actual Size

0 s 8

1 s s

1..k−1

16−byte−wide 2−input multiplexor

16 bytes

Permissions

segment

segment_offset(k) = Sum (actual_size(i))

segment_offset

segment_offset+1

segment_offset

segment_offset+1

16 Even

Segments

segment8−byte 8−byte

Tag

AddressCSize

(Compression Tag)

16 Odd


39

77

Classification of L2 Accesses

� Cache hits:

• Unpenalized hit: Hit to an uncompressed line that would have

hit without compression

- Penalized hit: Hit to a compressed line that would have hit

without compression

+ Avoided miss: Hit to a line that would NOT have hit without

compression

� Cache misses:

+ Avoidable miss: Miss to a line that would have hit with

compression

• Unavoidable miss: Miss to a line that would have missed

even with compression

78

Compression Ratios

0

2

4

6

Com

prs

sion R

atio

FPC

XRL

BRCL

gzip

bzip gc

cm

cf

twol

f

amm

p

appl

u

equa

ke

swim

apac

heze

usol

tp jbb

L

40

79

Seg. Compression Ratios - SPECint

0

1

2

3

Com

prs

sio

n R

ati

o FPC

Segmented FPC

XRL

Segmented XRL

BRCL

Segmented BRCL

bzip gcc mcf twolf

80

Seg. Compression Ratios - SPECfp

0

1

2

Co

mp

rssio

n R

ati

o FPC

Segmented FPC

XRL

Segmented XRL

BRCL

Segmented BRCL

ammp applu equake swim

41

81

Seg. Compression Ratios - Commercial

0

1

2

Co

mp

rssio

n R

ati

o FPC

Segmented FPC

XRL

Segmented XRL

BRCL

Segmented BRCL

apache zeus oltp jbb

82

Frequent Pattern Histogram

0

20

40

60

80

100

Patt

ern

%

Uncompressible

Compr. 16-bits

Compr. 8-bits

Compr. 4-bits

Zeros

bzip

bzipgcc

gccmcf

mcftwolf

twolfammp

ammpapplu

appluequake

equakeswim

swimapache

apachezeus

zeusoltp

oltpjbb

jbb

42

83

Segment Histogram

0

20

40

60

80

100P

att

ern

%

8 Segments

7 Segments

6 Segments

5 Segments

4 Segments

3 Segments

2 Segments

1 Segments


84

� Differentiate penalized hits and avoided misses?

– Only hits to top half of the tags in the LRU stack are penalized hits

� Differentiate avoidable and unavoidable misses?

� Is not dependent on LRU replacement

– Any replacement algorithm for top half of tags

– Any stack algorithm for the remaining tags

(LRU) Stack Replacement

16)( )()(

1)(≤⇔∑

=

=

kLRUiLRU

iLRUiCSizeMiss(k)Avoidable_

43

85

Cache Bits Read or Written

0

1

2

No

rmali

zed

Bit

s R

ead

/Wri

tten

Never

Always

Adaptive


86

Sensitivity to L2 Associativity

0.0

0.5

1.0

Norm

ali

zed R

unti

me

apache

Never

Always

Adaptive

4M

/2

6.1

4M

/4

5.5

4M

/8

4.9

4M

/16

4.5

0.0

0.5

1.0

Norm

ali

zed R

unti

me

mcf

Never

Always

Adaptive

4M

/2

6653

4M

/4

22.8

4M

/8

10.8

4M

/16

9.1

44

87

Sensitivity to Memory Latency

0.0

0.5

1.0

No

rmali

zed

Ru

nti

me

mcf

Never

Always

Adaptive

20010.8

30010.8

40010.8

50010.8

60010.8

70010.8

80010.8

0.0

0.5

1.0N

orm

ali

zed

Ru

nti

me

apache

Never

Always

Adaptive

2004.9

3004.9

4004.9

5004.9

6004.9

7004.9

8004.9

88

Sensitivity to Decompression Latency

0 5 10 15 20 25

Decompression Latency (Cycles)

0.0

0.5

1.0

1.5

Norm

ali

zed R

unti

me

ammp

never

always

adapt

0 5 10 15 20 25

Decompression Latency (Cycles)

0.0

0.5

1.0

Norm

ali

zed R

unti

me

mcf

never

always

adapt

45

89

Sensitivity to Cache Line Size

0.0

0.5

1.0

No

rmali

zed

Ru

nti

me

zeus

Never

Always

Adaptive

160.36

320.57

640.98

1281.57

2562.55

0.0

0.5

1.0

No

rmali

zed

Ru

nti

me

mcf

Never

Always

Adaptive

160.75

321.34

643.05

1288.34

25616.75 Pin Bandwidth

Demand

90

Phase BehaviorPredictor Value (K)Predictor Value (K)

Cache Size (MB)Cache Size (MB)

0 1 2 3 4

-200

-100

0

100

200

GC

P (x

1000

)

0 1 2 3 4

Time (Billion Cycles)

0

2

4

6

8

Eff

ectiv

e C

ache

Siz

e (M

B)

46

91

Commercial CMP Designs

� IBM Power5 Chip:

– Two processor cores, each 2-way multi-threaded

– ~1.9 MB on-chip L2 cache

• < 0.5 MB per thread with no sharing

• Compare with 0.75 MB per thread in Power4+

– Est. ~16GB/sec. max. pin bandwidth

� Sun’s Niagara Chip:

– Eight processor cores, each 4-way multi-threaded

– 3 MB L2 cache

• < 0.4 MB per core, < 0.1 MB per thread with no sharing

– Est. ~22 GB/sec. pin bandwidth

92

CMP Compression – Miss Rates

0.0

0.5

1.0

Norm

ali

zed M

issra

te

No Compression

L2 Compression

15.2

apache

11.3

zeus

3.3

oltp

3.5

jbb

75.2

art

9.3

apsi

17.5

fma3d

5.1

mgrid

47

93

CMP Compression: Pin Bandwidth Demand

0

5

10

15

20

25

Bandw

idth

(G

B/s

ec)

No Compression

L2 Compression

Link Compression

Both


94

CMP Compression: Sensitivity to L2 Size

0.0

0.5

1.0

Nor

mal

ized

Run

tim

e

jbb

No Compression

L2 Compression

Link Compression

Both

256K/8

3.8

512K/8

2.6

1M/8

3.4

2M/8

4.3

4M/8

7.1

8M/8

16.0

16M/8

30.8

0.0

0.5

1.0

Nor

mal

ized

Run

tim

e

apache

No Compression

L2 Compression

Link Compression

Both

256K/8

24.4

512K/8

14.9

1M/8

9.8

2M/8

7.4

4M/8

5.3

8M/8

5.7

16M/8

11.1

48

95

CMP Compression: Sensitivity to Memory Latency

Memory Latency (cycles)

0.0

0.5

1.0

No

rmal

ized

Ru

nti

me

apache

No Compression

L2 Compression

Link Compression

Both

200 300 400 500 600 700 800

Memory Latency (cycles)

0.0

0.5

1.0

No

rmal

ized

Ru

nti

me

fma3d

No Compression

L2 Compression

Link Compression

Both

200 300 400 500 600 700 800

96

CMP Compression: Sensitivity to Pin Bandwidth

Pin Bandwidth (GB/sec)

0.0

0.5

1.0

Norm

aliz

ed R

unti

me

fma3d

No Compression

L2 Compression

Link Compression

Both

10 20 40 80

Pin Bandwidth (GB/sec)

0.0

0.5

1.0

Norm

aliz

ed R

unti

me

apache

No Compression

L2 Compression

Link Compression

Both

10 20 40 80

49

97

Prefetching Properties (8p)

81.989.96.294.280.28.426.615.50.06mgrid

73.544.68.880.927.57.314.47.50.06fma3d

97.695.84.696.925.58.530.715.70.04apsi

85.056.049.781.330.956.324.19.40.05art

32.434.25.560.323.14.249.624.61.8jbb

41.526.42.458.06.62.044.820.913.5oltp

56.044.48.279.217.75.538.914.57.1zeus

57.937.710.555.58.86.142.016.44.9apache

accuracycoveragePF rateaccuracycoveragePF rateaccuracycoveragePF rate

L2 CacheL1 D CacheL1 I CacheBenc

hmark

98

Sensitivity to #Cores - OLTP

#Processors

-10

0

10

20

30

Perf.

Impr

ovem

ent (

%)

oltp

1p 2p 4p 8p 16p

50

99

Sensitivity to #Cores - Apache

#Processors

0

20

40

60

80

Perf.

Impr

ovem

ent (

%)

apache

1p 2p 4p 8p 16p

100

Analytical Model: IPC

)..(

1)(..

)(

22Nkmc

NsharersNyMissPenaltdpCPI

NNIPC

p

avLPerfectL

−

+−++

=

α

yLinkLatencncyMemoryLateyMissLatenc L +=2

)).().(1.(2

).().(

2

2

2

XSMissrateNIPC

XSMissrateNIPCXyLinkLatenc

pL

pL

−

+=

51

101

Model Parameters

� Divide chip area between cores and caches

– Area of one (in-order) core = 0.5 MB L2 cache

– Total chip area = 16 cores, or 8 MB cache

– Core frequency = 5 GHz

– Available bandwidth = 20 GB/sec.

� Model Parameters (hypothetical benchmark)

– Compression Ratio = 1.75

– Decompression penalty = 0.4 cycles per instruction

– Miss rate = 10 misses per 1000 instructions for 1proc, 8 MB Cache

– IPC for one processor, perfect cache = 1

– Average #sharers per block = 1.3 (for #proc > 1)

102

Model - Sensitivity to Memory Latency

�� CompressionCompression’’s impact similar on both extremess impact similar on both extremes

�� Compression can shift optimal configuration towards more Compression can shift optimal configuration towards more

cores (though not significantly)cores (though not significantly)

0 2 4 6 8 10 12 14

#Processors

0.0

0.5

1.0

Th

rou

gh

put

(IP

C)

200 cycle mem. lat..- no compr

200 cycle mem. lat..- cache + link compr









52

103

Model - Sensitivity to Pin Bandwidth

0 2 4 6 8 10 12 14

#Processors

0.0

0.2

0.4

0.6

Th

rou

gh

pu

t (I

PC

)

10 GB/sec.- no compr

10 GB/sec.- cache + link compr









104

Model - Sensitivity to L2 Miss rate

0 2 4 6 8 10 12 14

#Processors

0.0

0.2

0.4

Thr

ough

put (

IPC

) 10 misses/1K inst.- no compr

10 misses/1K inst.- cache + link compr

20 misses/1K inst.- no compr






0 2 4 6 8 10 12 14

#Processors

0

1

2

3

Thr

ough

put (

IPC

)





53

105

Model-Sensitivity to Compression Ratio

0 2 4 6 8 10 12 14

#Processors

0.0

0.2

0.4

0.6

Th

rou

gh

pu

t (I

PC

)

no compr.

1.1 compr. ratio- cache + link compr





106

Model - Sensitivity to Decompression

Penalty

0 2 4 6 8 10 12 14

#Processors

0.0

0.2

0.4

0.6

Thro

ug

hp

ut

(IP

C)

dp 0.0 cycles/inst - no compr

dp 0.0 cycles/inst - cache + link compr





54

107

Model - Sensitivity to Perfect CPI

0 2 4 6 8 10 12 14

#Processors

0.0

0.2

0.4

0.6

Th

rou

gh

pu

t (I

PC

)

0.25 perfect CPI.- no compr

0.25 perfect CPI.- cache + link compr









108

Simulation (20 GB/sec bandwidth) -

apache

0 2 4 6 8 10 12 14

#processors

0

5000

10000

Throughput (

Tr/1B

cycles)

apache

No PF or Compr.

PF Only

Compr Only

Both

55

109


oltp

0 2 4 6 8 10 12 14

#processors

0

500

1000

1500

Throughput

(T

r/1

B c

ycle

s)

oltp

No PF or Compr.

PF Only

Compr Only

Both

110

Simulation (20 GB/sec bandwidth) - jbb

0 2 4 6 8 10 12 14

#processors

0

20000

40000

60000

80000

100000

Throughput (

Tr/1B

cycles)

jbb

No PF or Compr.

PF Only

Compr Only

Both

56

111


zeus

0 2 4 6 8 10 12 14

#processors

0

5000

10000

Throughput

(T

r/1

B c

ycle

s) zeus

No PF or Compr.

PF Only

Compr Only

Both

�� Prefetching can degrade throughput for many systemsPrefetching can degrade throughput for many systems

�� Compression alleviates this performance degradationCompression alleviates this performance degradation

112


apache

0 2 4 6 8 10 12 14

#processors

0

2000

4000

6000

8000

10000

Th

ro

ug

hp

ut

(T

r/1

B c

ycle

s)

apache

No PF or Compr.

PF Only

Compr Only

Both

57

113


oltp

0 2 4 6 8 10 12 14

#processors

0

500

1000

1500

Th

ro

ug

hp

ut

(T

r/1

B c

ycle

s)

oltp

No PF or Compr.

PF Only

Compr Only

Both

114

Simulation (10 GB/sec bandwidth) - jbb

0 2 4 6 8 10 12 14

#processors

0

20000

40000

60000

80000

100000

Throughput (

Tr/1B

cycles)

jbb

No PF or Compr.

PF Only

Compr Only

Both

58

115

Compression & Prefetching Interaction

– 10 GB/sec pin bandwidth

�� Interaction is positive for most configurations (and all Interaction is positive for most configurations (and all

““optimaloptimal”” configurations)configurations)

0 2 4 6 8 10 12 14

#Processors

0

10

20

30

Interaction (PF, Compr) %

apache

zeus

oltp

jbb

116

Model Error – apache, zeus

0 2 4 6 8 10 12 14

#Processors

0

2000

4000

6000

8000

10000

Tra

ns.

Per

1B

Cy

cle

s

apache

Model_no compr

Model_cache&link compr.

Simulation_no compr.

Simulation_cache&link compr.

0 2 4 6 8 10 12 14

#Processors

0

5000

10000

Tra

ns.

Per

1B

Cy

cle

s

zeus

Model_no compr




59

117

Model Error – oltp, jbb

0 2 4 6 8 10 12 14

#Processors

0

500

1000

1500

2000

2500

Tra

ns.

Per

1B

Cycle

soltp

Model_no compr




0 2 4 6 8 10 12 14

#Processors

0

50000

100000

Tra

ns.

Per

1B

Cy

cle

s

jbb

Model_no compr




118

Online Transaction Processing (OLTP)

� DB2 with a TPC-C-like workload.

– Based on the TPC-C v3.0 benchmark.

– We use IBM’s DB2 V7.2 EEE database management system and an IBM benchmark kit to build the database and emulate users.

– 5 GB 25000-warehouse database on eight raw disks and an additional dedicated database log disk.

– We scaled down the sizes of each warehouse by maintaining the reduced ratios of 3 sales districts per warehouse, 30 customers per district, and 100 items per warehouse (compared to 10, 30,000 and 100,000 required by the TPC-C specification).

– Think and keying times for users are set to zero.

– 16 users per processor

– Warmup interval: 100,000 transactions

60

119

Java Server Workload (SPECjbb)

� SpecJBB.

– We used Sun’s HotSpot 1.4.0 Server JVM and Solaris’s

native thread implementation

– The benchmark includes driver threads to generate

transactions

– System heap size to 1.8 GB and the new object heap size to

256 MB to reduce the frequency of garbage collection

– 24 warehouses, with a data size of approximately 500 MB.

120

Static Web Content Serving: Apache

� Apache.

– We use Apache 2.0.39 for SPARC/Solaris 9 configured to use pthread

locks and minimal logging at the web server

– We use the Scalable URL Request Generator (SURGE) as the client.

– SURGE generates a sequence of static URL requests which exhibit

representative distributions for document popularity, document sizes,

request sizes, temporal and spatial locality, and embedded document

count

– We use a repository of 20,000 files (totaling ~500 MB)

– Clients have zero think time

– We compiled both Apache and Surge using Sun’s WorkShop C 6.1

with aggressive optimization

Documents

Using Compression to Improve Chip Multiprocessor Performance