18
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference and Exhibit ion Volume: 1 Pages: 142 – 147 Feb. 2004

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Page 1: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

A Self-Tuning Cache Architecture for Embedded Systems

Chuanjun Zhang, Vahid F., Lysecky R.

Proceedings of Design, Automation and Test in Europe Conference and Exhibition

Volume: 1

Pages: 142 – 147

Feb. 2004

Page 2: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 2/18

Abstract Memory accesses can account for about half of a microprocessor

system’s power consumption. Customizing a microprocessor cache’s total size, line size and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in pre-fabricated microprocessor platforms. Tuning those caches to a program is still however a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids.

We propose to move that CAD on-chip, which can greatly increase the acceptance of configurable caches. We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. We carefully designed the heuristic to avoid any cache flushing, since flushing is power and performance costly. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache can reduce memory-access energy by 45% to 55% on average, and as much as 97%, compared with a four-way set-associative base cache, completely transparently to the programmer.

Page 3: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 3/18

What’s the Problem

Tuning a configurable cache to a application is benefic for power and performance How to obtain the best cache configuration ??

Sometimes increase cache size (associativity) only improve limited performance but increase energy greatly

Determine the best cache configuration via simulation Straightly, but slowly and can’t capture runtime behavior

Thus, it’s essential to automatically tune a configurable cache dynamically as an application executes

Page 4: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 4/18

Introduction Previous work of this team

A highly configurable cache architecture [13],[14] Four parameters that designers can configure: 1) Cache total size: 8, 4 or 2 KB

2) Associativity: 4, 2 or 1 way for 8 KB; 2 or 1 way for 4KB; 1 way

for 2KB 3) Cache line size: 64, 32 or 16 bytes

4) Way prediction : ON or OFF

The proposed dynamic cache tuning method Cache tuning heuristic implementing with on-chip hardware

Without exhaustively tries all possible cache configurations Dynamically tunes the cache to an executing program Automate the process of finding the best cache configuration

The space of configuration may more larger

Page 5: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 5/18

Energy Evaluation Equations for total memory access energy consumption

Ehit: cache hit energy per cache access

Emiss: cache miss energy

Estatic_per_cycle: static energy dissipation

Equation for the heuristic cache tuner energy consumption

Timetotal: the total time used to finish one cache configuration search NumSearch: the number of cache configurations search

Related to cache size, associativity

Related to cache line size

Related to cache size

Page 6: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 6/18

Problem Overview A naive tuning approach

Exhaustively tries all possible cache configurations Two main drawbacks

Involves too many configurations Requires too many cache flushes

Searching in an arbitrary order may require flushing the cache

Goal: develop a self-tuning heuristic that Minimizes the number of cache configurations examined Minimizes cache flushing

While still finding a near-optimal cache configuration

1) Tuning dynamically as execution

2) Can be enabled, disabled by SW

Page 7: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 7/18

Heuristic Development Through Analysis Energy dissipation for benchmark parser at cache size from 1 KB to

1MB

However this tradeoff point is different for application and exist not only for cache size, but also for cache associativity and line size Therefore, the goal of searching heuristic is to find the configuration

Improve performance slightly but increase energy significantly

Energy dissipation of off-chip memory decreases rapidly

Increase cache performance and decrease total energy is observed

Page 8: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 8/18

Determine the Impact of Each Parameter The parameter with the greatest impact configure first

Vary cache size has the biggest impact on miss rate and energy Vary line size cause little energy variation for I$ but more variation for D$ Vary associativity has the smallest impact on energy consumption

Different line size Different

associativity

Develop a search heuristic that finds best cache size first, then best line size, finally best associativity

Page 9: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 9/18

Minimizing Cache Flushing The order of vary the values of each parameter

One order may require flushing, a different order may not Cache flush analysis when changing cache size

Increasing the cache size is preferable over decreasing When decreasing the cache size, an original hit may turn into miss

EX: address 000 (index=00) and 110 (index=10) are misses after shutdown For D $, need to write back when the data in the shutdown ways is dirty

When increasing the cache size does’t require flushing EX: address 100 (index=0) and 010 (index=0)

No write back is needed and thus avoid flushing

8 byte Memory

Page 10: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 10/18

Minimizing Cache Flushing

Cache flush analysis when changing associativity

Increasing the associativity is preferable over decreasing Decreasing the associativity may turn a hit into miss

EX: address 000 (index=0) and 100 (index=0) Increasing the associativity will be no extra misses

EX: address 000 (index=00) and 010 (index=10) Both still be hit after the associativity is increased

Page 11: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 11/18

Search Heuristic for Determining the Best Cache Configuration

Inputs to the heuristic

Cache size: C[ i ], 1 ≤ i ≤ n n=3 in our configurable cache

C[1]=2 KB, C[2]=4 KB, C[3]=8 KB

Line size: L[ j ], 1 ≤ j ≤ p p=3 in our configurable cache

L[1]=16 bytes, L[2]=32 bytes, L[3]=64 bytes

Associativity: A[ k ], 1 ≤ k ≤ m m=3 in our configurable cache

A[1]=1 way, A[2]=2 way, A[3]=4 way

Way prediction W[1]= OFF ,W[2]= ON

E[1]

As long as increase the cache size result in total

energy decrease

First

Then

And then

Finally

Page 12: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 12/18

The Efficiency of Search Heuristic

Suppose there are n configurable parameters, and each parameter has m values Total of mn different combinations Our heuristic only searches m*n combinations at most

EX: 10 configurable parameters, each has 10 values Brute force searching: searches 1010 combinations Our search heuristic: searches 100 combinations instead

Thus, using our search heuristic

Minimizes the number of cache configurations examined

Avoids most of the cache flushing

&

Page 13: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 13/18

Implementing the Heuristic in Hardware Hardware-based approach is preferable over software

SW approach not only change the runtime behavior of application but also affect the cache behavior

FSMD of the cache tuner

Ehit: correspond to 8KB 4way, 2way and 1way; 4KB 2way and 1way; 2KB 1way

Emiss: correspond to line size of 16 bytes, 32bytes and 64 bytes

Estatic_per_cycle: correspond to cache size of 8KB, 4KB and 2KB Configure register (7 bits wide) : 2 bits for cache size, 2 bits for line size, 2 bits f

or associativity and 1bit for way prediction

1

1

1

6

3

3

1

1

1

Runtime informationApplication independent information Result of energy calculation

Lowest of configuration

tested

Used to configure

cache

Page 14: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 14/18

Implementing the Heuristic in Hardware FSM of the cache tuner

Composed of three smaller state machines

EX: If the current state of PSM is P1 State V1 of VSM will determine the energy of 2 KB cache, V2 for 4 KB

cache, V3 for 8 KB cache Why we need CSM ??

Because we have three multiplications but only one multiplier Used four states to compute the energy

Determines best cache size Line size

Tuning each cache parameter

AssociativityWay prediction

Determines the energy for many possible values of

each parameter

2 KB

4 KB

8 KB

Controls the calculation of energy

PSM states depend on VSM, and VSM states depend on CSM

Page 15: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 15/18

Results of Search Heuristic Searches average 5.8 co

nfigurations compared to 27 configurations

Finds the optimal configuration in nearly all cases, except D-cache cfg. of pjepg D-cache cfg. of mpeg2

Page 16: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 16/18

The Reason of the Inaccuracy

Larger cache consume more dynamic and static energy Larger cache is only preferable if the reduction in Eoff_chip_mem over

comes the energy increase due to larger cache For mpeg2, using 8 KB cache, the reduction in Eoff_chip_mem is not larg

er enough to overcome the added energy by larger cache Therefore, selects a cache size of 4 KB

When associativity is considered (increased from 1 way to 2 way), the miss rate of 8 KB cache is significantly reduced

The heuristic does’t choose the optimal configuration due to When heuristic is determining the best cache size, it does’t predi

ct what will happen when associativity is increased

Page 17: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 17/18

Area and Power of the Tuning Hardware The area of cache tuner is about 4000 gates or 0.039 mm2 in

0.18 um technology An increase in area of just 3% over MIPS 4kp with cache

The power consumption of cache tuner is 2.69 mw at 200 MHz Only 0.5% of the power consumed by a MIPS processor

The average energy consumption of cache tuner Used 164 cycles to finish one cache configuration Average number of configurations searched is 5.4

The average energy dissipation of benchmarks is 2.34 J

Impact of avoid flushing by careful ordering of search When cache size is configured in the order of 8 KB down to 2 KB

The average energy consumption due to writing back dirty data is 5.38 mJ

Thus, if we search the possible cache size from largest to smallest

= 2.69 mw * (164/200M) * 5.4 = 11.9 nJ

negligible

The energy due to cache flushes would be 480,000 times than cache tuner

Page 18: A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference

112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 18/18

Conclusions Proposed a self-tuning on-chip CAD method finding the

best configuration automatically Relieving designers from the burden to determine the best

configuration Increasing the usefulness and acceptance of a configurable

cache Our cache tuning heuristic

Minimizes the number of configurations examined Minimizes the need for cache flushing Reduces 40% memory-access energy on average,

compared to a standard cache