Download pdf - High-Performance Low-Power Cache Memory …koji.inoue/paper/Phd/thesis.pdfSince almost all memory accesses concentrate in cache memories,improving performance/energy eﬃciency of

High-Performance Low-Power Cache Memory

Architectures

Koji Inoue

Kyushu University

January 2001

Abstract

Recent remarkable advances of VLSI technology have been increasing processor speed and

DRAM capacity. However, the advances also have introduced a large, growing performance

gap between processor and main memory. Cache memories have long been employed on pro-

cessor chips in order to bridge the processor-memory performance gap. Therefore, researchers

have made great efforts to improve the cache performance.

However, the surroundings of processor-chip design have been changing. 1) Recent growing

mobile-market strongly requires not only high performance but also low-energy dissipation

for expanding the battery life. 2) Recent VLSI technology have made it possible to integrate

processor and main memory into the same chip, so that the chip boundary between cache

and main memory can be eliminated. The changes suggest that we need to keep considering

cache architectures for high-performance, low-energy computer systems.

Reducing the frequency of off-chip accesses has mainly two advantages: reducing memory-

access latency and reducing energy dissipation for driving external I/O pins. The most

straightforward way to improve the performance/energy efficiency of memory systems is to

invest the increasing transistor budget in the cache memories (increasing cache capacity).

Increasing cache capacity improves cache-hit rates, so that more memory accesses can be

confined in on-chip. However, it also leads to increase in cache-access latency, which is the

time wasted to access the cache, and cache-access energy, which is the energy dissipated for

a cache access. Since almost all memory accesses concentrate in cache memories, improving

performance/energy efficiency of cache memories is one of the most important challenges.

This thesis introduces adaptive cache management techniques for high performance, low-

energy processor chips. The caches proposed in this thesis attempt to eliminate unnecessary

operations for reducing energy dissipation and improving performance.

In the first part of this thesis, we introduce a cache architecture for reducing cache-access

i

ii ABSTRACT

energy, called way-predicting set-associative cache. In conventional set-associative caches,

all ways are searched in parallel because the cache-access time is critical. In fact, on a

cache hit, only one way has the data desired by the processor. Therefore, the access to the

remaining ways is unnecessary. The way-predicting set-associative cache attempts to avoid

the unnecessary way activation, and has the following features.

• The cache has a way-prediction table. Each entry in the table is used for speculative

way selection.

• Before a cache access is started, the way-prediction table is accessed to get a hint for

the speculative way selection.

• Only the predicted way is activated (searched).

• If the way-prediction is correct, the cache access is completed in one cycle. As the

remaining ways are not activated, the energy dissipation for the cache access can be

reduced.

• If the cache makes an wrong way-prediction, then the remaining ways are searched in

the same manner as conventional set-associative caches. In this case, the cache can not

make any energy reduction. In addition, the cache wastes one more cycle to access the

remaining ways.

We evaluate the performance/energy efficiency of way-predicting set-associative caches. The

way-predicting scheme reduces more than 70 % of cache-access energy, while it occurs only

less than 10 % of cache-access-time overhead. In addition, we evaluate the effects of hard-

ware constraint to the way-predicting set-associative cache, and conclude that our scheme is

promising for future processor chips which employ large on-chip caches.

In the second part of this thesis, we introduce a cache architecture for reducing cache-access

energy, called history-based tag-comparison cache. In conventional caches, tag comparison is

performed in every access in order to determine whether the access hits the cache. The content

of cache is updated only when a cache miss takes place. Therefore, if an instruction block was

executed before, and if there has never been cache misses since the previous execution of the

instruction block, then it is guaranteed that the instruction block is currently cache resident.

In this case, we do not need to perform tag comparison, so that the energy dissipated for

ABSTRACT iii

performing tag comparison can be completely eliminated. The history-based tag-comparison

cache attempts to avoid unnecessary tag comparisons, and has the following features.

• Execution footprints are recorded in an extended BTB (Branch Target Buffer).

• When a branch is executed, a corresponding footprint is recorded. The footprint denotes

that the target instruction block of the branch is currently cache resident.

• When the branch is executed again, the corresponding footprint is checked. If the cache

detects the recorded footprint, then the tag comparison in cache accesses for the target

instruction block is omitted.

• When a cache miss takes place, all execution footprints are erased.

• Since hardware components for the history-based tag-comparison cache do not appear

on cache critical paths, the cache-access time of conventional organization is maintained.

We evaluate the energy efficiency of history-based tag-comparison caches. In best case, a

history-based tag-comparison cache reduces 99 % of tag-comparison energy for the execu-

tion of a program. Since the tag omitting scheme relies on loop structure in programs,

our cache works well for floating-point programs and media programs which have relatively

well structured loops. Although our cache does not make a significant reduction of tag-

comparison energy for some integer programs, increasing cache capacity improves the effect

of tag omitting scheme. Therefore, we conclude that the history-based tag-comparison cache

is promising for future processor chips which employ large on-chip caches.

In the last part of this thesis, we introduce high performance, low energy techniques for on-

chip memory systems, called dynamically variable line-size cache. For merged DRAM/logic

LSIs with a memory hierarchy including cache memory, we can exploit high on-chip memory

bandwidth by means of replacing a whole cache line at a time on cache misses. This ap-

proach tends to increase the cache-line size if we attempt to improve the attainable memory

bandwidth. Although larger cache lines give an effect of prefetching, it may worsen cache-hit

rates if programs do not have enough spatial locality. The dynamically variable line-size

cache attempts to avoid unnecessary data replacements, which is caused by large cache lines,

by adjusting the cache-line size according to the degree of spatial locality. The cache has the

following features:

iv ABSTRACT

• A large cache line is partitioned into small cache lines (sublines).

• When rich spatial locality is observed, a large number of sublines are involved in cache

replacements (assemble a large cache line). In contrast, when poor spatial locality is

observed, a small number of sublines are involved in cache replacements (assemble a

small cache line).

• As conflict misses are reduced by not increasing associativity but reducing cache-line

size, high-speed cache access can be maintained.

• Data transfer between the cache and main memory can be completed in a constant

time regardless of the cache-line sizes because of the high on-chip memory bandwidth

on merged DRAM/logic LSIs.

• Only the DRAM subarrays corresponding to the sublines to be replaced are activated,

thereby saving the main-memory-access energy for cache replacements.

We evaluate the performance/energy efficiency of dynamically variable line-size caches hav-

ing 32-byte, 64-byte, and 128-byte cache-line sizes. For a benchmark set which consists of

two integer programs and one floating-point program, a dynamically variable line-size cache

reduces average memory-access time by 20 %, while it makes 35 % average memory-access en-

ergy reduction, compared with a conventional cache having fixed 128-byte cache-line size. In

addition, we investigate the effects of on-chip DRAM characteristics which depend strongly

on device technology, and it is observed that the dynamically variable line-size cache can

make significant performance/energy improvements in a wide range of on-chip DRAM access

speed and energy. Therefore, we conclude that the dynamically variable line-size cache is

promising for future processor chips using merged DRAM/logic LSIs.

When we focus on portable computing in world-wide network systems, it is required to

provide software portability. Our caches monitor the behavior of memory references, then

attempt to avoid unnecessary operations at run-time. Since the caches do not require any

modification of instruction set architectures, the full compatibility of existing object codes

can be kept.

Contents

Abstract i

Contents v

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Memory Systems Employing Cache Memories 9

2.1 Principle of Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Memory-Access Time and Energy Definitions . . . . . . . . . . . . . . . . . . . 10

2.3 Conventional Cache Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 High-Speed Memory-Access Techniques . . . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Making Cache Access Faster . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.2 Making Cache-Miss Rate Lower . . . . . . . . . . . . . . . . . . . . . . 16

2.4.3 Making Cache-Miss Penalty Smaller . . . . . . . . . . . . . . . . . . . . 24

2.5 Low-Energy Memory-Access Techniques . . . . . . . . . . . . . . . . . . . . . 24

2.5.1 Reducing Cache-Access Energy . . . . . . . . . . . . . . . . . . . . . . 25

2.5.2 Reducing Data-Transfer/Main-Memory-Access Energy . . . . . . . . . 30

2.5.3 Reducing DRAM-Static Energy . . . . . . . . . . . . . . . . . . . . . . 31

2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Way-Predicting Set-Associative Cache Architecture 35

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Set-Associative Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

v

vi CONTENTS

3.2.1 Conventional Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.2 Phased Set-Associative Cache . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Way-Predicting Set-Associative Cache . . . . . . . . . . . . . . . . . . . . . . 38

3.3.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.2 Way Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4 Evaluations: Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Evaluations: Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5.2 Way-Prediction Hit Rates . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.5.3 Cache-Access Time and Energy . . . . . . . . . . . . . . . . . . . . . . 46

3.5.4 Energy-Delay Product . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5.5 Performance/Energy Overhead . . . . . . . . . . . . . . . . . . . . . . 51

3.5.6 Effects of Other Parameters . . . . . . . . . . . . . . . . . . . . . . . . 53

3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4 History-Based Tag-Comparison Cache Architecture 63

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Breakdown of Cache-Access Energy . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3 Interline Tag-Comparison Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.4 History-Based Tag-Comparison Cache . . . . . . . . . . . . . . . . . . . . . . . 66

4.4.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4.3 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


4.5.2 Energy Reduction for Tag Comparisons . . . . . . . . . . . . . . . . . . 73

4.5.3 Effects of Hardware Constraints . . . . . . . . . . . . . . . . . . . . . . 76

4.5.4 Energy Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

CONTENTS vii

5 Variable Line-Size Cache Architecture 83

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2 Conventional Approaches to Exploiting High Memory-Bandwidth . . . . . . . 85

5.3 Variable Line-Size Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.3.2 Concept and Principle of Operations . . . . . . . . . . . . . . . . . . . 88

5.3.3 Line Size Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.4 Statically Variable Line-Size Cache . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.4.3 Line-Size Determination . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.5 Dynamically Variable Line-Size Cache . . . . . . . . . . . . . . . . . . . . . . . 93

5.5.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.5.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.5.3 Line-Size Determination . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.6 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97


5.6.2 Cache-Access Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.6.3 Cache-Access Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.6.4 Cache-Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.6.5 Main-Memory-Access Time and Energy . . . . . . . . . . . . . . . . . . 106

5.6.6 Average Memory-Access Time . . . . . . . . . . . . . . . . . . . . . . . 109

5.6.7 Average Memory-Access Energy . . . . . . . . . . . . . . . . . . . . . . 110

5.6.8 Energy–Delay Product . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.6.9 Hardware Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.6.10 Effects of Other Parameters . . . . . . . . . . . . . . . . . . . . . . . . 116

5.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6 Conclusions 125

Acknowledgment 129

viii CONTENTS

Bibliography 131

List of Publications by the Author 143

Chapter 1

Introduction

1.1 Motivation

VLSI technologies have been increasing processor speed and DRAM capacity dramatically.

For example, the implementations of 1 GHz processors and 1 Gbits DRAM have been reported

[24], [32], [10], [55]. However, they also have introduced a large, growing performance gap

between processors and main memory (DRAM). By improving not only the clock speed but

also instruction level parallelism (ILP), the processor performance has been improving at a

rate of 60 % per year. On the other hand, the access time to DRAM has been improving at

a rate of less than 10 % per year [72]. Moreover, current memory systems suffer from a lack

of memory bandwidth caused by I/O-pin bottleneck. This problem is known as “Memory

Wall” [12], [97]. The inability of memory systems causes poor total system performance in

spite of higher processor performance.

Cache memories have been playing an important role in bridging the performance gap be-

tween high-speed processor and low-speed off-chip main memory, because confining memory

accesses in on-chip reduces memory access latency. Much research has focused on improv-

ing cache performance, and many high-performance cache architectures have been proposed.

However, the processor–memory performance gap is still growing. Patterson et al. [72] ana-

lyzed the breakdown of execution time (ET ) for benchmark programs as shown in Figure 1.1.

The memory hierarchy in the Alpha system includes up to level-3 caches. In their results,

it can be observed that the database and matrix computation programs spend about 75 %

time in memory accesses. In this case, 20 % memory-performance improvement achieves 15

1

2 CHAPTER 1. INTRODUCTION

0

25

50

75

100

SPECint92 SPECfp92 DataBase Sparse

Tim

e [%

]

ProcessorI-Cache MissesD-Cache MissesL2 Cache MissesL3 Cache Misses

Benchmark Programs

Figure 1.1: Fraction of Time Spent in Each Component on the Alpha 21164.

% ET reduction. In other words, 20 % memory-system performance degradation worsens

the total system performance by 15 %. Of cause, the effect of memory-system performance

to ET depends on the characteristics of target programs, for example, the total count of

load/store instructions executed, issue rate of instructions, and so on. Actually, the time

spent in memory accesses for SPEC programs are from 20 % to 30 % on the Alpha 21164

system. However, the inability of memory systems will increase the processor-memory per-

formance gap, and is clearly a serious problem for future processor-based computer systems.

Accordingly, we still need to make a great effort to improve the memory-system performance

by developing efficient cache memories.

The most straightforward approach to improving the memory-system performance is to

increase cache size. In order to alleviate the inability of memory systems, the trend is to invest

the increasing transistor budget in cache capacity. Increasing the cache capacity reduces the

frequency of off-chip accesses due to improving cache-hit rates. From energy point of view,

this approach seems to be useful because the energy dissipated for driving external I/O pins

1.2. CONTRIBUTIONS 3

can be reduced. However, this approach also increases the energy dissipated in cache accesses.

When we focused on power (the amount of energy consumed per unit time) of caches, several

studies were reported. The power consumption of on-chip caches for StrongARM SA110

occupy 43% of the total chip power [77]. In the 300 MHz bipolar CPU reported by Jouppi

et al. [44], 50 % of power is dissipated by caches. Recent growing mobile-market strongly

requires not only high performance but also low-energy dissipation. One of uncompromising

requirements of portable computing is energy efficiency, because that affects directly the

battery life. Therefore, from these studies, we believe that considering low-energy cache

architectures is a worthwhile work for future processor systems.

1.2 Contributions

Cache memories are indispensable for high-performance, low-energy processor chips. How-

ever, it is difficult to improve performance/energy efficiency of cache memories by relying

only on the advanced VLSI technologies. As explained in Section 1.1, although increasing

cache capacity improves cache-hit rates, it wastes a lot of energy per cache access. Moreover,

it also makes cache-access time longer [96]. Therefore, in case that the advantage of increase

in the cache-hit rate is smaller than the disadvantage caused by cache-access-time and cache-

access-energy overhead, we can not obtain any improvement of performance/energy efficiency.

There is another example on merged DRAM/logic LSIs. Eliminating the chip boundary be-

tween processor (with cache) and main memory makes it possible to exploit high on-chip

memory bandwidth. However, exploiting the maximum ability of memory bandwidth is not

always acceptable. Thrashing phenomenon may occur between the cache and main memory

because of unnecessary data replacements, thereby wasting the time and energy.

Since cache memories affect all memory references, we have to pay significant attention to

microarchitectures for improving performance/energy efficiency of cache memories. The goal

of this thesis is to propose and develop high-performance, low-energy cache architectures.

The role of cache memory is to serve read or write requirements from the processor as soon

as possible. However, as conventional caches employ conservative mechanisms, there are

many unnecessary operations. The unnecessary operations waste much energy and time. In

order to eliminate the unnecessary operations, our caches stand on the following strategy:


1. Monitor memory-reference behavior at run-time.

2. Predict and Detect unnecessary operations for future accesses by analyzing the mon-

itored memory-reference behavior at run-time.

3. Eliminate the unnecessary operations at run-time.

The key of our approach is to optimize the cache operation to the characteristics of target

programs at run-time. As our scheme does not require any modification of instruction-set

architectures, the full compatibility of existing object codes can be kept.

Three major contributions of this thesis are described below:

• Way-predicting (WP) set-associative cache: A cache architecture for low energy

dissipation is proposed and evaluated. The cache attempts to eliminate unnecessary

way activation in set-associative caches. In conventional set-associative caches, all ways

are searched in parallel because the cache-access time is critical. The way-predicting

set-associative cache predicts which way has the data desired by the processor before

starting the cache access. The way prediction is performed based on memory-access

history. As the way-predicting set-associative cache can maintain the cache-hit rate

of conventional organization, the latency and energy overhead for next-level memory

accesses do not appear.

• History-based tag-comparison (HTC) cache: A cache architecture for low energy

dissipation is proposed and evaluated. The cache predicts whether the instructions to be

fetched currently cache resident, and attempts to eliminate unnecessary tag comparison.

In conventional caches, tag comparison has to be performed on every cache access

in order to test whether the memory reference hits the cache. Execution footprints

recorded in a BTB (branch target buffer) is used for the prediction. As the history-

based tag-comparison cache does not affect both the cache-hit rate and the cache-access

time, the memory-system performance of conventional organization can be maintained.

• Dynamically variable line-size (DVLS) cache: A cache architecture for high-

performance/low-energy on merged DRAM/logic LSIs is proposed and evaluated. The

cache predicts the degree of spatial locality, and attempts to avoid performing unnec-

essary data replacements. For merged DRAM/logic LSIs with a memory hierarchy

1.2. CONTRIBUTIONS 5

Table 1.1: Characteristics of Proposed Cache Architectures.

Caches What to Monitor What to Predict What to Eliminate

WP MRU (Most-Recently-Used) a way to be accessed unnecessary

ways way activation

HTC program execution whether or not an instruction unnecessary

sequence to be fetched next tag comparisons

has already resided

DVLS cache-line reference the degree of spatial locality unnecessary

history data replacements

including cache memory, we can exploit high on-chip memory bandwidth by means of

replacing a whole cache line at a time on cache misses. This approach tends to in-

crease the cache-line size if we attempt to improve the attainable memory bandwidth.

In general, large cache lines can benefit some programs as the effect of prefetching.

Larger cache lines, however, might worsen the system performance if programs do not

have enough spatial locality because cache-conflict misses frequently take place. As

a result, widen on-chip buses and DRAM array will waste not only time but also a

lot of energy because of a number of main-memory accesses. Although conflict misses

can be reduced by increasing the cache associativity, this approach usually makes the

cache-access time longer. In the dynamically variable-line size cache, the large cache

line is partitioned into multiple small cache lines (sublines), and the cache attempts

to adjust the number of sublines to be involved on cache replacements. Namely, the

cache tries to optimize the cache-line size according to the degree of spatial locality

observed. Reducing the cache-line size alleviate the negative effect of large cache lines

without cache-access-time overhead. In addition, selective activation of on-chip buses

and DRAM subarrays, which correspond to the replaced sublines, reduces energy dis-

sipation for cache replacements.

Table 1.1, Table 1.2, and Table 1.3 summarize the characteristics, usability, and effects of

the proposed cache architectures, respectively.


Table 1.2: Usability of Proposed Cache Architectures.

Caches Instruction Cache Data Cache

Direct-Map Set-Associative Direct-Map Set-Associative

WP –√

–√

HTC√

– – –

DVLS√ √ √ √

Table 1.3: Effects of Proposed Cache Architectures.

Cache Accesses Main-Memory Accesses

Caches Time EnergyCache-Miss Rate

Time Energy

WP *) ↗ (5%) ↘ (72%) → → →

HTC **) → ↘ (30%) → → →

DVLS ***) → → ↘ (37%) → ↘ (52%)

*) compared with a conventional four-way set-associative data cache for 124.m88ksim.

*) compared with a conventional direct-mapped (DM) instruction cache for 107.mgrid.

**) compared with a conventional DM data cache with 128-byte lines for MIX-IntFp.

1.3 Overview

This thesis introduces adaptive cache memory architectures for high performance and low

energy dissipation, and is organized as follows. Chapter 2 briefly explains the principle of

memory hierarchy to confirm the most important characteristics of memory-reference behav-

1.3. OVERVIEW 7

ior, and defines metrics to evaluate the performance/energy efficiency of memory systems.

In addition, Chapter 2 surveys high-performance techniques and low-energy techniques for

cache memories. The way-predicting set-associative cache architecture and the history-based

tag-comparison cache architecture for low energy consumption are introduced in Chapter

3 and Chapter 4, respectively. Chapter 5 presents the dynamically variable line-size cache

architecture. Finally, Chapter 6 concludes this thesis.


Chapter 2

Memory Systems Employing Cache

Memories

2.1 Principle of Memory Hierarchy

Total system performance suffer from the inability of memory systems as explained in Chapter

1. If we can employ an ideal memory system which has infinite memory space and any memory

access can be completed within one processor clock cycle, the total system performance will

be improved dramatically. However, this assumption is impracticable in real memory systems

due to the restricted hardware budget, the limit of process technology, and so on. Employing

a memory hierarchy is a well known technique to make the real memory system close to the

ideal one.

There is a rule for program-execution behavior called 90/10 Locality Rule: a program

executes about 90 % of its instructions in 10 % of its code. We can understand from the rule

that there are some portions of program-address space executed frequently. Thus, programs

exhibit locality as follows [28]:

• Temporal locality: If an item is referenced, it will tend to be referenced again soon.

• Spatial locality: If an item is referenced, nearby items will tend to be referenced soon.

The principle of memory hierarchy stands on the locality of memory references. There are

many levels in a memory hierarchy. Data replacements are performed between adjacent levels.

Upper levels are smaller, faster, and closer to the processor than lower levels. An upper level

9

10 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES

consists of a part of memory space in the next-lower level. The processor tries to obtain the

reference data from the closest level, because that memory access can be completed faster.

If the processor can not find the data at that level, then the next lower level is searched. In

case that the required data is found at the lower level, a copy of the data is stored into the

upper level. After that, the accesses to the stored data can be completed at the upper level

until the data is evicted.

Here, we consider the locality of references, again. If programs have rich locality of memory

references, almost all accesses can be completed at upper levels in the memory hierarchy. Only

when the access misses the upper level, the next lower level are searched. Usually, accesses

to the highest level, or level-1 cache, can be completed in one clock cycle of high-speed

processor. Therefore, the real memory system can behave as the ideal memory system if

almost all memory accesses are confined in the level-1 cache.

2.2 Memory-Access Time and Energy Definitions

We consider a memory hierarchy which consists of a cache memory implemented by a Static

RAM and a main memory implemented by a Dynamic RAM. Note that the lowest level of

the memory hierarchy is 2. Cache-miss rate is the most popular metric of cache performance.

However, it is very important to consider not only the cache-miss rate but also cache-access

time. Since the cache-access time affects all load/store operations, it has a great impact

on total memory-system performance. In this thesis, we use average memory-access time

(AMAT ), which is the average latency per memory reference [28]. The average memory-

access time can be expressed by the following equations:

AMAT = TCache + CMR × 2 × TMainMemory, (2.1)

TMainMemory = TDRAMarray +LineSize

BandWidth, (2.2)

where, CMR is cache-miss rate (note that cache-hit rate is represented as CHR in this thesis).

Cache-access time denoted as TCache is the latency for determining whether a memory access

hits the cache and for providing the referenced data to the processor on a cache hit. Main-

memory-access time denoted as TMainMemory, or miss penalty, is the latency for an access to

the main memory. On a cache miss, if the cache employs write-back policy for cache-line

2.2. MEMORY-ACCESS TIME AND ENERGY DEFINITIONS 11

replacement, two main-memory accesses take place (one for write-back and one for refill)

in the worst case. The main-memory-access time (TMainMemory) consists of two factors: the

latency for an access to the DRAM array (TDRAMarray) and that for transferring a cache-line

between the cache and the main memory ( LineSizeBandWidth

). LineSize and BandWidth are a cache-

line size to be replaced and memory bandwidth between the cache and the main memory,

respectively.

On the other hand, the total energy consumed for the execution of a program consists

of two parts: energy dissipated in CPU core and that in memory hierarchy denoted as

EMemoryHierarchy. We assume that the total count of load/store instructions in the execution

of a program is a constant. Therefore, EMemoryHierarchy depends only on the energy efficiency

of the memory system, and can be approximated by the following equation:

EMemoryHierarchy =N∑

i=1

EMAi, (2.3)

where, N is the count of load/store instructions executed, and EMAi is the memory-access

energy which is dissipated by the memory system to serve ith memory-access operation. In

this thesis, we use average memory-access energy (AMAE), which is the average energy

dissipated per memory reference (i.e., EMemoryHierarchy = N × AMAE). AMAE can be

expressed by the following equations:

AMAE = ECache + CMR × 2 × EMainMemory, (2.4)

EMainMemory = EDRAMarray + EDataTransfer , (2.5)

where, ECache denotes cache-access energy which is the average energy dissipation per cache

access, and EMainMemory denotes main-memory-access energy which is the average energy

dissipation per main-memory access. On a cache miss, two main-memory accesses, one for

write-back and one for refill, consume the energy of 2 × EMainMemory in the worst case. The

main-memory-access energy consists of two factors: energy for accessing to the DRAM array

(EDRAMarray) and that for transferring a cache-line between the cache and the main memory

(EDataTransfer). Moreover, the cache-access energy (ECache) can be approximated by the

following equation [85]:

ECache = EDecode + ESRAMarray, (2.6)


where, EDecode is the average energy consumed for decoding the memory address, and ESRAMarray

is that consumed for accessing to the SRAM array (tag memory and data memory), per cache

access. The energy model described in [85] includes the energy consumed for driving external

I/O pins, and that energy is included in EDataTransfer in Equation (2.5). EDecode depends

on the switching activity of memory addresses generated by the processor, and is negligible

compared to the ESRAMarray. Previous papers reported that the energy consumption of the

address decoder is about three order of magnitude smaller than that of other components

[4],[54]. Therefore, we assume that the cache-access energy (ECache) depends on the energy

consumed for accessing to the SRAM array (ESRAMarray).

There are many levels where we can consider for improving the memory-system perfor-

mance and energy dissipation: device level, circuit level, architecture level, algorithm level,

and so on. In the following sections, we briefly survey architectural techniques for high

performance, low energy memory systems. Before presenting these techniques, we show con-

ventional cache architectures in Section 2.3. Then, high performance techniques and energy

reduction techniques are introduced in Section 2.4 and Section 2.5, respectively.

2.3 Conventional Cache Architectures

Mainly, there are two kind of cache architectures: a direct-mapped cache architecture and

a set-associative cache architecture. Figure 2.1 depicts the conventional organization of a

direct-mapped cache and a two-way set-associative cache. The block of data which can be

replaced between the cache and the main memory is called a cache line (or line). A set

consists of cache lines which have the same cache-index address. We can regard the direct-

mapped cache as an one-way set-associative cache. Each way consists of a tag-subarray and

a data-subarray for memorizing tags and cache lines, respectively. An n-way set-associative

cache works as follows [28]:

1. As soon as an effective memory address is generated by the processor, the cache starts

to decode the memory address and determines the set to be searched.

2. The cache starts simultaneously to read both the tag and the cache line designated

by the cache-index address from each way. Then the tags are compared with the tag-

portion in the memory address in order to test whether at most one of tags matches

2.3. CONVENTIONAL CACHE ARCHITECTURES 13

Processor

Address

Tag Index Offset

Hit / Miss?

Set

Mux

Load/Store Data

Processor

Address

Tag Index Offset

Hit / Miss?OR

Set

Mux

Tag

Load/Store Data

Cache LineTag

(a) Direct-Mapped Cache (b) 2-Way Set-Associative Cache

Tag-Subarray

Data-Subarray

Way1Data MemoryTag Memory Way2

Figure 2.1: Conventional Cache Architectures.

the tag-portion. All tag comparisons are performed in parallel because speed is critical.

3. If a match is found (i.e., on a cache hit), the cache provides the word data in the

associated cache line to the processor (for read). Otherwise, a cache-line replacement

takes place.

Compared to direct-mapped caches (i.e., one-way set-associative caches), n-way set-associative

caches (n ≥ 2) usually can produce higher cache-hit rates (reduce CMR in Equation (2.1)),

because higher associativity reduces conflict misses. However, increasing the cache associa-

tivity (i.e., increasing n) suffers from the following drawbacks:

• The cache-access time (TCache in Equation (2.1)) tends to be larger because the n-way

set-associative cache incurs an additional delay for way selection [30], [96]. The way

selection has to be performed after tag-comparison results are available. Therefore, if

the delay for the tag comparison is larger than that for reading the cache-line data, the

cache-access-time overhead due to the way selection appears.


• The cache-access energy (ECache in Equation (2.4)) tends to be larger [4], [29]. Although

at most only one way has the data desired by the processor, all the ways are accessed in

parallel. Increasing the cache associativity decreases the total number of word-lines in

the SRAM array (the height of the SRAM array), as shown in Figure 2.1. However, the

total number of bit-lines to be activated is increased. As a result, activating peripheral

circuits which are added to each bit-line, for example bit-line precharging circuit, sense

amplifier, and so on, increases the cache-access energy.

2.4 High-Speed Memory-Access Techniques

As explained in Section 2.3, there is a trade-off between the cache-access time and the cache-

hit rate: first access but low hit rate of direct-mapped caches vs. slow access but high hit rate

of set-associative caches. From Equation (2.1), it can be understood that there are at least

three approaches to improving the memory-system performance (to reducing the average

memory-access time) as follows.

• Reducing cache-access time (TCache), and maintaining cache-miss rate and miss penalty

as can as possible.

• Reducing cache-miss rate (CMR), and maintaining cache-access time and miss penalty

as can as possible.

• Reducing miss penalty (TMainMemory), and maintaining cache-access time and cache-

miss rate as can as possible.

In this section, we introduce techniques to satisfy the above requirements for high-performance

memory systems. Section 2.4.1 and Section 2.4.2 show techniques to improve the cache-access

time and the cache-hit rate, respectively. Then, Section 2.4.3 focuses on how to reduce the

cache-miss penalty for cache-line replacements.

2.4.1 Making Cache Access Faster

The most significant disadvantage of set-associative caches is to suffer from longer access

time due to way selection. Since the way selection can be performed after the tag-comparison

2.4. HIGH-SPEED MEMORY-ACCESS TECHNIQUES 15

results are available, the critical path becomes long. The key of techniques introduced in this

section is to complete the way selection as soon as possible.

2.4.1.1 Speculative Way Selection: Exploiting Locality

There are two methods to search the desired way in set-associative caches: parallel search and

sequential search. The parallel search examines all ways in parallel. Thus, the delay for the

way selection based on tag comparison makes cache-access time longer. The sequential search

examines one by one until it finds the desired way. Therefore, the way-selection overhead

can be eliminated if the first examine finds the desired way. In this case, the cache-access

time is as fast as direct-mapped caches. However, in the worst case, the sequential search

may require the same number of clock cycles as the associativity. Namely, the cache-access

time depends on how fast the cache can find the desired way.

Kessler et al. [52] proposed a set-associative MRU cache which uses hardware similar to

a direct-mapped cache. The MRU cache employs MRU-order-based sequential search. The

MRU information is stored in a mapping table. Chang et al. [16] proposed another MRU

cache, which is employed in System/370, to improve the access time of parallel-search set-

associative caches. Chang et al. reported that where a 128 KB cache has 64 associativity

(i.e., 64-way set-associative cache), more than 80 % of the overall memory references hit

the MRU region, even if the size of which is only 2 KB (128 KB / 64 way). The MRU

information for each set is used to select one way before the tag comparison is completed.

When a cache access is issued, the way designated by the corresponding MRU information is

selected. Where the cache selects an wrong way, two cycles are required due to accessing to

the remaining ways. Kessler et al. also reported that the MRU scheme achieves more than

30 % cache-access time improvement from a conventional four-way set-associative cache.

2.4.1.2 Speculative Way Selection: Partial Tag Comparison

Another approach to improving the access time of set-associative caches is to obtain the

tag-comparison results as soon as possible. If the control signals for the way selection are

available before cache-lines are completely read out, the cache-access-time overhead caused

by the way selection can be hidden.


Partial address matching proposed by Liu [60] is one of approaches to reducing the tag-

comparison time. Their cache has two memory arrays: MD (for “Main Directory”) and PAD

(for “Partial Address Directory”). MD contains complete tag information, whereas PAD

contains a part of tag bits (e.g., 5 bits). First, tag-comparison results from PAD are used for

the way selection. The complete tag comparison on MD is also performed in parallel, but it

is for the verification of the partial address tag-comparison. If the cache detects an wrong

way selection, the incorrectly accessed data is canceled. The timing advantage for partial

address matching comes from the simpler comparators and fast data read of small number of

bits. They reported that reading 5 partial address bits from cache directory can be almost

twice as fast as reading 18 full address bits.

Juan et al. [45] proposed the difference-bit cache. The idea is based on the fact that

the two stored tags corresponding to a set have to differ in at least one bit. By using the

1-bit (difference-bit) comparison result, the way selection can be performed. Diff memory

is employed to record the position and value of the difference-bit in the tag. Note that the

difference-bit comparison can be used for the way selection, but not for testing the cache hit

or miss. In case of two-way set-associative caches, two tags are read in parallel. After that,

one of tags is selected by using the difference-bit comparison result, then it is compared with

the tag portion in memory address in order to determine the memory access hits the cache.

Data selection for providing the desired data to the processor is also performed by using the

difference-bit comparison result, instead of complete tag-comparison result.

2.4.2 Making Cache-Miss Rate Lower

Memory access behavior varies within and among program executions. However, conventional

caches expect that all memory references have the high degree of temporal and spatial locality.

Thus, conventional organization have hardware parameters fixed: cache size, associativity,

mapping function, replacement policy, cache-line size, and so on. Therefore, it is difficult

for the conventional caches to follow the various behavior of memory references. To improve

cache-hit rates, many researchers have proposed cache architectures which attempt to adapt

dynamically or statically the cache parameters to the varying memory-access behavior.


2.4.2.1 Making Good Use of Cache Space

Unfortunately, conventional caches have only one mapping function for data placement. The

mapping function determines that which set the data designated by a memory address should

be placed in. In particular, a set in direct-mapped caches can include only one cache line.

Therefore, many data which compete in a set cause a large number of conflict misses. The

key of techniques introduced in this section is to employ several mapping functions to make

good use of limited cache space, thereby reducing the conflict misses.

(1) Employing Different Mapping Functions

The direct-mapped hash-rehash cache proposed by Agarwal et al. [2] attempts to avoid

conflict misses by using two different mapping functions. Conflicting data can be located

in a different set. When a cache access is issued, the first mapping function which is the

same as conventional direct-mapped cache is used to search the first entry. If the first search

finds a hit (i.e., first hit), the cache behaves as direct-mapped cache. Otherwise, the other

mapping function is used to search the second entry. Namely, the hash-rehash cache looks

like a two-way set-associative cache employing a sequential-search scheme. If the first and

second searches miss the cache (i.e., cache miss), the missed data is filled into the second

entry, and the first entry and the second entry are swapped for keeping the MRU cache-line

in the first location. Column-associative cache proposed by Agarwal et al.[1] has the same

configuration as the hash-rehash cache, except for a rehash bit in each set. The rehash bit

inhibits a rehash access in order to avoid secondary thrashing.

The hash-rehash cache and the column-associative cache worsen cache-miss rates than con-

ventional two-way set-associative caches with LRU replacement strategy. Because the mech-

anism for hash and rehash operations can not implement a true LRU replacement. Calder

et al.[14] proposed predictive sequential associative cache which has the steering bit table in

order to indicate which entry has to be searched first. In addition, the MRU information is

used for the complete LRU replacement strategy. Calder et al. also proposed how to predict

the cache-index in an earlier pipeline stage by using some prediction source, register contents,

offset, register number, and so on, in order to hide the steering-bit-table-access penalty.

Skewed-associative cache proposed by Seznec [80] improves the hit rate of 2-way set-

associative caches. A 2-way skewed-associative cache has the same configuration of a 2-way

conventional set-associative cache (i.e., there are two memory-subarrays), but it has two


mapping functions. Different mapping functions operate on difference memory-subarrays in

parallel. As the two mapping functions are more complex than those of the hash-rehash

cache, the skewed-associative cache can achieve higher cache-hit rates. Seznec reported that

a cache-miss rate produced by a 2-way skewed-associative cache is comparable with that

achieved by a 4-way conventional set-associative cache.

The other studies are sequential multi-column cache and parallel multi-column cache [99].

(2) Employing an Adaptive Mapping Function

Adaptive group-associative cache proposed by Peir et al. [74] attempts to intelligently use

the cache space. In conventional caches, a number of empty frame, or hole, exists in the

cache. The authors measured the average percentage of holes in various cache configurations

during the execution of a program, and it was observed that between 37.5 % and 42.6 % of

the cache are holes. In fact, these holes will be filled by rarely reused data. The idea of the

adaptive group-associative cache is to identify the existing holes and allocate the holes to

frequently reused data. On a cache miss, frequently reused data to be evicted from the cache

is moved into the hole, instead of the main memory. In other words, the cache optimizes

the mapping faction by detecting the holes at run-time. Peir et al. reported that the cache-

miss rate produced by an adaptive group-associative cache is comparable with that of a

fully-associative conventional cache for some of workloads.

2.4.2.2 Inhibiting Rarely Reused Data from Polluting Cache Space

As conventional caches load every data into the cache regardless of its reuse behavior, rarely

reused data pollute the limited cache space. Cache bypassing is one of approaches to solving

the problem. The missed data which has poor temporal locality is provided directly from the

main memory to the processor regardless of loading into the cache.

Johnson et al. [40] proposed a run-time adaptive cache management to improve the cache-

hit rates. Their cache employs amemory address table (MAT), in which the memory reference

behavior is recorded at run-time. Each entry in the MAT contains a counter in order to

identify the amount of temporal locality corresponding to the memory block. When the

value of the counter is smaller than a threshold, the reference data in the corresponding

memory block bypasses the cache.

McFarling [61] proposed dynamic exclusion replacement policy in order to reduce the num-


ber of conflict misses in direct-mapped instruction caches. The cache presented in the paper

measures reference patterns. When two instructions compete for the same cache line, the

dynamic exclusion approach attempts to prohibit loading one instruction into the cache, so

that the other instruction can be kept in the cache. For the dynamic exclusion control, a

simple finite-state machine for each cache line is used. McFarling also proposed an instruc-

tion reordering technique based on compiler optimization in order to exclude less frequently

executed instructions.

Another approach to avoiding the cache pollution is to secure a part of cache space for

frequently reused data. Scratch-pad memory has been proposed to realize this kind of memory

management. The scratch-pad memory consists of a part of the main-memory space, and is

located at level-1 memory hierarchy. Accessing to the scratch-pad memory can be completed

in one clock-cycle as same as level-1 cache. The difference between the scratch-pad memory

and the level-1 cache is that no data replacement takes place in the scratch-pad memory.

Namely, 100 % hit rate for the scratch-pad memory accesses is guaranteed, whereas the level-

1 cache-hit rate depends on compulsory, capacity, and conflict misses. Which data should

be allocated into the scratch-pad memory space has to be determined before the program is

executed. Panda et al. [69] presented a technique for exploiting effectively the scratch-pad

memory. A careful partitioning of scalar and array variables into the main memory and the

scratch-pad memory improves the memory performance. Chiou et al. [17] proposed column

caching strategy that allows to restrict the data replacement at a column, or way, granularity.

Cache-line replacements on the restricted column is prohibited, so that the missed data

which is allocated to the restricted column bypasses the cache. We can regard the restricted

column as a scratch-pad memory. A software-controllable bit-vector specifies the replacement

restriction. Nakamura et al. [65] also proposed a software technique, called SCIMA: Software

Controlled Integrated Memory Architecture, for high performance computing. An on-chip

memory space is divided into two portions: cache and on-chip memory. The cache is under

the control of hardware as conventional caches, while the data replacement of on-chip memory

is controlled by software. Namely, the on-chip memory works as the scratch-pad memory.

Since the cache and the on-chip memory share the hardware memory structure, software can

attempt to change and optimize the ratio of their sizes.


2.4.2.3 Exploiting Different-Characteristics Memories

Researchers have proposed many cache architectures which consist of several memory modules

for improving cache-hit rates. The memory modules are used for different purposes in order

to follow the various behavior of memory references.

(1) Keeping and Filtering: Attaching a High-Associative Cache

There are many approaches employing a small high-associative cache. The roles of the

attached set-associative cache are 1) to keep frequently reused data at close to the level-1

cache instead of the next-level memory and 2) to filter rarely reused data which pollute the

cache. If a data has rich temporal locality, it should not be evicted from the cache. In

contrast, a data having poor temporal locality should not be loaded into the cache.

Jouppi [43] proposed the victim cache which is a small full-associative cache located at

between the direct-mapped level-1 cache (main cache) and the next-level memory (main

memory). When a cache line in the main cache is evicted, it is moved to the victim cache.

In case of a miss in the main cache that hits in the victim cache, the cache lines are swapped

between the main cache and the victim cache. Namely, the victim cache attempts to keep

the data, which is evicted form the main cache but probably has rich temporal locality, at

close to the processor.

Theobald et al. [86] discussed the design space of hybrid-access caches (the combination of

a direct-mapped main cache and a set-associative cache like the victim cache), and proposed

the half-and-half cache. The access time to the direct-mapped main-cache is faster than

that to the attached small set-associative cache. For example, the main-cache access can be

completed in one cycle, while the associative cache requires two cycles: one for normal access

and one for swapping between the main cache and the set-associative cache. Thus, there is

a trade-off between the cache-access time and the cache-hit rate when we consider the cache

resource distribution to the direct-mapped region and the set-associative region. Although

increasing the direct-mapped region (decreasing the set-associative region) increases conflict

misses, it may improve the average cache-access time by increasing the number of hits to

the direct-mapped region. The half-and-half cache uses the half of total cache capacity for

direct-mapped region and the remaining half for set-associative region.

Against to the victim cache and the half-and-half cache, the annex cache proposed by John

et al. [39] and the pollution control cache proposed by Walsh et al. [93] attempt to filter


the data to be loaded into the main cache. Both the annex cache and the pollution control

cache are small high-associative caches attached to the main cache. On a cache miss, the

missed data is loaded into the small associative cache, instead of the main cache. Then the

cache lines in the main cache and the small associative cache are swapped when the filled

data in the small associative cache is referenced again. Therefore, no-reused data in the small

associative cache is evicted without loading into the main cache.

(2) Exploiting Different Types of Locality

The spatial locality can be exploited by increasing cache-line size. On the other hand,

decreasing cache-line size is a good approach to exploiting the temporal locality, because the

total number of entries, or cache lines, in the cache is increased. Unfortunately, conventional

caches have a fixed cache-line size, so that it is impossible to satisfy both the above mentioned

requirements. The most straightforward approach to solving the problem is to employ two

types of caches: one has a small cache-line size and the other has a large cache-line size.

Dual Data Cache proposed by Gonzalez et al. [23] consists of two memory modules: spatial

cache and temporal cache. These caches have the same organization, but the cache-line sizes

are different. The spatial cache has a larger cache-line size, whereas the temporal cache has a

smaller cache-line size. The locality prediction table in the dual data cache determines where

the missed data should be loaded. Each entry in the table corresponds to a recently executed

load/store instruction. Against to the dynamic optimization, Sanchez et al. [76] discussed a

static analysis of locality for the dual data cache.

Park [71] proposed the co-operative cache which consists of the spatial-oriented cache (SOC)

having a larger cache-line size and the temporal-oriented cache (TOC) having a smaller cache-

line size. Another example is the split temporal/spatial cache proposed by Milutinovic et al.

[62] which also has a spatial cache having a usual cache-line size and a temporal cache having

a small cache-line size.

(3) Prohibiting Non-Critical Data from Polluting Cache Space

So far, we have introduced many techniques to improve cache-hit rates. However, improving

the cache-hit rates may not be able to produce an advantage for total system performance.

When we consider the total execution time of a program, the most important thing is to

reduce the total number of clock cycles required. From memory system point of view, we

need to consider the total number of processor stalls caused by the real memory system.


Recent processors exploit increased instruction level parallelism (ILP), thereby achieving

higher performance. In other words, lack of ILP degrades the total processor performance.

The cache-hit rate may not be an appropriate metric to evaluate total memory system

performance, because it does not include how much each load/store operation affects the

total number of processor stalls. The processor stall depends on data dependency in program,

so that some cache misses which affect the data dependency are more critical than others.

In addition, if there are enough instructions which can be issued, a cache miss might not

affect the ILP. Actually, Srinivasan et al. [83] showed that not all data accesses need to occur

immediately if there are enough ready instructions for the processor to execute.

Non-Critical Buffer proposed by Fisk et al. [21] is a small associative buffer (for example

16-entry) which works in parallel with level-1 data cache (main cache). The Non-Critical

Buffer is used to prohibit non-critical data from polluting cache space. As a result, the large

main cache can be used for critical data which gives a significant damage to the processor

performance (i.e., ILP). Two mechanisms to identify the non-critical data at run-time were

proposed. One of the mechanisms tracks the processor performance by monitoring issue rate

or functional unit usage, and the other mechanism uses the Load/Store Queue (LSQ). Fisk

et al. reported that the non-critical buffer can achieve processor performance improvements

even if it worsens the cache-hit rates.

2.4.2.4 Data Prefetching by Larger Cache-Line Sizes

If we can perform perfect prefetching, the ideal memory system can be realized because all

main-memory accesses are overlapped with other computations. Increasing cache-line size

is one of methods to perform the data prefetching. If memory references have rich spatial

locality, larger cache-line sizes give the prefetching effect. However, the following drawbacks

prevent cache designers from increasing the cache-line size.

• Increase in conflict misses: increasing the cache-line size results in reducing the total

number of cache lines which can be held in the cache. Thus, large cache-line sizes

increase conflict misses when programs have poor spatial locality (increase in CMR in

Equation (2.1)), thereby degrading the memory system performance.

• Increase in the memory bandwidth requirement: On cache misses, large cache lines need

to be replaced between the cache and the main memory. Therefore, increasing cache-


line size increases the memory bandwidth required (increase in LineSize in Equation

(2.2)). However, the I/O pin bottleneck between the cache and the main memory in

conventional systems limits the attainable memory bandwidth. As a result, increasing

the cache-line size increases miss penalty (increase in TMainMemory in Equation (2.1)),

thereby degrading the memory system performance.

For the cache-line-size optimization, the following processes are required: 1) detect the

amount of spatial locality inherent in programs, and 2) modify the cache-line size. We can

consider the following approaches to detecting the locality and to modifying the cache-line

size:

• Hardware detection and hardware modification: The amount of spatial locality is mea-

sured at run-time. In this method, a mechanism to record memory-reference history

will be required. The cache-line size is modified according to the memory-reference

history [20], [41], [37], [92], [57].

• Software detection and software modification: The amount of spatial locality is analyzed

at compile-time. In this method, loop structures inherent in programs will be exploited.

Special instructions are inserted in program codes by compiler in order to modify the

cache-line size [91], [92].

• Hardware detection and software modification: The amount of spatial locality is mea-

sured at run-time. A compiler inserts special instructions in program codes. The special

instruction says ”if a condition is satisfied, then increase (or decrease) the cache-line

size”. The run-time measurement is exploited to determine the condition of the special

instruction for modifying the cache-line size.

The detail of the line-size optimization techniques are discussed in Chapter 5.

2.4.2.5 Optimizing Data Placement

Conflict misses take place when two data compete for a cache location. If we can re-allocate

one of the competing data address, the conflict miss can be avoided. Data-placement opti-

mization is a static approach to reducing the conflict misses [87], [68], [33], [47].


2.4.3 Making Cache-Miss Penalty Smaller

The final approach to the ideal memory system is to reduce the miss penalty which is wasted

on cache misses. As shown in Equation (2.2), there are at least three approaches to minimizing

the miss penalty: 1) improving the DRAM access time, 2) reducing the cache-line size, and

3) increasing the memory bandwidth. The DRAM access time can be improved by advanced

process technology. However, the process-level optimization is out of this thesis.

Conventional caches exploit the spatial locality by employing larger cache-line size. There-

fore, there is a trade-off between improving the cache-hit rate and reducing the miss penalty.

Although increasing cache-line size improves cache-hit rates due to the effect of prefetching

(decrease in CMR in Equation (2.1)), it also increases the memory bandwidth requirement for

cache-line replacements (increase in TMainMemory in Equation (2.1)). Adapting the cache-line

size introduced in Section 2.4.2.4 attempts to find appropriate trade-off points. Employing

cache-bypass mechanism is also a good approach to reducing the memory bandwidth require-

ment. The idea of Tyson et al. is based on the fact that almost all cache misses are caused

by a small number of instructions called troublesome instructions. Tyson et al. [90] reported

that less than 5 % of the total load instructions are responsible for causing over 99 % of all

cache misses. When a troublesome instruction causes a cache miss, it bypasses the cache,

instead of loading into the cache. In this case, only the troublesome instruction is transferred

from the main memory to the processor.

Improving the memory bandwidth can be achieved by integrating the cache and the main

memory into the same chip, or merged DRAM/logic LSI. Eliminating the chip boundary

between the cache and the main memory solves the I/O-pin bottleneck problem, thereby

improving dramatically the memory bandwidth [64], [78], [72], [73].

2.5 Low-Energy Memory-Access Techniques

From Equation (2.4), it can be understood that there are at least three approaches to reducing

the average memory-access energy as follows:

• Reducing the cache-access energy (ECache), and maintaining the cache-miss rate and

the main-memory-access energy as can as possible.

2.5. LOW-ENERGY MEMORY-ACCESS TECHNIQUES 25

• Reducing the cache-miss rate (CMR), and maintaining the cache-access energy and

the main-memory-access energy as can as possible.

• Reducing the main-memory-access energy (EMainMemory), and maintaining the cache-

access energy and the cache-miss rate as can as possible.

The techniques introduced in Section 2.4.2 for reducing conflict misses will be available for the

second approach [4]. In the following sections, we focus on the first and the third approaches.

Energy dissipation in CMOS technology circuits is mainly due to charging and discharging

gates. While a cache access is performed, the following energy is dissipated:

ESRAMarray = 0.5 × C × V dd2, (2.7)

where, V dd is the supply voltage as well as the output voltage swing. C is the total switched

load-capacitance on all cache components (bit-lines, word-lines, memory cells, and so on). It

can be understood from Equation (2.7) that we can reduce the energy dissipation by making

a small value of C, or V dd. Reducing the supply voltage (V dd) has a great impact on the

energy dissipation, because Equation (2.7) is a function of the square of supply voltage.

However, it makes access time longer [15], [66], so that we do not consider this approach in

this thesis.

In the following sections, we introduce energy reduction techniques for cache and main-

memory accesses by reducing the switched load capacitance (C). Section 2.5.1 shows tech-

niques to reduce the cache-access energy: structural approach and behavioral approach.

Section 2.5.2 presents energy reduction techniques for main-memory accesses. DRAM (main

memory) consumes static energy not for main-memory accesses but for refresh operations.

Although the static energy is not included in Equation (2.4), some techniques to reduce that

energy are introduced in Section 2.5.3.

2.5.1 Reducing Cache-Access Energy

Basically, the energy dissipation for a memory-array access depends on the array size (or

the number of words held in the memory array) [38]. Let us consider where a 32-bit data

needs to be read form 16 KB (128 × 128 bit cells) cache space. Here, we refer to a fraction

of the cache space where needs to be activated for the cache access as an activated-area. In

addition, we refer to CBLcell and CWLcell as the switched load capacitance associated with a


single bit cell on bit-line and word-line, respectively. In order to simplify the explanation of

the activated-area, in this section, we assume that the cache-access energy is determined by

only charging/discharging of bit-lines and word-lines. In fact, other circuits, for example bit-

line precharging circuits, would dissipate some energy. In case of the original memory-array

organization, the activated-area is equal to the whole cache space. Thus, the switched load

capacitance in Equation (2.7) is 16384∗CBLcell +128∗CWLcell. However, if the memory-array

is divided into four modules (128 × 32 bit cells × 4), and if it is possible to activate only

one module, the activated-area becomes a quarter of the whole cache. In this case, the total

switched load capacitance is 4096 ∗ CBLcell + 32 ∗ CWLcell, thereby saving the energy.

The key of techniques introduced in this section is to make a small activated-area, and the

following processes are required:

1. Module Partitioning: Divide the cache into at least two modules, or attach at least one

small cache module. As a result, small areas, which are candidates of the activated-area,

are generated.

2. Selective Activation: Activate only one small area for performing the cache access.

We classify energy reduction techniques for cache memories into two approaches: structural

one and behavioral one. The structural approach changes the memory organization (the cache

is divided), but the cache-access operation is not modified. While the behavioral approach

attempts to optimize the cache-access operation for low energy dissipation, but the memory

organization is maintained (caches have originally a multi-module organization).

2.5.1.1 Structural Approaches

(1) Horizontal Partitioning

The techniques introduced in this section partition the cache module horizontally. Well

known techniques for memory-array partitioning is word-line partitioning [75]. In conven-

tional memory arrays, a number of transfer gates are connected to a word-line. The word-line

partitioning reduces the total number of memory cells connected to the word-line.

Cache subbanking [85], [48] is a horizontal partitioning scheme for low-energy caches. Usu-

ally, a cache line includes several words (for example 8 words) in order to exploit the spatial

locality of memory references. In conventional caches, the referenced data is selected in the


corresponding cache line which is read from data memory. Thus, the remaining contents in

the cache line are unused. In the cache subbanking, the data memory is partitioned into

subarrays horizontally. Only subarrays designated by the offset-field in memory address is

activated.

Region-Based Caching proposed by Lee et al. [59] is another implementation of horizontal

partitioning. The region-based caching exploits the different characteristics of data type, and

consists of three cache modules: a small module for stack data, a small module for global

data, and larger main module for others. For example, a 4 KB direct-mapped stack-cache, a

4 KB direct-mapped global-cache, and a 32 KB direct-mapped main-cache are implemented.

Which module has to be searched can be determined by using the memory address. When

only the stack-cache or global-cache is activated, the energy dissipated for the cache access

can be reduced due to a small value of C in Equation (2.7), compared with conventional cache

organization. Lee et al. reported that about 70 % of memory references hit the stack-cache

or the global-cache.

(2) Vertical Partitioning

Against to the word-line partitioning, bit-line partitioning is another well known technique

[75]. Ghose et al. [22] evaluated the effects of the bit-line partitioning for the cache employed

by superscalar processors.

Employing a level-1 cache reduces the energy consumed for memory accesses because of

the small activated-area (i.e., accessing not to the large main memory but to the small level-1

cache). Similarly, adding a small level-0 cache between the level-1 cache and the processor

can make a significant energy reduction.

Su et at. [85] and Kamble et al. [48] evaluated the energy efficiency of cache-line buffering,

or block buffering, which consists of a single entry. The previous accessed cache-line is loaded

into the cache-line buffer. When a memory access is issued, the cache-line buffer is searched

first. If the memory access hits the cache-line buffer, the desired data is provided from the

cache-line buffer to the processor. In this case, the activated-area is the small cache-line

buffer, instead of the main cache. When memory references have rich temporal (and spatial)

locality, the buffer-hit rate is improved, thereby saving more energy. Ghose et al. [22]

proposed the multiple line buffer for superscalar processors, in which there are several (for

example four) entries. Kin et al. [54] proposed the filter cache. A level-1 cache access occurs


only on a filter-cache miss. Kin et al. reported that a filter cache reduces 51 % of energy-

delay product across a set of multimedia and communication applications compared with a

conventional cache organization. Another study for this kind of approach is demonstrated by

Bajwa et al. [5], in which a level-0 small cache is employed for reducing the energy consumed

for an instruction cache.

The effectiveness of vertical partitioning depends largely on how much the memory accesses

can be concentrated on the small level-0 cache. Bellas et al. [6] proposed a dynamic cache

management to allocate the most frequently executed instruction blocks to the small level-0

cache. A branch prediction unit is exploited for detecting the frequently executed blocks.

(3) Horizontal and Vertical Partitioning

Ko et al. [56] proposed the MDM (multi-divided module) cache architecture. The cache

is divided horizontally and vertically into small modules. Each small module includes own

peripheral circuits, so that it can operate as a stand-alone cache. Only a single small module

designated by the memory address is activated. When the MDM cache has M independently

selectable modules, the average load capacitance of which becomes almost 1/M compared

with a non-divided conventional organization.

(4) Static and Dynamic Regions Partitioning

Adding a small level-0 cache between the level-1 cache and the processor seems to be an

extension of memory hierarchy. Because data replacements between the level-0 and level-1

cache are required on level-0 cache misses. Another approach to reducing cache-access energy

is to partition the cache module into a small static module and a large dynamic module. Data

allocation to the static module is determined based on prior program analysis and that works

as scratch-pad memory explained in Section 2.4.2.2, whereas the dynamic module behaves

as normal caches. If we can concentrate memory accesses on the small static module, a lot

of energy can be reduced due to a small value of C in Equation (2.7).

Many techniques for data allocation to the static module have been proposed. These

techniques are based on the profile data from the execution of programs. Panwar et al. [70]

proposed S-cache. Frequently executed basic-blocks are placed in the S-cache (a small static

module). Jump instructions which control the execution flow between the S-cache and the

level-1 main cache are inserted in program codes. Although the S-cache has been proposed

for reducing the frequency of tag compares, it can be used for reducing energy consumption


for cache-line accesses. Bellas et al. [7], [8] proposed the loop cache (L-cache) for instruction

caches. The compiler lays out the target program to maximize the number of accesses to the

L-cache, and inserts special instructions to identify the boundary between the placed and

non-placed code into the L-cache.

Increasing the static-module size increases the static-module hit rate. However, it also

increases the energy dissipation for an access to the static module. Ishihara et al. [38]

discussed the trade-off between the size of the main level-1 cache and that of the static

module. Kawabe et al. [50] presented an implementation example of the static and dynamic

region partitioning. Of cause, the scratch-pad memory introduced in Section 2.4.2.2 can be

employed for this kind of low-energy techniques.

2.5.1.2 Behavioral Approaches

As explained in Section 2.3, all ways in set-associative caches are searched in parallel because

the cache-access time is critical. Thus, the energy consumed for a tag-subarray access and

that for a data-subarray access are consumed in each way. Since only one way has the data

desired by the processor on a cache hit, however, conventional set-associative caches waste a

lot of energy. Some techniques have been proposed for alleviating the negative effect of the

set-associative caches by optimizing cache-access behavior.

(1) Selective Way Activation

The activated-area in conventional set-associative caches includes all ways. However, all

way accesses but one are unnecessary. One of approaches to achieving energy reduction for

set-associative caches is to make the activated-area close to a single way which includes the

desired data.

Hitachi SH microprocessor employs a phased cache in order to avoid the unnecessary data-

subarray accesses [26]. In the phased cache, tag comparison and cache-line access are per-

formed sequentially. First, tag comparisons are performed without data-subarray activation.

Then, only a single data-subarray which includes the desired data is accessed if at most one

tag matches. Otherwise, a cache-line replacement is performed without any data-subarray

access. Although this approach reduces the energy consumed for data-subarray accesses

(cache-line accesses), the cache-access time will be increased due to the sequential flow. If we

know which way includes the desired data before starting the cache access (i.e., without per-


forming the tag comparison), the unnecessary way-accesses can be eliminated without cache-

access-time overhead. Thus, the way-prediction techniques introduced in Section 2.4.1.1 can

be used for reducing cache-access energy [35], [53]. A correct way-prediction makes it possible

to activate only the desired way without using the tag-comparison results. The detail of the

way-prediction techniques for low-energy dissipation is explained in Chapter 3.

As shown in Figure 2.1 in Section 2.3, conventional set-associative caches consist of several

memory-subarrays (i.e., ways). Albonesi [3] proposed the selective cache ways which allows

software to optimize the cache size and associativity. Each way can be enabled/disenabled

by a Cache Way Select Register (CWSR). For example, 32 KB four-way set-associative cache

can operate as an 8 KB direct-mapped cache, a 16 KB two-way set-associative cache, or a 32

KB four-way set-associative cache. Namely, the activated-area corresponds to the cache size

specified by the CWSR. A software as operating system can determine the trade-off between

the performance and the energy dissipation by modifying the CWSR.

(2) Omitting Tag Comparison

In conventional caches, tag comparison is performed in every access to determine whether

the current access hits the cache. Panwar et al. [70] proposed a conditional tag-comparison

which attempts to reduce the total count of tag comparison required in the execution of pro-

grams. If two successive instructions i and j reside in the same cache-line, the tag comparison

for j can be omitted. Another approach to omitting the tag comparison is to exploit exe-

cution footprints. The condition for performing the tag comparison is determined based on

the history of program execution [34]. The tag comparison for instruction j can be omitted,

even if instruction i and j are reside in different cache lines. The detail of this technique is

discussed in Chapter 4.

2.5.2 Reducing Data-Transfer/Main-Memory-Access Energy

As introduced in Section 2.5.1.1(1), cache-access energy can be reduced by employing the

cache subbanking. The idea of subbanking comes from that the referenced data is only one

word, instead of a whole cache line. As the offset of the referenced data in a cache line is

determined by the memory address, the selective activation of a subbank can be implemented.

However, this kind of technique can not be employed for main memory. Since the cache-line

size is fixed, a DRAM-array access with a fixed cache-line size takes place on every main-

2.6. CONCLUSIONS 31

memory access. The main-memory subbanking can be achieved by reducing the size of data

to be replaced between the cache and the main memory. Therefore, the techniques for low

memory traffic introduced in Section 2.4.3, cache bypassing introduced in Section 2.4.2.2,

and reducing the cache-line size introduced in Section 2.4.2.4 are useful for the main-memory

subbanking. The detail of main-memory subbanking approach based on line-size optimization

is discussed in Chapter 5.

Another approach to reducing the energy consumed for data transfer is bus cording. A

corded data is transferred from a sender to a receiver, instead of a raw data. The data to

be transferred is coded in order to reduce the number of bus transitions, thereby saving the

energy [84], [9]. Hardware components for decode and encode are required. Tomiyama et al.

[88] proposed a technique to reduce bus transitions based on instruction scheduling. Since the

number of bus transitions are reduced by only re-ordering the instructions, their approach

does not require any hardware overhead (i.e., no encoder and decoder are required).

2.5.3 Reducing DRAM-Static Energy

DRAM consumes not only the dynamic energy caused by DRAM accesses but also static

energy for refresh operations. The static energy is not included in Equation (2.4). However,

this energy consumption is also important for low-energy memory systems. One of approaches

to reducing the static energy-consumption is to reduce the total count of DRAM refresh to

be performed.

Ohsawa et al. [67] proposed selective refresh scheme to optimize DRAM refresh count

required for the execution of programs. The compiler analyze data lifetime, then it inserts a

hint in store instructions to determine whether the stored data needs to be refreshed. Another

approach to reducing standby power is to exploit DRAM power states [58], [19].

2.6 Conclusions

In this chapter, we have surveyed the techniques for high speed, low energy memory systems.

The best way to improve performance/energy efficiency is to achieve fast and low-energy

access at each level of memory hierarchy and to concentrate memory accesses on the closest

level to the processor. In order to imitate an ideal memory system, almost all techniques


introduced in this chapter stand on the locality of memory references: temporal locality and

spatial locality.

Against to Equation (2.1) and (2.4), the cache architectures proposed in this thesis have

the following effects:

• The way-predicting set-associative cache introduced in Chapter 3 reduces the cache-

access energy (ECache in Equation (2.4)), and maintains the cache-miss rate of a con-

ventional organization with the same cache size and associativity. However, our cache

incurs an acceptable cache-access-time overhead due to wrong way-predictions. The

main-memory-access time and energy are also maintained. The way-predicting set-

associative cache belongs to the behavioral approach explained in Section 2.5.1.2.

• The history-based tag-comparison cache introduced in Chapter 4 reduces the cache-

access energy (ECache in Equation (2.4)) of direct-mapped instruction caches, and main-

tains the cache-miss rate of a conventional organization with the same cache size. The

cache-access time, main-memory-access time, and main-memory-access energy are also

maintained. The history-based tag-comparison cache belongs to the behavioral ap-

proach explained in Section 2.5.1.2.

• The dynamically variable-line size cache introduced in Chapter 5 reduces the cache-miss

rate (CMR in Equation (2.1)), and maintains the cache-access time of a conventional

organization with the same cache size and associativity. Our cache also reduces the

main-memory-access energy (EMainMemory in Equation (2.4)). In addition, exploiting

the high on-chip memory bandwidth of merged DRAM/logic LSIs reduces the main-

memory-access time (TMainMemory in Equation (2.1)). The dynamically variable-line size

cache belongs to the data-prefetching high-performance technique explained in Section

2.4.2.4, the high-performance technique using high memory bandwidth explained in

Section 2.4.3, and energy reduction techniques explained in Section 2.5.2. We can

regard the energy reduction technique of the dynamically variable-line size cache (i.e.,

optimizing cache-line size at run-time) as a behavioral approach.

Although many cache architectures for improving memory-system performance have been

proposed, one of the most important goal in future memory systems is to achieve high per-

formance and low energy dissipation at the same time. From energy point of view, we have

2.6. CONCLUSIONS 33

introduced two approaches to reducing energy dissipation: structural approach explained

in Section 2.5.1.1 and behavioral approach explained in Section 2.5.1.2. The structural ap-

proach attempts to reduce energy dissipation by improving the hierarchical memory organi-

zation (memory module is partitioned vertically and/or horizontally). We believe that it is

promising to develop innovative cache architectures based on the behavioral approach and

to combine those with the structural approach.


Chapter 3

Way-Predicting Set-Associative

Cache Architecture

3.1 Introduction

Many modern processors employ set-associative caches as L1 or L2 caches. Since an n-way

set-associative cache has n locations where a cache line can be placed, it can offer higher hit

rates than direct-mapped caches. However, increasing cache associativity makes the cache

access time longer due to the delay for way selection based on tag-comparison results. To

compensate for this disadvantage, several researchers have proposed way-predictable set-

associative caches [14],[16],[52],[99]. In fact, the way-prediction technique has been employed

in commercial processors [89],[98].

The cited papers have focused on only the performance improvement achieved by the way

prediction. However, we believe that way prediction can offer a significant energy reduc-

tion in set-associative caches. In this chapter, we propose a low power cache architecture

using the way prediction, called way-predicting set-associative cache. The way-predicting

set-associative cache speculatively selects one way, which is likely to contain the data desired

by the processor, from the set designated by the memory address, before it starts the normal

cache access. In conventional set-associative caches, all ways are accessed to compensate

for long access time. Since the only one way has the referenced data, however, the other

way accesses are unnecessary. The correct way-prediction makes it possible to eliminate the

unnecessary way activation, so that the energy can be saved.

35

36 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE

The rest of this chapter is organized as follows. Section 2 summarizes the energy con-

sumption and the cache-access time of a conventional set-associative cache. In addition, a

low-power set-associative cache is described as a counterpart to our architecture. Section 3

discusses the way-predicting set-associative cache in detail. Section 4 and Section 5 evalu-

ates qualitatively and quantitatively the way-predicting cache in terms of both energy and

performance. Section 6 shows related work, and Section 7 gives some concluding remarks.

3.2 Set-Associative Cache

3.2.1 Conventional Caches

labelsecWP:Conventional Caches

The cache-access energy depends on the energy dissipated for the SRAM access, as ex-

plained in section 2.2. In this chapter, we simplify the cache-access energy as follows:

ECache ≈ ESRAMarray (3.1)

= NTag × ETag + NData × EData (3.2)

• NTag, NData: The average number of tag-subarrays, and data-subarrays, to be activated

for a cache access.

• ETag, EData: Energy dissipated for a tag-subarray access and that for a data-subarray

access, respectively.

In conventional set-associative caches, all the ways are activated regardless of hits or misses,

and the cache access can be completed in one cycle. Accordingly, average cache-access energy

(ECache) and time (TCache) of a conventional four-way set-associative cache (4SACache) can

be expressed by the following equations:

E4SACache = 4ETag + 4EData. (3.3)

T4SACache = 1Cycle. (3.4)

3.2.2 Phased Set-Associative Cache

Although at most only one way has the data desired by the processor, all the ways are

accessed in parallel, as shown in Figure 3.1(a). Thus, a lot of energy will be wasted in

3.2. SET-ASSOCIATIVE CACHE 37

conventional set-associative caches. To solve this issue, Hasegawa et al. proposed a low-power

set-associative cache architecture [26], we refer to which as phased set-associative cache. As

shown in Figure 3.1(b), the phased set-associative cache divides the cache-access process into

the following two phases:

• Cycle 1: All the tags in the set indexed by the memory address are read out from tag-

subarrays in parallel. Then the tags are compared with the tag-portion in the memory

address for cache lookup. No data accesses occur during this phase.

• Cycle 2: If one of the tag-comparison results is a match, the matching way includes the

data desired by the processor. In this case, only the data-subarray in the matching way

is accessed. The remaining ways are not activated, so that the phased set-associative

cache can reduce the energy consumption. If there is no matched tags, the referenced

data does not reside in the cache. Accordingly, the cache access is terminated without

any data-subarray access, and a cache replacement is performed.

As explained above, the phased set-associative cache reduces the energy consumption by elim-

inating unnecessary way accesses. The phased four-way set-associative cache (P4SACache)

makes the 3EData, and 4EData, energy reduction from the conventional four-way set-associative

cache (4SACache) on cache hits, and cache misses, respectively. However, the cache suffers

from a longer cache-access time. There is no access time penalty on cache misses, because

the data accesses are not performed. While two phases for the sequential access have to be

performed on cache hits. The average energy consumption for a cache access (EP4SACache),

and the average cache-access time (TP4SACache), of the phased four-way set-associative cache

can be expressed as follows:

EP4SACache = 4ETag + CHR × EData. (3.5)

TP4SACache = 1Cycle + CHR × 1Cycle. (3.6)

Here, CHR is the cache hit rate.


mux-drive

(a) Conventional 4-way set-associative cache (b) Phased 4-way set-associative cache

Data-Array

Cycle1

Accessed Subarray

Tag-Array

Way0 Way1 Way2 Way3

mux-drive

Way0 Way1 Way2 Way3

mux-drive

Way0 Way1 Way2 Way3

Cycle1

Cycle2

Figure 3.1: Phased Set-Associative Cache.

3.3 Way-Predicting Set-Associative Cache

3.3.1 Concept

The phased set-associative cache explained in section 3.2.2 attempts to eliminate unnecessary

data-subarray accesses by allowing the cache-hit time penalty. For almost all programs, cache

hit rates are very high even data caches because of high locality of memory references. As

the memory system performance strongly affects total program execution time, it is very

important to maintain fast cache accesses, especially on hits.

The way-predicting set-associative cache speculatively chooses one way before starting the

normal cache-access process. Then the cache divides the cache-access process into two phases,

like the phased set-associative cache but not the same, as follows:

• Cycle 1: Both of a tag and a cache line from only the predicted-way are read out, and

then the tag comparison is performed. If the tag-comparison result is a match, the

data desired by the processor is provided from the cache line read out, and the cache

access is completed successfully. In this case, the way-predicting set-associative cache

3.3. WAY-PREDICTING SET-ASSOCIATIVE CACHE 39

mux-drive

(a) Prediction-Hit (b) Prediction-Miss

Data

Accessed Subarray

Tag

Way0 Way1 Way2 Way3

mux-drive

Way0 Way1 Way2 Way3

mux-drive

Way0 Way1 Way2 Way3

Cycle1 Cycle1

Cycle2

Predicted Way Predicted Way

Figure 3.2: Way-Predicting Set-Associative Cache.

behaves as a direct-mapped cache, as shown in Figure 3.2(a). If the tag-comparison

result is not a match, then the second phase is performed.

• Cycle 2: The cache searches the other remaining ways in parallel, as shown in Fig-

ure 3.2(b). If one of the tag-comparison results is a match, the data from the hit way

is provided to the processor. Otherwise, a cache replacement takes place. Namely, the

way-predicting set-associative cache behaves as a “three-way” set-associative cache in

this phase.

On a prediction-hit, as shown in Figure 3.2(a), the way-predicting set-associative cache

consumes energy only for activating the predicted way. In addition, the cache access can be

completed in one cycle. On prediction-misses (or cache misses), however, the cache-access

time increases due to the successive process of two phases as shown in Figure 3.2(b). Since all

the remaining ways are activated in the same manner as conventional set-associative caches,

the way-predicting set-associative cache could not reduce energy consumption in this scenario.

The performance/energy efficiency of the way-predicting set-associative cache largely depends


on the accuracy of the way prediction.

The average energy consumption for an access (ECache), and the average cache-access

time (TCache), of the way-predicting four-way set-associative cache (WP4SACache) can be

expressed as follows:

EWP4SACache = (ETag + EData) + (1 − PHR) × (3ETag + 3EData) (3.7)

TWP4SACache = 1Cycle + (1 − PHR) × 1Cycle (3.8)

Here, PHR is the prediction-hit rate. The phased set-associative cache has the cache-access

time penalty on cache hits, while the way-predicting set-associative cache has that on cache

misses. The total number of cache hits is much more than that of cache misses, so that the

way-predicting set-associative cache has a significant advantage for the cache-access time,

compared with the phased set-associative cache.

3.3.2 Way Prediction

Many application programs have higher locality of memory references. This means that a

cache line referenced by the processor will be referenced to again in the near future.

Here, it is assumed that a seti is accessed by a processor for cache look-up, and an wayj

(0 ≤ j ≤ AS − 1, where AS is cache associativity) causes a cache hit. In this case, the

data required by the processor will reside in the wayj on a near future access to the seti if

programs have higher locality of memory references. Accordingly, we have decided to employ

an way-prediction policy based on MRU(Most Recently Used) algorithm. The way predictor

determines a predicted way for the set which has being accessed by the processor as follows:

• On prediction-hits, the way predictor does not do anything because the current way-

prediction is correct.

• On prediction-misses (but cache hits), the way predictor regards the way having the

data desired by the processor as the predicted way. The predicted way can be deter-

mined by tag comparison results.

• On cache-misses, the way predictor regards the way to be filled on cache replacement

as the predicted way. The predicted way can be determined by the results of tag

comparisons (hit or miss) and status flags indicating which way to be replaced.

3.3. WAY-PREDICTING SET-ASSOCIATIVE CACHE 41

Tag

Tag Index

Data Tag Data Tag Data Tag Data

Way-PredictionTable

Access Controler

Contorol Signals

Reference-Address

2-bits

Way0 Way1 Way2 Way3

Tag Tag Tag

Way-Predictor

Status

Reference-LineMux Drive

used on Cache-misses

used on Prediction-misses

used on Prediction-hits

Way-Prediction Flag

Figure 3.3: Organization of Way-Predicting Four-Way Set-Associative Cache.

3.3.3 Organization

Figure 3.3 gives an organization of the way-predicting four-way set-associative cache (WP4SACache).

Compared to the conventional four-way set-associative cache (4SACache), only the following

additional components are required:

• Way-prediction table, which contains a two-bit flag (way-prediction flag) for each set.

The two-bit flag is used to speculatively choose one way from the corresponding set.

• Way predictor, which determines the value of each way-prediction flag according to the

MRU (most-recently used) algorithm explained in Section 3.3.2.

The way-predicting four-way set-associative cache (WP4SACache) works as follows:

1. The way-prediction flag associated with a given set is accessed, and is read from the

way-prediction table immediately after an effective memory address is generated. The

predicted way is determined by the way-prediction flag read out.


2. When a memory access takes place, the WP4SACache starts to decode the memory

address in the same manner as conventional set-associative caches.

3. Only the predicted way is activated, and the tag and the cache-line associated with the

predicted way are read simultaneously. The tag is then compared with the tag-portion

of the memory address. If the tag-comparison result is a match (prediction-hit), the

cache access completes successfully. Otherwise, steps 4 and 5 are performed.

4. The remaining three ways are activated, and all the tags and the cache-lines are read

out in parallel. Then, the three tags are compared with the tag-portion of the memory

address. If at most one tag matches (prediction-miss), the WP4SACache provides the

referenced data to the processor. Otherwise (cache-miss), a cache replacement takes

place.

5. The way predictor modifies the way-prediction flag based on the results of the tag com-

parison or the status flags as explained in Section 3.3.2. The modified way-prediction

flag is written back to the way-prediction table. Note that this write back operation

does not make any cache-access time penalty, because it can be performed during data

transfer from the cache to the processor or cache replacement processes.

3.4 Evaluations: Theoretical Analysis

To clarify the upper and lower bound of the performance/energy improvements achieved

by the way-predicting set-associative cache architecture, we performed a qualitative analysis

using the energy and performance equations defined in the previous sections. Figure 3.4

shows the average energy consumption and the average cache-access time based on equations

from (3.3) to (3.8) for :

• a conventional four-way set-associative cache (4SACache),

• a phased four-way set-associative cache (P4SACache), and

• an way-predicting four-way set-associative cache (WP4SACache).

For every cache, the cache size, cache-line size, and associativity (the number of ways) are

16 K bytes, 32 bytes, and 4, respectively. Because the same replacement algorithm (usually

3.4. EVALUATIONS: THEORETICAL ANALYSIS 43

0.6

0.20.4

00.2 0.4 0.6 0.8 1

1.01.52.0

1

Tca

che

(# o

f C

ycle

s)

0.8

0 0.2 0.4 0.6 0.8 1

0.20.4

0.6

Eca

che

(# o

f E

dat

a)

01234

10.8

Cache

Miss Rate

Prediction-Hit Rate

Cache

Miss Rate

Prediction-Hit Rate

(a) Energy and Cache-Access Time

(b) Average Energy Consumption and Average Cache-Access Time when CMR=5%

0.0

0.5

1.0

1.5

2.0

2.5

0.2 0.4 0.6 0.8 0.95Prediction Hit Rate

95%

5%

1.0

2.0

3.0

4.0

5.0

0.0 0.2 0.4 0.6 0.8 0.95Prediction Hit Rate

60%

71%

Conventional Cache (4SACache)Phased Cache (P4SACache)Way-Predicting Cache (WP4SACache)

Eca

che

(# o

f E

)D

ata

Conventional Cache (4SACache)Phased Cache (P4SACache)Way-Predicting Cache (WP4SACache)

Tca

che

(# o

f C

ycle

s)

PHR=0.8

PHR=0.8

Figure 3.4: Average Energy Consumption per Cache Access and Average Cache-Access Time

LRU) is used for every cache, the cache-hit rate (CHR) is common to all the caches.

Figure 3.4(a) plots the average energy consumption per cache access, and the cache-access

time, as a function of the prediction-hit rate (PHR) and the cache-miss rate (CMR =

1 − CHR) for each cache (4SACache, P4SACache, and WP4SACache). The following as-

sumptions are made:

• The address size, set-index size, and byte-offset size are 32 bits, 7 bits (= log2 128), and

5 bits (= log2 32), respectively. Thus the tag size is 20 bits.

• ETag = 0.078EData because the ratio of the tag size to the cache-line size is 20 : 256 (in


terms of bits), or 0.078 : 1.

We can see the followings two results form figure 3.4(a). First, the way-predicting cache

performs best when PHR = 100% (i.e., CHR = 100%). In this case, the average energy

consumption is reduced by 75% without any cache-access time degradation, compared to the

conventional set-associative cache. On the other hand, although the phased cache also makes

about 75 % energy reduction, the cache-access time becomes two times longer due to the

sequential accesses on cache hits. Seconds, the way-predicting cache performs worst, even

if CHR = 100%, when PHR = 0%. Compared to the conventional set-associative cache,

the average cache-access time increases by 100% while the average energy consumption is

unchanged. Though the cache-access time is the same, the average energy consumption is

greater by 229%, compared to the phased cache.

Figure 3.4(b) cuts a cross section when CMR = 5% from Figure 3.4(a), and plots the

average energy consumption and cache-access time as the function of the prediction-hit rate

(PHR). From Figure 3.4(b), assuming the 0.8 prediction-hit rate, the following observations

are made:

• Phased cache vs. conventional cache: Compared to the conventional cache, the phased

cache can reduce the average energy consumption by 71%, but it increases the average

cache-access time by 95%.

• Way-predicting cache vs. conventional cache: If the way-predicting cache achieves a

95% prediction-hit rate, the average energy consumption can be reduced by 71% while

maintaining performance comparable to that of the conventional cache. When PHR is

80%, the way-predicting cache can achieve 60% energy reduction with only 5% average

cache-access time overhead.

3.5 Evaluations: Experimental Analysis

3.5.1 Simulation Environment

To evaluate the effectiveness of the way-predicting set-associative cache architecture on real

work loads, we performed a quantitative analysis using benchmark programs. We made

some experiments using a cache simulator. The cache simulator gets an address trace from

3.5. EVALUATIONS: EXPERIMENTAL ANALYSIS 45

QPT[31] as its input, and simulates the LRU cache replacement algorithm and the MRU way-

prediction algorithm. And then, the cache simulator reports the prediction-hit rate (PHR),

prediction-miss rate (PMR), and cache-miss rate (CMR) as its outputs. All benchmark

programs were compiled by GNU CC (–O2) for the UltraSPARC. We used the programs

listed in table 3.1 from the SPEC95 benchmark suite [82].

Table 3.1: Benchmark Programs.

Programs Inputs

SPECint95 099.go,

124.m88ksim, 126.gcc,

129.compress, 130.li,

132.ijpeg, 134.perl,

147.vortex

training

SPECfp95 101.tomcatv,

102.swim, 103.su2cor,

104.hydro2d

test

3.5.2 Way-Prediction Hit Rates

Table 3.2 shows the benchmark results. As can be seen from the table, I-caches of all the

programs achieve quite high prediction-hit rates (PHR) of over 90%, and the average PHR

is about 96%. This results can be understood by considering the behavior of instruction ref-

erences. Programs is basically based on the incremental execution of successive instructions.

As cache line consists of several instructions to exploit the spatial locality of references, the

MRU way-prediction algorithm works very well. For the D-caches, more than half of the

programs also achieve high prediction-hit rates (PHR) of over 90%, and the average PHR

is about 86% which is lower than that of I-caches. The data references also have spatial

locality, but is not so higher than that of I-caches. Since the data reference behavior depends

on program characteristics, the accuracy of way prediction based on MRU algorithm is highly

application-dependent.


Table 3.2: Benchmark Results: Prediction-Hit Rate (PHR), Prediction-Miss Rate (PMR),

and Cache-Miss Rate (CMR)

Benchmarks I-Cache D-Cache

PHR(%) PMR(%) CMR(%) PHR(%) PMR(%) CMR(%)

099.go 94.55 4.04 1.41 81.31 17.45 1.24

124.m88ksim 95.76 4.05 0.19 95.47 3.63 0.91

126.gcc 92.32 5.09 2.59 87.40 9.59 3.01

129.compress 99.98 0.02 0.00 91.64 3.63 4.73

130.li 97.28 2.71 0.00 92.82 3.91 3.27

132.ijpeg 99.74 0.25 0.01 92.60 6.38 1.02

134.perl 94.93 4.65 0.42 92.64 5.78 1.58

147.vortex 91.65 7.11 1.25 89.38 9.16 1.46

101.tomcatv 91.61 7.30 1.09 87.96 9.96 2.08

102.swim 97.96 2.04 0.00 50.27 31.71 18.03

103.su2cor 96.48 3.23 0.28 85.22 8.14 6.64

104.hydro2d 98.28 1.43 0.29 89.41 3.55 7.04

Average 95.87 3.49 0.62 86.34 9.41 4.25

Among all the benchmark results, the PHR of D-cache for 102.swim is the lowest at 50%,

and the cache-miss rate (CMR) is also the highest at 18%. If we consider that the CMR

of a direct-mapped cache with the same cache size and cache-line size for 102.swim is about

13.8%, it seems that the LRU cache-replacement algorithm and the MRU way-prediction

algorithm do not match the memory-reference pattern of 102.swim.

3.5.3 Cache-Access Time and Energy

Based on the models of energy consumption and cache-access time expressed by equations

from (3.3) to (3.8), and on the benchmark results reported in section 3.5.2, figure 3.5, and

figure 3.6, show the average energy consumption per cache access and the average cache-

access time of the I-cache, and the D-cache, respectively. All results of each program are

normalized to the conventional four-way set-associative cache (4SACache).


0.00

0.10

0.20

0.30

0.40

0.90

1.00N

orm

aliz

ed E

cach

e

099.go124.m88ksim

126.gcc129.compress

130.li132.ijpeg

134.perl147.vortex

101.tomcatv102.swim

103.su2cor104.hydro2d

Benchmarks

Phased Cache (P4SACache)

Way-Predicting Cache (WP4SACache)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

No

rmal

ized

Tca

che

099.go124.m88ksim

126.gcc129.compress

130.li132.ijpeg

134.perl147.vortex

101.tomcatv102.swim


Benchmarks

Figure 3.5: Average Energy Consumption and Average Cache-Access Time for I-Cache.

For many programs, the way-predicting I-cache produces better results than the phased I-

cache. The way-predicting cache activates only one tag-subarray on prediction-hits, while the

phased cache activates all tag-subarrays regardless of cache-hits or cache-misses. Accordingly,

the way-predicting cache can reduce more energy consumption than the phased cache when

the prediction-hit rate is high.


0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

No

rmal

ized

Tca

che

099.go124.m88ksim

126.gcc129.compress

130.li132.ijpeg

134.perl147.vortex

101.tomcatv102.swim


Benchmarks

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

1.00N

orm

aliz

ed E

cach

e

099.go124.m88ksim

126.gcc129.compress

130.li132.ijpeg

134.perl147.vortex

101.tomcatv102.swim


Benchmarks

Phased Cache (PSACache)


Figure 3.6: Average Energy Consumption and Average Cache-Access Time for D-Cache.

Compared to the conventional cache, for most of the programs, the phased cache reduces

the average energy consumption by about 70%, but it increases the average cache-access

time by about 100%. On the other hand, the way-predicting cache achieves mostly the same

energy reduction as the phased cache without significant drawbacks on performance.


0.00

400*10

800*10

1200*10

1600*10

2000*10

2400*10

2800*10

3200*10

3600*10

4000*10

4400*10

4800*10

5200*10

5600*10

11000*10

En

erg

y C

on

sum

ed in

Cac

hes

[E

dat

a](E

cach

e *

To

tal #

of

Mem

ory

Ref

eren

ces)

099.go124.m88ksim

126.gcc129.compress

130.li132.ijpeg

134.perl147.vortex

101.tomcatv102.swim


Benchmarks

Conventional Cache

Way-Predicting Cache(WP4SACache)

To

tal En

ergy w

here E

data is 2.5 n

J [Jou

le]6

6

6

6

6

6

6

6

6

6

6

6

6

6

6

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

13.0

14.0

27.5

Figure 3.7: Total I-Cache Energy Dissipated for The Execution of Programs.

Figure 3.7 and Figure 3.8 show the total energy consumed in caches during the execution

of each program, i.e., the total number of memory references × average cache-access energy

(ECache), where the energy dissipated for a data-subarray access (EData) is 2.5 nano jules.

EData is calculated based on the Kamble’s model [48], and the 0.8 micron CMOS cache design

described in [95] is assumed. In order to obtain the value of EData by using the Kamble’s

model, the following parameters are assumed:

• Total number of rows, or Nrow, is 128 ( 16KB32B×4way

= 128sets).

• Tag bits, or T , is 0 because EData does not include the energy dissipated in tag-

subarrays.

• Associativity, or M , is 1 because EData is the energy dissipated in one data-subarray

(not in all data-subarrays).

A more detailed explanation of the calculation is presented in Section 5.6.3.


0.00

200*106

400*106

600*106

800*106

1000*106

1200*106

1400*106

1600*106

1800*106

2000*106

2200*106

2400*106

2600*106

2800*106

3000*106E

ner

gy

Co

nsu

med

in C

ach

es [

Ed

ata]

(Eca

che

* T

ota

l # o

f M

emo

ry R

efer

ence

s)

099.go124.m88ksim

126.gcc129.compress

130.li132.ijpeg

134.perl147.vortex

101.tomcatv102.swim


Benchmarks

Conventional Cache

Way-Predicting Cache(WP4SACache)

To

tal En

ergy w

here E

data is 2.5 n

J [Jou

le]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

7.5

Figure 3.8: Total D-Cache Energy Dissipated for The Execution of Programs.

3.5.4 Energy-Delay Product

To evaluate both of energy and performance at the same time, we measured the energy-

delay products for each cache. Figure 3.9 shows the mean ED product (=average energy con-

sumption per cache access × average cache-access time) for all of the benchmark programs.

Again, these values are normalized to those of the conventional four-way set-associative cache

(4SACache). The I-Cache&D-Cache in the figure shows the average ED product per instruc-

tion execution. Figure 3.9 indicates that the way-predicting cache improves the mean ED

product by about 70% and 60% for the I-cache and the D-cache, respectively, compared with

the conventional set-associative cache.

Here, it is assumed from simulation results that the average number of instruction-memory

accesses and that of data-memory accesses per instruction execution are 1 and 0.278, respec-

tively. When we consider instruction cache and data cache together, the way-predicting

cache (WP4SACache) produces about 70% ED product reduction while the phased cache


0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

I-Cache D-Cache

Nor

mal

ized

Eca

che*

Tca

che

I-Cache & D-Cache

Conventional Cache (4SACache)

Phased Cache (P4SACache)


Figure 3.9: ED Product.

(P4SACache) reduces that only about 40%, from the conventional cache (4SACache).

3.5.5 Performance/Energy Overhead

Thus far, we ignored energy/performance overhead caused by the way prediction. In reality,

this overhead needs to be included in the following ways.

• Energy consumption overhead : Activating the way-predictor and accessing to the

way-prediction table dissipate extra energy.

• Cache access-time overhead : The delay for reading the way-prediction flag will directly

increase the cache access-time, because it should be completed before normal cache-

access starts.

First, we consider the energy overhead. The way-predictor can be implemented with a small

combinational logic, because MRU algorithm is very simple. Moreover, the way-predictor


0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4 0.6 0.8 1.0Toverhead (Cycle)

No

rmal

ized

ED

Pro

du

ctConventional Cache (4SACache)Phased Cache (P4SACache)Way-Predicting Cache (WP4SACache)

Toverhead (Cycle)

No

rmal

ized

ED

Pro

du

ct

42.6%24.7%

0.5

Figure 3.10: The Effect of Cache Access-Time Overhead to ED Products.

predicts and writes back the modified way-prediction flag into the way-prediction table when

incorrect way-predictions take place. Since the prediction-hit rate for many programs are

very high, as shown in section 3.5.2, this energy overhead can be ignored. For every cache

access, the two-bit way-prediction flag is read from the way-prediction table. This process

may also increase the energy overhead. In case when the way-prediction table is implemented

by flip-flops, the energy dissipation will depend on the switching activity of way-prediction

flags to be read. In our simulations, it is observed that the average switching number on

the two-bit way-prediction flag for an access are 0.4(I-Cache) and 0.8(D-Cache). This energy

consumption overhead is insignificant.

Next, we discuss the cache access-time overhead. Here, we re-define the average cache

access-time of the way-predicting set-associative cache to include the way-prediction-table

access overhead as follow:

TWP4SACache = 1Cycle + (1 − PHR) × 1Cycle + ToverheadCycle (3.9)

Here, Toverhead is the average way-prediction-table access-time. Figure 3.10 shows the average

ED product of twelve benchmark programs for the conventional four-way set-associative cache

(4SACache), the phased four-way set-associative cache (P4SACache), and the way-predicting

four-way set-associative cache (WP4SACache) with access-time overhead Toverhead. For in-


struction cache, the way-predicting cache produces about 42% ED product improvement

from the conventional cache, which is better than the phased cache, even if the cache has

100% access-time overhead (Toverhead=1.0 cycle). For data cache, the effectiveness of the way-

predicting cache with 50% access-time overhead (Toverhead=0.5 cycle) is almost all same as

that of the phased cache. In addition, when the access-time overhead is 100% (Toverhead=1.0

cycle), the way-predicting cache can still make about 25% ED product reduction from the

conventional cache.

Actually, there are some methods to solve the access-time overhead problem. For example,

the delay of way-prediction-table access can be reduced by means of implementation, using

flip-flops rather than SRAM array. In addition, the access-time overhead can be hidden by

calculating the cache-index address at an earlier stage in the pipe-line[14].

3.5.6 Effects of Other Parameters

The effectiveness of the way-predicting set-associative cache depends on the prediction-hit

rates. In this section, we evaluate the effects of hardware constraints, cache size, cache-

line size, associativity, and way-prediction table size. We use three benchmarks for I-caches,

129.compress, 147.vortex, and 101.tomcatv, and four benchmarks for D-caches, 099.go, 124.m88ksim,

102.swim, and 104.hydro2d. The prediction-hit rates of 129.compress on I-cache and 124.m88ksim

on D-cache are the best of all programs. The other programs used in this section produces

lower prediction-hit rates, as shown in section 3.5.2. For all figures in this section, solid lines,

and broken lines, show the calculation results for the way-predicting four-way set-associative

cache (WP4SACache), and the phased four-way set-associative cache (P4SACache), re-

spectively. All results on each parameter are normalized to the conventional four-way set-

associative cache (4SACache). Therefore, the figures plot that how much the caches improve,

or degrade, the energy and performance over, or under, the conventional cache. In this sec-

tion, we assume that the cache size, the cache line size, and the associativity are 16 KB, 16

B, and 4, respectively (unless stated otherwise).

3.5.6.1 Cache Size

We measured the average energy consumption per cache access (ECache), and the average

cache-access time (TCache), with various cache sizes. Figure 3.11 shows simulation results.


*) Line size is 16 bytes(A) Instruction Caches

P4SACache (147.vortex)P4SACache (101.tomcatv)

P4SACache (129.compress)

WP4SACache (101.tomcatv)

WP4SACache (129.compress)WP4SACache (147.vortex)

2 4 8 16 32 64 128Cache Size [KB]

0.20

0.30

0.40

0.50

0.60

0.70

Nor

mal

ized

Eca

che

0.80

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Tca

che

2 4 8 16 32 64 128Cache Size [KB]

*) Line size is 16 bytes(B) Data Caches

WP4SACache (102.swim)P4SACache (124.m88ksim)

WP4SACache (099.go)

P4SACache (102.swim)WP4SACache (124.m88ksim)

P4SACache (099.go)

P4SACache (104.hydro2d) WP4SACache (104.hydro2d)

2 4 8 16 32 64 128Cache Size [KB]

0.80

0.20

0.30

0.40

0.50

0.60

0.70

Nor

mal

ized

Eca

che

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Tca

che

2 4 8 16 32 64 128Cache Size [KB]

Figure 3.11: Effects of Cache Size.

As can be seen from the figure, increasing the cache size reduces the energy consumption of

the way-predicting set-associative cache, while there is no significant difference on the phased

cache. The cache-miss rates affect the accuracy of the way prediction. Because the cache-

miss means that the desired data does not reside in the cache. Thus it is impossible to make

a correct prediction on the cache misses. In other words, increasing the cache size improves

the cache-hit rate, as the result, the prediction-hit rate is also improved. The phased cache

sacrifices the cache-hit time, while the way-predicting cache sacrifices the prediction-miss


Table 3.3: Energy and Access Time with 128 bytes Large Line Size.

Benchmarks I-Cache D-Cache

P4SACache WP4SACache P4SACache WP4SACache

TCache ECache TCache ECache TCache ECache TCache ECache

099.go 199.5% 26.3% 102.1% 26.6% 197.0% 25.7% 130.1% 47.6%

124.m88ksim 199.8% 26.4% 102.7% 27.0% 199.7% 26.4% 107.5% 30.6%

126.gcc 198.1% 26.0% 104.3% 28.2% 197.5% 25.8% 115.6% 36.7%

129.compress 200.0% 26.4% 100.0% 25.1% 196.6% 25.6% 109.9% 32.4%

130.li 200.0% 26.4% 101.4% 26.1% 198.2% 26.0% 108.2% 31.1%

132.ijpeg 200.0% 26.4% 100.6% 25.4% 199.6% 26.3% 112.1% 34.1%

134.perl 199.5% 26.3% 103.5% 27.7% 199.0% 26.2% 111.9% 33.9%

147.vortex 199.3% 26.3% 104.4% 28.3% 198.1% 26.0% 117.3% 37.9%

101.tomcatv 198.0% 25.9% 103.9% 28.0% 199.4% 26.2% 119.4% 39.5%

102.swim 200.0% 26.4% 101.0% 25.8% 168.6% 18.7% 152.4% 64.3%

103.su2cor 199.6% 26.3% 101.7% 26.3% 197.6% 25.8% 134.3% 50.7%

104.hydro2d 199.6% 26.4% 100.8% 25.6% 198.2% 26.0% 113.2% 34.9%

time. Therefore, the performance gap between the way-predicting cache and the phased

cache gets larger and larger with increase in the cache size.

3.5.6.2 Cache-Line Size

As explained in Chapter 1, we can exploit the high on-chip memory bandwidth in the

merged DRAM/logic LSI which will be one of the core devices in future system LSIs. The

high bandwidth makes it possible to increase the cache-line size without increase in cache-

replacement penalty. Therefore, it is important to evaluate the way-predicting set-associative

cache architecture with larger cache-line sizes.

For finding the effects of the cache-line size to the energy and performance of the way-

predicting cache, we measured the prediction-hit rates on the caches with various cache-line

sizes, and then calculated the average energy consumption (ECache) and the average cache-

access time (TCache). Figure 3.12 shows the calculation results.



WP4SACache (099.go)


P4SACache (099.go)

WP4SACache (104.hydro2d)P4SACache (104.hydro2d)

(A) Instruction Caches

(B) Data Caches





0.10

0.20

0.30

0.40

0.50

0.60

Nor

mal

ized

Eca

che

0.70

16Cache-Line Size [byte]

32 64 128 256

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Tca

che


32 64 128 256

0.10

0.20

0.30

0.40

0.50

0.60

Nor

mal

ized

Eca

che

0.70


32 64 128 256

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Tca

che


32 64 128 256

Figure 3.12: Effects of Cache-Line Size.

The incremental instruction accesses within the large cache lines improves the MRU based

prediction-hit rate. Therefore, the energy gap between the way-predicting cache and the

phased cache gets smaller and smaller with increase in the cache-line size. On the other hand,

the best cache-line size in the D-cache is highly application dependent [36]. Accordingly, the

energy reduction achieved by the way-predicting cache also depends on the characteristics

of programs. For instance, the energy consumption for 099.go increases with increase in the


cache-line size. In 104.hydro2d, the energy is reduced until 64 bytes cache-line size, and then

it starts to increase. For almost all programs, it is observed that the energy consumption

increases when we have much large cache-line sizes. This comes from the following two

reasons. First, much larger cache-line sizes worsen the cache-hit rates when the program

have poor spatial locality. As explained in section 3.5.6.2, lower cache-hit rates bring lower

prediction-hit rates. Seconds, increasing the cache-line size reduces the total number of sets

in the cache. A number of memory references share a set, so that the way accesses in the

set could be distributed. These memory references have to share also a way-prediction flag.

As the results, the accuracy of the way prediction is degraded. However, still there is a large

performance gap between the way-predicting cache and the phased cache.

Table 3.3 shows the average energy consumption per cache access and the average cache-

access time for all benchmark programs when the caches have 128-byte cache-line size. The

way-predicting cache can reduce a lot of energy, and also can maintain the fast cache access,

as well as the 32-byte small cache-line size, reported in section 3.5.3. The advantage of

way-predicting I-caches is very clear. For all programs except for 099.go, 102.swim, and

103.su2cor, the difference of energy-reduction rates between the way-predicting cache and

the phased cache is less than 15 %, while the performance difference is much larger. For

the cache with a large cache-line size, we can summarize form the above results as follows.

If we don’t care about the performance degradation, the phased cache should be employed.

Otherwise, it is better to employ the way-predicting cache. In particular, the way-predicting

cache produces significant performance/energy improvements when the cache-line size is equal

or smaller than 128 byte.

3.5.6.3 Cache Associativity


cache-access time (TCache), with various cache associativity. Figure 3.13 shows simulation

results.

For both I-caches and D-caches, the amount of energy reduction decreases with increase

in the cache associativity. Because the effect of energy consumption for activating the tag-

subarrays becomes larger. For example, tag width is 25 bits, index width is 3 bits (23 sets),

and offset width is 4 bits (24 bytes) in case of cache associativity is 128, cache size is 16 KB,


*) Line size is 16 bytes(A) Instruction Caches





2 4 8 16 32 64 128Associativity

0.00

0.30

0.40

0.50

0.60

Nor

mal

ized

Eca

che

0.70

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Tca

che


*) Line size is 16 bytes(B) Data Caches


WP4SACache (099.go)


P4SACache (099.go)

P4SACache (104.hydro2d) WP4SACache (104.hydro2d)


Nor

mal

ized

Eca

che

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Tca

che


0.20

0.10

0.00

0.30

0.40

0.50

0.60

0.70

0.20

0.10

Figure 3.13: Effects of Associativity.

and cache-line size is 16 bytes. Accordingly, the 3,200 bits (25 bits × 128 way) are activated

for tag comparison, while 128 bits (16-byte cache-line size) are activated for a cache-line

access.

On the other hand, the way-predicting cache can produce significant energy reductions up

to 16 associativity. Increasing the cache associativity helps to reduce the energy consumption

due to subbanking effects. However, the higher-associative cache has many candidates of the

predicted way, so that it might worsen the accuracy of the way prediction. In case that the



WP4SACache (099.go)


P4SACache (099.go)

WP4SACache (104.hydro2d)P4SACache (104.hydro2d)

(A) Instruction Caches

(B) Data Caches





0.20

0.30

0.40

0.50

0.60

0.70

Nor

mal

ized

Eca

che

0.80

1(each)# of Sets Sharing A Way-Prediction Flag

4 16 64 256(all)

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Tca

che

Nor

mal

ized

Eca

che

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Tca

che


4 16 64 256(all)

0.20

0.30

0.40

0.50

0.60

0.70

0.80


4 16 64 256(all) 1(each)# of Sets Sharing A Way-Prediction Flag

4 16 64 256(all)

Figure 3.14: Effects of Way-Prediction Table Size.

way-prediction accuracy reduction overcomes the subbanking effects, the energy efficiency of

the way-predicting cache is reduced. Accordingly, when we employ higher-associative cache,

the phased cache should be chosen. Otherwise, the way-predicting cache can make significant

energy reduction, so that it must be employed.

3.5.6.4 Way-Prediction Table Size



cache-access time (TCache), with various way-prediction table size. Figure 3.14 shows simula-

tion results. The x-axis is the number of sets sharing a way-prediction flag. The most right

plot, denoted as 256 (all), is the result when the cache has only one way-predicting flag, i.e.,

the total number of way-predicting flag is one. Decreasing the total number of way-predicting

flag alleviates the performance/energy overhead discussed in section 3.5.5.

For I-caches, the energy-reduction rates is reduced when the number of sets sharing a way-

prediction flag is changed from 1 to 4. However, after that, the cache can maintain the energy

efficiency. On the other hand, the energy efficiency is degraded in proportion to the increase

in the number sharing sets. These behavior can be seen for the cache-access time. From

these results, we consider that sharing the way-prediction flag is a good way to alleviate the

performance/energy overhead for accessing the way-prediction table for I-caches, but not for

D-caches.

3.6 Related Work

There have been several proposals for reducing the power consumption of on-chip caches.

MDM (Multiple-Divided Module) cache [56] attempts to reduce the power consumption by

means of partitioning the cache into several small sub-caches. MDM cache requires a great

amount of hardware modification. Block buffering [48], [85], filter cache [54], and L-cache

[25] achieve low power consumption by adding a very small L0-cache between the processor

and the L1-cache. The advantage of L0-cache approaches decreases when memory reference

locality is low and cache replacement happens frequently between the L0 and L1 caches.

Hasegawa et al. [26] proposed a low-power set-associative cache architecture, which has been

compared with our way-predicting cache. Their cache is referred as phased cache in this

chapter, and it suffers from longer cache-hit time.

On the other hand, the way-predicting set-associative cache architecture can be imple-

mented with small hardware overhead, because the cache structure and memory hierarchy

of conventional memory system is maintained. Assuming the same associativity, the cache-

miss rate of the way-predicting set-associative cache is the same as that of the conventional

set-associative cache. The way-predicting set-associative cache can offer significant energy

reduction without large performance degradation when the way-prediction-hit rate is high.

3.7. CONCLUSIONS 61

3.7 Conclusions

In this chapter, the way-predicting set-associative cache for low energy consumption has been

proposed. The way-predicting cache speculatively selects one way from the set designated by

a memory address, before beginning a normal cache access. By accessing only the one way

predicted, instead of accessing all the ways, the energy consumption can be reduced.

For the way-predicting cache to perform well, the accuracy of way prediction is important.

The experimental results show that the accuracy of the MRU-based way prediction is higher

than 90% for most of the benchmark programs. It is also observed that the way-predicting

cache improves the ED product by 60–70% over the conventional set-associative cache.

To implement the MRU-based way-prediction, the way-prediction table has to be added to

the conventional cache organization. In particular, the performance penalty caused by access-

ing the way-prediction table can not be ignored. We have evaluated the performance/energy

efficiency of the way-predicting set-associative cache with the performance penalty. Then we

have also related some approach to solve this performance problem.

In addition, we have evaluated the effects of other parameters to the improvement achieved

by the way-predicting set-associative cache: cache size, cache-line size, associativity, and

size of way-prediction table. It is observed that increasing the cache size produces better

results. The trend increases the cache size to confine more memory accesses in on-chip for

achieving high performance and low power consumption. Therefore, we believe that the way-

predicting set-associative cache architecture is usable for future LSIs. Moreover, we have

reported that decreasing the size of way-prediction table is a promising way to alleviate the

performance/energy overhead for accessing the way-prediction table.

There are many alternatives for the way prediction, such as hash-rehash caches and column-

associative caches [52], [1], [14], which have been proposed for performance considerations.

Kim et. al. [53] reported the MRU approach consumes the least energy. In this chapter,

we have assumed that the way-predicting cache-access time on prediction hit is same as

the cache-hit time of a conventional set-associative cache. Actually, the cache-access time

on prediction hit is faster, as the hash-rehash caches and column-associative caches. As

mater which way-prediction algorithm is employed, this kind of behavioral approach is very

promising for future high-performance/low-power system LSIs.


Chapter 4

History-Based Tag-Comparison

Cache Architecture

4.1 Introduction

On-chip caches have been playing an important role in achieving high performance processors.

In particular, much higher performance is required for instruction caches because one or more

instructions have to be issued on every clock cycle. In other words, from energy point of view,

the instruction cache consumes a lot of energy. Therefore, it is strongly required to reduce

the energy consumption for instruction-cache accesses.

In cache accesses, the tag indexed by the memory address is read from tag memory. Then

the tag is compared with the tag-portion in the memory address to determine whether the

entry in the cache corresponds to the requested address. If the tag is equal to the tag-portion,

then the access hits in the cache. Otherwise, a cache miss occurs. Therefore, the energy for

the tag comparison and that for the data access (data read or write) are consumed on every

cache access.

In this chapter, we focus on the energy consumed for the tag comparison, and propose

a novel architecture for low-power direct-mapped instruction caches, called “history-based

tag-comparison cache”. The cache predicts the residence of instructions to be fetched before

the tag comparison is performed. If the prediction is correct, the tag comparison can be

omitted. In this case, the cache does not need to waste the energy for the tag comparison.

Our method guarantees completely correct predictions.

63

64 CHAPTER 4. HISTORY-BASED TAG-COMPARISON CACHE ARCHITECTURE

The rest of this chapter is organized as follows. Section 2 shows the effect of tag compar-

ison on the total cache energy. In addition, another technique to omit the tag comparison

proposed in [70] is explained as a comparative method. Section 3 presents the concept and

mechanism of our history-based tag-comparison cache. Section 4 reports evaluation results

for energy efficiency of the history-based tag-comparison cache. Moreover, the effects of

hardware constraint is analyzed. Section 5 shows related work, and Section 6 concludes this

chapter.

4.2 Breakdown of Cache-Access Energy

In direct-mapped instruction caches, tag comparison and data read are performed in parallel.

Thus, the total energy consumed for a cache access has two factors: the energy for the tag

comparison and that for the data read. Here, we assume that the logic portion, comparators

for the tag comparison and multiplexors for the data read, does not dissipate any energy.

Therefore, we need to consider the energy for tag-memory accesses and data-memory accesses.

In conventional caches, the height of the tag memory and that of the data memory are

equal, but not for the width. Because the memory width depends on the tag size and the

cache-line size. Usually, the tag size is much smaller than the cache-line size. For example,

in case of a 16 KB direct-mapped cache having 32-byte lines, the cache-line size is 256 bits

(32×8), while the tag size is 18 bits (32bits word - 9bits index - 5bits offset). Thus, the total

cache energy is dominated by data-memory accesses.

Cache subbanking is one of approach to reducing the data-access energy. The data-memory

array is partitioned into several subbanks, and only the subbank which includes the desired

data is activated [85]. As the bank address can be given by the memory address, the cache

has no access-time overhead. Figure 4.1 depicts the breakdown of energy consumption for

a 16 KB direct-mapped cache with various number of subbanks. We have calculated the

energy consumption based on Kamble’s model [48]. The energy for I/O drivers and address

decoder in the model are not included. The horizontal axis in the figure shows the total

number of subbanks. Ebit data and Ebit tag are the energy consumed on bit-lines in the data

memory and the tag memory while a cache access is performed, respectively. All results are

normalized to the base configuration denoted as “8(1)”, in which there is no subbanking. It

is clear from the figure that increasing the number of subbanks reduces a lot of energy for

4.3. INTERLINE TAG-COMPARISON CACHE 65

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Nor

mal

ized

Ene

rgy

Con

sum

ptio

n

1 (8) 2 (4) 4 (2) 8 (1) 1 (8) 2 (4) 4 (2) 8 (1)

# of Subbanks (# of Words in a Subbank)

32-bit word 64-bit word

OthersEbit_data

Ebit_tag

Figure 4.1: The Energy Effect of Tag Comparison.

the data memory. Since the tag-access energy is maintained, however, the effect of the tag

comparison becomes a significant on the total energy consumption. When the word size is

32 bits, Ebit tag wastes about 30 % energy. If the word size is 64 bits, Ebit tag occupies almost

half of the total energy.

4.3 Interline Tag-Comparison Cache

As explained in Section 4.2, it is important to reduce the tag-access energy for obtaining

more energy reduction in low-power caches. A technique to reduce the frequency of the tag

comparisons has been proposed [70], which is referred to as interline tag-comparison cache

in this chapter.

In case of two instructions i and j are executed successively, we can consider the following


cases:

• Intraline sequential flow: i and j reside in the same cache line, and their addresses are

sequential.

• Intraline non-sequential flow: i and j reside in the same cache line, and their addresses

are not sequential. Therefore, i is a taken-branch or jump instruction.

• Interline sequential flow: i and j reside in the different cache lines, and their addresses

are sequential.

• Interline non-sequential flow: i and j reside in the different cache lines, and their ad-

dresses are not sequential. Therefore, i is a taken-branch or jump instruction.

All instructions in a cache line are filled into the cache at a time when one of the instructions

occurs a cache miss. Therefore, the residence of instruction j is guaranteed when i and j

are in the intraline sequential flow or intraline non-sequential flow. In this case, the tag

comparison for instruction j can be omitted. Namely, the cache needs to perform the tag

comparison only when two successive instructions are in the interline sequential flow or the

interline non-sequential flow. The interline flows can be detected by comparing the current

PC with the previous one. Another approach to finding the interline flows is to analyze the

compiled program code. In this case, unused space in the operation code will be used as a

compiler hint.

4.4 History-Based Tag-Comparison Cache

4.4.1 Concept

The content of cache memory is updated when cache misses take place. Instruction caches

can achieve much higher cache-hit rates due to rich locality of memory references. This means

that the content of instruction caches is rarely updated.

There are many loops in programs, so that some instruction blocks will be executed in

many times. In this chapter, we call a run-time instruction block “a dynamic basic-block”.

The dynamic basic-block consists of one or more successive basic blocks. The top of the

dynamic basic-block is addressed by a branch-target address, and the end of it is addressed

4.4. HISTORY-BASED TAG-COMPARISON CACHE 67

by a taken-branch or jump address. Therefore, not-taken conditional branches might be

included in the dynamic basic-block.

We consider where a dynamic basic-block is executed in many times during program ex-

ecution. On the first time of the dynamic basic-block execution, the tag comparison for all

instructions has to be performed. However, on the second execution, if no cache miss has

occurred since the first execution, it is guaranteed that the dynamic basic-block resides in the

cache. Hence, we can determine that the indexed cache entry corresponds to the requested

address without performing the tag comparison.

When a dynamic basic-block is executed, the history-based tag-comparison cache attempts

to avoid unnecessary tag comparisons by detecting the following conditions:

1. the dynamic basic-block has been executed, and

2. no cache miss has occurred since the previous execution of the dynamic basic-block.

The history-based tag-comparison cache omits the tag comparison when the above conditions

are satisfied regardless of the intraline- and interline-flows.

4.4.2 Organization

To detect the conditions for omitting the tag comparison, as explained in Section 4.4.1, ex-

ecution footprints are recorded. The footprint indicates whether the corresponding dynamic

basic-block resides in the cache. If a dynamic basic-block left the footprint at the previ-

ous execution, then the tag comparisons for current execution are omitted. All footprints

are erased when a cache miss takes place, because a dynamic basic-block (or a part of the

dynamic basic-block) might be evicted from the cache.

Recent high-performance processors employ a branch-prediction unit to solve control-

hazard problems. A conventional branch-prediction unit consists of a BPT (branch predictor

table) and a BTB (branch target buffer). Each entry of the BTB has a branch-address field

and a target-address field. If a matching entry to the current program counter (PC) is found

on BTB lookup, the corresponding target address is stored to the PC for the next instruction

fetch. When an unregistered taken branch is performed, a new entry will be added.

The execution footprints for the history-based tag-comparison cache are implemented in

the BTB with additional information. Figure 4.2 depicts an organization of the extended


Branch Addr Target Addr

Branch Addr Target Addr

BTB(Branch Target Buffer)

RCT RCN

TCO

Dynamic Basic-BlockTop

NotTaken

Tail

Direct-mappedInstruction Cache

Tag Comparison Enable

PredictionResult

Inc

PC

Address

Figure 4.2: Organization.

BTB. The following flags are added to each BTB entry:

• RCT (Residing in Cache on Taken) 1-bit flag per entry : This is an execution footprint

for a dynamic basic-block, the top of which is addressed by the corresponding target

address. This flag is set to 1 when the corresponding branch is taken, and is reset to 0

whenever a cache miss occurs.

• RCN (Residing in Cache on Not-taken) 1-bit flag per entry : This is an execution

footprint for fall-though instructions. This flag is set to 1 when the corresponding

branch is not-taken, and is reset to 0 whenever a cache miss occurs.

In addition, a flag to enable the tag comparison in the cache is required:

• TCO (Tag-Comparison Omit) 1-bit flag : This flag indicates whether the tag compar-

ison can be omitted. If this flag is 1, the tag comparison is not performed.

The TCO does not appear on the cache critical-paths. Hence, the history-based tag-comparison

cache does not have any cache-access-time overhead.


4.4.3 Operation

The execution footprints (i.e., RCT and RCN flags) are left according to run-time program-

execution behavior, and are erased whenever a cache miss occurs. In addition, all footprints

have to be erased when BTB replacements take place. Because the scope of the RCT and

RCN flags is defined by the target address in the corresponding BTB entry and the branch

address in another BTB entry. The scope information might be evicted from the BTB by

the replacements. In this case, the cache loses the way to detect that how long the tag

comparison can be omitted. Figure 4.3 (A) presents the operation flow on BTB lookups, and

the extended BTB behaves as follows:

1. When a cache miss occurs, all footprints in the BTB are erased (all RCT and RCN

flags are reset to 0). Note that there is no cache misses when the TCO flag is 1.

2. If a matching entry is found in the BTB, the next status is performed. Otherwise,

the operation state moves to the initial state, then the cache starts to fetch the next

instruction.

3. The return address stack (RAS) improves the accuracy of branch prediction [46]. How-

ever, we have not extended the RAS for recording the execution footprints. Therefore,

the TCO has to be reset to 0 whenever the target address is provided from the RAS.

4. There is a matching entry in the BTB, so that the TCO flag and the footprint are

modified. If the branch-prediction result is taken, the RCT flag in the matching entry

is stored to the TCO flag. Then the RCT flag is set to 1 as the execution footprint.

Otherwise, the RCN flag is treated as well as the RCT flag.

When a wrong branch-prediction is detected, the PC has to be recovered to guarantee

the correct execution. This recovery might cause BTB updates. Figure 4.3 (B) shows the

extended BTB operation on the wrong-prediction recovery, and the extended BTB works as

follows:

5. There are two cases for the BTB update: one is the registration of a new entry and

the other is the modification of the target-address field in a entry. If the registration

evicts another BTB entry, all footprints are erased. Then the TCO flag is reset to 0


Start

Cache hit?

BTB hit?

RAS?

Taken prediction?

RCT -> TCO

1 -> RCT

RCN -> TCO

1 -> RCN

go to start

(A) On BTB lookup

Start

BTB update?

Wrong Prediction?

Replacement?0 -> all RCTs0 -> all RCNs

0 -> all RCTs0 -> all RCNs

0 -> TCO 0 -> TCO

1 -> RCT0 -> RCN

go to start

1 -> RCN

RCN -> TCO

TCO -> RCT TCO -> RCN

1 -> RCT

RCT -> TCO

(B) On Wrong-Branch Recovery

no

yes

yesyes

no

yes no

yes

no

no

noyes

yes

no1

2

3

4

5

6

Figure 4.3: Operation State Diagram.

because the dynamic basic-block addressed by the new target address may not reside

in the cache. In addition, the RCT flag is set to 1 as the execution footprint.

6. Only the branch direction is wrong. Thus, the TCO is stored back to the RCT or RCN

flags. Then the correct footprint is stored to the TCO, and the execution footprint is

recorded.

Figure 4.4 (A) shows an example of program execution flow which has seven-times iteration.

The size of the dynamic basic-block is varied by the three conditional branches addressed by

B, C, and D. The solid and broken lines in the figure represent the control flow of the loop

execution. Figure 4.4 (B) shows the conditions of the extended BTB. The pair of a number

and a capital letter in the figure denotes when the BTB is updated. For instance, 1 − C

means that the BTB is updated on the branch-C execution in the iteration 1. In the first

iteration, the tag comparisons are performed. Then the history-based tag-comparison cache

works as follows:

1-C : A new entry for the branch-C is registered in the BTB. Since the TCO is still 0,

performing the tag comparison is continued in the iteration 2. The RCT flag is set to 1 as


(A) Execution Flow

Execution FlowTop

Branch to F

Branch to A

Branch to A

A

B

C

D

F Top

1 2 3 4 5 6 7# of Iteration

1-C

2-C

3-C

4-C

4-D

5-C

5-D

6-C

6-D

7-B

NewEntry

NewEntry

NewEntry

Time:Iteration-BranchAddr

RCT RCN TCO RCN of Branch-C at 4-D

0 1

Branch-C

Branch-C

Branch-C

Branch-C

Branch-C

Branch-C

Branch-C

Branch-C

Branch-C

Branch-CBranch-D

Branch-D

Branch-D

Branch-D

Branch-D

Branch-D

Branch-B

(B) BTB Consitions

Figure 4.4: Example of Operation.

the execution footprint.

2-C : The footprint (RCT flag) recorded at 1 − C is stored to the TCO. Thus, no tag

comparison is performed in the iteration 3.

3-C : The footprint (RCT flag) recorded at 2 − C is stored to the TCO. Therefore, no tag

comparison is performed in the iteration 4.

4-C : The condition of branch-C is not-taken. In this case, the size of the dynamic basic-

block is increased. The execution footprint for the fall-through instructions (i.e., the RCN

flag) is stored to the TCO. Then the RCN flag is set to 1. The tag comparison is resumed.

4-D : As a new entry for the branch-D is registered in the BTB, the TCO is reset to 0.

Therefore, in the iteration 5, performing the tag comparison is continued.

5-C : the RCN flag corresponding to the branch-C recorded at 4−C is stored to the TCO.

As the TCO is set to 1, the tag comparisons for the remaining instructions in the iteration

5 are omitted.



Programs Inputs

SPECint95 099.go,

124.m88ksim, 126.gcc,

129.compress, 130.li,

132.ijpeg, 134.perl,

147.vortex

training

SPECfp95 102.swim, 107.mgrid,

110.applu, 125turb3d,

141.apsi

test

5-D : The footprint (RCT flag) recorded at 4−D is stored to the TCO, so that performing

that tag comparison is suspended.

6-C : the RCN flag recorded at 5 − C is stored to the TCO, and the tag comparisons are

omitted.

6-D : the RCT flag recorded at 5 − D is stored to the TCO, and the tag comparisons are

omitted.

7-B : The condition of branch-B is taken. In this case, the size of the dynamic basic-block

is decreased. Since it is the first taken condition for the branch-B, a new entry is registered

in the BTB. Therefore, the TCO is reset to 0, and the tag comparison is resumed for the

target dynamic basic-block.

4.5 Evaluations

In this section, we evaluate the energy efficiency of the history-based tag-comparison cache

by comparing with a conventional cache and the interline tag-comparison cache. The combi-

nation of the history-based tag-comparison cache and the interline tag-comparison cache is

also evaluated.

4.5. EVALUATIONS 73


In this evaluation, eight integer programs using train input and five floating-point programs

using test input from the SPEC95 benchmark suit are used [82], as shown in Table 4.1. The

benchmark programs are executed on the SimpleScalar simulator [11]. We have modified the

simulator to implement the history-based tag-comparison cache. For each program, the total

count of tag comparisons on the following caches is measured:

• C-TC (Conventional Tag-Comparison cache) : The tag comparison is performed on

every instruction fetch. This is the base model in this evaluation.

• IL-TC (InterLine Tag-Comparison cache) : The tag comparison is performed only on

interline flows, as explained in section 4.3

• H-TC (History-based Tag-Comparison cache) : The tag comparison is performed ac-

cording to the TCO flag, as explained in section 4.4.

• H-TC ideal : This is the same as the H-TC cache except that it has a perfect in-

struction cache (i.e., no cache miss) and a full associative BTB (i.e., no BTB-conflict

miss).

• HIL-TC (History-based InterLine Tag-Comparison cache) : This is a combination of

the IL-TC cache and the H-TC cache. The tag comparison is performed if the TCO

flag is 0 and the fetched instruction is on the interline flows.

Unless stated otherwise, the following configuration is assumed: cache size is 32 K bytes,

cache-line size is 32 bytes, the number of direct-mapped BPT entry is 2048, the number of

BTB set is 512, the BTB associativity is 4, the RAS size is 8.

4.5.2 Energy Reduction for Tag Comparisons

Table 4.2 shows the normalized total count of tag comparisons to the conventional cache

(C-TC). First, we compare the history-based tag-comparison cache (H-TC) with the conven-

tional cache (C-TC) and the interline tag-comparison cache (IL-TC). Since there are many

incremental accesses in almost all programs, the interline tag-comparison cache works well for

all programs. While the effectiveness of the history-based tag-comparison cache is application


Table 4.2: Normalized Tag-Comparison Counts.

Benchmark C-TC IL-TC H-TC H-TC HIL-TC

ideal

099.go 1.000 0.3203 0.7604 0.4027 0.2378

124.m88ksim 1.000 0.3302 0.4217 0.1856 0.1361

129.compress 1.000 0.3528 0.1751 0.1718 0.0706

126.gcc 1.000 0.3343 0.6810 0.2812 0.2278

130.li 1.000 0.3515 0.4500 0.1811 0.1684

132.ijpeg 1.000 0.2992 0.1062 0.0560 0.0311

134.perl 1.000 0.3436 0.6643 0.1361 0.2249

147.vortex 1.000 0.3213 0.8838 0.2141 0.2837

102.swim 1.000 0.2957 0.0623 0.0622 0.0278

107.mgrid 1.000 0.2600 0.0008 0.0002 0.0002

110.applu 1.000 0.2657 0.0252 0.0248 0.0070

125.turb3d 1.000 0.2813 0.0849 0.0727 0.0266

141.apsi 1.000 0.2801 0.1050 0.0476 0.0307

dependent. The history-based tag-comparison cache produces the more reduction than the

interline tag-comparison cache for two integer programs, 129.compress and 132.ijpeg, and for

all floating-point programs. In particular, the cache reduces more than 90 % tag comparisons

for the floating-point programs. This result can be understood by considering the charac-

teristics of the programs. The floating-point programs and the media application programs

have relatively well structured loops. The history-based tag-comparison cache attempts to

avoid the unnecessary tag comparisons by exploiting the iterative execution in the programs.

Figure 4.5 shows the total count and total energy dissipated for tag comparisons during

the execution of each program. In the figure, we assume that the average energy dissipated

for a tag comparison (ETag) is 1.1 nano joules. Etag is calculated based on the Kamble’s

model [48], and the 0.8 micron CMOS cache design described in [95] is assumed. In order to

obtain the value of ETag by using the Kamble’s model, the following parameters are assumed

(A more detailed explanation of the calculation is presented in Section 5.6.3):

4.5. EVALUATIONS 75

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.03.23.43.63.84.04.24.44.64.8

9.09.29.4

16.817.017.2

19.019.219.419.6

099.go124.m88ksim

126.gcc129.compress

130.li132.ijpeg

134.perl147.vortex

102.swim107.mgrid

110.applu125.turb3d

Benchmarks

To

tal En

ergy D

issipated

for T

ag C

om

pariso

ns w

here E

tag is 1.1 n

J [Jou

le]

Total Count(6,249,776)

Total Count(3,543,203)

To

tal C

ou

nt

of

Tag

Co

mp

aris

on

s [x

10 ]9

0.000.220.440.660.881.101.321.541.761.982.202.422.642.863.083.303.523.743.964.184.404.624.845.065.28

9.9010.1210.34

18.4818.7018.92

20.9021.1221.3421.56

Conventional Cache

History-Based Tag-Comparison Cache

141.apsi

Figure 4.5: Total Energy Dissipated for Tag Comparisons for The Execution of Programs.

• Total number of rows, or Nrow, is 1024 (32KB32B

= 1024sets).

• Cache-line bits, or L, is 0 because ETag does not include the energy dissipated in data-

subarray.

• Tag bits, or T is 17 because index bits is 10 and offset bits is 5 (32 − 10 − 5 = 17).

• Associativity, or M , is 1.


It is observed from the Figure 4.5 that the history-based tag-comparison cache makes signif-

icant energy reductions, in particular, for floating-point programs.

Next, we compare the ideal history-based tag-comparison cache (H-TC ideal) with the

realistic one (H-TC). It can be seen from the results of the ideal cache that the history-based

tag-comparison cache has the potential to achieve the better results than the interline tag-

comparison cache for all programs, with the exception for 099.go. However, the realistic cache

with hardware constraint (H-TC) does not make significant improvements for some integer

programs, 099.go, 126.gcc, 134.perl, and 147.vortex. For these programs, the tag-comparison

reduction achieved by the interline tag-comparison cache is about 70 %, while that produced

by the realistic history-based tag-comparison cache is only from 12 % to 34 %. The difference

between the ideal cache (H-TC ideal) and the realistic cache (H-TC) are analyzed in section

4.5.3.

Finally, we discuss the efficiency of the history-based interline tag-comparison cache. We

can see from the simulation results that the combination of the history-based tag-comparison

and the interline tag-comparison makes a significant reduction. In the best case of 107.mgrid,

the total count of tag comparisons is reduced to less than 0.01 %. Even if in the worst case

of 147.vortex, the combination reduces more than 70 % tag comparisons.

4.5.3 Effects of Hardware Constraints

When cache misses or BTB replacements take place, all footprints in the BTB are erased. In

this section, we analyze the effects of the cache size and the BTB associativity. Four integer

programs, 132.ijpeg, 099.go, 126.gcc, and 147.vortex are used in this analysis. The history-

based tag-comparison cache works well for the 132.ijpeg, but not for the other programs, as

reported in Section 4.5.2.

4.5.3.1 BTB Associativity

Figure 4.6 shows the total count of tag comparisons on the history-based tag-comparison

cache (H-TC) with various BTB associativity. Note that the BTB size is maintained. In

addition, the cache size is 32 KB except that the ideal cache (H-TC ideal) has a perfect

instruction cache. All results are normalized to the conventional cache (C-TC).

It is clear from the figure that there is no significant improvement even if the BTB associa-

4.5. EVALUATIONS 77

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

132.ijpeg 099.go 126.gcc 147.vortex

Benchmark Programs

Nor

mal

ized

# o

f Tag

-Com

paris

onH-TC (2way BTB)H-TC (8way BTB)H-TC (32way BTB)H-TC (128way BTB)

H-TC (512way BTB)H-TC (full asociative BTB)H-TC ideal

Figure 4.6: Effect of the BTB Associativity.

tivity is increased. The gap between the realistic cache (H-TC) and the ideal cache (H-TC

ideal) is still large. This trend can be seen for all benchmark programs. We consider that the

BTB has enough capacity for the programs. Therefore, the BTB conflict rarely takes place

even if the BTB has a small number of associativity.

4.5.3.2 Cache Size

The cache-hit rates affect directly the efficiency of the history-based tag-comparison cache.

Figure 4.7 shows the simulation results of the history-based tag-comparison cache (H-TC)

with various cache size. All results are normalized to the conventional cache (C-TC). Note

that the basic BTB configuration is maintained except that the ideal cache (H-TC ideal) has

the full-associative BTB.

For all programs, increasing the cache size improves the efficiency of the history-based tag-


0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

132.ijpeg 099.go 126.gcc 147.vortex

Benchmark Programs

Nor

mal

ized

# o

f Tag

-Com

paris

onH-TC (4KB Cache)H-TC (8KB Cache)H-TC (16KB Cache)H-TC (32KB Cache)

H-TC (64KB Cache)H-TC (128KB Cache)H-TC (Perfect Cache)H-TC ideal

Figure 4.7: Effects of Cache Capacity.

comparison cache. In particular, where the cache size exceeds 64 KB, the realistic history-

based tag-comparison caches reduce the total count of tag comparisons as well as the ideal

cache, for 132.ijpeg. For the other programs, the gap between the realistic cache and the

ideal cache is decreased with increase in the cache size. We have measured the breakdown

of the footprint-erase operations for 099.go. As a result, 98 % of footprint-erase operations

are caused by cache misses, and the remaining 2 % are caused by the BTB replacements.

Therefore, the efficiency of the history-based tag-comparison is strongly affected by the cache

size rather than the BTB replacements. The trend has been increasing the cache size, so that

the efficiency of the history-based tag-comparison cache will also be increased.

4.5. EVALUATIONS 79

4.5.4 Energy Overhead

In this section, we discuss the energy overhead of the extended BTB for the history-based

tag-comparison cache. As explained in section 4.4.2, only 2 bits per entry are added: one

for the RCT flag and one for the RCN flag. We need to consider the energy consumed for

reading or writing the footprints and that for erasing all footprints.

The energy consumed for reading or writing the footprint depends on the implementation

of the BTB. If the pipelined access is employed, the BTB lookup is performed first. Then the

target address and the footprints in the matching entry is accessed. If there is no matching

entry, the BTB access is finished without accessing the target address and the footprints. In

case of this implementation, accessing to the footprints is performed on BTB hits. Only when

the instruction being fetched is a branch or jump, the BTB hit is occurred. Therefore, the

energy for accessing to the footprints is consumed on every branch or jump execution, which

are already registered in the BTB. On the other hand, the tag comparison in conventional

caches is performed on every clock cycle. Therefore, the energy overhead is trivial.

For high-performance implementation of the set-associative BTB, the branch addresses

and the target addresses in the indexed set are read in parallel. Hence, the RCT and RCN

flags are also read in parallel. In this case, the energy overhead will appear on every cycle.

The total number of the footprints to be accessed depends on the BTB associativity. Where

the BTB has the n-way set-associative organization, (RCT + RCN) × n bits are accessed.

If the BTB has much higher associativity (i.e., a large number of n), the energy overhead

becomes a serious problem.

Next, we consider the energy consumed for erasing all footprints (i.e., for reseting all RCT

and RCN flags to 0). This energy overhead depends on how many footprints are reset from

1 to 0. Table 4.3 shows the energy overhead of the extended BTB in terms of the number

of erased footprints. The left most column noted as “Total” is the total number of erased

footprints during the program execution. The middle column noted as “per erase” and the

right most column noted as “per i-fetch” are the average number of erased footprints per

footprint-erase operation and per instruction fetch, respectively. For almost all programs,

less than 6 footprints are erased on a footprint-erase operation in average. The conventional

caches need to read the whole tag data, while the history-based tag-comparison cache erase

less than 0.1 bit footprint, on every clock cycle. Therefore, we believe that the energy


Table 4.3: Total Number of Erased Footprints.

Benchmark Total per erase per i-fetch

099.go 44,765,053 1.576 0.082

124.m88ksim 5,004,912 5.780 0.042

129.compress 21,114 5.504 0.001

126.gcc 113,901,538 1.696 0.088

130.li 8,602,721 9.374 0.047

132.ijpeg 4,045,877 2.322 0.003

134.perl 201,247,564 3.234 0.084

147.vortex 181,134,297 2.270 0.072

102.swim 7,482 1.167 0.000

107.mgrid 256,370 0.996 0.000

110.applu 109,144 0.565 0.000

125.turb3d 22,937,475 5.612 0.001

141.apsi 23,492,530 1.525 0.003

overhead can be ignored.

4.6 Related Work

Panwar et al. have proposed the concept of conditional tag comparison to reduce the fre-

quency of the tag comparisons, and have presented the interline tag-comparison [70], as

explained in Section 4.3. The interline tag-comparison cache omits the tag comparison if

instructions have intraline flows, whereas the history-based tag-comparison cache can omit

it not only on the intraline flows but also on the interline flows.

The S-cache has also been proposed in [70]. The S-cache is a small added memory to the

L1 cache, and has statically allocated address space. No cache replacement occurs in the S-

cache. Therefore, the tag comparison is unnecessary because the S-cache accesses always hit.

The scratchpad-memory [69], the loop-cache [8] [7], and the decompressor-memory [38] also

employ this kind of a small memory, and have the same effect as the S-cache. The scratchpad-

4.7. CONCLUSIONS 81

memory and the loop-cache analyze the programs, then the compiler allocates well executed

instructions to the small memory. For the S-cache and the decompressor-memory, prior

simulations using input-data set are required to optimize the code allocation. Their works

differ from ours in two aspects. First, these caches require a static analysis. Second, the cache

module has to be separated to a dynamically allocated memory space (i.e., main cache) and a

statically allocated memory space (i.e., the small cache). The history-based tag-comparison

cache does not require these arrangements.

4.7 Conclusions

In this chapter, we have proposed the history-based tag-comparison cache for low-energy

consumption. The history-based tag-comparison cache exploits the following two facts. First,

instruction-cache-hit rates are much higher. Second, almost all programs have many loops.

The cache records the execution footprints, and determines whether the instructions to be

fetched are currently cache resident without tag-lookup. Therefore, the cache can reduce the

energy consumed for the tag comparisons. The branch target buffer (BTB) is extended to

record the execution footprints.

We have evaluated the efficiency of the history-based tag-comparison cache. It is observed

that more than 90 % of tag-comparison counts are reduced in many benchmark programs. In

addition, the combination of our history-based tag-comparison cache and the interline tag-

comparison cache makes a remarkable reduction, in half of benchmark programs the total

count of tag comparison is reduced by more than 95 %. Moreover, we have analyzed the effects

of hardware constraint: cache size and BTB associativity. As a result, it is observed that

the efficiency of the history-based tag-comparison cache is improved by increasing the cache

size. The trend has been certainly increasing the on-chip cache size. Therefore, we believe

that the history-based tag-comparison is one of superior approach to achieving low-power

instruction caches for future processor chips.


Chapter 5

Variable Line-Size Cache

Architecture

5.1 Introduction

Recent remarkable advances of VLSI technology have been increasing processor speed and

DRAM capacity dramatically. However, the advances also have introduced a large and

growing performance gap between the processor and DRAM, this problem is referred to

as “Memory Wall” [12], [97], resulting in poor total system performance in spite of higher

processor performance. Integrating processors and DRAM on the same chip, or merged

DRAM/logic LSI, is a good approach to solve the “Memory Wall” problem [72]. Merged

DRAM/logic LSIs provide high on-chip memory bandwidth by interconnecting the processors

and DRAM with wider on-chip busses. In addition, the design space of memory hierarchy

for merged DRAM/logic LSIs becomes so broad that the designer could choose an option

from various on-chip memory-path architectures. We can classify the on-chip memory-path

architectures as shown in Figure 5.1:

• DRCM (Datapath–Register–Cache–MainMemory) architecture: This architecture comes

straight from the common memory hierarchy, which is widely employed in recent com-

mercial processor chips. The on-chip memory-path consists of datapath, registers, cache

memory, and main memory. The high on-chip memory bandwidth is exploited between

the cache and the main memory on cache replacements [64], [81], [13], [78].

83

84 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE

Discrete LSIs Merged DRAM/logc LSIs

Datapath

Registers

Cache (SRAM)

Main Memory (DRAM)

Datapath

Registers

Cache (SRAM)

Main Memory (DRAM)

Datapath

Main Memory (DRAM) Main Memory (DRAM)

Registers

Datapath

Figure 5.1: On-Chip Memory-Path Architectures.

• DRM (Datapath–Register–MainMemory) architecture: This architecture is based on

vector processing. The on-chip memory-path consists of datapath, vector registers, and

main memory. The high on-chip memory bandwidth is exploited on vector load/store

operations [73].

• DM (Datapath–MainMemory) architecture: This architecture is based on the direct

calculations, in which all operands reside in the main memory. The on-chip memory-

path consists of datapath, and main memory. The high on-chip memory bandwidth is

exploited on ALU operations [27]

Which memory-path architecture should be employed depends largely on the characteristics

of target programs. Among these candidates for on-chip memory path architectures, we focus

on the memory hierarchy including cache memory, the DRCM architecture. On-chip SRAM

caches are still necessary for most programs to hide large DRAM-access latency even if the

processors and DRAM are integrated on the same chip.

This chapter introduces the concept of a novel cache architecture for merged DRAM/logic

LSIs, called “Variable Line-Size Cache (VLS cache). The VLS cache attempts to make good

use of the attainable high on-chip memory bandwidth by optimizing the cache-line size.

A large cache-line size can benefit some programs with rich spatial locality of references

due to the effect of prefetching. A small cache-line size makes it possible to reduce the

frequency of cache-line conflict without any access-time overhead. In addition, decreasing

5.2. CONVENTIONAL APPROACHES TO EXPLOITING HIGH MEMORY-BANDWIDTH85

the cache-line size reduces the energy consumed for on-chip main-memory accesses. This

chapter also introduces two VLS caches: a statically variable line-size cache (S-VLS cache)

and a dynamically variable line-size cache (D-VLS cache). In addition, we evaluate the

performance/energy efficiency of the VLS caches using many benchmark programs.

The rest of this chapter is organized as follows. Section 2 shows a conventional approach

to exploiting the high on-chip memory bandwidth, and the advantages and disadvantages

are cleared. Section 3 gives the concept of the VLS cache architecture as one of approach

to solving the disadvantages. Section 4 and Section 5 propose two types of the VLS cache:

a statically VLS cache and a dynamically VLS cache. Section 6 presents some simulation

results and evaluates the performance/energy efficiency of the VLS caches. In addition, the

dynamically VLS cache is analyzed in detail. Section 6 shows related work, and Section 7

concludes this chapter.

5.2 Conventional Approaches to Exploiting High Memory-

Bandwidth

In merged DRAM/logic LSIs with the memory hierarchy including cache memory, the high

on-chip memory bandwidth can be exploited on cache replacements.

Again, we show the definition of average memory-access time, Equation (2.1) and (2.2),

explained in Chapter 2.

AMAT = TCache + CMR × 2 × TMainMemory.

TMainMemory = TDRAMarray +LineSize

BandWidth.

Even if LineSize increases within the range of BandWidth, assuming a constant DRAM

access time, the miss penalty will not be increased. Since BandWidth in traditional computer

systems is very small due to the I/O-pin bottleneck, the miss penalty will increase if we

increase LineSize. On the other hand, BandWidth of merged DRAM/logic LSIs can be

enlarged dramatically due to lack of the I/O-pin limitation. The high bandwidth is easily

realized by widening the on-chip busses. Therefore, designers can increase the cache-line size

within the range of the enlarged BandWidth in a constant TDRAMarray. Generally, large cache


0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16 32 64 128 256Line-Size(bytes)

Miss Ratio(%)

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16 32 64 128 2560.0

2.0

4.0

6.0

8.0

10.0

16 32 64 128 2560.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

16 32 64 128 256

052.alvinn072.sc

104.hydro2d 099.go134.perl

103.su2cor

101.tomcatv

126.gcc132.ijpeg

(a) (b) (c)

Figure 5.2: The Effects of Cache-Line Size to Cache-Miss Rates.

lines can benefit some programs with rich spatial locality due to the effect of prefetching.

Consequently, in merged DRAM/logic LSIs, the designer can positively take the advantage

of spatial locality inherent in programs. For example, since instruction references have rich

spatial locality in almost all programs, increasing the cache-line size makes a significant

performance improvement [78].

Unfortunately, since conventional caches employ a single cache-line size, increasing the

cache-line size is the only approach to exploiting the high on-chip memory bandwidth. How-

ever, increasing the cache-line size results in reducing the total number of cache lines which

can be held in the cache memory. Thus, large cache lines might worsen the cache-miss rates

due to frequent cache-line evictions if programs have poor spatial locality. Actually, the

spatial locality of data references depends on the characteristics of programs, and general

purpose processors have to execute a number of programs. Figure 5.2 shows how the cache-

miss rate is affected by the cache-line size in a 16 KB direct-mapped data cache. 052.alvinn

and 072.sc, from the SPEC92 benchmark program suite, are executed using the reference in-

put. The other integer programs and floating-point programs, from the SPEC95 benchmark

program suite, are executed using the training input and the test input, respectively. These

programs are compiled by GNU CC with the “–O2” option, and are executed on an Ultra

SPARC processor. It is clear from Figure 5.2 that the best cache-line size for each program

is very diverse. For example, the best cache-line size in Figure 5.2 (a) is equal to or larger

than 128 bytes, that in Figure 5.2 (b) is equal to or smaller than 32 bytes, and that in Figure

5.2 (c) is just 64 bytes. If programs do not have enough spatial locality, as shown in Figure

5.3. VARIABLE LINE-SIZE CACHE 87

5.2 (b) or (c), we will have the following problems:

1. A number of conflict misses will take place due to frequent evictions.

2. As a result, a lot of time and energy will be wasted by a number of main-memory

accesses.

3. Activating the wide on-chip bus and the DRAM array will also dissipate a lot of energy.

Employing a set-associative cache is a conventional approach to solving the first and second

problems, because it can improve the cache-hit rates. As increasing the cache associativity

makes access time longer, however, it might worsen the memory performance [30][96]. In

addition, we still have the third problem due to a fixed large cache-line size.

5.3 Variable Line-Size Cache

5.3.1 Terminology

In the VLS cache, an SRAM (cache) cell array and a DRAM (main memory) cell array are

divided into several subarrays. Data transfer for cache replacements is performed between

corresponding SRAM and DRAM subarrays. Figure 5.3 summarizes the definition of terms.

Subline, or address-block, is a block of data associated with a single tag in the cache. Line,

or transfer-block, is a block of data transferred at once between the cache and the main

memory. The sublines from every SRAM subarray, which have the same cache-index, form

a cache-sector. A cache-sector and a subline which are being accessed during a cache lookup

are called a reference-sector and a reference-subline, respectively. When a memory reference

is a cache hit, the desired data resides in the reference-subline. Otherwise, the desired data

is not in the reference-subline but only in the main memory. A memory-sector is a block of

data in the main-memory, and corresponds to the cache-sector. Adjacent-subline is defined

as follows:

1. It resides in the reference-sector, but is not the reference-subline. For the example

depicted in Figure 5.3, the sublines which include address 32, 2, and 3 satisfy this

condition.


0 2 34

Transfer-BlockAddress-Block(Subline)

2

Adjacent-Subline

3

Reference-Subline

5

Index 0Index 1Index 2

Reference to 1Reference to 3

Memory-Sector

Reference-Sector

VLS cache

Main Memory

Cache-Sector

(Line)

SRAMSub-Array

0

SRAMSub-Array

1

SRAMSub-Array

2

SRAMSub-Array

3

DRAMSub-Array

0

DRAMSub-Array

1

DRAMSub-Array

2

DRAMSub-Array

3

32 33 34 35

32

36 37Memory Address

Figure 5.3: Terminology for VLS Caches.

2. The main-memory home-location of it is in the same memory-sector as that of the

data which is currently being referenced by the processor. For the example depicted in

Figure 5.3, the sublines which include address 2, and 3 satisfy this condition.

3. It has been referenced at least once since it was fetched into the cache. For the example

depicted in Figure 5.3, the subline which includes address 3 satisfies this condition.

5.3.2 Concept and Principle of Operations

To make good use of the high on-chip memory bandwidth, the VLS cache optimizes its line

size according to the characteristics of programs. When programs have rich spatial locality,

5.3. VARIABLE LINE-SIZE CACHE 89

Main Memory

Cache

0 1 2 3

(c) Replace with a Maxmum Line

128-byte Line

(b) Replace with a Medium LineMain Memory

Cache

0 1 2 3

64-byte Line

Legend

Main Memory

Cache

0 1 2 34

(a) Replace with a Minmum Line

Line32-byte

data transfer occurs

no data transfer occurs

Figure 5.4: Three Different Transfer-Block Sizes on Cache Replacements.

the VLS cache would determine to use larger lines, each of which consists of lots of sublines.

Conversely, the VLS cache would determine to use smaller lines, each of which consists of a

single or a few sublines, and could try to avoid cache conflicts. In addition, activating only

the DRAM subarrays corresponding to the small lines (i.e., the small number of sublilnes)

makes a significant energy reduction.

The construction of the direct-mapped VLS cache illustrated in Figure 5.4 is similar to that

of a conventional 4-way set-associative cache. However, the conventional 4-way set-associative

cache has four locations where a subline can be placed, while the direct-mapped VLS cache

has only one location for a subline, just like a conventional direct-mapped cache. Since the


VLS cache attempts to reduce conflict misses without increasing the cache associativity, the

fast access of a direct-mapped cache can be maintained.

For the VLS cache shown in Figure 5.3, there are three possible line sizes as follows:

• Minimum line size, where only the designated subline is involved in cache replacements

(see Figure 5.4 (a)).

• Medium line size, where the designated subline and one of its neighborhood in the

corresponding cache-sector are involved (see Figure 5.4 (b)).

• Maximum line size, where the designated subline and all of its neighborhood in the

corresponding cache-sector are involved (see Figure 5.4 (c)).is shorter than that of

conventional caches with higher associativity.

Since it is not be allowed that the medium line misaligns with the 64-byte boundary in the

128-byte cache-sector, the number of possible combinations of sublines to be involved in cache

replacements is just seven (four for minimum, two for medium, and one for maximum line

size, respectively) rather than fifteen (= 24 − 1).

5.3.3 Line Size Optimization

The effectiveness of the VLS cache depends heavily on how well the cache replacement is

performed with appropriate line size. We need to consider how and when the line size should

be changed. At least, there are the following three methods how to determine the appropriate

line sizes.

1. Static determination based on prior simulations : Application programs are analyzed

using a cache simulator in advance. We determine suitable line sizes based on the

results of the simulation using input sets.

2. Static determination based on compiler analysis : Source programs are analyzed by a

special compiler. Then the compiler determines suitable line sizes.

3. Dynamic determination using a hardware assist : Special hardware determines suitable

line sizes at run-time.

In addition, we can consider the following granularity for line size modification.

5.4. STATICALLY VARIABLE LINE-SIZE CACHE 91

• program by program : a program has an appropriate line size. Thus, the line size is

changed at context switches take place.

• procedure by procedure : each procedure has own appropriate line size. Therefore, the

line size is changed at procedure calls.

• code by code : each load (or store) instruction has own appropriate line size. The line

size is changed at load/store operations.

• data by data : each data located in main memory has own appropriate line size. The

line size depends on the memory reference address.

We will introduce two VLS caches. One adopts the static line-size determination based on

prior simulations, and changes the line size program by program (section 5.4). The other

employs the dynamic determination by a hardware assist, and optimizes the line size data

by data (section 5.5).

5.4 Statically Variable Line-Size Cache

5.4.1 Organization

The most straight way to determine an appropriate line size is to test various fixed line sizes

based on prior simulations. This method might be available if programs have independent

behavior of input data. We refer to this kind of VLS cache as Statically Variable Line-Size

Cache (S-VLS cache). Figure 5.5 illustrates the block diagram of a direct-mapped S-VLS

cache. As the subline size is 32 bytes, the S-VLS cache can provide 32-byte, 64-byte, and

128-byte line sizes.

The number and the size of the tag fields are equal to those of a conventional direct-mapped

cache with fixed 32-byte lines. A status register in the processor has a field in order to indicate

the current line size. The line-size field can be modified by a special instruction inserted in

the top of programs. Namely, a program is executed with a single line size which is specified

by the special instruction. When a task switch occurs, the line-size information is saved, and

restored, along with the machine state by means of a conventional context saving/restoring

sequence. Therefore, changing the line size does not incur any extra overhead of performance.


Comparator

Tag

MUX

Processor

Main Memory

VLS Cache

Address

Load/Store Data

Status Register

Tag Index SA Offset

MUXHit / Miss?

Data

ProgramProgram

Program

Set the Line-Size Mode

Line-Size Mode

32Bytes

Special Instruction

32Bytes 32Bytes 32Bytes

SA: Subarray

Figure 5.5: A Direct-Mapped S-VLS Cache with 32-byte, 64-byte, and 128-byte Lines.

5.4.2 Operation

The S-VLS cache works as follows:

1. When a memory access takes place, the cache tag array is looked up in the same manner

as normal caches, except that every SRAM subarray has its own tag memory and the

lookup is performed on every tag memory.

2. On a cache hit, the hit subline has the required data, and the cache access is performed

on this subline in the same manner as normal caches.

3. On a cache miss, a cache refill takes place as follows:

5.5. DYNAMICALLY VARIABLE LINE-SIZE CACHE 93

(a) According to the designated line size, one or more sublines are written back from

the indexed cache-sector into their home locations in the DRAM main memory.

(b) According to the designated line size, one or more sublines (one of which contains

the required data) are fetched from the memory-sector into the cache-sector.

5.4.3 Line-Size Determination

In case that a direct-mapped S-VLS cache can provide 32-byte, 64-byte, and 128-byte lines,

for example, we can determine the suitable line size for a program in the following manner.

First, the program is simulated three times to measure hit rates assuming three direct-

mapped caches with fixed line size of 32 bytes, 64 bytes, and 128 bytes. Then we can regard

the suitable line size of the program as the line size which gives the highest hit rate of three

simulations.

5.5 Dynamically Variable Line-Size Cache

5.5.1 Organization

The statically variable line-size cache (S-VLS) explained in section 5.4 attempts to improve

cache-hit rates by exploiting the difference of spatial locality among programs. It may be pos-

sible to adopt the static method when target programs have regular access patterns within

well-structured loops. However, a number of programs have non-regular access patterns.

Therefore, the amount of spatial locality may vary both within and among program execu-

tions. Against to the static method, the dynamically variable line-size cache (D-VLS cache)

selects adequate line sizes based on recently observed data-reference behavior at run time.

The cache has some hardware components to optimize the line size.

Figure 5.6 illustrates the block diagram of a direct-mapped D-VLS cache having three line

sizes, 32 bytes, 64 bytes, and 128 bytes. The D-VLS cache has the following components for

optimizing the line size at run time:

• A reference-flag bit per subline : This flag bit is reset to 0 when the corresponding

subline is fetched into the cache, and is set to 1 when the subline is accessed by the processor.

Of course, the reference-flag bit corresponding to the subline which has caused the cache miss


Comparator

reference-flag

Tag

MUX

Main Memory

D-VLS Cache

Address Load/Store Data

Tag Index

MUXHit / Miss?

Data

32Bytes32Bytes 32Bytes 32Bytes

Processor

Line-Size Determinator

(LSD)

current line-size

next line-size

SA Offset

SA : Subarray

for Detection of the reference-sector

for Detection of the adjacent-subline

LSS-table

Line-SizeSpecifier

(LSS)

Figure 5.6: Block Diagram of a Direct-Mapped D-VLS Cache.

is set to 1 when the cache replacement is performed. It is used for determining whether the

corresponding subline is an adjacent-subline. On cache lookup, if the tag of an subline

which is not the reference-subline matches the tag field of the memory address, and if the

reference-flag bit is 1, then the subline is an adjacent-subline.

• A Line-Size Specifier (LSS) per cache-sector : This specifies the line size of the

corresponding cache-sector. As described in section 5.3.2, each cache-sector is in one of three

states: minimum, medium, and maximum line-size states. To identify these states, every

LSS provides a 2-bit state information. This means that the cache replacement is performed

according to the line size which is specified by the LSS corresponding to the reference-sector.

The LSS is stored in the LSS-table, as shown in Figure 5.6.

• Line-Size Determiner (LSD) : On every cache lookup, the LSD determines the state

5.5. DYNAMICALLY VARIABLE LINE-SIZE CACHE 95

of the line-size specifier of the reference-sector. The detail of determination algorithm is

explained in section 5.5.3.

5.5.2 Operation

The D-VLS cache works as follows:

1. The memory address generated by the processor is divided into the byte offset within

a subline, subarray field designating the subarray, index field used for indexing the tag

memory, and tag field.

2. Each cache subarray has its own tag memory and comparator, and it can perform the

tag-memory lookup using the index and tag fields independently with each other. At

the same time, the LSS corresponding to the reference-sector is read from the LSS-table

using the index field.

3. One of the tag-comparison results is selected by the subarray field in the memory

address, and then the cache hit or miss is detected.

4. On a cache miss, a cache replacement is performed according to the state of the LSS.

5. Regardless of hits or misses, the LSD determines the state of the LSS. After that, the

LSD writes back the modified LSS to the LSS-table.

5.5.3 Line-Size Determination

The algorithm for determining adequate line sizes is very simple. This algorithm is based

not on memory-access history but on the current state of the reference-sector. This means

that no information for evicted data from the cache needs to be maintained. On every cache

lookup, the LSD determines the state of the LSS of the reference-sector, as follows:

1. The LSD investigates how many adjacent-sublines exist in the reference-sector using

all the reference-flag bits and the tag-comparison results.

2. Based on the above-mentioned investigation result and the current state of the LSS of

the reference-sector, the LSD determines the next state of the LSS. The state-transition

diagram is shown in Figure 5.7.


Minimum Line Maximum Line

Medium Line

: reference-subline or adjacent-subline

Other PatternsOther Patterns

Other Patterns

Reference-Sector

Initial

Figure 5.7: State Transition Diagram.

If there are many neighboring adjacent-sublines, the reference-sector has rich spatial lo-

cality. This is because the data currently being accessed by the processor and the adjacent-

sublines are fetched from the same memory-sector, and these sublines have been accessed

by the processor recently. In this case, the line size should become larger. Thus the state

depicted in Figure 5.7 moves from the minimum state (32-byte line) to the medium state

(64-byte line) or from the medium (64-byte line) state to the maximum state (128-byte line)

when the reference-subline and adjacent-sublines construct a larger line size than the current

line size.

In contrast, if the reference-sector has been accessed sparsely before the current access,

there should be few adjacent-sublines in the reference-sector. This means that the reference-

sector has poor spatial locality at that time. In this case, the line size should become smaller.

So the state depicted in Figure 5.7 moves from the maximum state (128-byte line) to the

medium state (64-byte line) when the reference-subline and adjacent-sublines construct equal

or smaller line-size than the medium line-size (64-byte or 32-byte line). Similarly, the state

moves from the medium state (64-byte line) to the minimum state (32-byte line) when the

reference-subline and adjacent-sublines construct minimum line-size (32-byte line).

5.6. EVALUATIONS 97

5.6 Evaluations

In this section, we discuss the performance/energy efficiency of the VLS caches, S-VLS and D-

VLS. Before presenting the performance/energy improvements achieved by the VLS caches,

we consider the access time and access energy of the cache and main memory, respectively.

Then we show simulation results for cache-hit rates and cache-line sizes, and evaluate the

performance in term of the average memory-access time (AMAT ) and the energy in term of

the average memory-access energy (AMAE).


In this evaluation, we compare the VLS caches with some conventional caches. Each cache

model is represented as follows:

• Fix128 : Conventional 16 KB direct-mapped cache with fixed 128-byte line size.

• Fix128W2 : Conventional 16 KB two-way set-associative cache with fixed 128-byte line

size.

• Fix128W4 : Conventional 16 KB four-way set-associative cache with fixed 128-byte line

size.

• Fix128db : Conventional 32 KB direct-mapped cache with fixed 128-byte line size.

• SVLS128-32 : 16 KB direct-mapped S-VLS cache having three line sizes of 32 bytes,

64 bytes, and 128 bytes. The cache changes the line size program by program. The

adequate line size of each program is determined based on prior simulations.

• DVLS128-32 : 16 KB direct-mapped D-VLS cache having three line sizes of 32 bytes,

64 bytes, and 128 bytes. The line-size determiner optimize the line size at run-time.

For the cache-access time (TCache), we use the CACTI 2.0 model in Section 5.6.2. CACTI

estimates the cache-access time with the detail analysis of several components, for example,

sense amplifiers, output drivers, and so on [96] [42]. In addition, we calculate the cache-access

energy (ECache) based on Kamble’s model [48]. Then, we measure cache-miss rates using two

kind of cache simulators written in C: one for conventional caches with fixed 128-byte line size

and the other for the VLS caches with 32-byte, 64-byte, and 128-byte line sizes. The line size



Programs Inputs

SPECint92 026.compress, 072.sc ref

SPECfp92 052.alvinn ref

SPECint95 099.go, 124.m88ksim, 126.gcc, 130.li,

132.ijpeg, 134.perl, 147.vortex training

SPECfp95 101.tomcatv, 102.swim, 103.su2cor, 104.hydro2d test

MPEG2 encoder, decoder verification

Mix-Int1 124.m88ksim, 130.li, 147.vortex –

Mix-Int2 072.sc, 126.gcc, 134.perl –

Mix-Fp 052.alvinn, 101.tomcatv, 103.su2cor –

Mix-IntFp 132.ijpeg, 099.go, 104.hydro2d –

(i.e., the number of sublines involved in cache replacements) is also measured for the D-VLS

cache). In our experiments, eleven integer programs and five floating-point programs from the

SPEC92/95 benchmark suit [82] are used. We also simulate mpeg2encode and mpeg2decode

programs from [63] using verification pictures as media applications. Furthermore, to assume

more realistic execution on general purpose processors, four benchmark sets are used, Mix-

Int1, Mix-Int2, Mix-Fp, and Mix-IntFp. The programs in each benchmark set are assumed

to run in multiprogram manner on a uni-processor system, and a context switch occurs per

execution of one million instructions. Mix-Int1 and Mix-Int2 contain integer programs only,

andMix-Fp consists of three floating-point programs. Mix-IntFp is formed by two integer and

one floating-point programs. For each benchmark set, three billion instructions are executed.

All of the programs are compiled by GNU CC with the “–O2” option, and are executed on

an Ultra SPARC architecture. The address traces are captured by QPT [31].

5.6.2 Cache-Access Time

Cache-access time, or cache-hit time, is very sensitive to the cache organization. Figure

5.8 illustrates critical timing paths on conventional caches and the S-VLS cache. MatchOut

5.6. EVALUATIONS 99

Data

Conventioanl Direct-Mapped Cache

with 128-byte lines

(a) (c)Decoder

TagRead

Comparator

Tag0

MatchDriver DataDriver

DataRead TagRead

Reference Address Reference AddressDataOut

DataRead

Comparator

Tag0

TagRead

Reference Address

DataRead

Comparator

MuxDriver

DataDriver

Decoder Decoder

(b)

MatchDriver DataDriver

MuxDriver

OffsetAddress

MuxDriver

MatchDriver

128Bytes

Tag

32Bytes 32Bytes

TagSide-Path DataSide-Path

MatchOut DataOut

MatchOut

MatchOut DataOut

Tag3

Data0

Data3

Tag3

Data0

Data3

Conventioanl 4-way set-associative Cache

with 32-byte lines

Direct-Mapped Statically Variable Line-Size Cache

with 32,64,128-byte lines

MuxSidePath

Figure 5.8: Cache Critical Path.

and DataOut are outputs of the caches, both of which are driven by tri-state buffers. We

assume that the multiplexors to select a word data are implemented by the tri-state buffers.

The cache-access time consists of the delay for decoder, tag read, data read, comparators,

multiplexor drivers, and output drivers [96]. The cache-access time of the conventional direct-

mapped cache is determined by either the TagSide-path or the DataSide-path, while that of

the conventional set-associative cache is determined by the longer path of the MuxSide-path

and the DataSide-path, as shown in Figure 5.8 (a) and (b).

The structure of the S-VLS cache is similar to that of the conventional set-associative cache

having 32-byte small line size, as shown in Figure 5.8 (b) and (c). In the conventional set-

associative cache, the MuxSide-path often determines the cache-access time because control

signals for selecting a word data are made after tag comparison performed. However, this

critical path does not appear in the S-VLS cache because the control signals for the data

selection are made from the reference address directly. In the D-VLS cache, reference-flag,

LSS-table, and LSD for run-time line-size optimization are added to the S-VLS cache orga-

nization. As these components are not on the critical-path of the S-VLS cache, the D-VLS

cache also does not have extra overhead for the cache-access time.

Larger cache lines have two effects on the cache-access time. First, the delay for decoder

is reduced by the decreased number of cache lines in the SRAM array. Second, the delay


Table 5.2: Cache Access Time.

Cache Access Time [s] Normalized Access Time [Tunit]

Fix128 1.12129e-09 1.000

Fix128W2 1.64826e-09 1.470

Fix128W4 2.11147e-09 1.883

Fix128db 1.34006e-09 1.195

SVLS128-32 1.12129e-09 1.000

DVLS128-32 1.12129e-09 1.000

for data drivers becomes longer because the number of drivers which share an output line is

increased and there is more loading at the output of each driver [96]. These features appear

not only on conventional caches but also on the VLS caches. Therefore, we can assume

that the DataSide-path delay of the VLS caches is the same as that of Fix128 which is the

conventional cache with the same cache size and the same associativity. On the other hand,

the TagSide-path of the VLS caches might be slightly longer than that of Fix128 because one

of four tag-comparison results has to be chosen for MatchOut signal. However, control signals

for this selection are made from the reference address directly. Thus, the TagSide-path of

the VLS caches is longer than that of Fix128 by only the delay for a single tri-state buffer.

We consider that there is hardly any bad influence of the latency caused by a tri-state buffer

on the cache-access time. Consequently, it is assumed that the cache-access time of the VLS

caches is the same as that of Fix128.

Table 5.2 shows the cache-access time based on the CACTI model [42]. It is assumed

that the process technology is 0.18 um. Here, we regard the cache- access time of 16 KB

conventional direct-mapped cache (Fix128) as Tunit. The dynamic line-size optimization

in the D-VLS cache requires two times of LSS-table accesses: one for reading and one for

writing. From cache-cycle-time point of view, the two accesses might make the cache-cycle

time longer. Because if the LSS is implemented by an SRAM array, it is very hard to complete

the two SRAM accesses in a processor clock cycle. There are two methods to resolve this

problem: one is the pipelining of the LSS-table accesses, and the other is to implement

the LSS-table using flip-flops. The latter method is employed in this evaluation, because

5.6. EVALUATIONS 101

the former method makes the structure and control for implementing the LSS-table more

complex.

5.6.3 Cache-Access Energy

To obtain the value of the capacitances, C in Equation (2.7), we refer to [49] which follows

the model given by [95]. We enumerate the energy dissipated in a conventional M-way set-

associative cache, in which the total number of sets (the total number of cache-sectors) and

the tag size are referred to as Nrow and T bits, respectively. The cache has a St-bits status

flag for each subline. The cache-access energy (ECache) can be approximated by the sum of

the energy dissipated in the bit lines (Ebit) and that in the word lines (Eword) [48]:

ECache ≈ ESRAMarray

≈ Ebit + Eword

Ebit = 0.5 ∗ V dd2 ∗ [Nbl, prch ∗ Cbl, prch + Nbl, w ∗ Cbl, w + Nbl, r ∗ Cbl, r +

m ∗ (8 ∗ L + T + St) ∗ (Cg, qpa + Cg, qpb + Cg, qp)]

Eword = V dd2 ∗ m ∗ (8 ∗ L + T + St) ∗ (2 ∗ Cg, q1 + Cwordwire)

Cbl, pr = Nrows ∗ (0.5 ∗ Cd, q1 + Cbitwire)

Cbl, w = Cbl, r = Nrows ∗ (0.5 ∗ Cd, q1 + Cbitwire) + Cd, qp + Cd, qpa

Nbl, pr = 0.5 ∗ (T ∗ M + St + 8 ∗ L ∗ M) ∗ 2

Nbl, r = 0.5 ∗ (T ∗ M + ST + 8 ∗ L ∗ M) ∗ 2

Nbl, w = 0.5 ∗ WPA ∗ (St + Wavg, data) ∗ 2

Here, we assume 3.3 volts power supply, and a 12V dd voltage swing on the bit-lines.

WPA(writeperaccess) denotes the write operation per cache access, and is assumed as 0.3.

Wavg, data is the average data width of a write request, and is assumed as 19 bits. We also

make the assumption that all signal values are independent and have a uniform switching

probability of 0.5. Nbl, prch, Nbl, w, Nbl, r are the total number of transitions, and Cbl, prch,

Cbl, w, Cbl, r are the load capacitances, in the bit lines due to precharging, writing, and read-

ing, respectively. Cg, X and Cd, X are the gate and drain capacitances of a transistor X,

respectively. The transistor qp, qpa, and qpb are used for bit-line precharging circuits, and q1


Table 5.3: Cache Access Energy.

Cache Access Energy [fJ] Normalized Access Energy [Eunit]

Fix128 10,013,100 1.000

Fix128W2 11,611,419 1.160

Fix128W4 14,818,110 1.148

Fix128db 18,406,193 1.838

SVLS128-32 10,529,301 1.051

DVLS128-32 10,916,142 1.090

is the path gate for an SRAM cell. Cbitwire is the bit-line wire capacitance, and Cwordwire

is the word-line wire capacitance, per SRAM cell. We have referred to the various capaci-

tances as follows based on [49]: Cd, q1 =2.737 fF; Cg, q1 =0.401 fF; Cbitwire =4.4 fF/bitcell;

Cd, qp = Cd, qa = Cd, qb =80.89 fF; Cg, qp = Cg, qa = Cg, qb =38.08 fF; Cwordwire =1.8

fF/bitcell. These values are based on the 0.8 micron CMOS cache design described in [95].

Table 5.3 shows the cache-access energy (ECache) for each cache. The energy consumed

for write operations for cache refill is ignored. We regard the cache-access energy of the 16

KB conventional direct-mapped cache (Fix128) as Eunit. Increasing the cache associativity

consumes more energy, because it increases the total number of bit-lines, precharging circuits,

and so on. Similarly, increasing the cache size consumes more energy due to the increase in

the bit-line capacitance (i.e., increase in Nrow). Thus, the cache-access energy of Fix128W4

and Fix128db are larger than that of Fix128. On the other hand, the VLS caches do not

have this kind of energy overhead, because the cache size and associativity of Fix128 are

maintained. The caches consume more energy due to the extra tag comparison; the VLS

caches perform tag comparison at all subarrays as explained in Section 5.4.2 and in Section

5.5.2. However, the total number of bit-lines to be activated for the tag-memory accesses is

much smaller than that for the data-memory accesses. Therefore, the energy overhead for

the extra tag comparison is small. In addition, although the D-VLS cache needs to read

the 2-bit LSS and four 1-bit reference flags for run-time line-size optimization, this energy

overhead is also trivial.


Table 5.4: Cache-Miss Rates.

Program Fix128 Fix128W2 Fix128W4 Fix128db SVLS DVLS

128-32 128-32

026.compress 0.1871 0.1755 0.1732 0.1634 0.1718 0.1724

072.sc 0.037 0.0285 0.0263 0.0276 0.0364 0.0465

052.alvinn 0.0224 0.0087 0.0080 0.0175 0.0224 0.0181

099.go 0.1024 0.0695 0.0302 0.0541 0.0571 0.0638

124.m88ksim 0.0202 0.0045 0.0028 0.0068 0.0167 0.0153

126.gcc 0.0611 0.0344 0.0254 0.0349 0.0535 0.0526

130.li 0.0341 0.0203 0.0182 0.0226 0.0341 0.0358

132.ijpeg 0.0244 0.0048 0.0036 0.0068 0.0195 0.0175

134.perl 0.0542 0.0230 0.0105 0.0295 0.0332 0.0286

147.vortex 0.050 0.0292 0.0195 0.030 0.036 0.0374

101.tomcatv 0.0633 0.0182 0.0062 0.0546 0.0633 0.0578

102.swim 0.2612 0.3007 0.3137 0.1016 0.1381 0.1419

103.su2cor 0.2600 0.0840 0.0242 0.2396 0.0887 0.0758

104.hydro2d 0.0481 0.0217 0.0179 0.0259 0.0481 0.0295

mpeg2encoder 0.0840 0.0033 0.0007 0.0326 0.0468 0.0476

mpeg2decoder 0.0265 0.0045 0.0036 0.0131 0.0105 0.0197

Mix-Int1 0.0348 0.0187 0.0145 0.0211 0.0278 0.0285

Mix-Int2 0.0515 0.0269 0.0192 0.0309 0.0384 0.0414

Mix-Fp 0.1119 0.0370 0.0132 0.1005 0.0468 0.0385

Mix-IntFp 0.0597 0.0327 0.0188 0.0311 0.0452 0.0377

5.6.4 Cache-Miss Rate

We have measured cache-miss rates for all benchmark programs using event-driven cache-

simulators. Table 5.4 shows simulation results. For some programs, the VLS caches can

achieve almost all the same or lower miss rates than the double-size conventional direct-

mapped cache (Fix128db). However, increasing associativity produces much better results.

For all programs except for 026.compress and 102.swim, the conventional four-way set-


052.alvinn101.tomcatv

102.swim103.su2cor

104.hydro2dmpeg2encode

mpeg2decode

Integer Programs Floating-Point Programs

0.60

0.70

0.80

0.90

1.001.10

1.20

1.30

1.40

1.50

No

rmal

ized

Mis

s R

ate

Integer Programs

026.compress072.sc

099.go124.m88ksim

126.gcc130.li

132.ijpeg134.perl

147.vortex

1.792 1.775 1.633

DVLS128-32FIX128FIX64FIX32

2.031 1.546 1.891 2.931 1.6381.795

0.60

0.70

0.80

0.90

1.001.10

1.20

1.30

1.40

1.50

No

rmal

ized

Mis

s R

ate

Figure 5.9: Miss Rates for Benchmarks.

associative cache (Fix128W4) achieves the lowest cache-miss rates of all caches.

To evaluate the accuracy of dynamic line-size optimization of the D-VLS cache, we have

executed the SPEC benchmark programs and MPEG2 programs on 16 KB conventional

direct-mapped caches, each of which has 32-byte lines (Fix32), 64-byte lines (Fix64), and

128-byte lines (Fix128). Figure 5.9 presents simulation results. The left three bars for

each benchmark are cache-miss rates produced by the conventional caches. The remaining


1

2

3

4

10300 10350 10400 10450 10500 10550 10600 10650 10700 10750 10800

10300 10350 10400 10450 10500 10550 10600 10650 10700 10750 108001

2

3

4

Nu

mb

er o

f R

efer

ence

d 3

2-b

yte

Su

blin

es072.sc

104.hydro2d

Replace Sequence

10300 10350 10400 10450 10500 10550 10600 10650 10700 10750 10800

1

2

3

4

134.perl

Figure 5.10: Amount of Spatial Locality at a Cache-Sector.

bar to the right is the result of the D-VLS caches (DVLS128-32). For each benchmark,

simulation results are normalized to the cache-miss rate produced by the conventional cache

with the best line size. It is clear that the best line size is highly application-dependent.

In a number of program, however, the D-VLS cache gives nearly equal or lower miss rates

than the conventional cache with the best line size. In particular, for 132.ijpeg, 134.perl,

052.alvinn, and 104.hydro2d, the D-VLS cache has significant performance advantages over

the conventional caches. For the other programs but one (072.sc), the D-VLS cache produces

better results than the conventional cache with the second appropriate line size.

Although the D-VLS cache gives good results in almost all programs, it does not work

better for 072.sc. In order to clarify this cause, we have analyzed the transition of the

amount of spatial locality at the cache-sector which is the most frequently accessed by the

processor on Fix128. In this analysis, we have measured the number of 32-byte sublines

referenced by the processor in a 128-byte fixed-line during the 128-byte line resides in the

cache. We regard the number of the referenced 32-byte sublines as the amount of spatial


locality at the cache-sector. Figure 5.10 presents the simulation results; the horizontal axis

shows cache-replacement sequence, and the vertical axis shows the number of the referenced

32-byte sublines in the 128-byte fixed-line. It is clear that the amount of spatial locality

in 134.perl and 104.hydro2d are stable, whereas that in 072.sc frequently varies. On every

cache lookup, the line-size determiner (LSD) tries to detect the amount of spatial locality at

the reference-sector based on the number of adjacent-sublines. When the amount of spatial

locality of each cache-sector varies frequently, such as 072.sc, the LSD will lack the accuracy

for determining the adequate line size.

5.6.5 Main-Memory-Access Time and Energy

The main-memory-access time (TMainMemory) and energy (EMainMemory) depend on the mem-

ory size, organization, process technology, and so on. In this evaluation, we assume that the

main-memory-access time including the delay for data transfer between the cache and the

main memory (i.e., TDRAMarray + LineSizeBandwidth

) is ten times longer than the access time of the 16

KB direct-mapped conventional cache having 128-byte lines (i.e., TMainMemory = 10×Tunit).

For the main-memory-access energy, we assume that there is no energy dissipation for

DRAM refresh operations in order to simplify the evaluation. Thus, for the on-chip memory-

path architectures with a conventional cache, the main-memory-access energy (EMainMemory)

depends only on the total number of main-memory accesses. In other words, only cache-miss

rates affect the energy consumption. Since the VLS caches activate only the DRAM subarrays

corresponding to replaced sublines, the energy consumed for accessing to the on-chip main

memory depends not only on cache-miss rates but also on cache-line sizes (i.e., the number of

sublines to be involved in cache replacements). Accordingly, the main-memory-access energy

(EMainMemory) in Equation (2.4) can be expressed as follow:

EMainMemory = (EDRAMarray + EDataTransfer) × AverageLineSize

128bytes. (5.1)

Here, we assume that the average main-memory-access energy in conventional caches is ten

times larger than the cache-access energy of Fix128 (i.e., EDRAMarray + EDataTransfer = 10 ×Eunit). The right factor (AverageLineSize

128bytes) in Equation (5.1) denotes the average number of

activated 32-byte DRAM subarrays per cache-line replacement.

Table 5.5 shows the average line size on the S-VLS cache (SVLS128-32) and the D-VLS

cache (DVLS128-32). The table also reports the breakdown of cache-replace count for line


Table 5.5: Average Line Size and Replace Count on VLS caches.

S-VLS D-VLS

Program Ave. Line Replace Count Ave. Line

Size [B] 32 [B] 64 [B] 128 [B] Size [B]

026.compress 32.00 3,164,502 243,979 14,498 34.69

072.sc 128.00 1,038,520 492,007 352,181 58.32

052.alvinn 128.00 11,546,415 1,465,880 18,806,730 90.22

099.go 32.00 6,445,160 1,724,674 389,746 42.82

124.m88ksim 64.00 317,746 53,858 68,353 50.83

126.gcc 64.00 10,092,540 3,463,487 1,468,861 48.76

130.li 128.00 1,190,072 426,488 189,570 49.63

132.ijpeg 64.00 3,530,649 1,179,064 1,246,695 58.43

134.perl 32.00 7,987,886 5,250,134 3,849,457 63.46

147.vortex 32.00 19,805,372 3,593,130 1,416,595 42.11

101.tomcatv 128.00 23,539,313 2,608,352 2,650,269 43.73

102.swim 32.00 32,465,163 4,163,613 884,142 37.81

103.su2cor 32.00 15,340,954 6,701,837 3,315,895 53.01

104.hydro2d 128.00 3,784,227 860,802 6,175,600 89.34

mpeg2encoder 32.00 1,764,231 79,783 10,182 33.90

mpeg2decoder 32.00 30,770 2,968 2,047 40.12

Mix-Int1 81.15 14,705,908 4,618,322 2,022,767 60.24

Mix-Int2 74.12 18,492,295 8,250,953 4,620,521 48.02

Mix-Fp 55.44 21,632,285 8,565,636 8,541,197 54.56

Mix-IntFp 82.60 17,005,515 4,564,846 7,577,526 61.97

sizes in the D-VLS cache. The average line size of the D-VLS cache depends on the char-

acteristics of memory-reference behavior in programs. It is observed that the D-VLS cache

attempts to use the small line size for 026.compress, and the average line size is 34.69 bytes.

In contrast, the cache choose aggressively the large line size for 052.alvinn in order to exploit

the rich spatial locality, and the average line size is 90.22 bytes.


0.000.100.200.300.400.500.600.700.800.901.001.101.201.30

052.alvinn 099.go 126.gcc 132.ijpeg 147.vortex026.compress 072.sc 124.m88ksim 130.li 134.perl

Fix128Fix128W2

SVLS128-32DVLS128-32

Fix128W4Fix128db

Benchmark Programs

102.swim 104.hydro2d mpeg2decode Mix-Int2 Mix-IntFp

101.tomcatv 103.su2cor mpeg2encode Mix-int1 Mix-Fp

0.000.100.200.300.400.500.600.700.800.901.001.101.201.30

Nor

mal

ized

Ene

rgy

Con

sum

ptio

nN

orm

aliz

ed E

nerg

y C

onsu

mpt

ion

Figure 5.11: Energy Consumed for On-Chip DRAM Accesses (CMR × 2 × EMainMemory).

Figure 5.11 depicts the energy consumption for accessing to the on-chip main memory.

All results are normalized to the conventional direct-mapped cache having 128-byte lines

(Fix128). As explained earlier, the energy consumption for conventional caches depends only

on cache-miss rates. Therefore, the conventional four-way set-associative cache (Fix128W4)

can achieve large energy reductions. For some programs, 052.alvinne, 130.li, 101.tomcatv, and

104.hydro2d, the S-VLS cache can not reduce any energy over Fix128, because the appropriate

line size is 128 bytes. Although the cache-miss rates of the VLS caches are higher than those

of Fix128W4, the caches make significant advantages of energy consumption by the selective


activation of the on-chip DRAM subarrays. Actually, for a number of programs, the energy

reduction achieved by the VLS caches are comparable to that achieved by the Fix128W4.

5.6.6 Average Memory-Access Time

We have calculated the average memory-access time (AMAT ) as the performance based on

the cache-access time explained in Section 5.6.2, the cache-miss rates reported in Section

5.6.4, and the main-memory-access time defined in Section 5.6.5. Figure 5.12 depicts the

average memory-access time for each program in term of Tunit which is the access time of

Fix128. The upper dark-gray box of each bar is the delay for the cache replacement, which

is formulated by CMR × 2 × TMainMemory.

First, we compare the conventional caches. Increasing the cache associativity (Fix128W2,

Fix128W4) makes a significant improvements in cache-miss rates, as reported in Section

5.6.4. However, the improvement is negated by the longer cache-access time. As a result,

the conventional set-associative caches could not improve the average memory-access time

for many programs. On the other hand, the double-size conventional direct-mapped cache

(Fix128db) can achieve the higher performance than the Fix128 for many programs because

of the small access-time overhead.

Next, we discuss the performance improvements achieved by the VLS caches (SVLS128-32

and DVLS128-32). The VLS caches have no cache-access-time overhead, so that the cache-

miss rates improved by optimizing the line size appears on the average memory-access time

directly. The S-VLS cache changes the line size program by program. Thus, the performance

of the SVLS128-32 is the same as that of Fix128 for 052.alvinn, 072.sc, 130.li, 101.tomcatv,

and 104.hydro2d, the appropriate line size of which is 128 bytes. For other programs, the

S-VLS cache can improve the performance from Fix128. This can be understood easily

by considering how to determine the appropriate line size. The appropriate line size is

determined based on prior simulations assuming three direct-mapped caches with fixed 32-

byte, 64-byte, and 128-byte lines. Therefore, at least the cache-miss rates of the Fix128 is

guaranteed. On the other hand, when the dynamic line-size optimization in the D-VLS cache

lacks the accuracy, as 072.sc explained in Section 5.6.4, the cache worsen the performance.

However, most of the programs see the performance improvements from the dynamic line-

size optimization, with the exception of 072.sc. The performance improvements achieved by


0.000.501.001.502.002.503.003.504.004.505.005.506.00


Ave

rage

Mem

ory

Acc

ess

Tim

e [T

unit]

Fix128Fix128W2Fix128W4

Fix128dbSVLS128-32DVLS128-32

TCache

Benchmark Programs



6.27.5

8.2

6.2

0.000.501.001.502.002.503.003.504.004.505.005.506.00

Ave

rage

Mem

ory

Acc

ess

Tim

e [T

unit]

CMR*2*TMainMemory

Figure 5.12: Average Memory-Access Time (AMAT ).

the VLS caches (S-VLS128-32 and D-VLS128-32) are comparable with that achieved by the

double-size conventional direct-mapped cache (Fix128db).

5.6.7 Average Memory-Access Energy

We have measured the average memory-access energy (AMAE) based on the cache-access

energy explained in Section 5.6.3, cache-miss rates reported in Section 5.6.4, and the main-



0.000.501.001.502.002.503.003.504.004.505.005.506.00



ECache

Benchmark Programs



6.27.2 7.8

6.26.6

0.000.501.001.502.002.503.003.504.004.505.005.506.00

Ave

rage

Mem

ory

Acc

ess

Ene

rgy

[Eun

it]

CMR*2*EMainMemory

Ave

rage

Mem

ory

Acc

ess

Ene

rgy

[Eun

it]

Figure 5.13: Average Memory-Access Energy (AMAE).

memory-access energy evaluated in Section 5.6.5. Figure 5.13 depicts the average memory-

access energy for benchmark programs in term of Eunit, which is the access energy of Fix128.

The upper dark-gray box of each bar is the energy consumed for the cache replacement, which

is formulated by CMR × 2 × EMainMemory.

As explained in Section 5.6.5, the conventional four-way set-associative cache (Fix128W4)

can reduce a lot of energy consumed for main-memory accesses due to achieving much lower

cache-miss rates. However, Fix128W4 consumes more total energy than the 16 KB conven-


3200*10

3400*10

3600*10

3800*10

4000*10

0.00

200*10

400*10

600*10

800*10

1000*10

1200*10

1400*10

1600*10

1800*10

2000*10

Mix-Int1 Mix-Int2 Mix-Fp Mix-IntFp



6

6

6

6

6

6

6

6

6

6

6

6

6

6

6

Tot

al E

nerg

y D

issi

pate

d in

Mem

ory

Sys

tem

s [E

unit]

(AM

AE

* T

otal

# o

f Mem

ory

Ref

eren

ces)

32.0

34.0

36.0

38.0

40.0

0.00

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

Benchmark sets

Total E

nergy where E

unit is 10.0 nJ [Joule]

Figure 5.14: Total Energy Dissipated in Memory Systems.

tional direct-mapped cache (Fix128) for some programs. Because the cache-access-energy

overhead for increasing the associativity is more than the reduction of the main-memory-

access energy achieved by improving the cache-miss rates. Similarly, since the double-size

conventional direct-mapped cache (Fix128db) consumes much energy, it is not efficient.

On the other hand, the energy overhead for changing the line size in the VLS caches

(SVLS128-32 and DVLS128-32) is trivial. Therefore, the cache-miss improvement and the


DRAM-subarray subbanking based on small line sizes in the VLS caches contribute to the

total energy reduction. In the best case of 103.su2cor, the VLS caches achieve more than 70

% reduction of the average memory-access energy. Figure 5.14 depicts total energy dissipated

in memory systems, i.e., average memory-access energy × total number of memory references,

for each benchmark set.

5.6.8 Energy–Delay Product

To evaluate the performance and the energy at the same time, we have calculated the energy-

delay products (AMAE × AMAT ) based on Section 5.6.6 and Section 5.6.7. Figure 5.15

shows the results. For each program, all results are normalized to Fix128.

In conventional caches, the performance improvement achieved by increasing the cache size

(Fix128db) is negated by the more energy consumption. Contrarily, energy improvement

produced by increasing the cache associativity (Fix128W2 and Fix128W4) is negated by the

low-performance caused by the long cache-access time. The VLS caches do not have this

kind of negations because they can produce both the performance and energy improvements.

For the benchmark set Mix-IntFp, the highest performance of conventional caches is given

by the double-size direct-mapped cache (Fix128db), whereas the two-way set-associative

cache (Fix128W2) is the most efficient for energy consumption. However, the ED product

reductions achieved by Fix128db and Fix128W2 are only from 8 % to 20 %, compared with

the 16 KB conventional direct-mapped cache (Fix128). On the other hand, the S-VLS cache

(SVLS128-32) and the D-VLS cache (DVLS128-32) can reduce the ED product by 35 %

and 47 %, respectively. For most of the benchmarks, the S-VLS cache or the D-VLS cache

can make the most significant ED product reduction by optimizing the line size, with the

exception of 101.tomcatv and mpeg2encode.

For many programs, the D-VLS cache brings better results than the S-VLS cache. This

reason can be understood by considering the frequency of line size modification. Since the

amount of spatial locality will vary both within and among program executions, the appro-

priate line size is not a constant. The S-VLS cache attempts to optimize the line size program

by program, while the D-VLS cache modify it data by data. Therefore, the D-VLS cache can

adapt to the change of spatial locality inherent in programs.


0.00.10.20.30.40.50.60.70.80.91.01.11.21.31.41.51.61.7


Nor

mal

ized

ED

Pro

duct

[AM

AE

* A

MA

T]



Benchmark Programs



0.00.10.20.30.40.50.60.70.80.91.01.11.21.31.41.51.61.7

Nor

mal

ized

ED

Pro

duct

[AM

AE

* A

MA

T]

Figure 5.15: Energy–Delay Product.

5.6.9 Hardware Cost

Generally, a cache consists of an SRAM portion (data-array and tag-array) and logic

portions (decoder, comparator, and multiplexors). Additionally, the D-VLS cache requires

the special hardware components, the reference-flag bits, the LSS-table and the LSD. We have

calculated the size of the SRAM portion and have designed the logic portions in order to find

the number of transistors for each cache. In this design, we have described the logic portions


Table 5.6: Hardware Costs.

Cache Model SRAM portion Logic portion Total

Data Tag Total Logic LSD LSS Total

[bits] [bits] [bits] [Tr] [Tr] [Tr] [Tr] [Tr]

Fix128 131,072 2,304 133,376 17,988 – – 17,988 84,676

Fix32W4 131,072 10,240 141,312 18,968 – – 18,968 89,624

SVLS128-32 131,072 9,216 140,288 18,922 – – 18,922 89,066

DVLS128-32 131,072 9,728 140,800 18,922 230 14,020 33,172 103,572

in RT-Level using VHDL (VHSIC Hardware Description language), and have translated that

to a Gate-Level description using Synopsys VHDL Compiler.

For the D-VLS cache (DVLS128-32), each tag includes the 1-bit reference-flag. The LSS-

table is implemented by flip-flops in order to keep the cache-cycle time, as explained earlier

in Section 5.6.2. Since the 16 KB D-VLS cache with 32-byte, 64-byte, and 128-byte lines has

128 cache-sectors (= 16KB / 128bytes), DVLS128-32 requires 256 (= 2bits×128) flip-flops

for the LSS-table. We can implement the LSD with small combinational logic due to the

simple algorithm for determining the adequate line sizes. Table 5.6 shows the size of the

SRAM portion and the number of transistors for the logic portions. The right-most column

describes the total number of transistors including the SRAM portion where a 2-bit SRAM

is assumed to be one transistor. This assumption comes from a load-map [79] which shows

that the rate of the LogicTransistors/cm2 to CacheSRAMBits/cm2 from 2001 to 2007 is

approximately 1:2. The column described as “LSS” includes both flip-flops for the LSS-table

and multiplexors for selecting the LSS corresponding to the reference-sector.

The construction of the direct-mapped S-VLS cache (SVLS128-32) is similar to that of a

conventional four-way set-associative cache having 32-byte lines, with the exception of the

tag size. The tag size is affected by the cache associativity, but not by the cache line size.

It is observed that the hardware overhead of the S-VLS cache and the D-VLS cache over

Fix128 are 5 % and 22 %, respectively. Although the D-VLS cache requires more transistors

than the conventional caches, this hardware overhead is trivial for the area of the entire chip

of merged DRAM/logic LSIs which have not only the on-chip cache but also a large on-chip


0.0

1.0

2.0

3.0

4.0

5.0

6.0

Fix32DVLS128-32

Fix128

Conflict Miss

Capacity Miss and

Conflict MissCapacity Miss

Compulsory Miss

Compulsory Miss

Mis

s R

ate

per

Mis

s T

ype

(%)

16 KB Caches 128 KB Caches

Fix32DVLS128-32

Fix128

0.000.010.020.030.040.050.060.070.080.090.100.110.120.130.140.150.16

4 8 16 32 64 128

Mis

s R

ate

(%)

Cache Size [KB]

Fix32Fix128

DVLS128-32

(A) Miss Rates with Various Cache Size (B) Breakdown of Miss Rates

Figure 5.16: The Effect of Cache Size.

main memory.

5.6.10 Effects of Other Parameters

It is important to analyze the proposed cache architecture under various conditions. In this

section, we evaluate the effectiveness of the dynamically variable line-size (D-VLS) cache in

detail: the effect of cache size, on-chip main-memory-access time and energy, and the size of

LSS-table. The 16 KB direct mapped D-VLS cache having 32-byte, 64-byte, and 128-byte

lines is compared with the three conventional caches having 128-byte fixed line size: 16 KB

direct-mapped cache (Fix128), 16 KB four-way set-associative cache (Fix128W4), and 32 KB

direct-mapped cache (Fix128db). The four benchmark sets, Mix-Int1, Mix-Int2, Mix-Fp, and

Mix-IntFp, are used in this analysis.

5.6.10.1 Cache Size

In order to investigate the effect of cache size on the D-VLS cache performance, we have

simulated the conventional caches and the D-VLS cache varying the cache sizes from 4 KB

to 128 KB. Figure 5.16 (A) presents the average cache-miss rates of four benchmark sets.

DVLS128-32 is superior to the other conventional caches even though the cache size is varied

from 4 KB to 128 KB. Where the cache size exceeds 64 KB, however, the D-VLS cache does


not make a significant improvement.

Figure 5.16 (B) shows the breakdown of the cache-miss rates for Mix-IntFp benchmark set.

From the figure, it is clear that increasing the cache size reduces the conflict misses even if

the fixed large-line size is employed. When the cache size is very small, the total number

of large lines in the cache is very few. In this case, the negative effect of frequent evictions

caused by large lines exceeds the positive effect of prefetching. In contrast, increasing cache

size increases the total number of large lines in the cache. As a result, the conflict misses

can be reduced even if programs do not have enough spatial locality. The D-VLS cache

attempts to improve the performance by reducing the conflict misses. Where the cache has

enough capacity for the working-set of programs, the conventional cache already can avoid

the frequent evictions. Therefore, the effectiveness of the D-VLS cache is degraded with

increase in the cache size.

Although the trend has been certainly increasing the on-chip cache size, the working-set of

target application programs has been also growing. Hence, we believe that the D-VLS cache

will produce the large performance improvements even if the cache size is increased.

5.6.10.2 On-Chip Main-Memory-Access Time and Energy

In merged DRAM/logic LSIs, the on-chip main memory will occupy a large area of the

whole chip. The main-memory-access time (TMainMemory) and energy (EMainMemory) depend

on the on-chip DRAM size, process technology, and so on. Therefore, it is very important to

consider the effect of the on-chip main-memory performance and energy on the total memory-

system performance and energy. To evaluate the availability of the D-VLS cache, we have

simulated the conventional caches and the D-VLS cache under various conditions.

Figure 5.17 shows the average memory-access time (AMAT ) when the main-memory-

access time (TMainMemory) is changed from 2Tunit to 22Tunit, where Tunit is the cache-

access time of Fix128. The conventional cache with higher associativity (Fix128W4) or

larger size (Fix128db) produces lower cache-miss rates than the D-VLS cache, as reported

in Section 5.6.4. Therefore, the conventional caches have higher performance than the D-

VLS cache when the main-memory-access time is increased. Because increasing the cache

associativity or cache size in the conventional cache decreases the main-memory-access time

(TMainMemory) by improving the cache-miss rate more than they increase the cache-access


0.000.501.001.502.002.503.003.504.004.505.005.506.006.50

2 6 10 14 18 22 2 6 10 14 18 22Main Memory Access Time [Tunit]

Ave

rage

Mem

ory

Acc

ess

Tim

e [T

unit]

Fix128Fix128W4

Fix128dbDVLS128-32

TCache

CMR*2*TMainMemory

Mix-Int1 Mix-Int2

2 6 10 14 18 22 2 6 10 14 18 22Main Memory Access Time [Tunit]

0.000.501.001.502.002.503.003.504.004.505.005.506.006.50

Ave

rage

Mem

ory

Acc

ess

Tim

e [T

unit]

Mix-FpMix-IntFp

Figure 5.17: The Effect of Main-Memory-Access Time.

time (TCache). however, the performance efficiency of the D-VLS cache is still comparable to

the conventional caches even if the main-memory-access time is 22Tunit.

Figure 5.18 depicts the average memory-access energy (AMAE) when the main-memory-

access energy (EMainMemory) is changed from 2Eunit to 22Eunit, where Eunit is the cache-

access energy of Fix128. Where the main-memory-access energy exceeds 10Eunit, the set-

associative cache (Fix128W4) is superior to the other conventional caches because of the

lowest cache-miss rates. The cache-miss improvement reduces the total number of main-

memory accesses, so that the total energy is reduced. This trend is much clear where the main-

memory-access energy is increased. The cache-miss rates of the D-VLS cache (DVLS128-32)


0.000.501.001.502.002.503.003.504.004.505.005.506.006.50

2 6 10 14 18 22 2 6 10 14 18 22Main Memory Access Energy [Eunit]

Ave

rage

Mem

ory

Acc

ess

Ene

rgy

[Eun

it]

2 6 10 14 18 22 2 6 10 14 18 22Main Memory Access Energy [Eunit]

0.000.501.001.502.002.503.003.504.004.505.005.506.006.50

Ave

rage

Mem

ory

Acc

ess

Ene

rgy

[Eun

it]

Mix-Fp

Mix-IntFp

Mix-Int1

Fix128Fix128W4

Fix128dbDVLS128-32

ECache

CMR*2*EMainMemory

Mix-Int2

Figure 5.18: The Effect of Main-Memory-Access Energy.

are higher than those of the set-associative cache (Fix128W4). However, the D-VLS cache

has two way to reduce the main-memory-access energy: one is to improve the cache-miss rates

and the other is to obtain the DRAM subbanking effect based on the optimized line size.

Thus, the D-VLS cache can make the most significant energy reduction for all benchmark

sets even if the main-memory-access energy is increased.

5.6.10.3 LSS-Table Size

Thus far, we have assumed that each cache-sector in the 16 KB D-VLS cache has own

line-size specifier (LSS). Namely, the LSS-table has the same number of entries as the total


0.020

0.035

0.030

0.045

0.040

0.045

0.050

0.055

0.060

memorysector

1(128) 2(64) 4(32) 8(16) 16(8) 32(4) 64(2) 128(1)

Cac

he M

iss

Rat

e

Mix-Int1Mix-Int2

Mix-FpMix-IntFp

The total number of cache-sectors shareing a LSS(The total number of entries in the LSS-table)

Figure 5.19: The Effect of The LSS-table Size.

number of cache-sectors in the cache. To evaluate the accuracy of the cache-sector-based run-

time line-size optimization, we compare it with a memory-sector-based run-time optimization

which ignoring the hardware cost. In addition, we simulate the benchmark sets with various

granularity of specifying the line size in order to find the effect of the LSS-table size on the

D-VLS cache performance. Sharing the LSS by many cache-sectors reduces the hardware

cost of the D-VLS cache, as reported in Section 5.6.9.

Figure 5.19 depicts the cache-miss rates for benchmark sets. The horizontal axis shows

the total number of cache-sectors sharing a LSS. For example, “8(16)” represents the eight

cache-sectors share a LSS, so that the total number of entry in the LSS-table is sixteen (the

total number of cache-sectors in the 16 KB D-VLS cache is 128: 16×1024128

). The right-most

plot denoted as “memory-sector” means that D-VLS cache has a LSS for each memory-sector

rather than for each cache-sector. This is an ideal D-VLS cache ignoring the hardware cost

5.7. RELATED WORK 121

for implementing the LSS-table.

First, we compare the cache-sector-based realistic D-VLS cache denoted as “1(128)” with

the memory-sector-based ideal D-VLS cache denoted as “memory-sector” in the figure. In all

but one (Mix-Int2) benchmark set, The difference of the improvements given by the realistic

model and the ideal model is small. This means that the line-size determiner can select

the adequate line sizes even if it does not accurately track the amount of spatial locality of

individual memory-sectors.

Next, we discuss the effect of reducing the LSS-table size on the D-VLS cache performance.

The cache-miss rate of the 16 KB conventional cache having fixed 128-byte lines (Fix128)

for Mix-Int1, Mix-Int2, Mix-Fp, and Mix-IntFp are 0.0348, 0.0515, 0.1119, and 0.0597, re-

spectively, as reported in Section 5.6.4. Although the trend of decreasing the LSS-table size

(i.e., increasing the number of cache-sectors sharing a LSS) increases the cache-miss rate, the

D-VLS cache still can achieve better results. For Mix-Int1 and Mix-Int2, sharing a LSS by a

few LSS improves the cache-miss rates. For example, the cache-miss rates given by “4(32)”,

in which four cache-sectors share a LSS, are lower than those given by the completely cache-

sector-based one denoted “1(128)”. This can be understood by considering the behavior of

the line-size determiner (LSD) explained in Section 5.5.3. The LSD updates the state of the

LSS corresponding to the reference-sector which is a cache-sector accessed by the processor.

Therefore, the LSS is updated frequently where it is shared by some cache-sectors. As a

result, the LSS might be able to converge rapidly to an appropriate line size.

5.7 Related Work

Saulsbury et al.[78] and Wilson et al.[94] discussed cache architectures having large cache-line

size (512 bytes) with high on-chip memory bandwidth. They tried to avoid frequent cache

conflicts, occurred by the large cache lines, by increasing cache associativity. As the D-VLS

cache resolves the conflict problem using variable cache-line size, first access of direct mapped

cache can be maintained.

Several studies have proposed coherent caches in order to produce the performance im-

provement of shared-memory multiprocessor systems [18], [20]. The cache proposed in [20]

can adjust the amount of data stored in a cache line, and aims to produce fewer invalidations

of shared data and reduce bus or network transactions. On the other hand, the VLS cache


aims at improving the system performance of merged DRAM/logic LSIs by partitioning a

large cache line into multiple independently small cache sublines, and adjusting the number

of sublines to be enrolled on cache replacements. The fixed and adaptive sequential prefetch-

ing proposed in [18] allows us to fetch more than one consecutive cache lines. This approach

needs a counter for indicating the number of lines to be fetched. Regardless of the values

of memory reference addresses, the counter is always used for fetching cache lines on read

misses. On the other hand, the D-VLS cache has several flags indicating the cache-line size.

Which flag should be used depends on memory reference addresses. In other words, the

D-VLS cache can change the cache-line size not only along the advance of program execution

but also across data located in different memory addresses.

Excellent cache architectures exploiting spatial locality have been proposed in [23], [57]

and [41]. The caches presented in [41] and [57] need tables for recording the memory access

history of not only cached data but also evicted data from the cache. Similarly, the cache

presented in [23] uses a table for storing the situations of past load/store operations. In

addition, the detection of spatial locality in [23] relies on the memory access behavior derived

from constant-stride vector accesses. On the other hand, the D-VLS cache determines a

suitable cache-line size based on only the state of the cache line which is currently being

accessed by the processor. Consequently, the D-VLS cache has no large tables for storing the

memory access history. Just a single bit is added to each cache-tag for storing the memory

access history.

Furthermore, the above studies have focused on only performance. Our VLS cache attempts

to achieve not only high-performance but also low-power consumption by making good use

of the high on-chip memory bandwidth available on merged DRAM/logic LSIs.

5.8 Conclusions

In this chapter, we have described the variable line-size cache (VLS cache), which is a novel

cache architecture suitable for merged DRAM/logic LSIs. The purpose of the VLS cache is to

make good use of the attainable high on-chip memory bandwidth. The VLS cache attempts

to alleviate the negative effects of large cache line size by changing the cache line size. As the

line size modification does not require any access time overhead, the VLS cache can improve

the memory performance. Moreover, Activating only the DRAM subarrays corresponding to

5.8. CONCLUSIONS 123

the replaced line size makes a significant energy reduction.

We have proposed two VLS caches: the statically variable line-size cache (S-VLS cache)

and the dynamically variable line-size cache (D-VLS cache). The S-VLS cache determines

an appropriate line size based on prior simulations. The D-VLS cache try to optimize the

cache line size using a hardware assist. The line-size determiner detects the varying amount

of spatial locality within and among programs based on recently observed data reference

behavior at run time.

To evaluate the performance/energy efficiency of the VLS caches, we have simulated many

benchmark programs on the VLS caches and on conventional caches. As a result, it is observed

that the VLS caches can make a significant performance/energy improvement. In addition,

we have designed the caches to evaluate the hardware overhead. In the simulation results

for Mix-IntFp benchmark set which includes two integer and one floating-point programs, it

is observed that a S-VLS cache and a D-VLS cache reduce the energy-delay product by 35

% and 47 %, whereas the hardware overhead of those are only 5 % and 22 %, respectively,

compared with a conventional cache having the same cache size and the same associativity.

The D-VLS cache is more promising. Since the D-VLS cache does not require any modifi-

cation of instruction set architectures, the full compatibility of existing object codes can be

kept. In addition, the cache is adaptive to the varying amount of spatial locality within and

among programs. Therefore, we have analyzed the D-VLS cache in detail: the effect of cache

size, on-chip main-memory-access time and energy, and the size of LSS-table.

Employing merged DRAM/logic LSIs is one of the most important approaches for future

computer systems, because it can achieve high-performance/low-power by eliminating the

chip boundaries between processors and main memory. It is possible to obtain more perfor-

mance/energy improvements by exploiting the attainable high on-chip memory bandwidth

effectively. Since the VLS cache is applicable to any merged DRAM/logic LSIs, we believe

that the cache management using variable line-size is a very useful approach to improving

the performance/energy efficiency.


Chapter 6

Conclusions

Mobile market is likely to continue to grow in the future. One of uncompromising require-

ments from portable computing is energy efficiency, because that affects directly the battery

life. On the other hand, portable computing will target more demanding applications, for

example moving pictures, so that higher performance of which is also required.

Cache memories have been employed as one of the most important components of com-

puter systems, because memory accesses are confined in on-chip. Reducing the frequency of

off-chip memory accesses produces significant advantages: reducing memory-access latency

and reducing I/O driving energy. In order to achieve higher performance, designers have

invested the increasing transistor budget in the cache memories (increasing cache capacity).

However, increasing the cache capacity makes cache-access time and energy larger. Since

memory references have locality, temporal and spatial locality, memory accesses concentrate

on the cache memory. Therefore, the performance/energy efficiency of cache memories af-

fects strongly the total system performance and energy dissipation. This fact suggests that

we need to keep considering to develop high-performance, low-energy cache memories.

In this thesis, we have proposed the following three cache architectures for high performance

and low energy dissipation.

• Way-predicting set-associative cache: A history table in the cache records MRU infor-

mation of each set. When a cache access is issued, only the MRU way is activated.

If the way prediction is correct, there is no activation in the remaining ways, thereby

saving the energy. Namely, the way-predicting set-associative cache attempts to elimi-

nate unnecessary way activation in set-associative caches. It has been observed in our

125

126 CHAPTER 6. CONCLUSIONS

evaluation that a way-predicting set-associative cache makes more than 70 % of cache-

access-energy reduction (ECache in Equation (2.4)), while it leads to only less than 10 %

of cache-access-time overhead (TCache in Equation (2.1)), compared with a conventional

set-associative cache.

• History-based tag-comparison cache: A history table implemented in a BTB (Branch

Target Buffer) records execution footprints of each instruction block. The corresponding

footprint is left at the first execution of the instruction block. When the instruction

block is executed again, the corresponding footprint is tested. If the footprint is detected

in current execution, the tag comparison in cache accesses for the instruction block can

be omitted. The execution footprints are left until a cache miss takes place. Namely, the

history-based tag-comparison cache attempts to eliminate unnecessary tag comparison

for reducing energy dissipation. It has been observed in our evaluation that a history-

based tag-comparison cache makes more than 99 % of tag-comparison-energy reduction

for a program (107.mgrid), compared with a conventional cache.

• Dynamically variable line-size cache: A history table implemented as reference flags

records recently observed memory-access patterns. The dynamically variable line-size

cache adjusts the cache-line size according to the amount of spatial locality at run-

time. If rich spatial locality is observed, the cache increases the cache-line size in order

to obtain the effect of prefetching. Otherwise the cache decreases the cache-line size

for avoiding conflict misses. Namely, the dynamically variable line-size cache attempts

to eliminate unnecessary data replacement and bandwidth utilization. It has been

observed in our evaluation that a dynamically variable line-size cache improves energy-

delay product (AMAT ×AMAE, i.e., Equation (2.1) × Equation (2.4)) by more than

45 % for a program (MixIntFp), compared with a conventional organization.

Our caches attempt to improve performance/energy efficiency by eliminating unnecessary

operations at run-time. Dynamic measurement makes it possible to adapt the caches to the

characteristics of programs. Although we have discussed individually the cache architectures,

it is also possible to combine them, for example the combination of a way-predicting set-

associative cache and a set-associative dynamically variable line-size cache, the combination

of a history-based tag-comparison cache and a direct-mapped dynamically variable line-size

127

cache. Therefore, we conclude that our cache architectures are promissing for improving the

performance/energy efficiency of memory systems in future processor systems.

We believe that more space in future processor chips will be invested for the cache memories

(not only level-1 but also level-2, -3, and so on). Thus, the cache memories will be an

important component in processor chips. The followings are our future challenges.

• The most effective approach to reducing energy dissipation is to reduce the supply

voltage (V dd in Equation (2.7)). However, low supply voltage will produce a leakage

power which is consumed at whole cache memories [51]. Challenging to reduce the

leakage power consumption is an attractive problem.

• We believe that behavioral approaches to improve performance/energy efficiency ex-

plained in Section 2.5.1.2 is more promising. However, clever cache control will com-

plex logic verification. The maturity of verification techniques, for example formal

verification techniques, for cache memories is very important.

• Increasing cache area may result in undesirable effects of reducing manufacturing yield.

Although addition of redundancy circuits (and memory cells) increases manufacturing

yield, it also leads to a performance degradation (i.e., cache-access time will become

longer). The cache-access time affects directly the memory-access latency as shown in

Equation (2.1). Thus, fault-tolerant techniques suitable for high-speed cache memories

are very important.

• In the future social environment in world-wide network systems, one of the most serious

problems is the security of information, for example credit-card number, phone number,

other personal data, and so on. This kind of information will be memorized and treated

via memory systems (disk, main memory, cache memory, and so on). Accordingly, next

challenge is to develop high security memory systems for the society in twenty-first

century.

128 CHAPTER 6. CONCLUSIONS

Acknowledgment

I would like to express my sincere appreciation to my advisor, Professor Hiroto Yasuura, for

his insight, advice, and support during my studies. My future career will benefit greatly from

his guidance.

I wish to acknowledge valuable discussions with Professor Kazuaki Murakami. His stern

evaluation of my work enhanced the quality of my research. I would like to express my

gratitude to Professor Toshinori Sueyoshi. I learned the attitude as a researcher under his

guidance. I would like to thank Professor Itsujiro Arita for giving me an opportunity to work

in computer science. I also would like to thank Professor Mizuho Iwaihara and Mr. Sunao

Sawada for discussing in laboratory seminar.

I am very grateful to Professor Makoto Amamiya and Professor Kazuaki Murakami for

serving on committee members of this thesis and providing thoughtful suggestions.

I would like to acknowledge numerous, past and present, people at Kyushu Institute of

Technology for supporting me. In particular, I would like to thank Professor Morihiro Kuga,

Mr. Koichiro Tanaka, Mr. Hidetomo Shibamura, Dr. Masaru Okumura, Mr. Masahide

Ouchi, and Mr. Munehiro Iida for sharing so much technical knowledge. I would like to

thank my past and present colleagues of Kyushu University, Dr. Hiroyuki Tomiyama, Mr.

Hiroshi Miyajima, Mr. Kenjiro Ike, Dr. Kei Hirose, Dr. Tohru Ishihara, Dr. Akihiko Inoue,

Mr. Eko Fajar Nurprasetyo, Mr. Makoto Sugihara, Mr. Koji Hashimoto, Mr. Katsuhiko

Metsugi, Mr. Takanori Okuma, Ms. Yun Cao, and other members of laboratory, for giving

me helpful suggestions. I also thank Ms. Kazumi Matsuoka, Ms. Kaori Kuga, Ms. Noriko

Usuki, Ms. Kyoko Matsuda, Ms. Kyoko Kubota, Ms. Rika Shudo, and Ms. Naoko Taketomi

for helping my activity.

I am grateful to Dr. Seiki Ogura, Dr. Yutaka Hayashi, Mr. Seitoku Ogura, Ms. Tomoko

Ogura, and Ms. Betty Bhudhikanok of Halo LSI Design & Device Technology, Inc. for giving

129

130 ACKNOWLEDGMENT

me so much knowledge for circuit and layout design. In particular, I would like to express

my sincere appreciation to Dr. Seiki Ogura. He gave me an opportunity to work for the new

company which is in forefront of VLSI technology. I have studied a lot of things from the

experience in the company. I also would like to thank Ms. Ichie Ogura for helping my life in

the USA. I am also grateful to Mr. Makoto Kojima of Matsushita Electric Industrial Corp.

for giving me so much worthwhile knowledge for circuit design.

I would like to thank the past and present members of the ISIT/KYUSHU (Institute of

Systems & Information Technologies / KYUSHU). Special thanks are due to Dr. Hiroshi

Data, Mr. Koji Kai, and Mr. Hideaki Fujikake for their cooperation.

I would like to thank my uncle, Takayuki Kurihara, and his wife, Hiroko Kurihara for

supporting my life. I also would like to thank my parents, Haruo Inoue and Chizuko Inoue,

for many years of love, care, and support.

Last, but not the least, thanks to my wife Tomomi and two children, Sakura, and Gaku,

for encouraging me.

Bibliography

[1] Agarwal, A., and Pudar, S. D., “Column-associative caches: A technique for reducing

the miss rate of direct-mapped caches, ” In Proc. of the 20th International Symposium

on Computer Architecture, pp. 179–180, May 1993.

[2] Agarwal, A., Hennesy, J., and Horowitz, M., “Cache performance of operating systems

and multiprogramming, ” In ACM Transactions on Computer Systems, volume 6, pp.

393–431, Nov. 1988.

[3] Albonesi, D. H., “Selective cache ways: On-demand cache resource allocation, ” In Proc.

of the International Symposium on Microarchitecture, pp. 248–259, Nov. 1999.

[4] Bahar, R. I., Albera, G., and Manne, S., “Power and performance tradeoffs using

various caching strategies, ” In Proc. of the 1998 International Symposium on Low

Power Electronics and Design, pp. 64–69, Aug. 1998.

[5] Bajwa, R. S., Hiraki, M., Kojima, H., Gorny, D. J., Nitta, K., Shridhar, A., Seki, K., and

Sasaki, K., “Instruction buffering to reduce power in processors for signal processign, ”

In IEEE Transaction on Very Large Scale Integration Systems, volume 5, pp. 417–424,

Dec. 1997.

[6] Bellas, N., Hajj, I., and Polychronopoulos, C., “Using dynamic cache management

techniques to reduce energy in a high-performance processor, ” In Proc. of the 1999

International Symposium on Low Power Electronics and Design, pp. 64–69, Aug. 1999.

[7] Bellas, N., Hajj, I., Polychronopoulos, C., and Stamoulis, G., “Architectural and com-

piler support for energy reduction in the memory hierarchy of high performance micro-

processors, ” In Proc. of the 1998 International Symposium on Low Power Electronics

and Design, pp. 70–75, Aug. 1998.

131

132 BIBLIOGRAPHY

[8] Bellas, N., Hajj, I., Polychronopoulos, C., and Stamoulis, G., “Energy and performance

improvements in microprocessor design using a loop cache, ” In Proc. of the International

Conference on Computer Design: VLSI in Computers & Processors, pp. 378–383, Oct.

1999.

[9] Benini, L., De Micheli, G., Macii, E., Sciuto, D., and Silvano, C, “Asymptotic zero-

transition activity encording for address busses in low-power microprocessor-based sys-

tems, ” In Proc. of the 7th Great Lakes Symposium on VLSI, pp. 77–82, Mar. 1997.

[10] Benschneidr, B. J., Park, S., Allmon, R., Anderson, W., Arneborn, M., Cho, J., Choi, C.,

Clouser, J., Han, S., Hokinson, R., Hwang, G., Jung, D., Kim, J., Krause, J., Kwack, J.,

Meier, S., Seok, Y., Thierauf, S., and Zhou, C., “A 1ghz alpha microprocessor, ” In

Proc. of the 2000 International Solid-State Circuits Conference, pp. 86–87, Feb. 2000.

[11] Burger, D. C., Austin, T. M., and Bennett, S., “Evaluating future microprocessors - the

simplescalar toolset, ” .

[12] Burger, D., Goodman, J. R., and Kagi, A., “Memory bandwidth limitations of future

microprocessors, ” In Proc. of the 23rd Annual International Symposium on Computer

Architecture, pp. 78–89, May 1996.

[13] Burger, D., Kaxiras, S., and Goodman, J. R., “Datascalar architectures, ” In Proc. of

the 23rd Annual International Symposium on Computer Architecture, June 1997.

[14] Calder, B., Grunwald, D., and Emer, J., “Predictive sequential associative cache, ” In

Proc. of the 2nd International Symposium on High-Performance Computer Architecture,

pp. 244–253, Feb. 1996.

[15] Caravella, J. S., “A low voltage sram for embedded applications, ” In IEEE Journal of

Solid-State Circuits, volume 32, pp. 428–432, Mar. 1997.

[16] Chang, J. H, Chao, H., and So, K., “Cache design of a sub-micron cmos system370, ”

In Proc. of the 14th International Symposium on Computer Architecture, pp. 208–213,

June 1987.

BIBLIOGRAPHY 133

[17] Chiou, D., Jain, P., Rudolph, L., and Devadas, S., “Application-specific memory man-

agement for embedded systems using software-controlled caches, ” In Proc. of 37th

Design Automation Conference, pp. 416–419, June 2000.

[18] Dahlgren, F., Dubois, M, and Stenstrom, P., “Fixed and adaptive sequential prefetching

in shared memory multiprocessors, ” In Proc. of the 1993 International Conference on

Parallel Processing, pp. 56–63, Aug. 1993.

[19] Delaluz, V., Kandemir, M., Vijaykrishnan, N., and Irwin, M. J., “Energy-oriented com-

piler optimizations for partitioned memory architectures, ” In Proc. of the International

Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pp. 138–

147, Nov. 2000.

[20] Dubnicki, C., and LeBlanc, T. J., “Adjustable block size coherent caches, ” In Proc. of

the 19th Annual International Symposium on Computer Architecture, pp. 170–180, May

1992.

[21] Fisk, B. R., and Bahar, R. I., “The non-critical buffer: Using load latency tolerance to

improve data cache efficiency, ” In Proc. of the International Conference on Computer

Design: VLSI in Computers & Processors, pp. 538–545, Oct. 1999.

[22] Ghose, K., and Kamble, M. B,, “Reducing power in superscalar processor caches using

subbanking, multiple line buffers and bit-line segmentation, ” In Proc. of the 1999

International Symposium on Low Power Electronics and Design, pp. 70–75, Aug. 1999.

[23] Gonzalez, A., Aliagas, C., and Valero, M., “A data cache with multiple caching strate-

gies tuned to different types of locality, ” In Proc. of the International Conference on

Supercomputing, pp. 338–347, July 1995.

[24] Green, P. K., “A ghz ia-32 architecture microprocessor implemented on 0.18um tech-

nology with aluminum interconnect, ” In Proc. of the 2000 International Solid-State

Circuits Conference, pp. 98–99, Feb. 2000.

[25] Haji, N. B. I., Polychronopoulos, C., and Stamoulis, G., “Architectural and compiler

support for energy reduction in the memory hierarchy of high performance micropro-

cessors, ” In Proc. of the 1998 International Symposium on Low Power Electronics and

Design, pp. 70–75, Aug. 1998.

134 BIBLIOGRAPHY

[26] Hasegawa, A., et al., “Sh3: High code density, low power, ” In IEEE Micro, pp. 11–19,

Dec 1995.

[27] Hashimoto, K., Tomita, H., Inoue, K., Metsugi, K., Murakami, K., Miyakawa, N., In-

abata, S., Yamada, S., Takashima, H., Kitamura, K., Obara, S., Amisaki, T., Tanabe, K.,

Nagashima, U., and Hayakawa, K., “Moe: A special-purpose parallel computer for high-

speed, large scale molecular orbital calculation, ” In SuperComputing (SC99), Nov.

1999.

[28] Hennessy, J. L., and Patterson, D. A., “Computer architecture: A quantitative ap-

proach, ” In Morgan Kaufmann Publishers, Inc, 1990.

[29] Hicks, P., Walnock, M., Owens, R. M., “Analysis of power consumption in memory

hierarchies, ” In Proc. of the 1997 International Symposium on Low Power Electronics

and Design, pp. 239–242, Aug. 1997.

[30] Hill, M. D., “A case for direct-mapped caches, ” In IEEE Computer, volume 21, pp.

25–40, Dec. 1988.

[31] Hill, M. D., Larus, J. R., Lebeck, A. R., Talluri, M., and Wood, D. A., “Warts: Wisconsin

architectural research tool set, ” In http://www.cs.wisc.edu/larus/warts.html.

[32] Hofstee, P., Aoki, N., Boerstler, D., Coulman, P., Dhong, S., Flachs, B., Kojima, N.,

Kwon, O., Lee, K., Meltzer, D., Kowka, K., Park, J., Peter, J., Posluszny, S., Shapiro, M.,

Silberman, J., Takahashi, O., and Weinberger, B., “A 1ghz single-issue 64b powerpc

processor, ” In Proc. of the 2000 International Solid-State Circuits Conference, pp.

92–93, Feb. 2000.

[33] Hwu, W. W., and Chang, P. P., “Achieving high instruction cache performance with

an optimizing compiler, ” In Proc.of the 16th Annual International Symposium on

Microarchitecture, pp. 242–251, May 1989.

[34] Inoue, K., and Murakami, K., “Tag comparison omitting for low-power instruction

caches (in japanese), ” In IPSJ Technical Report, volume ARC140-6, pp. 25–30, Nov.

2000.

BIBLIOGRAPHY 135

[35] Inoue, K., Ishihara, T., and Murakami, K., “Way-predicting set-associative cache for

high performance and low energy consumption, ” In Proc. of the 1999 International

Symposium on Low Power Design, pp. 273–275, Aug. 1999.

[36] Inoue, K., Kai, K., and Murakami, K, “High bandwidth, variable line-size cache archi-

tecture for merged dram/logic lsis, ” In IEICE Transactions on Electronics.

[37] Inoue, K., Kai, K., and Murakami, K, “Dynamically variable line-size cache exploiting

high on-chip memory bandwidth of merged dram/logic lsis, ” In Proc. of the 5th In-

ternational Symposium on High-Performance Computer Architecture, pp. 218–222, Jan.

1999.

[38] Ishihara, T., and Yasuura, H., “A power reduction technique with object code merging

for application specific embedded processors, ” In Proc. of Design, Automation and Test

in Europe Conference 2000, pp. 617–623, Mar. 2000.

[39] John, L. K., and Subramanian, A., “Design and performance evaluation of a cache assist

to implement selective caching, ” In Proc. of the International Conference on Computer

Design: VLSI in Computers & Processors, pp. 510–518, Oct. 1997.

[40] Johnson, T., L., and Hwu, W. W., “Run-time adaptive cache hierarchy management via

reference analysis, ” In Proc. of the 19th Annual International Symposium on Computer

Architecture, pp. 315–326, June 1997.

[41] Johnson, T. L., Merten, M. C, and Hwu, W. W., “Run-time spatial locality detection

and optimization, ” In Proc. of the 30th Annual International Symposium on Microar-

chitecture, pp. 57–64, Dec. 1997.

[42] Jouppi, N. P., “Cacti home page, ” In

http://www.research.digital.com/wrl/people/jouppi/CACTI.html.

[43] Jouppi, N. P., “Improving direct-mapped cache performance by the addition of a small

fully-associative cache and prefetch buffers, ” In Proc. of the 17th Annual International

Symposium on Computer Architecture, pp. 364–373, June 1990.

[44] Jouppi, N. P., Boyle, P., Dion, J., Doherty, M. J., Eustace, A., Haddad, R. W., Mayo, R.,

Menon, S., Monier, L. M., Stark, D., Turrini, S., Yang, J. L., Hamburgen, W. R.,

136 BIBLIOGRAPHY

Fitch, J. S., and Kao, R., “A 300-mhz 115-w 32-b bipolar ecl microprocessor, ” In

IEEE Journal of Solid-State Circuits, volume 28, pp. 1152–1166, Nov. 1993.

[45] Juan, T., Lang, T., and Navarro, J. J., “The difference-bit cache, ” In Proc. of the 23th

Annual International Symposium on Computer Architecture, pp. 114–119, May 1996.

[46] Kaeli, R. D., and Emma, G. P., “Branch history table prediction of moving target

branches due to subroutine returns, ” In Proc. of the 18th Annual International Sym-

posium on Computer Architecture, pp. 34–42, May 1991.

[47] Kalamatianos, J., and Kaeli, D. R., “Temporal-based procedure reordering for improved

instruction cache performance, ” In Proc. of the 4th International Symposium on High-

Performance Computer Architecture, pp. 244–253, Jan./Feb. 1998.

[48] Kamble, M. B. and Ghose, K., “Analytical energy dissipation models for low power

caches, ” In Proc. of the 1997 International Symposium on Low Power Electronics and

Design, pp. 143–148, Aug. 1997.

[49] Kamble, M. B. and Ghose, K., “Energy-efficiency of vlsi caches: A comparative study, ”

In Proc. of the 10th International Conference on VLSI Design, pp. 261–267, Jan. 1997.

[50] Kawabe, N., and Usami, K, “Low power technique for on-chip memory using biased

partitioning and access concentration (in japanese), ” In IPSJ DA Symposium ’00, pp.

191–196, July 2000.

[51] Kaxiras, S., Hu, Z., Narlikar, G., and McLellan, R, “Cache-line decay: A mechanism

to reduce cache leakage power, ” In Proc. of Workshop on Power-Aware Computer

Systems, Nov. 2000.

[52] Kessler, R. E, Jooss, R., Lebeck, A., and Hill, M. D, “Inexpensive implementations of set-

associativity, ” In Proc. of the 16th International Symposium on Computer Architecture,

pp. 131–139, 1989.

[53] Kim, H. S., Vijaykrishnan, N., Kandemir, M., and Irwin, M. J., “Multiple access caches:

Energy implications, ” In Proc. of the IEEE CS Annual Workshop on VLSI, Apr. 2000.

BIBLIOGRAPHY 137

[54] Kin, J., Gupta, M., and Mangione-Smith, W. H., “The filter cache: An energy ef-

ficient memory stucture, ” In Proc. of the 30th Annual International Symposium on

Microarchitecture, pp. 184–193, Dec. 1997.

[55] Kirihata, T., Mueller, G., Ji, B., Frankowsky, G., Ross, J., Terletzki, H., Netis, D., Wein-

furtner, O., Hanson, D., Daniel, G., Hsu, L., Storaska, D., Reith, A., Hug, M., Guay, K.,

Selz, M., Poechmueller, P., Hoenigschmid, H., and Wordeman, M., “A 390mm2 16 bank

1gb ddr sdram with hybrid bitline architecture, ” In Proc. of the 1999 International

Solid-State Circuits Conference, pp. 422–423, Feb. 1999.

[56] Ko, U., Balsara, P. T., and Nanda, A. K., “Energy optimization of multi-level processor

cache architecture, ” In Proc. of the 1995 International Symposium on Low Power

Design, pp. 45–49, Apr. 1995.

[57] Kumar, S. and Wilkerson, C., “Exploiting spatial locality in data caches using spa-

tial footprints, ” In Proc. of the 25th Annual International Symposium on Computer

Architecture, pp. 357–368, June 1998.

[58] Lebeck, A. R., Fan, X., Zeng, H., and Ellis, C., “Power aware paga allocation, ” In Proc.

of the 9th International Conference on Architectural Support for Programming Language

and Operating Systems, pp. 105–116, Nov. 2000.

[59] Lee, H. S., and Tyson, G. S., “Region-based caching:an energy-delay efficient memory

architecture for embedded processors, ” In Proc. of the International Conference on

Compilers, Architecture, and Synthesis for Embedded Systems, pp. 120–127, Nov. 2000.

[60] Liu. L, “Cache design with partial address matching, ” In Proc.of the 27th Annual

International Symposium on Microarchitecture, pp. 128–136, Nov./Dec. 1994.

[61] McFarling, S., “Cache replacement with dynamic exclusion, ” In Proc. of the 19th

Annual International Symposium on Computer Architecture, pp. 191–200, May 1992.

[62] Milutinovic, V., Markovic, B., Tomasevic, M., and Tremblay, M., “The split tempo-

ral/spatial cache: A complexity analysis, ” In Proc. of the SCIzz-L5, Mar. 1996.

[63] MPEG Software Simulation Group, “Free mpeg softwares mpeg-2 encoder / decoder,

version 1.2, ” In http://www.mpeg.org/ tristan/MPEG/MSSG/, 1996.

138 BIBLIOGRAPHY

[64] Murakami, K., Shirakawa, S., and Miyajima, H., “Parallel processing ram chip with

256mb dram and quad processors, ” In Proc. of the 1997 International Solid-State


[65] Nakamura, H., Kondo, M., and Boku, T., “Software controlled reconfigurable on-chip

memory for high performance computing, ” In Proc. of the 2nd Workshop on Intelligent

Memory Systems, Nov. 2000.

[66] Nii, K., Makino, H., Tujihashi, Y., Morishima, C., Hayakawa, Y., Ninogami, H.,

Arakawa, T., and Hamano, H., “A low power sram using auto-backgate-controlled

mt-cmos, ” In Proc. of the 1998 International Symposium on Low Power Design, pp.

293–298, Aug. 1998.

[67] Ohsawa, T., Kai, K., and Murakami, K, “Optimizing the dram refresh count for merged

dram/logic lsis, ” In Proc. of the 1998 International Symposium on Low Power Design,

pp. 82–87, Aug. 1998.

[68] Panda, P. R., Dutt, N. D., and Nicolau, A., “Memory organization for improved data

cache performance in embedded processors, ” In Proc. of the International Symposium

on System Synthesis, pp. 90–95, Nov. 1996.

[69] Panda, P. R., Dutt, N. D., and Nicolau, A., “Efficient utilization of scratch-pad memory

in embedded processor applications, ” In Proc. of European Design & Test Conference,

Mar. 1997.

[70] Panwar, R., and Rennels, D., “Reducing the frequency of tag compares for low power i-

cache design, ” In Proc. of the 1995 International Symposium on Low Power Electronics

and Design, pp. 57–62, Apr. 1995.

[71] Park, G-H., “Desing and analysis of an adaptive memory system for deep-submicron and

processor-memory integration technologies, ” In PhD thesis, Department of Computer

Science The Graduate School Yonsei University, p. Dec., Nov. 1999.

[72] Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C.,

Thomas, R., and Yelick, K., “A case for intelligent ram, ” In IEEE Micro, volume 17,

pp. 34–44, Mar./Apr. 1997.

BIBLIOGRAPHY 139

[73] Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C.,

Thomas, R., and Yelick, K., “Intelligent ram(iram) chips that remember and com-

pute, ” In Proc. of the 1997 International Solid-State Circuits Conference, pp. 224–225,

Feb. 1997.

[74] Peir, J. K., Lee, Y., and Hsu, W. W., “Capturing dynamic memory reference behavior

with adaptive cache topology, ” In Proc. of the 8th International Conference on Archi-

tectural Support for Programming Language and Operating Systems, pp. 240–250, Oct.

1998.

[75] Sakurai, T., et al., “Low-power high-speed lsi circuits & technology (in japanese, ” In

Realize, Inc, 1998.

[76] Sanchez, F. J, Gonzalez, A., and Valero, M., “Static locality analysis for cache manage-

ment, ” In Proc. of the International Conference on Parallel Architectures and Compi-

lation Techniques, Nov. 1997.

[77] Santhanam, S., “Strongarm sa110 -a 160mhz 32b 0.5w cmos arm processor-, ” In Hot

Chips 8: A Symposium on High-Performance Chips, Aug. 1996.

[78] Saulsbury, A., Pong, F., and Nowatzyk, A., “Missing the memory wall: The case for

processor/memory integration, ” In Proc. of the 23rd Annual International Symposium

on Computer Architecture, pp. 90–101, May 1996.

[79] Semiconductor Industry Association, “The national technology roadmap for semicon-

ductors, ” 1994.

[80] Seznec, A., “A case for two-way skewed-associative caches, ” In Proc. of the 20th Annual

International Symposium on Computer Architecture, pp. 169–178, May 1993.

[81] Shimizu, T., Korematu, J., Satou, M., Kondo, H., Iwata, S., Sawai, K., Okumura, N.,

Ishimi, K., Nakamoto, Y., Kumanoya, M., Dosaka, K., Yamazaki, A., Ajioka, Y., Tsub-

ota, H., Nunomura, Y., Urabe, T., Hinata, J., and Saitoh, K., “A multimedia 32b

risc microprocessor with 16mb dram, ” In Proc. of the 1996 International Solid-State


[82] SPEC (Standard Performance Evaluation Corporation), In http://www.specbench.org/.

140 BIBLIOGRAPHY

[83] Srinivasan, S. T., and Lebeck, A. R., “Load latency tolerance in dynamically scheduled

processors, ” In Proc. of the 31th Annual International Symposium on Microarchitecture,

Nov.–Dec. 1998.

[84] Stan, M. R., and Burleson, W. P., “Bus-invert cording for low-power i/o, ” In IEEE

Transaction on Very Large Scale Integration Systems, volume 3, pp. 49–58, Mar. 1995.

[85] Su, C. L., and Despain, A. M., “Cache design trade-offs for power and performance

optimization:a case study, ” In Proc. of the 1995 International Symposium on Low

Power Design, pp. 69–74, Apr. 1995.

[86] Theobald, K. B., Hum, H. H. J., and Gao, G. R., “A design framework for hybrid-access

caches, ” In Proc. of the 1st International Symposium on High-Performance Computer

Architecture, pp. 144–153, Jan. 1995.

[87] Tomiyama, H., and Yasuura, H., “Code placement techniques for cache miss rate reduc-

tion, ” In ACM Transactions on Design Automation of Electronic Systems, volume 2,

pp. 410–429, Oct. 1997.

[88] Tomiyama, H., Ishihara, T., Inoue, A., and Yasuura, H., “Instruction scheduling for

power reduction in processor-based system design, ” In Proc. of Design Automation and

Test in Europe, pp. 855–860, Feb. 1998.

[89] Tremblay, M., and O’Connor, J. M., “Ultrasparci: A four-issue processor supporting

multimedia, ” In IEEE Micro, volume 16, pp. 42–50, Apr. 1996.

[90] Tyson, G., Farrens, M., Matthews, J., and Pleszkun, A. R., “A modified approach to

data cache management, ” In Proc.of the 28th Annual International Symposium on

Microarchitecture, pp. 93–103, Nov./Dec. 1995.

[91] Veidenbaum, A. V, Tang, W., Gupta, R., Nicolau, A., and Ji, X., “Adapting cache line

size to application behavior, ” In The International Conference on SuperComputing,

Nov. 1999.

[92] Vleet, P. V., Anderson, E., Brown, L., Baer, J., and Karlin, A, “Pursuing the perfor-

mance potential of dynamic cache line sizes, ” In Proc. of the International Conference

on Computer Design: VLSI in Computers & Processors, pp. 528–537, Oct. 1999.

BIBLIOGRAPHY 141

[93] Walsh, S. J., and Board, J. A., “Pollution control caching, ” In Proc. of the International

Conference on Computer Design: VLSI in Computers & Processors, pp. 300–306, Oct.

1995.

[94] Wilson, K. M. and Olukotun, K., “Designing high bandwidth on-chip caches, ” In Proc.

of the 24th Annual International Symposium on Computer Architecture, pp. 121–132,

June 1997.

[95] Wilton, S. J. E. and Jouppi, N. P., “An enhanced access and cycle time model for

on-chip caches, ” In Digital WRL Research Report 93/5, July 1994.

[96] Wilton, S. J. E. and Jouppi, N. P., “Cacti: An enhanced cache access and cycle time

model, ” In IEEE Journal of Solid-State Circuits, volume 31, pp. 677–688, May 1996.

[97] Wulf, W. A. and McKee, S. A., “Hitting the memory wall: Implications of the obvious, ”

In ACM Computer Architecture News, volume 23, Mar. 1995.

[98] Yeager, K. C., “The mips r10000 superscalar microprocessor, ” In IEEE Micro, vol-

ume 16, pp. 28–40, Apr. 1996.

[99] Zhang, C., Zhang, X., and Yan, Y., “Two fast and high-associativity cache schemes, ”

In IEEE Micro, volume 17, pp. 40–49, Sep.Oct. 1997.

142 BIBLIOGRAPHY

List of Publications by the Author

Journal Publications

[J-1] Inoue, K., Kai, K., and Murakami, K., “High Bandwidth, Variable Line-Size Cache

Architecture for Merged DRAM/Logic LSIs,” IEICE Transactions on Electronics, vol.

E81-C, no.9, pp.1438–1447, Sep. 1998.

[J-2] Inoue, K., Ishihara, T., and Murakami, K., “A High-Performance and Low-Power

Cache Architecture with Speculative Way-Selection,” IEICE Transactions on Elec-

tronics, vol. E83-C, no.2, Feb. 2000.

[J-3] Inoue, K., Kai, K., and Murakami, K., “Dynamically Variable Line-Size Cache Ar-

chitecture for Merged DRAM/Logic LSIs,” IEICE Transactions on Information and

Systems, vol. E83-D, no.5, pp.1048–1057, May 2000.

[J-4] Inoue, K., Kai, K., and Murakami, K., “A High-Performance / Low-Power On-chip

Memory-Path Architecture with Variable Cache-Line Size,” IEICE Transactions on

Electronics, vol. E83-C, no. 11, pp.1716–1723, Nov. 2000.

[J-5] Inoue, K., Ishihara, T., Kai, K., and Murakami, K., “High-Performance/Low-Power

Cache Architectures for Merged DRAM/Logic LSIs (in Japanese),” To appear in IPSJ

Journal, vol. 42, no. 3, Mar. 2001.

International Conference Publications

[C-1] Nakagaki, K., Ouchi, M., Inoue, K., Apduhan, B. O., kuga, M., Sueyoshi, T.,“Design

and Implementation of the Educational Microprocessor DLX–FPGA Using VHDL,”

Proceedings of the Second Asian Pacific Conference on Hardware Description Lan-

guages, pp.147-150, Oct. 1994.

143

144 LIST OF PUBLICATIONS BY THE AUTHOR

[C-2] Miyajima, H., Inoue, K., and Murakami, K., “On-Chip Memorypath Architecture

for Parallel Processing RAM (PPRAM),” Workshop on Mixing Logic and DRAM

(http://iram.CS.Berkeley.EDU/isca97-workshop/), June 1997.

[C-3] Murakami, K., Inoue, K., and Miyajima, H., “PPRAM (Parallel Processing RAM): A

Merged-DRAM/Logic System-LSI Architecture,” Proc. of The International Confer-

ence on Solid State Devices and Materials, pp.274–275, Sep. 1997.

[C-4] Inoue, K., Kai, K., and Murakami, K., “Dynamically Variable Line-Size Cache Ex-

ploiting High On-Chip Memory Bandwidth of Merged DRAM/Logic LSIs,” Proc.

of The Fifth International Symposium on High-Performance Computer Architecture

(HPCA-5), pp.218–222, Jan. 1999.

[C-5] Inoue, K., Ishihara, T., and Murakami, K., “Way-Predicting Set-Associative Cache

for High Performance and Low Energy Consumption,” Proc. of 1999 International

Symposium on Low Power Electronics and Design (ISLPED’99), pp.273–275, Aug.

1999.

[C-6] Hashimoto, K., Tomita, H., Inoue, K., Metsugi, K., Murakami, K., Miyakawa, N.,

Inabata, S., Yamada, S., Takashima, H., Kitamura, K., Obara, S., Amisaki, T., Tan-

abe, K., Nagashima, U., and Hayakawa, K., “MOE: A Special-Purpose Parallel Com-

puter for High-Speed, Large Scale Molecular Orbital Calculation,” SuperComputing

(SC99), Nov. 1999.

[C-7] Inoue, K., Kai, K., and Murakami, K., “An On-chip Memory-Path Architecture on

Merged DRAM/Logic LSIs for High-Performance/Low-Energy Consumption,” Proc.

of International Symposium on Low-Power and High-Speed Chips (COOL Chips III),

pp.283, Apr. 2000.

[C-8] Inoue, K., Kai, K., and Murakami, K., “Performance/Energy Efficiency of Variable

Line-Size Caches on Intelligent Memory Systems,” Proc. of The 2nd Workshop on

Intelligent Memory Systems, Nov. 2000.

[C-9] Inoue, K., and Murakami, K., “A Low-Power Instruction Cache Architecture Ex-

ploiting Program Execution Footprints,” To appear in Work-in-progress Session at

(not included in the proceedings of) The Seventh International Symposium on High-

Performance Computer Architecture (HPCA-7), Jan. 2001.

LIST OF PUBLICATIONS BY THE AUTHOR 145

Technical Society Meeting and Domestic Conference

Publications

[T-1] Nakagaki, K., Inoue, K., Kuga, M., and Sueyoshi, T.,“Design and Implementation of

the Educational Microprocessor DLX–FPGA for Advanced Computer Architecture

Course,” IEICE Technical Report, CPSY94–57, Sep. 1994.

[T-2] Inoue, K., Nakagaki, K., Ouchi, M., Kuriyama, T., Kuga, M., and Sueyoshi, T.,

“Implementation of the Floating-Point Pipeline for the DLX–FPGA Microprocessor

(in Japanese),” IPSJ SIG Notes, ARC-110-19, DA-73-19, Jan. 1995.

[T-3] Inoue, K., Nakagaki, K., Ouchi, M., Kuga, M., and Sueyoshi, T.,“Design and

Rapid System Prototyping of the Educational RISC Microprocessor DLX-FPGA (in

Japanese),” IEICE Technical Report, CPSY95-20, FTS95-20, ICD95-20, Apr. 1995.

[T-4] Sueyoshi, T., Inoue, K., Okumura, M., and Kuga, M., “Development of an FPGA

board for the 32-bit Educational RISC Microprocessor DLX–FPGA (in Japanese),”

Proc. of The Third Japanese FPGA/PLD Design Conference & Exhibit, pp.579–588,

July 1995.

[T-5] Inoue, K., Okumura, M., Kuga, M., and Sueyoshi, T., “Rapid Prototyping of the Edu-

cational 32-bit RISC Microprocessor DLX-FPGA,” Proc. of IPSJ General Conference,

vol. 6, 6P-2, Sep. 1995.

[T-6] Inoue, K., Iida, M., Ouchi, M., Kuga, M., and Sueyoshi, T., “A Feasibility Study for

Design Education Using 32bit RISC Microprocessor DLX-FPGA,” IPSJ SIG Notes,

ARC-115-18, DA-78-18, pp. 109-114, Dec. 1995.

[T-7] Inoue, K., Miyajima, H., Kai, K., and Murakami, K.,“An examination of On-chip

Memorypath Architecture for PPRAM-type LSI (in Japanese),” IEICE Technical Re-

port, ICD97-10, CPSY97-10, FTS97-10, pp. 25-32, Apr. 1997.

[T-8] Murakami, K., Inoue, K., and Miyajima, H., “PPRAM” A Merged Memory/Logic

System LSI Architecture (in Japanese),” Society Symposium Plan: New Trend of

VLSI Architecture, 55th IPSJ General Conference, Sep. 1997.

[T-9] Inoue, K., Kai, K., and Murakami, K.,“Dynamically Variable Line-Size Caches

Exploiting High On-Chip Memory Bandwidth of Merged DRAM/Logic LSIs (in

146 LIST OF PUBLICATIONS BY THE AUTHOR

Japanese),” IEICE Technical Report, ICD98-25, CPSY98-25, FTS98-25, pp. 109-116,

Apr. 1998.

[T-10] Inoue, K., Ishihara, T., and Murakami, K.,“A High-Performance/Low-Energy Cache

Architecture with Way-Prediction Technique (in Japanese),” IEICE Technical Report,

VLD98-44, ICD98-147, FTS98-71, pp. 1-8, Sep. 1998.

[T-11] Inoue, K., Ishihara, T., and Murakami, K.,“A High-Performance Set-Associative

Cache Architecture with Speculative Way-Selection (in Japanese),” IEICE Techni-

cal Report, DSP98-94, ICD98-181, CPSY98-96, pp. 35-42, Oct. 1998.

[T-12] Inoue, K., Kai, K., and Murakami, K.,“Performance and Energy Evaluation of a Dy-

namically Variable Line-Size Cache (in Japanese),” IEICE Technical Report, ICD2000-

5, pp. 25-30, Apr. 2000.

[T-13] Inoue, K., and Murakami, K.,“Tag Comparison Omitting for Low-Power Instruction

Caches (in Japanese),” IPSJ SIG Notes, ARC140-6, pp. 25–30, Nov. 2000.